We want your feedback to make the book better for you and other students. You may annotate some text by selecting it with the cursor and then click the on the pop-up menu. You can also see the annotations of others: click the in the upper right hand corner of the page 4. To work with these data in R we begin by generating two vectors: one for the student-teacher ratios STR and one for test scores TestScore , both containing the data from the table above.
Furthermore, we might want to add a systematic relationship to the plot. To draw a straight line, R provides the function abline. We just have to call this function with arguments a representing the intercept and b representing the slope after executing plot in order to add the line to our plot.
If another sample of the same size is taken, another sample equation could be generated. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. Computing sb is tedious, and is almost always left to a computer, especially when there is more than one explanatory variable. The estimate is based on how much the sample points vary from the regression line. If the points in the sample are not very close to the sample regression line, it seems reasonable that the population points are also widely scattered around the population regression line and different samples could easily produce lines with quite varied slopes.
Though there are other factors involved, in general when the points in the sample are farther from the regression line, sb is greater. Rather than learn how to compute sb, it is more useful for you to learn how to find it on the regression results that you get from statistical software. It is often called the standard error and there is one for each independent variable. The printout in Figure 8. If the slope equals zero, then changes in x do not result in any change in y.
Substitute zero for b into the t-score equation, and if the t-score is small, b is close enough to zero to accept Ha. The degrees of freedom equal n — m -1, where n is the size of the sample and m is the number of independent x variables.
There is a separate hypothesis test for each independent variable. This means you test to see if y is a function of each x separately. Testing your regression: does this equation really help predict? By testing to see if the regression helps predict, you are testing to see if there is a functional relationship in the population. This is not a very sophisticated prediction technique, but remember that the sample mean is an unbiased estimator of population mean, so on average you will be right.
Notice that the measures of these differences could be positive or negative numbers, but that error or improvement implies a positive distance. Squaring each error so that worries about signs are overcome, and then adding the squared errors together, gives you a measure of the total mistake you make if you want to predict y. To make this raw measure of the improvement meaningful, you need to compare it to one of the two measures of the total mistake.
One compares the improvement to the mistakes still made with regression. The other compares the improvement to the mistakes that would be made if the mean was used to predict. The second is called R2, or the coefficient of determination. All of these mistakes and improvements have names, and talking about them will be easier once you know those names.
Further, we can rewrite the above equation as: where SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable x, and SSE measures all the variations due to other factors excluded from the estimated model. Going back to the idea of goodness of fit, one should be able to easily calculate the percentage of each variation with respect to the total variations.
In particular, the strength of the estimated regression model can now be measured. The closer R2 is to one, the stronger the model is. Alternatively, R2 is also found by: This is the ratio of the improvement made using the regression to the mistakes made using the mean. The numerator is the improvement regression makes over using the mean to predict; the denominator is the mistakes errors made using the mean.
Thus R2 simply shows what proportion of the mistakes made using the mean are eliminated by using regression. One typical approach is to add more relevant factors to the simple regression model. In this case, the estimated model is referred to as a multiple regression model. While R2 is not used to test hypotheses, it has a more intuitive meaning than the F-score. The F-score is the measure usually used in a hypothesis test to see if the regression made a significant improvement over using the mean.
It is used because the sampling distribution of F-scores that it follows is printed in the tables at the back of most statistics books, so that it can be used for hypothesis testing. It works no matter how many explanatory variables are used. As a result very few samples from such populations will have a large sum of squares regression and large F-scores. The sum of squares regression is divided by the number of explanatory variables to account for the fact that it always decreases when more variables are added.
You can also look at this as finding the improvement per explanatory variable. The sum of squares residual is divided by a number very close to the number of observations because it always increases if more observations are added. You can also look at this as the approximate mistake per observation.
To test to see if a regression equation was worth estimating, test to see if there seems to be a functional relationship: This might look like a two-tailed test since Ho has an equal sign. But, by looking at the equation for the F-score you should be able to see that the data support Ha only if the F-score is large.
This is because the data support the existence of a functional relationship if the sum of squares regression is large relative to the sum of squares residual.
If the computed F is greater than the table F, then the computed F is unlikely to have occurred if Ho is true, and you can safely decide that the data support Ha. There is a functional relationship in the population. Now that you have learned all the necessary steps in estimating a simple regression model, you may take some time to re-estimate the Nelson apartment model or any other simple regression model, using the interactive Excel template shown in Figure 8.
Like all other interactive templates in this textbook, you can change the values in the yellow cells only. The result will be shown automatically within this template. For this template, you can only estimate simple regression models with 30 observations. The first step is to enter your data under independent and dependent variables. Next, select your alpha level.
Check your results in terms of both individual and overall significance. Once the model has passed all these requirements, you can select an appropriate value for the independent variable, which in this example is the distance to downtown, to estimate both the confidence intervals for the average price of such an apartment, and the prediction intervals for the selected distance.
Both these intervals are discussed later in this chapter. Remember that by changing any of the values in the yellow areas in this template, all calculations will be updated, including the tests of significance and the values for both confidence and prediction intervals. Multiple Regression Analysis When we add more explanatory variables to our simple regression model to strengthen its ability to explain real-world data, we in fact convert a simple regression model into a multiple regression model.
Obviously, there should be more relevant factors that can be added into this model to make it stronger. If we go back to Excel and estimate our model including the new added variable, we will see the printout shown in Figure 8.
Before using this estimated model for prediction and decision-making purposes, we should test three hypotheses. In other words, we test the overall significance of the estimated model.
The degrees of freedom equal n — m -1, where n is the size of the sample and m is the number of independent x variables. Furthermore, the adjusted R2 indicates that only There is a separate hypothesis test for each independent variable. You can also look at this as finding the improvement per explanatory variable. Though there are other factors involved, in general when the points in the sample are farther from the regression line, sb is greater. Further, we can rewrite the above equation as: where SST stands for sum of squares due to total variations, SSR measures the sum of squares due to the estimated regression model that is explained by variable x, and SSE measures all the variations due to other factors excluded from the estimated model.
In this scatter diagram, a negative simple regression line has been shown. Because the standard deviation of this sampling distribution is seldom known, statisticians developed a method to estimate it from a single sample. This can be done in two ways. While R2 is not used to test hypotheses, it has a more intuitive meaning than the F-score. Predictions using the estimated simple regression If the estimated regression line fits well into the data, the model can then be used for predictions. In particular, the strength of the estimated regression model can now be measured.
In order to estimate such a model within Figure 8.
The estimate is based on how much the sample points vary from the regression line. The other compares the improvement to the mistakes that would be made if the mean was used to predict.
Using the interactive Excel template shown in Figure 8.
In this case, the estimated model is referred to as a multiple regression model. However, this interpretation in a multiple regression model should be adjusted slightly. You should be careful to note that Figure 8. Alternatively, R2 is also found by: This is the ratio of the improvement made using the regression to the mistakes made using the mean.