Hi again k9queen,
The dependent variable of the regression is ATTENDANCE. We are hoping
to establish a link between class sizes, SAT scores, and the
attendance at colleges. With this in mind, ATTENDANCE should be made
the dependent variable, with the other variables being used in the
regression.
The results of the regression are as follows:
Dependent variable: ATTENDANCE
Current sample: 1 to 18
Number of observations: 18
Mean of dep. var. = 69.7778 LM het. test = .153765 [.695]
Std. dev. of dep. var. = 14.8386 Durbin-Watson = 1.32248 [<.099]
Sum of squared residuals = 2353.88 Jarque-Bera test = .995509 [.608]
Variance of residuals = 147.118 Ramsey's RESET2 = .366230E-02 [.953]
Std. error of regression = 12.1292 F (zero slopes) = 9.44297 [.007]
R-squared = .373102 Schwarz B.I.C. = 72.2923
Adjusted R-squared = .333921 Log likelihood = -69.4019
Estimated Standard
Variable Coefficient Error t-statistic P-value
CLASS_SIZE -1.06184 .676443 -1.56974 [.136]
SCORE .094201 .015422 6.10826 [.000]
T(16) Critical Value: 2.119905, Two-tailed area: .05000
The R-squared is 0.37, so 37% of the variation is described by the
variables in the model.
The CLASS_SIZE coefficient is -1.06, so increasing class sizes reduce
the attendance rate at colleges. The coefficient on SCORE is 0.094,
so higher SAT scores improve college attendance.
Describing multicollinearity is something that I don't feel I could do
nearly as eloquently as the description I found at graphpad.com
http://www.graphpad.com/articles/Multicollinearity.htm
"In some cases, multiple regression results may seem paradoxical. Even
though the overall P value is very low, all of the individual P values
are high. This means that the model fits the data well, even though
none of the X variables has a statistically significant impact on
predicting Y. How is this possible? When two X variables are highly
correlated, they both convey essentially the same information. In this
case, neither may contribute significantly to the model after the
other one is included. But together they contribute a lot. If you
removed both variables from the model, the fit would be much worse. So
the overall model fits the data well, but neither X variable makes a
significant contribution when it is added to your model last. When
this happens, the X variables are collinear and the results show
multicollinearity."
It seems unlikely that there is multicollinearity in this data. There
are a number of methods for testing for multicollinearity described at
this site http://www.xycoon.com/detection.htm . We can stick to a
simple one. If the t-statistic for both variables, CLASS_SIZE and
SCORE were not statistically significant, while the predictive value
of the two together was high it would be likely that there was
multicollinearity. However, in this case the computed critical value
of the t-statistic is about 2.12, so the t-statistic of SCORE is
significant, while that of CLASS_SIZE is not. Or, alternatively, if
the P-values for both variables were high we might suspect
multicollinearity, but in this case the P-value for SCORE is 0.
For the last part of the question, I refer you to another article
lifted from the graphpad site here
http://www.graphpad.com/instatman/Ismulticollinearityaproblem_.htm
"The term multicollinearity is as hard to understand as it is to say.
But it is important to understand, as multicollinearity can interfere
with proper interpretation of multiple regression results. To
understand multicollinearity, first consider an absurd example.
Imagine that you are running multiple regression to predict blood
pressure from age and weight. Now imagine that you've entered
weight-in-pounds and weight-in-kilograms as two separate X variables.
The two X variables measure exactly the same thing - the only
difference is that the two variables have different units. The P value
for the overall fit is likely to be low, telling you that blood
pressure is linearly related to age and weight. Then you'd look at the
individual P values. The P value for weight-in-pounds would be very
high - after including the other variables in the equation, this one
adds no new information. Since the equation has already taken into
account the effect of weight-in-kilograms on blood pressure, adding
the variable weight-in-pounds to the equation adds nothing. But the P
value for weight-in-kilograms would also be high for the same reason.
After you include weight-in-pounds to the model, the goodness-of-fit
is not improved by including the variable weight-in-kilograms. When
you see these results, you might mistakenly conclude that weight does
not influence blood pressure at all since both weight variables have
very high P values. The problem is that the P values only assess the
incremental effect of each variable. In this example, neither variable
has any incremental effect on the model. The two variables are
collinear.
That example is a bit absurd, since the two variables are identical
except for units. The blood pressure example -- model blood pressure
as a function of age, weight and gender - is more typical. It is hard
to separate the effects of age and weight, if the older subjects tend
to weigh more than the younger subjects. It is hard to separate the
effects of weight and gender if the men weigh more than the women.
Since the X variables are intertwined, multicollinearity will make it
difficult to interpret the multiple regression results."
I think that pretty much sums up why overall prediction will remain
okay when multicollinearity is present, even though the parameter
estimates may be distoreted.
This has been a bit of long answer, and I hope it is clear. As
always, if there are any troubles, please ask for clarification.
Hibiscus |