Hello,
In answering this question, I shall precede your initial question with
>'s
> The regression equation is
> Total£ = - 36280 + 84.2 GFarea + 20629 Bedrooms
In regression, we are trying to get an estimate which best predicts
the outcome. The outcome in this case is Total£. If we know GFArea
and Bedrooms, our best guess at Total£ is given by this equation.
> 79 cases used 2 cases contain missing values
You had 81 rows in your data, however two of these rows contained
missing data, for at least one of the variables, and so were not used
in the analysis.
> Predictor Coef SE Coef T P
> Constant -36280 20143 -1.80 0.076
> GFarea 84.198 9.779 8.61 0.000
> Bedrooms 20629 4903 4.21 0.000
This next part gives some of the details of the equation.
Each of the estimates (coefficients, indicated with Coef) has a
standard error - this is a measure of how variable the estiamte is
likely to be. To gain the 95% confidence intervals of the
coefficient, we multiply the standard error by 1.96, and add and
subrtract this from the coefficient.
So our best guess at GFArea is 84. Howevever, this estimate has a
standard error of (approximately) 10. So the confidence intervals are
given by 84 + 1.96 x 10 = 104 and 84 - 1.96 x 10 = 64. If we were to
say that the true (population) value for the coefficient is likely to
be from 64 to 104, there is only a 5% chance (1 in 20) that we are
wrong. That is, one more unit of GFArea adds between 64 and 104 units
of Total£.
The next part of the model is T. T is given by Coef / SE.
So: -36280 / 20143 = -1.80
T isn't very useful on its own, but it does give us P - that is the
probability of the result occurring, if the real value in the
population is zero. The fact that GFarea and Bedrooms both have low
probabilities (less than 0.0005) means that it is very unlikely you
would have found this result, if in fact they had no effect.
The constant is a special variable - this is the estimated value of
Total£ when all of the other predictors are zero. It often makes no
sense - as is the case here (I am guessing). The value of a house
with 0 bedrooms and a 0 ground floor area is -36280, is obviously a
silly thing to say.
> S = 49455 R-Sq = 66.5% R-Sq(adj) = 65.7%
OK, so this equation tells us the best guess at Total£, but the next
question is, how good is that? This is given by R-Sq - or R-Squared.
R-Squared is the proportion of variance in the Total£ which is
explained by the predictors - in this case, it is 66.5% - quite a high
prediction. If you take this as 0.665, and find the square root, it
is 0.81. This is the correlation between the predicted score (given
by the equation) and the actual score.
The next question to ask is whether this is a good prediction - i.e.
is this prediction better than chance.
> Source DF SS MS F P
> Regression 2 3.69762E+11 1.84881E+11 75.59 0.000
> Residual Error 76 1.85878E+11 2445767700
> Total 78 5.55641E+11
> Source DF Seq SS
> GFarea 1 3.26460E+11
> Bedrooms 1 43302466403
This is answered by the next section. This is usually not reported in
depth, so I am not going to cover it here, but request clarification
if you need it. The P value again tells us whether we can make a
significant prediction, or whether we are better off guessing.
because this p-value is very low, it is highly significant, and better
than guessing. The most common thing here is to report the p-value
(again, it's <0.0005, it's not 0.000). you might also want to report
the F, and the DF, in which case it's F=75.6, df = 2, 76.
> Obs GFarea Total£ Fit SE Fit Residual StResid
> 8 2504 381984 257070 9787 124914 2.58R
> 18 1996 100300 214297 6362 -113997 -2.32R
> 49 3076 274958 284602 17274 -9644 -0.21 X
> 53 1980 374256 295467 19406 78789 1.73 X
> 54 3501 453744 382274 17160 71470 1.54 X
>R denotes an observation with a large standardized residual
>X denotes an observation whose X value gives it large influence.
The final part of the output is some diagnostics, to help you to
interpret the equation. Minitab has selected some cases it believes
you might want to look at. It bases this on the residuals and the
influence.
First, the residuals. The residual is the difference between the
value we would expect, given GFArea and Bedrooms, and what we actually
have. Large residuals are marked with an R. Case 8 has a very large
residual - its value for Total£ is 124914 higher than would be
expected. Similarly, 18 has a much lower value than would be
expected. It is worth looking at these to see if there has been an
error entering the data, or there is something unusual about them.
Maybe they are in a very different area to the others, maybe they are
paved with gold, or come with a free farm (I am guessing, because I
have no idea what the data are about - I am sure you can think of
something more sensible).
Second, the influential cases, marked with an X. An influential case
is more important than the others in determining the values of the
coefficients. It isn't necessarily anything to worry about, but again
is worth checking. (I am not going to go into detail on this, because
of your time limit, but if you would like, just request
clarification).
I have written this fairly swiftly, because of your time limit, so if
you think that I have missed anything out, or would like clarification
on anything, please ask, before rating the question.
I will recheck the page fairly regularly for the rest of the day, to
see fi a clarification request appears.
Here's a useful site:
http://www.fw.umn.edu/biochr/assoc/dho/107/notes/minitab/REGRESS1.HTM
Here's a useful book. :)
http://www.amazon.co.uk/exec/obidos/ASIN/0761962301/qid%3D1006938682/sr%3D1-1/ref%3Dsr%5Fsp%5Fre/026-9955570-8940457
jeremymiles-ga |
Clarification of Answer by
jeremymiles-ga
on
28 Apr 2003 10:39 PDT
Sure. F is a statistic that is used in ANOVA and regression.
It's kind of hard to tell with your numbers (because they are very
large), but F is given by MSregression / MSError, where MS stands for
"mean squares".
The mean squares are given by the SS (sums of squares) divided by the
degrees of freedom.
So, what are the sums of squares? The sums of squares are what
regression is all about - you might have heard of regression being
called OLS (ordinary least squares) regression - these are the squares
that make the sums of squares. The sums of squares are short for sums
of squared deviation from the regression.
The first sum of squares we have are the total sum of squares. These
are calculated by finding the residual (difference) between each value
and the mean, squaring it, and then adding them up.
The mean is a bit like a regression equation with no predictor
variables. If you want to find a value, such that you minimise the
sum of squares, you will end up with the mean. Here's an example:
Take the numbers:
3
4
5
6
Calculate the mean: 4.5
Find the difference between each value and the mean:
3 -1.5
4 -0.5
5 0.5
6 1.5
Square them:
3 -1.5 2.25
4 -0.5 0.25
5 0.5 0.25
6 1.5 2.25
Add them up; 5.
This gives the total sum of squares. It is also impossible to pick
another number and find a lower value for the sum of squares, so we
can think of the mean as a least squares estimator. (But we are
getting off the point). (The sum of squares is also given by the
variance * N; the variance is the square of the standard deviation).
Now, where were we.
Now we get our regression equation, and we calculate a predicted
variable for each case and find the residual. If we square the
residuals, and add them up, we have the error sum of squares - the
point of regression is to find values for the equation that give the
error sum of squares - this is the least squares that we are talking
about. This will always be lower than the total sum of squares, but
our question is, how much lower.
We have SStotal, and SSerror. We can calculate the regression sum of
squares by subtracting error from total:
SSregression = SStotal - SSerror.
[You might notice at this point that the R-Sq is equal to the
SSregression / SStotal - hence it is the proportion of variance -
(SStotal being variance).]
This gives the three SS that we have in the table. The trouble with
sums of squares is that they are dependent on things like the number
of people, and the number of variables, so we need to divide the
regression by the regression df, which is given by k (where k is the
number of predictor variables) and error df is given by N - k - 1.
We divide the SSregression by the SSerror, and we have F. So, the
smaller the error the larger the F. In addition, more predictors, for
the same value of R-Sq gives a lower F.
F is a test statistic, which has a distribution which depends on the
degrees of freedom, which is why you need to report the degrees of
freedom when you report F. What you are asking is "if there was no
effect in my data, what is the probability of getting a value of F
this high?" High values of F are associated with low probabilities,
which means that it is unlikely that your data are a fluke, and that
you have a real result.
Again, please feel free to request clarification if any of this is
unclear.
jeremymiles-ga
|