Google Answers: Minitab/ multiple regression

View Question

Q: Minitab/ multiple regression ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Minitab/ multiple regression
Category: Business and Money
Asked by: san007-ga
List Price: $20.00

Posted: 28 Apr 2003 06:06 PDT
Expires: 28 May 2003 06:06 PDT
Question ID: 196474

Hi

Im trying to get to grips with the program Minitab, mainly multiple
regression. My question is could someone explain each area of a
results table shown below especially their meaning. This is just an
example so i just need to know how each value works i.e what are p
values and their effect etc.

I need this in the next 4 hours thanks very much

Regression Analysis: Total£ versus GFarea, Bedrooms

The regression equation is
Total£ = - 36280 + 84.2 GFarea + 20629 Bedrooms

79 cases used 2 cases contain missing values

Predictor        Coef     SE Coef          T        P
Constant       -36280       20143      -1.80    0.076
GFarea         84.198       9.779       8.61    0.000
Bedrooms        20629        4903       4.21    0.000

S = 49455       R-Sq = 66.5%     R-Sq(adj) = 65.7%

Analysis of Variance

Source            DF          SS          MS         F        P
Regression         2 3.69762E+11 1.84881E+11     75.59    0.000
Residual Error    76 1.85878E+11  2445767700
Total             78 5.55641E+11

Source       DF      Seq SS
GFarea        1 3.26460E+11
Bedrooms      1 43302466403

Unusual Observations

Obs     GFarea     Total£         Fit      SE Fit    Residual    St
Resid
  8       2504     381984      257070        9787      124914       
2.58R
 18       1996     100300      214297        6362     -113997      
-2.32R
 49       3076     274958      284602       17274       -9644      
-0.21 X
 53       1980     374256      295467       19406       78789       
1.73 X
 54       3501     453744      382274       17160       71470       
1.54 X

R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.

Answer

Subject: Re: Minitab/ multiple regression
Answered By: jeremymiles-ga on 28 Apr 2003 07:52 PDT
Rated: 5 out of 5 stars

Hello, In answering this question, I shall precede your initial question with >'s > The regression equation is > Total£ = - 36280 + 84.2 GFarea + 20629 Bedrooms In regression, we are trying to get an estimate which best predicts the outcome. The outcome in this case is Total£. If we know GFArea and Bedrooms, our best guess at Total£ is given by this equation. > 79 cases used 2 cases contain missing values You had 81 rows in your data, however two of these rows contained missing data, for at least one of the variables, and so were not used in the analysis. > Predictor Coef SE Coef T P > Constant -36280 20143 -1.80 0.076 > GFarea 84.198 9.779 8.61 0.000 > Bedrooms 20629 4903 4.21 0.000 This next part gives some of the details of the equation. Each of the estimates (coefficients, indicated with Coef) has a standard error - this is a measure of how variable the estiamte is likely to be. To gain the 95% confidence intervals of the coefficient, we multiply the standard error by 1.96, and add and subrtract this from the coefficient. So our best guess at GFArea is 84. Howevever, this estimate has a standard error of (approximately) 10. So the confidence intervals are given by 84 + 1.96 x 10 = 104 and 84 - 1.96 x 10 = 64. If we were to say that the true (population) value for the coefficient is likely to be from 64 to 104, there is only a 5% chance (1 in 20) that we are wrong. That is, one more unit of GFArea adds between 64 and 104 units of Total£. The next part of the model is T. T is given by Coef / SE. So: -36280 / 20143 = -1.80 T isn't very useful on its own, but it does give us P - that is the probability of the result occurring, if the real value in the population is zero. The fact that GFarea and Bedrooms both have low probabilities (less than 0.0005) means that it is very unlikely you would have found this result, if in fact they had no effect. The constant is a special variable - this is the estimated value of Total£ when all of the other predictors are zero. It often makes no sense - as is the case here (I am guessing). The value of a house with 0 bedrooms and a 0 ground floor area is -36280, is obviously a silly thing to say. > S = 49455 R-Sq = 66.5% R-Sq(adj) = 65.7% OK, so this equation tells us the best guess at Total£, but the next question is, how good is that? This is given by R-Sq - or R-Squared. R-Squared is the proportion of variance in the Total£ which is explained by the predictors - in this case, it is 66.5% - quite a high prediction. If you take this as 0.665, and find the square root, it is 0.81. This is the correlation between the predicted score (given by the equation) and the actual score. The next question to ask is whether this is a good prediction - i.e. is this prediction better than chance. > Source DF SS MS F P > Regression 2 3.69762E+11 1.84881E+11 75.59 0.000 > Residual Error 76 1.85878E+11 2445767700 > Total 78 5.55641E+11 > Source DF Seq SS > GFarea 1 3.26460E+11 > Bedrooms 1 43302466403 This is answered by the next section. This is usually not reported in depth, so I am not going to cover it here, but request clarification if you need it. The P value again tells us whether we can make a significant prediction, or whether we are better off guessing. because this p-value is very low, it is highly significant, and better than guessing. The most common thing here is to report the p-value (again, it's <0.0005, it's not 0.000). you might also want to report the F, and the DF, in which case it's F=75.6, df = 2, 76. > Obs GFarea Total£ Fit SE Fit Residual StResid > 8 2504 381984 257070 9787 124914 2.58R > 18 1996 100300 214297 6362 -113997 -2.32R > 49 3076 274958 284602 17274 -9644 -0.21 X > 53 1980 374256 295467 19406 78789 1.73 X > 54 3501 453744 382274 17160 71470 1.54 X >R denotes an observation with a large standardized residual >X denotes an observation whose X value gives it large influence. The final part of the output is some diagnostics, to help you to interpret the equation. Minitab has selected some cases it believes you might want to look at. It bases this on the residuals and the influence. First, the residuals. The residual is the difference between the value we would expect, given GFArea and Bedrooms, and what we actually have. Large residuals are marked with an R. Case 8 has a very large residual - its value for Total£ is 124914 higher than would be expected. Similarly, 18 has a much lower value than would be expected. It is worth looking at these to see if there has been an error entering the data, or there is something unusual about them. Maybe they are in a very different area to the others, maybe they are paved with gold, or come with a free farm (I am guessing, because I have no idea what the data are about - I am sure you can think of something more sensible). Second, the influential cases, marked with an X. An influential case is more important than the others in determining the values of the coefficients. It isn't necessarily anything to worry about, but again is worth checking. (I am not going to go into detail on this, because of your time limit, but if you would like, just request clarification). I have written this fairly swiftly, because of your time limit, so if you think that I have missed anything out, or would like clarification on anything, please ask, before rating the question. I will recheck the page fairly regularly for the rest of the day, to see fi a clarification request appears. Here's a useful site: http://www.fw.umn.edu/biochr/assoc/dho/107/notes/minitab/REGRESS1.HTM Here's a useful book. :) http://www.amazon.co.uk/exec/obidos/ASIN/0761962301/qid%3D1006938682/sr%3D1-1/ref%3Dsr%5Fsp%5Fre/026-9955570-8940457 jeremymiles-ga
Request for Answer Clarification by san007-ga on 28 Apr 2003 09:40 PDT hi is it possible to have further clarification on what is the f part of this is please it would help me out greatly > Source DF SS MS F P > Regression 2 3.69762E+11 1.84881E+11 75.59 0.000 > Residual Error 76 1.85878E+11 2445767700 > Total 78 5.55641E+11 > Source DF Seq SS > GFarea 1 3.26460E+11 > Bedrooms 1 43302466403
Clarification of Answer by jeremymiles-ga on 28 Apr 2003 10:39 PDT Sure. F is a statistic that is used in ANOVA and regression. It's kind of hard to tell with your numbers (because they are very large), but F is given by MSregression / MSError, where MS stands for "mean squares". The mean squares are given by the SS (sums of squares) divided by the degrees of freedom. So, what are the sums of squares? The sums of squares are what regression is all about - you might have heard of regression being called OLS (ordinary least squares) regression - these are the squares that make the sums of squares. The sums of squares are short for sums of squared deviation from the regression. The first sum of squares we have are the total sum of squares. These are calculated by finding the residual (difference) between each value and the mean, squaring it, and then adding them up. The mean is a bit like a regression equation with no predictor variables. If you want to find a value, such that you minimise the sum of squares, you will end up with the mean. Here's an example: Take the numbers: 3 4 5 6 Calculate the mean: 4.5 Find the difference between each value and the mean: 3 -1.5 4 -0.5 5 0.5 6 1.5 Square them: 3 -1.5 2.25 4 -0.5 0.25 5 0.5 0.25 6 1.5 2.25 Add them up; 5. This gives the total sum of squares. It is also impossible to pick another number and find a lower value for the sum of squares, so we can think of the mean as a least squares estimator. (But we are getting off the point). (The sum of squares is also given by the variance * N; the variance is the square of the standard deviation). Now, where were we. Now we get our regression equation, and we calculate a predicted variable for each case and find the residual. If we square the residuals, and add them up, we have the error sum of squares - the point of regression is to find values for the equation that give the error sum of squares - this is the least squares that we are talking about. This will always be lower than the total sum of squares, but our question is, how much lower. We have SStotal, and SSerror. We can calculate the regression sum of squares by subtracting error from total: SSregression = SStotal - SSerror. [You might notice at this point that the R-Sq is equal to the SSregression / SStotal - hence it is the proportion of variance - (SStotal being variance).] This gives the three SS that we have in the table. The trouble with sums of squares is that they are dependent on things like the number of people, and the number of variables, so we need to divide the regression by the regression df, which is given by k (where k is the number of predictor variables) and error df is given by N - k - 1. We divide the SSregression by the SSerror, and we have F. So, the smaller the error the larger the F. In addition, more predictors, for the same value of R-Sq gives a lower F. F is a test statistic, which has a distribution which depends on the degrees of freedom, which is why you need to report the degrees of freedom when you report F. What you are asking is "if there was no effect in my data, what is the probability of getting a value of F this high?" High values of F are associated with low probabilities, which means that it is unlikely that your data are a fluke, and that you have a real result. Again, please feel free to request clarification if any of this is unclear. jeremymiles-ga
Request for Answer Clarification by san007-ga on 28 Apr 2003 11:16 PDT you have been very helpful by the way!! if you can and want to could you clarify what you think the effectiveness is of this model or how i can obtain this. i think im going to have to alter it do you know how the hypothesis testing (null etc) would configure into this model??
Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:23 PDT The model has a pretty good R-Sq, most people would be happy with that - it's rare to see one higher. This is the most common way to statistically consider a model. The cases with high residuals and influence statistics are not a cause of excessive worry, IMHO. What do you plan to do to alter the model? [Given your time limit, I will post this and continue to the next part of your question.]
Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:26 PDT The null hypothesis is usually the hypothesis of zero effect. There are two sets of null hypotheses tested in a regression analysis. First is the null hypothesis that your predictions are no better than chance. This is tested using the F test of the R-Sq. Your data have easily rejected this null hypothesis. Second, is the set of null hypotheses about the parameter estimates or coefficients. Here, for GFArea and Bedrooms, you have rejected the null hypotheses of no effect. For the constant (sometimes called the intercept) the null hypothesis is not rejected (if you use 0.05 as your cut-off), however, as already discussed, the intercept in this case isn't especially useful.
Request for Answer Clarification by san007-ga on 28 Apr 2003 13:20 PDT i was thinking of taking the rooms variable out as the r sq goes higher. is it possible to take out the constant??
Clarification of Answer by jeremymiles-ga on 29 Apr 2003 02:16 PDT R-Sq is higher without the rooms variable? That surprises me. Can you post the correlations between the variables? I an calculate the R-Sq then. You can remove the intercept - this is called "regression through the origin" - I am not sure if you can do this in minitab, but it is usually not recommended. This page: http://mallit.fr.umn.edu/fr5218/reg_refresh/origin.html Says: "Regression through the origin should not be used indiscriminately. The variability of the data about the regression line should be compared to that for the equation including b0 (that is, compare corresponding sy.xs). A plot of studentized residuals versus X should be made as well. If a non-horizontal linear trend is apparent, a nonzero intercept should be suspected." This one: http://www.graphpad.com/instatman/Whentoforcearegressionlinethroughtheorigin_.htm Says: "Therefore you might be tempted to force the regression line through the origin. But you don't particularly care where the line is in the vicinity of the origin. You really care only that the line fits the standards very well near the unknowns. You will probably get a better fit by not constraining the line." Just a stab in the dark here, but have you considered the non-linear effect of GFArea - a small house getting an extra 10m^2 might make a bigger difference than a large house getting an extra 10m^2. Also, have you considered an interaction effect? That is, the effect of GFarea may not be constant across all numbers of bedrooms. (Although these effects may have to be quite strong to detect them with your sample sie). These are two ways to increase the value of R-Sq. jeremymiles-ga

san007-ga rated this answer: 5 out of 5 stars

Amazingly accurate and very very helpful as always. Thank you very
much and if you have a chance could you answer my clarification.
Thanks so much

P.S I would love to tip but have no funds at present, though i expect
we will meet again in the future a syou have been most helpful

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy