 View Question
Q: Minitab/ multiple regression ( Answered ,   0 Comments ) Question
 Subject: Minitab/ multiple regression Category: Business and Money Asked by: san007-ga List Price: \$20.00 Posted: 28 Apr 2003 06:06 PDT Expires: 28 May 2003 06:06 PDT Question ID: 196474
 ```Hi Im trying to get to grips with the program Minitab, mainly multiple regression. My question is could someone explain each area of a results table shown below especially their meaning. This is just an example so i just need to know how each value works i.e what are p values and their effect etc. I need this in the next 4 hours thanks very much Regression Analysis: TotalŁ versus GFarea, Bedrooms The regression equation is TotalŁ = - 36280 + 84.2 GFarea + 20629 Bedrooms 79 cases used 2 cases contain missing values Predictor Coef SE Coef T P Constant -36280 20143 -1.80 0.076 GFarea 84.198 9.779 8.61 0.000 Bedrooms 20629 4903 4.21 0.000 S = 49455 R-Sq = 66.5% R-Sq(adj) = 65.7% Analysis of Variance Source DF SS MS F P Regression 2 3.69762E+11 1.84881E+11 75.59 0.000 Residual Error 76 1.85878E+11 2445767700 Total 78 5.55641E+11 Source DF Seq SS GFarea 1 3.26460E+11 Bedrooms 1 43302466403 Unusual Observations Obs GFarea TotalŁ Fit SE Fit Residual St Resid 8 2504 381984 257070 9787 124914 2.58R 18 1996 100300 214297 6362 -113997 -2.32R 49 3076 274958 284602 17274 -9644 -0.21 X 53 1980 374256 295467 19406 78789 1.73 X 54 3501 453744 382274 17160 71470 1.54 X R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.``` Subject: Re: Minitab/ multiple regression Answered By: jeremymiles-ga on 28 Apr 2003 07:52 PDT Rated: ```Hello, In answering this question, I shall precede your initial question with >'s > The regression equation is > TotalŁ = - 36280 + 84.2 GFarea + 20629 Bedrooms In regression, we are trying to get an estimate which best predicts the outcome. The outcome in this case is TotalŁ. If we know GFArea and Bedrooms, our best guess at TotalŁ is given by this equation. > 79 cases used 2 cases contain missing values You had 81 rows in your data, however two of these rows contained missing data, for at least one of the variables, and so were not used in the analysis. > Predictor Coef SE Coef T P > Constant -36280 20143 -1.80 0.076 > GFarea 84.198 9.779 8.61 0.000 > Bedrooms 20629 4903 4.21 0.000 This next part gives some of the details of the equation. Each of the estimates (coefficients, indicated with Coef) has a standard error - this is a measure of how variable the estiamte is likely to be. To gain the 95% confidence intervals of the coefficient, we multiply the standard error by 1.96, and add and subrtract this from the coefficient. So our best guess at GFArea is 84. Howevever, this estimate has a standard error of (approximately) 10. So the confidence intervals are given by 84 + 1.96 x 10 = 104 and 84 - 1.96 x 10 = 64. If we were to say that the true (population) value for the coefficient is likely to be from 64 to 104, there is only a 5% chance (1 in 20) that we are wrong. That is, one more unit of GFArea adds between 64 and 104 units of TotalŁ. The next part of the model is T. T is given by Coef / SE. So: -36280 / 20143 = -1.80 T isn't very useful on its own, but it does give us P - that is the probability of the result occurring, if the real value in the population is zero. The fact that GFarea and Bedrooms both have low probabilities (less than 0.0005) means that it is very unlikely you would have found this result, if in fact they had no effect. The constant is a special variable - this is the estimated value of TotalŁ when all of the other predictors are zero. It often makes no sense - as is the case here (I am guessing). The value of a house with 0 bedrooms and a 0 ground floor area is -36280, is obviously a silly thing to say. > S = 49455 R-Sq = 66.5% R-Sq(adj) = 65.7% OK, so this equation tells us the best guess at TotalŁ, but the next question is, how good is that? This is given by R-Sq - or R-Squared. R-Squared is the proportion of variance in the TotalŁ which is explained by the predictors - in this case, it is 66.5% - quite a high prediction. If you take this as 0.665, and find the square root, it is 0.81. This is the correlation between the predicted score (given by the equation) and the actual score. The next question to ask is whether this is a good prediction - i.e. is this prediction better than chance. > Source DF SS MS F P > Regression 2 3.69762E+11 1.84881E+11 75.59 0.000 > Residual Error 76 1.85878E+11 2445767700 > Total 78 5.55641E+11 > Source DF Seq SS > GFarea 1 3.26460E+11 > Bedrooms 1 43302466403 This is answered by the next section. This is usually not reported in depth, so I am not going to cover it here, but request clarification if you need it. The P value again tells us whether we can make a significant prediction, or whether we are better off guessing. because this p-value is very low, it is highly significant, and better than guessing. The most common thing here is to report the p-value (again, it's <0.0005, it's not 0.000). you might also want to report the F, and the DF, in which case it's F=75.6, df = 2, 76. > Obs GFarea TotalŁ Fit SE Fit Residual StResid > 8 2504 381984 257070 9787 124914 2.58R > 18 1996 100300 214297 6362 -113997 -2.32R > 49 3076 274958 284602 17274 -9644 -0.21 X > 53 1980 374256 295467 19406 78789 1.73 X > 54 3501 453744 382274 17160 71470 1.54 X >R denotes an observation with a large standardized residual >X denotes an observation whose X value gives it large influence. The final part of the output is some diagnostics, to help you to interpret the equation. Minitab has selected some cases it believes you might want to look at. It bases this on the residuals and the influence. First, the residuals. The residual is the difference between the value we would expect, given GFArea and Bedrooms, and what we actually have. Large residuals are marked with an R. Case 8 has a very large residual - its value for TotalŁ is 124914 higher than would be expected. Similarly, 18 has a much lower value than would be expected. It is worth looking at these to see if there has been an error entering the data, or there is something unusual about them. Maybe they are in a very different area to the others, maybe they are paved with gold, or come with a free farm (I am guessing, because I have no idea what the data are about - I am sure you can think of something more sensible). Second, the influential cases, marked with an X. An influential case is more important than the others in determining the values of the coefficients. It isn't necessarily anything to worry about, but again is worth checking. (I am not going to go into detail on this, because of your time limit, but if you would like, just request clarification). I have written this fairly swiftly, because of your time limit, so if you think that I have missed anything out, or would like clarification on anything, please ask, before rating the question. I will recheck the page fairly regularly for the rest of the day, to see fi a clarification request appears. Here's a useful site: http://www.fw.umn.edu/biochr/assoc/dho/107/notes/minitab/REGRESS1.HTM Here's a useful book. :) http://www.amazon.co.uk/exec/obidos/ASIN/0761962301/qid%3D1006938682/sr%3D1-1/ref%3Dsr%5Fsp%5Fre/026-9955570-8940457 jeremymiles-ga``` Request for Answer Clarification by san007-ga on 28 Apr 2003 09:40 PDT ```hi is it possible to have further clarification on what is the f part of this is please it would help me out greatly > Source DF SS MS F P > Regression 2 3.69762E+11 1.84881E+11 75.59 0.000 > Residual Error 76 1.85878E+11 2445767700 > Total 78 5.55641E+11 > Source DF Seq SS > GFarea 1 3.26460E+11 > Bedrooms 1 43302466403``` Clarification of Answer by jeremymiles-ga on 28 Apr 2003 10:39 PDT ```Sure. F is a statistic that is used in ANOVA and regression. It's kind of hard to tell with your numbers (because they are very large), but F is given by MSregression / MSError, where MS stands for "mean squares". The mean squares are given by the SS (sums of squares) divided by the degrees of freedom. So, what are the sums of squares? The sums of squares are what regression is all about - you might have heard of regression being called OLS (ordinary least squares) regression - these are the squares that make the sums of squares. The sums of squares are short for sums of squared deviation from the regression. The first sum of squares we have are the total sum of squares. These are calculated by finding the residual (difference) between each value and the mean, squaring it, and then adding them up. The mean is a bit like a regression equation with no predictor variables. If you want to find a value, such that you minimise the sum of squares, you will end up with the mean. Here's an example: Take the numbers: 3 4 5 6 Calculate the mean: 4.5 Find the difference between each value and the mean: 3 -1.5 4 -0.5 5 0.5 6 1.5 Square them: 3 -1.5 2.25 4 -0.5 0.25 5 0.5 0.25 6 1.5 2.25 Add them up; 5. This gives the total sum of squares. It is also impossible to pick another number and find a lower value for the sum of squares, so we can think of the mean as a least squares estimator. (But we are getting off the point). (The sum of squares is also given by the variance * N; the variance is the square of the standard deviation). Now, where were we. Now we get our regression equation, and we calculate a predicted variable for each case and find the residual. If we square the residuals, and add them up, we have the error sum of squares - the point of regression is to find values for the equation that give the error sum of squares - this is the least squares that we are talking about. This will always be lower than the total sum of squares, but our question is, how much lower. We have SStotal, and SSerror. We can calculate the regression sum of squares by subtracting error from total: SSregression = SStotal - SSerror. [You might notice at this point that the R-Sq is equal to the SSregression / SStotal - hence it is the proportion of variance - (SStotal being variance).] This gives the three SS that we have in the table. The trouble with sums of squares is that they are dependent on things like the number of people, and the number of variables, so we need to divide the regression by the regression df, which is given by k (where k is the number of predictor variables) and error df is given by N - k - 1. We divide the SSregression by the SSerror, and we have F. So, the smaller the error the larger the F. In addition, more predictors, for the same value of R-Sq gives a lower F. F is a test statistic, which has a distribution which depends on the degrees of freedom, which is why you need to report the degrees of freedom when you report F. What you are asking is "if there was no effect in my data, what is the probability of getting a value of F this high?" High values of F are associated with low probabilities, which means that it is unlikely that your data are a fluke, and that you have a real result. Again, please feel free to request clarification if any of this is unclear. jeremymiles-ga``` Request for Answer Clarification by san007-ga on 28 Apr 2003 11:16 PDT ```you have been very helpful by the way!! if you can and want to could you clarify what you think the effectiveness is of this model or how i can obtain this. i think im going to have to alter it do you know how the hypothesis testing (null etc) would configure into this model??``` Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:23 PDT ```The model has a pretty good R-Sq, most people would be happy with that - it's rare to see one higher. This is the most common way to statistically consider a model. The cases with high residuals and influence statistics are not a cause of excessive worry, IMHO. What do you plan to do to alter the model? [Given your time limit, I will post this and continue to the next part of your question.]``` Clarification of Answer by jeremymiles-ga on 28 Apr 2003 11:26 PDT ```The null hypothesis is *usually* the hypothesis of zero effect. There are two sets of null hypotheses tested in a regression analysis. First is the null hypothesis that your predictions are no better than chance. This is tested using the F test of the R-Sq. Your data have easily rejected this null hypothesis. Second, is the set of null hypotheses about the parameter estimates or coefficients. Here, for GFArea and Bedrooms, you have rejected the null hypotheses of no effect. For the constant (sometimes called the intercept) the null hypothesis is not rejected (if you use 0.05 as your cut-off), however, as already discussed, the intercept in this case isn't especially useful.``` Request for Answer Clarification by san007-ga on 28 Apr 2003 13:20 PDT ```i was thinking of taking the rooms variable out as the r sq goes higher. is it possible to take out the constant??``` Clarification of Answer by jeremymiles-ga on 29 Apr 2003 02:16 PDT ```R-Sq is higher without the rooms variable? That surprises me. Can you post the correlations between the variables? I an calculate the R-Sq then. You can remove the intercept - this is called "regression through the origin" - I am not sure if you can do this in minitab, but it is usually not recommended. This page: http://mallit.fr.umn.edu/fr5218/reg_refresh/origin.html Says: "Regression through the origin should not be used indiscriminately. The variability of the data about the regression line should be compared to that for the equation including b0 (that is, compare corresponding sy.xs). A plot of studentized residuals versus X should be made as well. If a non-horizontal linear trend is apparent, a nonzero intercept should be suspected." This one: http://www.graphpad.com/instatman/Whentoforcearegressionlinethroughtheorigin_.htm Says: "Therefore you might be tempted to force the regression line through the origin. But you don't particularly care where the line is in the vicinity of the origin. You really care only that the line fits the standards very well near the unknowns. You will probably get a better fit by not constraining the line." Just a stab in the dark here, but have you considered the non-linear effect of GFArea - a small house getting an extra 10m^2 might make a bigger difference than a large house getting an extra 10m^2. Also, have you considered an interaction effect? That is, the effect of GFarea may not be constant across all numbers of bedrooms. (Although these effects may have to be quite strong to detect them with your sample sie). These are two ways to increase the value of R-Sq. jeremymiles-ga```
 san007-ga rated this answer: ```Amazingly accurate and very very helpful as always. Thank you very much and if you have a chance could you answer my clarification. Thanks so much P.S I would love to tip but have no funds at present, though i expect we will meet again in the future a syou have been most helpful```  