![]() |
|
![]() | ||
|
Subject:
Validating that predicted values match actual ones
Category: Science > Math Asked by: enderrob-ga List Price: $10.00 |
Posted:
30 Jul 2003 11:24 PDT
Expires: 01 Aug 2003 12:24 PDT Question ID: 237011 |
The fundamental questions is this: How can I be sure that my predictive model adequately fits the data? I know that one method of validation is to look at the R^2 of the actual vs. predicted numbers. But, I feel that R^2 isn't exactly what I'm looking for. I believe that all R^2 will tell me is whether or not the shapes of the distributions match. But I'm also interested in their "height". In other words, given two sets of predicted data that both adequately predict the shape of the distribution of the dependent value, how can I tell which is better? | |
| |
| |
| |
|
![]() | ||
|
There is no answer at this time. |
![]() | ||
|
Subject:
Re: Validating that predicted values match actual ones
From: hfshaw-ga on 01 Aug 2003 10:13 PDT |
The statistic you want to use to examine the "goodness of fit" is the "Chi-squared statistic. To calculate it, you need to know (or be able to estimate) the uncertainties (i.e., errors) in your observed (actual) values. However, I can tell without calculating anything that there is probably a problem with your fit, and it will yield biased predictions. (To say for certain, I'd need to know more about the magnitude of the errors on your observations.) The problem can be demonstrated making a plot of the "residuals", which are defined as the differences between your observed (actual) values and your predicted values. For each data point, plot this value on the y axis against the value of the independent variable for that observation (which you didn't include in the data set above) on the x-axis, and see if you see any patterns or trends in the plot. (Because you didn't provide the values for your independent variables, I used used the your observed values of the dependent variable as the x-axis variable in my plot, which works just as well in this case.) You should also use your estimates of the errors on the observations to plot error bars for each point. Making a plot of this sort should be your first step in assessing the goodness of any fit, before you attempt a more sophisticated statistical analysis. For your data set, note that when dependent variable is less than ~60,000, the residuals are almost all negative -- the predicted values are larger than the observations, and your fit consistently overpredicts the value of your independent variable. For values of the dependent variable greater than this, the residuals are mostly positive, with the size of the residual generally increasing as the magnitude of the observation increases. For values of the independent variable > ~60,000, your fit consistently underpredicts the value. This is not good. In a good fit, the residuals should have a random distribution of positive and negative values across the entire range of of the fit. The fact that the residuals *do* have some structure probably means that the mathematical model you are using does not correctly account for all the variation in your data. If you are simply fitting a straight line to your data, your might consider adding a quadratic term (i.e., a term that depends on the value of the independent variable squared) to the fit. I also suspect that you have not adequately accounted for the errors in your observations when doing your fit. You should be weighting each data point by it's errors. |
Subject:
Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:16 PDT |
An important eyeballing test which may improve your model is to simply plot your predictions vs actual values as an xy plot, then also plot the line x=y. If all your points fall on this line, then your prediction is perfect. The greater the spread away from this line, the worse your model. For example, when I do this with the points you've posted, I see that when the model predicts between 10,000 and 70,000, it is consistently underpredicting. Therefore I would tinker with the model to boost such predictions. Now of course you want a hard number to compare different models. This is a little subjective because it depends what you would expect the noise in the data to originate from, and hence how you would expect it to be distributed. The first commonly used measure is the so called RMS deviation ("Root Mean Square"). For each point, calculate (pred - actual) and square this number. Then take an average over all the points. Then take the square root of the answer. The higher your final answer, the "worse" your model. However, the method above treats an error of say 300 in a small value as just as bad as an error of 300 in a large value. But if you were predicting sales on a certain day for example, getting 10,000 +/- 300 is much worse than 100,000 +/- 300. Here you might think that the relative error of each sample is what's important. So in this case you might calculate the relative errors of each of your predictions, [ie (pred-actual)/actual] then do the RMS procedure again. |
Subject:
Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:21 PDT |
hfshaw and I were writing at the same time. He is obviouly a "proper" statistician (he knows what he's doing _and_ uses all the right words). So you're welcome to ignore my comment above if you like, we were trying to get at the same thing anyway. |
Subject:
Re: Validating that predicted values match actual ones
From: enderrob-ga on 01 Aug 2003 12:23 PDT |
Thanks to both hfshaw and 99of9. Your comments were very helpful, and I will explore my data along those lines. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |