Google Answers: Validating that predicted values match actual ones

View Question

Q: Validating that predicted values match actual ones ( No Answer, 4 Comments )

Question

Subject: Validating that predicted values match actual ones
Category: Science > Math
Asked by: enderrob-ga
List Price: $10.00

Posted: 30 Jul 2003 11:24 PDT
Expires: 01 Aug 2003 12:24 PDT
Question ID: 237011

The fundamental questions is this: How can I be sure that my predictive model adequately fits the data? I know that one method of validation is to look at the R^2 of the actual vs. predicted numbers. But, I feel that R^2 isn't exactly what I'm looking for. I believe that all R^2 will tell me is whether or not the shapes of the distributions match. But I'm also interested in their "height". In other words, given two sets of predicted data that both adequately predict the shape of the distribution of the dependent value, how can I tell which is better?
Request for Question Clarification by answerguru-ga on 30 Jul 2003 11:29 PDT Hello enderrob-ga, The short answer to your question is that the answer is dependant on the data you are trying to predict - would it be possible for you to provide a sample data set so that we can provide a more relevant answer? Thanks, answerguru-ga
Clarification of Question by enderrob-ga on 30 Jul 2003 11:31 PDT Absolutely. Here is the exact data in question: Proj Act 522.1234957 747.52 730.7886543 623.13 1688.980008 2236.83 2980.264307 3306.04 6041.793371 6564.06 9928.021432 9132 9535.929362 8586.38 13366.20667 12058.51 13443.47593 10327.05 13248.14974 11256.37 12498.69659 8542.51 15212.21493 14537.47 19316.96364 15080.1 18758.00447 12947.8 20923.71432 15035.16 19688.14052 14187.22 23474.22531 18461.53 20150.95952 16721.8 14021.96931 12102.92 12708.39345 9917.17 16213.62315 13056.52 18601.96511 15739.55 19149.09788 16605.54 19808.78722 17363.51 12491.9434 10905.62 13771.8997 11081.97 15999.16483 10935.45 13285.23522 10777.7 12713.28274 10585.78 13819.638 11125.1 20947.32267 17478.71 29632.69994 24771.68 39739.03201 34291.62 37629.22611 31272.82 35446.73558 29819.37 31627.28442 29370.52 43175.63982 40896.52 40830.40829 40551.46 60321.75809 50858.42 62748.53209 60407.04 73716.28877 77802.91 89181.30103 87328.93 85229.3373 82954.25 85898.89974 84514.09 83108.38179 94102.48 84799.75673 80470.31 81770.09717 84629.28 127220.82 130245.02 134538.9217 132191.12 125578.0476 129521.68 151863.1087 160429.78 160165.0497 169720.66 199049.2055 203118.31 204812.679 210052.42 201353.4333 215837.9 219509.2973 219244.73 194316.8657 209295.78 205014.8405 207920.92 214555.6231 216863.07 243700.9229 232224.52 251781.2262 283689.23 274116.4971 289083.18 291833.2171 320546.62 163476.0784 155119.04
Request for Question Clarification by answerguru-ga on 30 Jul 2003 13:20 PDT I believe I may have been unclear as to my prior clarification - I was more interested in what this data actually represented. Can you provide the actual question? answerguru-ga
Clarification of Question by enderrob-ga on 30 Jul 2003 13:42 PDT I'm afraid I can't get too specific with the details, since this is a public post. The first column, labelled "Proj" are the projected values obtained through a regression equation. The column labelled "Act" are the actual values for our dependent variable. I'm interested in ascertaining the predictive ability of my equation. I can graph those two data sets and see that they look close. I can look at my R^2 and see that there is some predictive ability there. But, accuracy is just as important as precision for me. I think I have a good test for precision (R^2) but not necessarily for accuracy.

Answer

There is no answer at this time.

Comments

Subject: Re: Validating that predicted values match actual ones
From: hfshaw-ga on 01 Aug 2003 10:13 PDT

The statistic you want to use to examine the "goodness of fit" is the
"Chi-squared statistic.  To calculate it, you need to know (or be able
to estimate) the uncertainties (i.e., errors) in your observed
(actual) values.

However, I can tell without calculating anything that there is
probably a problem with your fit, and it will yield biased
predictions.  (To say for certain, I'd need to know more about the
magnitude of the errors on your observations.) The problem can be
demonstrated making a plot of the "residuals", which are defined as
the differences between your observed (actual) values and your
predicted values.  For each data point, plot this value on the y axis
against the value of the independent variable for that observation
(which you didn't include in the data set above) on the x-axis, and
see if you see any patterns or trends in the plot.  (Because you
didn't provide the values for your independent variables, I used used
the your observed values of the dependent variable as the x-axis
variable in my plot, which works just as well in this case.)  You
should also use your estimates of the errors on the observations to
plot error bars for each point.  Making a plot of this sort should be
your first step in assessing the goodness of any fit, before you
attempt a more sophisticated statistical analysis.

For your data set, note that when dependent variable is less than
~60,000, the residuals are almost all negative -- the predicted values
are larger than the observations, and your fit consistently
overpredicts the value of your independent variable.  For values of
the dependent variable greater than this, the residuals are mostly
positive, with the size of the residual generally increasing as the
magnitude of the observation increases.  For values of the independent
variable > ~60,000, your fit consistently underpredicts the value. 
This is not good.  In a good fit, the residuals should have a random
distribution of positive and negative values across the entire range
of of the fit.  The fact that the residuals *do* have some structure
probably means that the mathematical model you are using does not
correctly account for all the variation in your data.  If you are
simply fitting a straight line to your data, your might consider
adding a quadratic term (i.e., a term that depends on the value of the
independent variable squared) to the fit.

I also suspect that you have not adequately accounted for the errors
in your observations when doing your fit.  You should be weighting
each data point by it's errors.

Subject: Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:16 PDT

An important eyeballing test which may improve your model is to simply
plot your predictions vs actual values as an xy plot, then also plot
the line x=y.  If all your points fall on this line, then your
prediction is perfect.  The greater the spread away from this line,
the worse your model.

For example, when I do this with the points you've posted, I see that
when the model predicts between 10,000 and 70,000, it is consistently
underpredicting.  Therefore I would tinker with the model to boost
such predictions.

Now of course you want a hard number to compare different models. 
This is a little subjective because it depends what you would expect
the noise in the data to originate from, and hence how you would
expect it to be distributed.

The first commonly used measure is the so called RMS deviation ("Root
Mean Square").  For each point, calculate (pred - actual) and square
this number.  Then take an average over all the points.  Then take the
square root of the answer.  The higher your final answer, the "worse"
your model.

However, the method above treats an error of say 300 in a small value
as just as bad as an error of 300 in a large value.  But if you were
predicting sales on a certain day for example, getting 10,000 +/- 300
is much worse than 100,000 +/- 300.  Here you might think that the
relative error of each sample is what's important.  So in this case
you might calculate the relative errors of each of your predictions,
[ie (pred-actual)/actual] then do the RMS procedure again.

Subject: Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:21 PDT

hfshaw and I were writing at the same time.
He is obviouly a "proper" statistician (he knows what he's doing _and_
uses all the right words).  So you're welcome to ignore my comment
above if you like, we were trying to get at the same thing anyway.

Subject: Re: Validating that predicted values match actual ones
From: enderrob-ga on 01 Aug 2003 12:23 PDT

Thanks to both hfshaw and 99of9.  Your comments were very helpful, and
I will explore my data along those lines.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy