Google Answers Logo
View Question
 
Q: Validating that predicted values match actual ones ( No Answer,   4 Comments )
Question  
Subject: Validating that predicted values match actual ones
Category: Science > Math
Asked by: enderrob-ga
List Price: $10.00
Posted: 30 Jul 2003 11:24 PDT
Expires: 01 Aug 2003 12:24 PDT
Question ID: 237011
The fundamental questions is this:  How can I be sure that my
predictive model adequately fits the data?

I know that one method of validation is to look at the R^2 of the
actual vs. predicted numbers.  But, I feel that R^2 isn't exactly what
I'm looking for.  I believe that all R^2 will tell me is whether or
not the shapes of the distributions match.  But I'm also interested in
their "height".  In other words, given two sets of predicted data that
both adequately predict the shape of the distribution of the dependent
value, how can I tell which is better?

Request for Question Clarification by answerguru-ga on 30 Jul 2003 11:29 PDT
Hello enderrob-ga,

The short answer to your question is that the answer is dependant on
the data you are trying to predict - would it be possible for you to
provide a sample data set so that we can provide a more relevant
answer?

Thanks,
answerguru-ga

Clarification of Question by enderrob-ga on 30 Jul 2003 11:31 PDT
Absolutely.  Here is the exact data in question:

Proj	Act
522.1234957	747.52
730.7886543	623.13
1688.980008	2236.83
2980.264307	3306.04
6041.793371	6564.06
9928.021432	9132
9535.929362	8586.38
13366.20667	12058.51
13443.47593	10327.05
13248.14974	11256.37
12498.69659	8542.51
15212.21493	14537.47
19316.96364	15080.1
18758.00447	12947.8
20923.71432	15035.16
19688.14052	14187.22
23474.22531	18461.53
20150.95952	16721.8
14021.96931	12102.92
12708.39345	9917.17
16213.62315	13056.52
18601.96511	15739.55
19149.09788	16605.54
19808.78722	17363.51
12491.9434	10905.62
13771.8997	11081.97
15999.16483	10935.45
13285.23522	10777.7
12713.28274	10585.78
13819.638	11125.1
20947.32267	17478.71
29632.69994	24771.68
39739.03201	34291.62
37629.22611	31272.82
35446.73558	29819.37
31627.28442	29370.52
43175.63982	40896.52
40830.40829	40551.46
60321.75809	50858.42
62748.53209	60407.04
73716.28877	77802.91
89181.30103	87328.93
85229.3373	82954.25
85898.89974	84514.09
83108.38179	94102.48
84799.75673	80470.31
81770.09717	84629.28
127220.82	130245.02
134538.9217	132191.12
125578.0476	129521.68
151863.1087	160429.78
160165.0497	169720.66
199049.2055	203118.31
204812.679	210052.42
201353.4333	215837.9
219509.2973	219244.73
194316.8657	209295.78
205014.8405	207920.92
214555.6231	216863.07
243700.9229	232224.52
251781.2262	283689.23
274116.4971	289083.18
291833.2171	320546.62
163476.0784	155119.04

Request for Question Clarification by answerguru-ga on 30 Jul 2003 13:20 PDT
I believe I may have been unclear as to my prior clarification - I was
more interested in what this data actually represented. Can you
provide the actual question?

answerguru-ga

Clarification of Question by enderrob-ga on 30 Jul 2003 13:42 PDT
I'm afraid I can't get too specific with the details, since this is a
public post.  The first column, labelled "Proj" are the projected
values obtained through a regression equation.  The column labelled
"Act" are the actual values for our dependent variable.

I'm interested in ascertaining the predictive ability of my equation. 
I can graph those two data sets and see that they *look* close.  I can
look at my R^2 and see that there is some predictive ability there.

But, accuracy is just as important as precision for me.  I think I
have a good test for precision (R^2) but not necessarily for accuracy.
Answer  
There is no answer at this time.

Comments  
Subject: Re: Validating that predicted values match actual ones
From: hfshaw-ga on 01 Aug 2003 10:13 PDT
 
The statistic you want to use to examine the "goodness of fit" is the
"Chi-squared statistic.  To calculate it, you need to know (or be able
to estimate) the uncertainties (i.e., errors) in your observed
(actual) values.

However, I can tell without calculating anything that there is
probably a problem with your fit, and it will yield biased
predictions.  (To say for certain, I'd need to know more about the
magnitude of the errors on your observations.) The problem can be
demonstrated making a plot of the "residuals", which are defined as
the differences between your observed (actual) values and your
predicted values.  For each data point, plot this value on the y axis
against the value of the independent variable for that observation
(which you didn't include in the data set above) on the x-axis, and
see if you see any patterns or trends in the plot.  (Because you
didn't provide the values for your independent variables, I used used
the your observed values of the dependent variable as the x-axis
variable in my plot, which works just as well in this case.)  You
should also use your estimates of the errors on the observations to
plot error bars for each point.  Making a plot of this sort should be
your first step in assessing the goodness of any fit, before you
attempt a more sophisticated statistical analysis.

For your data set, note that when dependent variable is less than
~60,000, the residuals are almost all negative -- the predicted values
are larger than the observations, and your fit consistently
overpredicts the value of your independent variable.  For values of
the dependent variable greater than this, the residuals are mostly
positive, with the size of the residual generally increasing as the
magnitude of the observation increases.  For values of the independent
variable > ~60,000, your fit consistently underpredicts the value. 
This is not good.  In a good fit, the residuals should have a random
distribution of positive and negative values across the entire range
of of the fit.  The fact that the residuals *do* have some structure
probably means that the mathematical model you are using does not
correctly account for all the variation in your data.  If you are
simply fitting a straight line to your data, your might consider
adding a quadratic term (i.e., a term that depends on the value of the
independent variable squared) to the fit.

I also suspect that you have not adequately accounted for the errors
in your observations when doing your fit.  You should be weighting
each data point by it's errors.
Subject: Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:16 PDT
 
An important eyeballing test which may improve your model is to simply
plot your predictions vs actual values as an xy plot, then also plot
the line x=y.  If all your points fall on this line, then your
prediction is perfect.  The greater the spread away from this line,
the worse your model.

For example, when I do this with the points you've posted, I see that
when the model predicts between 10,000 and 70,000, it is consistently
underpredicting.  Therefore I would tinker with the model to boost
such predictions.

Now of course you want a hard number to compare different models. 
This is a little subjective because it depends what you would expect
the noise in the data to originate from, and hence how you would
expect it to be distributed.

The first commonly used measure is the so called RMS deviation ("Root
Mean Square").  For each point, calculate (pred - actual) and square
this number.  Then take an average over all the points.  Then take the
square root of the answer.  The higher your final answer, the "worse"
your model.

However, the method above treats an error of say 300 in a small value
as just as bad as an error of 300 in a large value.  But if you were
predicting sales on a certain day for example, getting 10,000 +/- 300
is much worse than 100,000 +/- 300.  Here you might think that the
relative error of each sample is what's important.  So in this case
you might calculate the relative errors of each of your predictions,
[ie (pred-actual)/actual] then do the RMS procedure again.
Subject: Re: Validating that predicted values match actual ones
From: 99of9-ga on 01 Aug 2003 10:21 PDT
 
hfshaw and I were writing at the same time.
He is obviouly a "proper" statistician (he knows what he's doing _and_
uses all the right words).  So you're welcome to ignore my comment
above if you like, we were trying to get at the same thing anyway.
Subject: Re: Validating that predicted values match actual ones
From: enderrob-ga on 01 Aug 2003 12:23 PDT
 
Thanks to both hfshaw and 99of9.  Your comments were very helpful, and
I will explore my data along those lines.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy