Dear asiatechnicals-ga,
Thank you for your exceptionally well-articulated question. I will
write the answer, but not in Excel notation since I do not use Excel
normally and would probably use erroneous syntax. I hope you don't
mind just plain mathematical notation...
** What's wrong:
As you correctly state, the observed coefficient of determination
(cell D3), which I will denote by CD, is related to, but not equal to,
the original "Weighting" (cell D1), which I will denote by W. The
relation between them is:
W^2
CD = -------------
W^2 + (1-W)^2
( in words, in case I mess up the formatting: CD is W squared over [W
squared plus (1-W) squared] )
The only thing that's "wrong" here is your belief that what you call
the "weighting" is the same as the coefficient of determination; it is
not. I will explain how CD is computed from W, and after that I will
explain "how to fix it".
I hope you don't mind a bit of algebra:
X is the first sample
Y is the second sample, which you define as
Y = W*X + (1-W)*Z,
where Z is a random sample which is independent of Z.
Denote by sd(X) the standard deviation of X, sd(Y) that of Y etc.
The CD is defined as the square of the "correlation coefficient" (CC).
CC(X,Y) is defined as the normalized covariance between X and Y, that
is, the covariance, divided by the product of the standard deviations:
Cov(X,Y)
CC(X,Y) = -----------
sd(X)*sd(Y)
In layman's terms, the covariance measures correlation between X,Y,
but you have to choose the appropriate scale, or units of measure; the
correct units of measure are those of the standard deviations of X and
Y. This gives you as
CC(X,Y) a "pure" (=without units) measure of correlation, which is a
number between -1 and 1.
Now, to compute CC(X,Y), we write
Cov(X,Y) = Cov(X, W*X + (1-W)*Z) = W*Cov(X,X) + (1-W)*Cov(X,Z)
since Cov behaves linearly in the second argument.
Now, Cov(X,X) is just the variance of X, or the square of sd(X). And
Cov(X,Z) is 0 since X and Z are just independent random samples.
Therefore
Cov(X,Y) = W * sd(X)^2 (W times the square of sd(X))
To get CC, we have to divide by the standard deviations. Notice that
sd(Y)^2 = W^2 * sd(X)^2 + (1-W)^2 * sd(Z)^2
since variance of a sum of independent samples is the sum of the
variances.
sd(Z)=sd(X) because they are both independent random samples chosen
from the sample RAND() function, so (taking square roots in the above
equation)
sd(Y) = sd(X) * square root of [W^2 + (1-W)^2]
Therefore
Cov(X,Y) W*sd(X)^2
CC(X,Y) = ----------- = ----------------------------- =
sd(X)*sd(Y) sd(X)*sd(X)*sqrt[W^2+(1-W)^2]
W
= -----------------
sqrt[W^2+(1-W)^2]
(sqrt means square root; "^2" means squared)
Taking the square of both sides of the equation gives the formula for
CD that I wrote at the beginning.
*** How to fix it
I assume that by "fix", you mean how to change this code to generate
two random samples with given CD. Well, you thought you were
specifying CD by specifying the "weighting" W, but this led to a CD
which was related to W by the equation I wrote above. All you need to
do is, starting with a given CD, find the correct W that gives you the
coefficient of determination that you want. In other words, you want
to solve the equation
W^2
CD = -------------
W^2 + (1-W)^2
for W, when CD is given. This can be done with a little bit of
high-school algebra. I won't bother writing the calculation, (if you
want me to write it please ask) but the result is
CD + sqrt[CD*(1-CD)]
W = --------------------
2*CD - 1
except when CD=1/2, in which case W = 1/2. So what you should do is
add two lines at the beginning of the program which are, in
approximate EXCEL notation:
CoefficientDetermination (Cell E1) = 80%
Weighting (Cell D1) = ($E$1+SQRT($E$1*(1-$E$1)))/(2*$E$1-1)
Like I said I don't use Excel so I can't give you a worked example,
but it should be easy enough to implement the changes that I
suggested. If I were to try to explain in layman's terms what led to
your mistake, I would say that it was the introduction of the third
independent random sample Z which affected the correlation coefficient
(and hence the CD) in a way that you did not expect. I can't explain
it much better than that - I guess you have to stick to the formulas.
For your information, I have been working on this question for about
an hour.(It was 3-4 minutes to understand your mistake, then another
7-8 minutes to work out the algebra, and the rest of the time just to
write the solution in a readable and friendly format...) I will leave
it at your discretion to decide which rating to give me and whether
you want to tip. In any case, I hope this answer has been helpful, and
if you still need clarification please ask and I will do my best to
help.
Regards,
dannidin. |
Request for Answer Clarification by
asiatechnicals-ga
on
21 Nov 2002 18:48 PST
dannidin,
Thank you for your excellent and very rapid response. It answered all
I asked for, was easily comprehendible and you have already
comfortably earned a 5-star answer rating. I will be happy to tip
according to the charge rate promised: an additional US$10 for a half
hour well spent.
I have not yet entered a rating and closed out the question, as I need
clarification of your answer. Its beyond the scope of my original
question, but Google Answers wont allow me to request a specific
researcher. Therefore I propose to continue to pay for your time at
the current rate, through the tip. If you arent able to respond,
Ill post your 5-star rating and US$10 tip on Monday and submit the
clarification as a new question.
Clarification
=============
(1) Have I misunderstood the meaning or use of CD? I understood, from
elementary textbooks, that CD is the explanatory power of one data
series in the other.
(2) If I want to produce two paired data series, where Y was W%
explained by X, should I continue to use the weighting procedure,
normalise (adjust) it with your formula, or is there a more
appropriate technique? (Please see question below before answering)
(3) The purpose of these data series is to feed a relatively simple
Monte-Carlo simulation model I am building. I need data series bound
within 0<Y<1 as an input for the MS-Excel formula =NORMINV(Y,Mean,SD).
Your normalisation results in a data series that exceeds these
parameters. What procedure would create appropriately correlated data
series within these parameters?
(4) Any on-line resources I should read?
(5) Finally, please state your time spent. I anticipate the remaining
questions can be completed within an hour.
AsiaTechnicals
|
Clarification of Answer by
dannidin-ga
on
22 Nov 2002 04:01 PST
Hi asiatechnicals,
I didn't answer all of your additional questions. It is possible that
I lack the necessary knowledge to answer (2),(3) - see below. In any
case it is not necessary for you to tip me extra, as your questions
are well within the scope of the original one. If you absolutely
insist, then this clarification has taken me about 30 minutes to
write.
(1) No, you have not misunderstood the definition of CD. It is ONE WAY
TO MEASURE the explanatory power of one data series over another, and
a rather natural way at that. But one thing you have to remember,
saying "Y is W% explained by X" is just our way of thinking about the
precise mathematical statement "Y and X have coefficient of
determination W". People tend to translate mathematical statements
into natural language, and there is some loss of precision in the
process. Many misunderstandings occur that way (your "mistake" that
prompted your original question is a good example). I tend to be wary
of "vague" statements such as
"X explains 80% of the value of Y"
"the standard of living has increased by 5% over the last six months"
"smoking cigarettes increases your chances of having cancer by 25%"
etc., which you often hear on the news. Remember, there is always a
precise meaning to these statements, which can be rather hard to put
into one concise statement. But newspapers (and people) like short and
catchy headlines...
(2),(3) This problem seems to be Excel-related, and I'm not sure I
understand it. Is your problem simply that Y is not between 0 and 1? A
simple linear scaling will solve this problem, e.g. Y_new = a*(Y+b)
where b=minimal value of Y and a = 1/(maximal value of (Y+b)). Or is
your problem that Y does not have s.d. 1? If you clarify the
Excel-related stuff I'll try to answer this.
(one thing I'll need to know, for instance, is what kind of numbers
does the RAND() function give? Is it numbers between 0 and 1? Or
random integers between 1 and some fixed number?)
(4) Good question. I answered the question from personal knowledge so
I didn't think to refer you to any online sources. Doing a (google)
search on "coefficient of determination" gives many results, most of
which have a lot of statistics mumbo-jumbo which I don't like very
much (for reasons explained in (1)). But check out
http://www.richland.cc.il.us/james/lecture/m113/coefficient_determination.html
http://stat.tamu.edu/stat30x/notes/node47.html
which might interest you. Also, a wonderful on-line source for all
things mathematical is
Eric Weisstein's World of Mathematics
http://www.math.wolfram.com
which has an entry (not too enlightening, if you ask my opinion) on
the correlation coefficient
http://mathworld.wolfram.com/CorrelationCoefficient.html
Finally, check out the Google Directory listing for statistics
http://directory.google.com/Top/Science/Math/Statistics/?tc=1
which has many links to on-line resources on statistics.
Hope this helps,
dannidin
|