Google Answers Logo
View Question
 
Q: Simulation of two correlated data series ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Simulation of two correlated data series
Category: Science > Math
Asked by: asiatechnicals-ga
List Price: $14.00
Posted: 20 Nov 2002 19:13 PST
Expires: 20 Dec 2002 19:13 PST
Question ID: 111686
I am simply trying to model two correlated data series.  

Please forgive the mechanical and highly-structured nature of this
question.  I am trying to make myself as clear as possible, so as to
avoid wasting your time with endless clarifications.

Anyway, it should be fun (for the more twisted of us), quick and easy
to give a definitive answer.

Clarifications are welcome, if needed.

- AsiaTechnicals





Background
==========

I am building a simulation model.  I wish to generate two data series
in MS-Excel with a given correlation of deviation from the mean.

Specifically:

Both series are stationary (the Mean does not change with progression
of the dataseries)

Both series are psuedo-random, with white noise (uniform distribution)

Series Y is partially explained by series X

Where:

The explanatory power of X is given by a coefficient of determination,
constrained by -1<CD<1



Theory
======

Coefficient of determination (Y,X)
= The explanatory power of X in Y
= Correlation (Y,X) squared


Experiment
==========

In Excel, I used a weighted-average of two random numbers to test this
empirically

Weighting (Cell D1) =80%

Data series X (Cell A1) =RAND()
Data series Y (Cell B1) =(A1*$D$1)+(RAND()*(1-$D$1))

Copy cells A1:B1 down 1000 rows (A1:B1000)

Correlation (Y,X) (Cell D2) =CORREL(B1:B1000,A1:A1000)
Coefficient of determination (Y,X) (Cell D3) =D2^2

So, if my understanding of the theory is correct, D3 should equal D1
for any weighting.  Instead, there is a curved relationship between
the two.

I have checked MS-Excel's CORREL function.  It is correct in this
case.


Question
========

What's wrong with my methodology and how can I put it right?


Scoring
=======

I will pay the for an answer that scores at least 2 stars.  I will
score your answer as follows:

** What's wrong
*** What should I do to fix it (or plausible explanation if my method
is right)
**** A worked example that gets it right
***** A worked example that gets it right and gives the theory in
layman's terms

I assume the answer will require 30 mins to someone familiar with
Statistics and MS-Excel.  At US$20 per hour after commission, I have
priced my question at US$14.  If the appointed Answerer requires more,
please post a clarification with your price. I am willing to pay
extra, within reason.
Answer  
Subject: Re: Simulation of two correlated data series
Answered By: dannidin-ga on 21 Nov 2002 07:29 PST
Rated:5 out of 5 stars
 
Dear asiatechnicals-ga,

Thank you for your exceptionally well-articulated question. I will
write the answer, but not in Excel notation since I do not use Excel
normally and would probably use erroneous syntax. I hope you don't
mind just plain mathematical notation...

** What's wrong:

As you correctly state, the observed coefficient of determination
(cell D3), which I will denote by CD, is related to, but not equal to,
the original "Weighting" (cell D1), which I will denote by W. The
relation between them is:
               W^2
      CD = -------------
           W^2 + (1-W)^2

( in words, in case I mess up the formatting: CD is W squared over [W
squared plus (1-W) squared] )

The only thing that's "wrong" here is your belief that what you call
the "weighting" is the same as the coefficient of determination; it is
not. I will explain how CD is computed from W, and after that I will
explain "how to fix it".

I hope you don't mind a bit of algebra:

X is the first sample
Y is the second sample, which you define as

Y = W*X + (1-W)*Z,

where Z is a random sample which is independent of Z.

Denote by sd(X) the standard deviation of X, sd(Y) that of Y etc.

The CD is defined as the square of the "correlation coefficient" (CC).

CC(X,Y) is defined as the normalized covariance between X and Y, that
is, the covariance, divided by the product of the standard deviations:

           Cov(X,Y)
CC(X,Y) = -----------
          sd(X)*sd(Y)

In layman's terms, the covariance measures correlation between X,Y,
but you have to choose the appropriate scale, or units of measure; the
correct units of measure are those of the standard deviations of X and
Y. This gives you as
CC(X,Y) a "pure" (=without units) measure of correlation, which is a
number between -1 and 1.

Now, to compute CC(X,Y), we write

Cov(X,Y) = Cov(X, W*X + (1-W)*Z) = W*Cov(X,X) + (1-W)*Cov(X,Z)

since Cov behaves linearly in the second argument.
Now, Cov(X,X) is just the variance of X, or the square of sd(X). And
Cov(X,Z) is 0 since X and Z are just independent random samples.
Therefore

Cov(X,Y) = W * sd(X)^2  (W times the square of sd(X))

To get CC, we have to divide by the standard deviations. Notice that

sd(Y)^2 = W^2 * sd(X)^2 + (1-W)^2 * sd(Z)^2

since variance of a sum of independent samples is the sum of the
variances.
sd(Z)=sd(X) because they are both independent random samples chosen
from the sample RAND() function, so (taking square roots in the above
equation)

sd(Y) = sd(X) * square root of [W^2 + (1-W)^2]

Therefore
           Cov(X,Y)              W*sd(X)^2
CC(X,Y) = ----------- = ----------------------------- =
          sd(X)*sd(Y)   sd(X)*sd(X)*sqrt[W^2+(1-W)^2]


                               W
                    =  -----------------
                       sqrt[W^2+(1-W)^2]

(sqrt means square root; "^2" means squared)

Taking the square of both sides of the equation gives the formula for
CD that I wrote at the beginning.


*** How to fix it

I assume that by "fix", you mean how to change this code to generate
two random samples with given CD. Well, you thought you were
specifying CD by specifying the "weighting" W, but this led to a CD
which was related to W by the equation I wrote above. All you need to
do is, starting with a given CD, find the correct W that gives you the
coefficient of determination that you want. In other words, you want
to solve the equation
              W^2
    CD =  -------------
          W^2 + (1-W)^2

for W, when CD is given. This can be done with a little bit of
high-school algebra. I won't bother writing the calculation, (if you
want me to write it please ask) but the result is

         CD + sqrt[CD*(1-CD)]
    W =  --------------------
              2*CD - 1

except when CD=1/2, in which case W = 1/2. So what you should do is
add two lines at the beginning of the program which are, in
approximate EXCEL notation:

CoefficientDetermination (Cell E1) = 80%
Weighting (Cell D1) = ($E$1+SQRT($E$1*(1-$E$1)))/(2*$E$1-1)

Like I said I don't use Excel so I can't give you a worked example,
but it should be easy enough to implement the changes that I
suggested. If I were to try to explain in layman's terms what led to
your mistake, I would say that it was the introduction of the third
independent random sample Z which affected the correlation coefficient
(and hence the CD) in a way that you did not expect. I can't explain
it much better than that - I guess you have to stick to the formulas.

For your information, I have been working on this question for about
an hour.(It was 3-4 minutes to understand your mistake, then another
7-8 minutes to work out the algebra, and the rest of the time just to
write the solution in a readable and friendly format...) I will leave
it at your discretion to decide which rating to give me and whether
you want to tip. In any case, I hope this answer has been helpful, and
if you still need clarification please ask and I will do my best to
help.

Regards,
dannidin.

Request for Answer Clarification by asiatechnicals-ga on 21 Nov 2002 18:48 PST
dannidin,

Thank you for your excellent and very rapid response.  It answered all
I asked for, was easily comprehendible and you have already
comfortably earned a 5-star answer rating.  I will be happy to tip
according to the charge rate promised: an additional US$10 for a half
hour well spent.

I have not yet entered a rating and closed out the question, as I need
clarification of your answer.  It’s beyond the scope of my original
question, but Google Answers won’t allow me to request a specific
researcher.  Therefore I propose to continue to pay for your time at
the current rate, through the tip.  If you aren’t able to respond,
I’ll post your 5-star rating and US$10 tip on Monday and submit the
clarification as a new question.

Clarification
=============

(1) Have I misunderstood the meaning or use of CD?  I understood, from
elementary textbooks, that CD is the explanatory power of one data
series in the other.

(2) If I want to produce two paired data series, where Y was W%
explained by X, should I continue to use the weighting procedure,
normalise (adjust) it with your formula, or is there a more
appropriate technique? (Please see question below before answering)

(3) The purpose of these data series is to feed a relatively simple
Monte-Carlo simulation model I am building.  I need data series bound
within 0<Y<1 as an input for the MS-Excel formula =NORMINV(Y,Mean,SD).
 Your normalisation results in a data series that exceeds these
parameters.  What procedure would create appropriately correlated data
series within these parameters?

(4) Any on-line resources I should read?

(5) Finally, please state your time spent.  I anticipate the remaining
questions can be completed within an hour.

AsiaTechnicals

Clarification of Answer by dannidin-ga on 22 Nov 2002 04:01 PST
Hi asiatechnicals,

I didn't answer all of your additional questions. It is possible that
I lack the necessary knowledge to answer (2),(3) - see below. In any
case it is not necessary for you to tip me extra, as your questions
are well within the scope of the original one. If you absolutely
insist, then this clarification has taken me about 30 minutes to
write.

(1) No, you have not misunderstood the definition of CD. It is ONE WAY
TO MEASURE the explanatory power of one data series over another, and
a rather natural way at that. But one thing you have to remember,
saying "Y is W% explained by X" is just our way of thinking about the
precise mathematical statement "Y and X have coefficient of
determination W". People tend to translate mathematical statements
into natural language, and there is some loss of precision in the
process. Many misunderstandings occur that way (your "mistake" that
prompted your original question is a good example). I tend to be wary
of "vague" statements such as

"X explains 80% of the value of Y"
"the standard of living has increased by 5% over the last six months"
"smoking cigarettes increases your chances of having cancer by 25%"

etc., which you often hear on the news. Remember, there is always a
precise meaning to these statements, which can be rather hard to put
into one concise statement. But newspapers (and people) like short and
catchy headlines...

(2),(3) This problem seems to be Excel-related, and I'm not sure I
understand it. Is your problem simply that Y is not between 0 and 1? A
simple linear scaling will solve this problem, e.g. Y_new = a*(Y+b)
where b=minimal value of Y and a = 1/(maximal value of (Y+b)). Or is
your problem that Y does not have s.d. 1? If you clarify the
Excel-related stuff I'll try to answer this.
(one thing I'll need to know, for instance, is what kind of numbers
does the RAND() function give? Is it numbers between 0 and 1? Or
random integers between 1 and some fixed number?)

(4) Good question. I answered the question from personal knowledge so
I didn't think to refer you to any online sources. Doing a (google)
search on "coefficient of determination" gives many results, most of
which have a lot of statistics mumbo-jumbo which I don't like very
much (for reasons explained in (1)). But check out

http://www.richland.cc.il.us/james/lecture/m113/coefficient_determination.html
http://stat.tamu.edu/stat30x/notes/node47.html

which might interest you. Also, a wonderful on-line source for all
things mathematical is

Eric Weisstein's World of Mathematics
http://www.math.wolfram.com

which has an entry (not too enlightening, if you ask my opinion) on
the correlation coefficient

http://mathworld.wolfram.com/CorrelationCoefficient.html

Finally, check out the Google Directory listing for statistics
http://directory.google.com/Top/Science/Math/Statistics/?tc=1

which has many links to on-line resources on statistics.

Hope this helps,
dannidin

Request for Answer Clarification by asiatechnicals-ga on 22 Nov 2002 05:11 PST
<<Comment, not a clarification request; response unnecessary>>

Thanks very much for getting back to me so quickly.

FYI, RAND() produces a uniformly distributed psuedorandom number
between 0 and 1.  That's just what I need to input into
NORMINV(Percentile,Mean,SD), which converts it to a normally
distributed number.

The weighting method does not produce an appropriate correlation, but
it does keep the result within bounds.  I will apply a linear scalar
as you suggested, which will produce the data I need.  It's a bit
contrived, but it's practical and adequate for my purposes.  I will
close out this question now and hand over your rating and tip.

Given the professional quality of the responses I have seen, I aim to
offer interesting, adequately priced questions with specific
objectives and scoring.  If you see more of my questions posted, I
hope you'll take time to look them over.

Kindest regards,

AsiaTechnicals

Clarification of Answer by dannidin-ga on 24 Nov 2002 00:46 PST
thanks man ;-)
asiatechnicals-ga rated this answer:5 out of 5 stars and gave an additional tip of: $20.00
Thank you for your complete answer and rapid response to my question
and clarification.  I enclose a tip equivalent to an extra hour's work
well spent.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy