Google Answers: Simulation of two correlated data series

View Question

Q: Simulation of two correlated data series ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Simulation of two correlated data series
Category: Science > Math
Asked by: asiatechnicals-ga
List Price: $14.00

Posted: 20 Nov 2002 19:13 PST
Expires: 20 Dec 2002 19:13 PST
Question ID: 111686

I am simply trying to model two correlated data series.  

Please forgive the mechanical and highly-structured nature of this
question.  I am trying to make myself as clear as possible, so as to
avoid wasting your time with endless clarifications.

Anyway, it should be fun (for the more twisted of us), quick and easy
to give a definitive answer.

Clarifications are welcome, if needed.

- AsiaTechnicals





Background
==========

I am building a simulation model.  I wish to generate two data series
in MS-Excel with a given correlation of deviation from the mean.

Specifically:

Both series are stationary (the Mean does not change with progression
of the dataseries)

Both series are psuedo-random, with white noise (uniform distribution)

Series Y is partially explained by series X

Where:

The explanatory power of X is given by a coefficient of determination,
constrained by -1<CD<1



Theory
======

Coefficient of determination (Y,X)
= The explanatory power of X in Y
= Correlation (Y,X) squared


Experiment
==========

In Excel, I used a weighted-average of two random numbers to test this
empirically

Weighting (Cell D1) =80%

Data series X (Cell A1) =RAND()
Data series Y (Cell B1) =(A1*$D$1)+(RAND()*(1-$D$1))

Copy cells A1:B1 down 1000 rows (A1:B1000)

Correlation (Y,X) (Cell D2) =CORREL(B1:B1000,A1:A1000)
Coefficient of determination (Y,X) (Cell D3) =D2^2

So, if my understanding of the theory is correct, D3 should equal D1
for any weighting.  Instead, there is a curved relationship between
the two.

I have checked MS-Excel's CORREL function.  It is correct in this
case.


Question
========

What's wrong with my methodology and how can I put it right?


Scoring
=======

I will pay the for an answer that scores at least 2 stars.  I will
score your answer as follows:

** What's wrong
*** What should I do to fix it (or plausible explanation if my method
is right)
**** A worked example that gets it right
***** A worked example that gets it right and gives the theory in
layman's terms

I assume the answer will require 30 mins to someone familiar with
Statistics and MS-Excel.  At US$20 per hour after commission, I have
priced my question at US$14.  If the appointed Answerer requires more,
please post a clarification with your price. I am willing to pay
extra, within reason.

Answer

Subject: Re: Simulation of two correlated data series
Answered By: dannidin-ga on 21 Nov 2002 07:29 PST
Rated: 5 out of 5 stars

Dear asiatechnicals-ga, Thank you for your exceptionally well-articulated question. I will write the answer, but not in Excel notation since I do not use Excel normally and would probably use erroneous syntax. I hope you don't mind just plain mathematical notation... ** What's wrong: As you correctly state, the observed coefficient of determination (cell D3), which I will denote by CD, is related to, but not equal to, the original "Weighting" (cell D1), which I will denote by W. The relation between them is: W^2 CD = ------------- W^2 + (1-W)^2 ( in words, in case I mess up the formatting: CD is W squared over [W squared plus (1-W) squared] ) The only thing that's "wrong" here is your belief that what you call the "weighting" is the same as the coefficient of determination; it is not. I will explain how CD is computed from W, and after that I will explain "how to fix it". I hope you don't mind a bit of algebra: X is the first sample Y is the second sample, which you define as Y = WX + (1-W)Z, where Z is a random sample which is independent of Z. Denote by sd(X) the standard deviation of X, sd(Y) that of Y etc. The CD is defined as the square of the "correlation coefficient" (CC). CC(X,Y) is defined as the normalized covariance between X and Y, that is, the covariance, divided by the product of the standard deviations: Cov(X,Y) CC(X,Y) = ----------- sd(X)sd(Y) In layman's terms, the covariance measures correlation between X,Y, but you have to choose the appropriate scale, or units of measure; the correct units of measure are those of the standard deviations of X and Y. This gives you as CC(X,Y) a "pure" (=without units) measure of correlation, which is a number between -1 and 1. Now, to compute CC(X,Y), we write Cov(X,Y) = Cov(X, WX + (1-W)Z) = WCov(X,X) + (1-W)Cov(X,Z) since Cov behaves linearly in the second argument. Now, Cov(X,X) is just the variance of X, or the square of sd(X). And Cov(X,Z) is 0 since X and Z are just independent random samples. Therefore Cov(X,Y) = W sd(X)^2 (W times the square of sd(X)) To get CC, we have to divide by the standard deviations. Notice that sd(Y)^2 = W^2 * sd(X)^2 + (1-W)^2 * sd(Z)^2 since variance of a sum of independent samples is the sum of the variances. sd(Z)=sd(X) because they are both independent random samples chosen from the sample RAND() function, so (taking square roots in the above equation) sd(Y) = sd(X) * square root of [W^2 + (1-W)^2] Therefore Cov(X,Y) Wsd(X)^2 CC(X,Y) = ----------- = ----------------------------- = sd(X)sd(Y) sd(X)sd(X)sqrt[W^2+(1-W)^2] W = ----------------- sqrt[W^2+(1-W)^2] (sqrt means square root; "^2" means squared) Taking the square of both sides of the equation gives the formula for CD that I wrote at the beginning. *** How to fix it I assume that by "fix", you mean how to change this code to generate two random samples with given CD. Well, you thought you were specifying CD by specifying the "weighting" W, but this led to a CD which was related to W by the equation I wrote above. All you need to do is, starting with a given CD, find the correct W that gives you the coefficient of determination that you want. In other words, you want to solve the equation W^2 CD = ------------- W^2 + (1-W)^2 for W, when CD is given. This can be done with a little bit of high-school algebra. I won't bother writing the calculation, (if you want me to write it please ask) but the result is CD + sqrt[CD(1-CD)] W = -------------------- 2CD - 1 except when CD=1/2, in which case W = 1/2. So what you should do is add two lines at the beginning of the program which are, in approximate EXCEL notation: CoefficientDetermination (Cell E1) = 80% Weighting (Cell D1) = ($E$1+SQRT($E$1(1-$E$1)))/(2$E$1-1) Like I said I don't use Excel so I can't give you a worked example, but it should be easy enough to implement the changes that I suggested. If I were to try to explain in layman's terms what led to your mistake, I would say that it was the introduction of the third independent random sample Z which affected the correlation coefficient (and hence the CD) in a way that you did not expect. I can't explain it much better than that - I guess you have to stick to the formulas. For your information, I have been working on this question for about an hour.(It was 3-4 minutes to understand your mistake, then another 7-8 minutes to work out the algebra, and the rest of the time just to write the solution in a readable and friendly format...) I will leave it at your discretion to decide which rating to give me and whether you want to tip. In any case, I hope this answer has been helpful, and if you still need clarification please ask and I will do my best to help. Regards, dannidin.
Request for Answer Clarification by asiatechnicals-ga on 21 Nov 2002 18:48 PST dannidin, Thank you for your excellent and very rapid response. It answered all I asked for, was easily comprehendible and you have already comfortably earned a 5-star answer rating. I will be happy to tip according to the charge rate promised: an additional US$10 for a half hour well spent. I have not yet entered a rating and closed out the question, as I need clarification of your answer. It’s beyond the scope of my original question, but Google Answers won’t allow me to request a specific researcher. Therefore I propose to continue to pay for your time at the current rate, through the tip. If you aren’t able to respond, I’ll post your 5-star rating and US$10 tip on Monday and submit the clarification as a new question. Clarification ============= (1) Have I misunderstood the meaning or use of CD? I understood, from elementary textbooks, that CD is the explanatory power of one data series in the other. (2) If I want to produce two paired data series, where Y was W% explained by X, should I continue to use the weighting procedure, normalise (adjust) it with your formula, or is there a more appropriate technique? (Please see question below before answering) (3) The purpose of these data series is to feed a relatively simple Monte-Carlo simulation model I am building. I need data series bound within 0<Y<1 as an input for the MS-Excel formula =NORMINV(Y,Mean,SD). Your normalisation results in a data series that exceeds these parameters. What procedure would create appropriately correlated data series within these parameters? (4) Any on-line resources I should read? (5) Finally, please state your time spent. I anticipate the remaining questions can be completed within an hour. AsiaTechnicals
Clarification of Answer by dannidin-ga on 22 Nov 2002 04:01 PST Hi asiatechnicals, I didn't answer all of your additional questions. It is possible that I lack the necessary knowledge to answer (2),(3) - see below. In any case it is not necessary for you to tip me extra, as your questions are well within the scope of the original one. If you absolutely insist, then this clarification has taken me about 30 minutes to write. (1) No, you have not misunderstood the definition of CD. It is ONE WAY TO MEASURE the explanatory power of one data series over another, and a rather natural way at that. But one thing you have to remember, saying "Y is W% explained by X" is just our way of thinking about the precise mathematical statement "Y and X have coefficient of determination W". People tend to translate mathematical statements into natural language, and there is some loss of precision in the process. Many misunderstandings occur that way (your "mistake" that prompted your original question is a good example). I tend to be wary of "vague" statements such as "X explains 80% of the value of Y" "the standard of living has increased by 5% over the last six months" "smoking cigarettes increases your chances of having cancer by 25%" etc., which you often hear on the news. Remember, there is always a precise meaning to these statements, which can be rather hard to put into one concise statement. But newspapers (and people) like short and catchy headlines... (2),(3) This problem seems to be Excel-related, and I'm not sure I understand it. Is your problem simply that Y is not between 0 and 1? A simple linear scaling will solve this problem, e.g. Y_new = a*(Y+b) where b=minimal value of Y and a = 1/(maximal value of (Y+b)). Or is your problem that Y does not have s.d. 1? If you clarify the Excel-related stuff I'll try to answer this. (one thing I'll need to know, for instance, is what kind of numbers does the RAND() function give? Is it numbers between 0 and 1? Or random integers between 1 and some fixed number?) (4) Good question. I answered the question from personal knowledge so I didn't think to refer you to any online sources. Doing a (google) search on "coefficient of determination" gives many results, most of which have a lot of statistics mumbo-jumbo which I don't like very much (for reasons explained in (1)). But check out http://www.richland.cc.il.us/james/lecture/m113/coefficient_determination.html http://stat.tamu.edu/stat30x/notes/node47.html which might interest you. Also, a wonderful on-line source for all things mathematical is Eric Weisstein's World of Mathematics http://www.math.wolfram.com which has an entry (not too enlightening, if you ask my opinion) on the correlation coefficient http://mathworld.wolfram.com/CorrelationCoefficient.html Finally, check out the Google Directory listing for statistics http://directory.google.com/Top/Science/Math/Statistics/?tc=1 which has many links to on-line resources on statistics. Hope this helps, dannidin
Request for Answer Clarification by asiatechnicals-ga on 22 Nov 2002 05:11 PST <<Comment, not a clarification request; response unnecessary>> Thanks very much for getting back to me so quickly. FYI, RAND() produces a uniformly distributed psuedorandom number between 0 and 1. That's just what I need to input into NORMINV(Percentile,Mean,SD), which converts it to a normally distributed number. The weighting method does not produce an appropriate correlation, but it does keep the result within bounds. I will apply a linear scalar as you suggested, which will produce the data I need. It's a bit contrived, but it's practical and adequate for my purposes. I will close out this question now and hand over your rating and tip. Given the professional quality of the responses I have seen, I aim to offer interesting, adequately priced questions with specific objectives and scoring. If you see more of my questions posted, I hope you'll take time to look them over. Kindest regards, AsiaTechnicals
Clarification of Answer by dannidin-ga on 24 Nov 2002 00:46 PST thanks man ;-)

asiatechnicals-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $20.00

Thank you for your complete answer and rapid response to my question
and clarification.  I enclose a tip equivalent to an extra hour's work
well spent.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy