Google Answers Logo
View Question
 
Q: calculate the statistical uniqueness of data ( Answered 5 out of 5 stars,   1 Comment )
Question  
Subject: calculate the statistical uniqueness of data
Category: Science > Math
Asked by: gw-ga
List Price: $5.00
Posted: 17 Jan 2003 12:41 PST
Expires: 16 Feb 2003 12:41 PST
Question ID: 144853
I imagine this must be a common problem in statistics.

Suppose there are N statistical samples.  For the purposes of this
question it should not matter what they represent.

I can easily calculate the correlation coefficient between each pair
of samples via linear least-squares regression, so I can create an NxN
triangular matrix with N * (N - 1) / 2 correlation coefficients (not
counting the diagonal of all 1's):

     0     1     2     3     4
  +-----+-----+-----+-----+-----+
0 | 1.0 |     |     |     |     |
  +-----+-----+-----+-----+-----+
1 | r01 | 1.0 |     |     |     |
  +-----+-----+-----+-----+-----+
2 | r02 | r12 | 1.0 |     |     |
  +-----+-----+-----+-----+-----+
3 | r03 | r13 | r23 | 1.0 |     |
  +-----+-----+-----+-----+-----+
4 | r04 | r14 | r24 | r34 | 1.0 |
  +-----+-----+-----+-----+-----+

Fig. 1: correlation coefficient matrix for N = 5 samples.

What I would like is a function or algorithm to examine these
correlation coefficients and calculate some measure of the
"uniqueness" of each sample.

For example, if all the samples are nearly identical (each pair of
samples has a correlation coefficient near 1.0) then each would have a
uniqueness somewhere near 1/N.  If one sample is not correlated to any
other sample, then it would have a uniqueness near 1.0.  The
uniqueness value for any given sample would always be greater than 0
and less than or equal to 1.

The ideal answer will contain a function in pseudo-code, Pascal, C, or
BASIC.

Clarification of Question by gw-ga on 17 Jan 2003 13:44 PST
For sake of argument, we can imagine that each "sample" mentioned in
my original question is an array containing the percent change in
volume (Y-value) from one moment to the next of an audio stream (the
array index or time index is the X-value).  If several audio streams
are derived from the same source, they will have high correlation
coefficients and low measures of uniqueness.

Request for Question Clarification by jeremymiles-ga on 17 Jan 2003 13:50 PST
have you considered the tolerance of the sample, or some function of
it?  This is the 1-R^2 for each variable, when all other variables are
used as predictors of it.  If the values in the sample could be
predicted from the other sample, the tolerance is zero.

If you haven't considered this, I think it might solve your problem,
and I will post an answer.  If you have, I will keep thinking,

jeremymiles-ga

Request for Question Clarification by jeremymiles-ga on 17 Jan 2003 13:52 PST
You wrote:
"For example, if all the samples are nearly identical (each pair of
samples has a correlation coefficient near 1.0) then each would have a
uniqueness somewhere near 1/N."

I would have thought it has a uniqueness near to zero, rather than
near to 0.2.  Why does the number of samples alter the uniqueness of
the sample?

Clarification of Question by gw-ga on 17 Jan 2003 14:42 PST
Using the tolerance might work.

The uniqueness value will be used to weight the samples so that each
input pattern (group of highly-correlated inputs) will get roughly
equal attention in the processing that happens later on.  That is why
I suggested 1/N when all signals are equal, or more generally, 1/M
when M signals are equal and none match any of the other (N-M)
signals.

So ideally, the sum of the uniqueness values for any group of similar
inputs would not get too low or they'll be ignored.  A group of ten
inputs that are equal should have about the same combined weight as
one input that is distinct from the other ten, all other things being
equal.

I apologize that it's difficult to give an exact definition of
uniqueness--that is, after all, what I'm hoping to get from you.

Clarification of Question by gw-ga on 17 Jan 2003 15:03 PST
Of course, when I mentioned the 1/M value, that was not intended to be
an exact value, but an example for a very trivial case.

Request for Question Clarification by jeremymiles-ga on 18 Jan 2003 02:20 PST
I have two possible solutions, and I will post them both, however my
knowledge of programming isn't up to giving you the complete pseudo
code, for the tricky bits, so I will check first if you are able to
program them, or if you are capable of finding (understanding) them.

The first problem that you will need is matrix
inversion/multiplication.  The problem is simplified because the
matrices are always symmetric.  There is source code available to do
this (e.g. in the book "Numerical recipes in C".

The second thing is an iterative solver of some sort.  If you are
happy for the code to be pretty inefficient, this isn't too hard to
write, if you want to make nicer code, you need to use something like
the Newton-Raphson algorithm.  Again, you can find the algorithms on
the web.

Some programs, such as Excel, (I believe) Matlab, and Mx (that's
freeware) can do all of these things for you, however you would have
to read the results into your program, and if you want to automate the
process, or do it 'on the fly' this isn's going to work.

jeremymiles-ga

Clarification of Question by gw-ga on 18 Jan 2003 09:12 PST
Your solutions sound plausible and I should be able to program them.

And yes, it is most definitely my intention to automate the process.
Answer  
Subject: Re: calculate the statistical uniqueness of data
Answered By: jeremymiles-ga on 19 Jan 2003 12:56 PST
Rated:5 out of 5 stars
 
OK, here goes. 
The suggestion that I have is to use the tolerance, which is 1-R^2 for
each variable, when all of the others are used as predictors in a
(multiple) regression equation.
You need to do this calculation for every variable.  Pick one
variable, call it the outcome, call the rest the predictors.

    0     1     2     3     4 
  +-----+-----+-----+-----+-----+ 
0 | 1.0 |     |     |     |     | 
  +-----+-----+-----+-----+-----+ 
1 | r01 | 1.0 |     |     |     | 
  +-----+-----+-----+-----+-----+ 
2 | r02 | r12 | 1.0 |     |     | 
  +-----+-----+-----+-----+-----+ 
3 | r03 | r13 | r23 | 1.0 |     | 
  +-----+-----+-----+-----+-----+ 
4 | r04 | r14 | r24 | r34 | 1.0 | 
  +-----+-----+-----+-----+-----+ 

The matrix of correlations amongst the predictors call Rxx, the vector
of correlations of the outcome and the predictors, call Rxy.
We want to calculate the vector (called B), which has one element for
each of the predictors.
B = Rxx-1 * Rxy
Where Rxx-1 is the inverse of the matrix Rxx.
Each element of B is multiplied by the corresponding element of Rxy,
and the sum of these is found.  This gives the value of R^2 in that
variable, which is a measure of the amount of variance (information)
in that variable, which is shared with the other variables.  The
tolerance is 1 – R^2.
An example:
Here is our original matrix:
1	0.7	0.5
0.7	1	0.6
0.5	0.6	1

Take sample 3 as the outcome:
Rxx-1 = 
1.960784	-1.37255
-1.37255	1.960784
Rxx-1 x Rxy = B =
	0.157
	0.490

Rxy1 * B1 = 0.6 * 0.157 = 0.078431373
(Where Rxy1 is the first element of the vector Rxy, and B1 is the
first element of the vector B).
Rxy2 * B2 = 0.294117647

The sum of these two is: 0.372.

So 37.2% of the information in the last column would have been
obtained from the first two columns.  The tolerance of the last sample
is 1 – R^2 = 0.628, so 62.8% of the variance (information) is unique.

You could find all this by searching for: “multiple regression” “
matrix algebra”
://www.google.com/search?q=%22matrix+algebra%22+%22multiple+regression%22
(I perhaps shouldn’t admit that I didn’t actually look it up though. 
I did use a computer for the calculations though.)
Mathtalk also suggest PCA as an alternative.  This might be useful,
but takes a different perspective – it takes the variance that is
shared amongst all of your measures, and looks at the correlation
between each variable and the other shared variables.  This is very
similar to the second approach that I was going to suggest –
confirmatory factor analysis, which is much easier to program (IMHO,
but I am not much of a programmer).  However the problem with this is
that if two measures are highly related, and a third one is not, you
will end up with something saying that very little variance is shared
amongst all of your measures.  If you are interested, I can point you
towards some links.
gw-ga rated this answer:5 out of 5 stars
Thanks for all the time you put into this.

Comments  
Subject: Re: calculate the statistical uniqueness of data
From: mathtalk-ga on 18 Jan 2003 18:52 PST
 
Hi, gw-ga:

I'm not clear about the intended application, but you might find
"principal components analysis" a useful tool.  Basically the
principal components are the basis of eigenvectors of the symmetric
correlation or (more commonly) the covariance matrix of a data set. 
The components (eigenvectors) can be ordered by the sizes of their
respective eigenvalues.  See, for example:

http://obelia.jde.aca.mmu.ac.uk/multivar/pca.htm

If all samples are "alike" (perfectly correlated), the correlation
matrix is all ones, and thus has one principal component whose
eigenvalue is positive (the rest of the eigenvalues are zero, as the
correlation matrix is rank one).

If all samples are "independent" (perfectly uncorrelated), the
correlation matrix is the identity matrix, and thus has N principal
components whose eigenvalues are positive (eigenvalues one).

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy