View Question
Q: calculate the statistical uniqueness of data ( Answered ,   1 Comment )
 Question
 Subject: calculate the statistical uniqueness of data Category: Science > Math Asked by: gw-ga List Price: \$5.00 Posted: 17 Jan 2003 12:41 PST Expires: 16 Feb 2003 12:41 PST Question ID: 144853
 I imagine this must be a common problem in statistics. Suppose there are N statistical samples. For the purposes of this question it should not matter what they represent. I can easily calculate the correlation coefficient between each pair of samples via linear least-squares regression, so I can create an NxN triangular matrix with N * (N - 1) / 2 correlation coefficients (not counting the diagonal of all 1's): 0 1 2 3 4 +-----+-----+-----+-----+-----+ 0 | 1.0 | | | | | +-----+-----+-----+-----+-----+ 1 | r01 | 1.0 | | | | +-----+-----+-----+-----+-----+ 2 | r02 | r12 | 1.0 | | | +-----+-----+-----+-----+-----+ 3 | r03 | r13 | r23 | 1.0 | | +-----+-----+-----+-----+-----+ 4 | r04 | r14 | r24 | r34 | 1.0 | +-----+-----+-----+-----+-----+ Fig. 1: correlation coefficient matrix for N = 5 samples. What I would like is a function or algorithm to examine these correlation coefficients and calculate some measure of the "uniqueness" of each sample. For example, if all the samples are nearly identical (each pair of samples has a correlation coefficient near 1.0) then each would have a uniqueness somewhere near 1/N. If one sample is not correlated to any other sample, then it would have a uniqueness near 1.0. The uniqueness value for any given sample would always be greater than 0 and less than or equal to 1. The ideal answer will contain a function in pseudo-code, Pascal, C, or BASIC. Clarification of Question by gw-ga on 17 Jan 2003 13:44 PST For sake of argument, we can imagine that each "sample" mentioned in my original question is an array containing the percent change in volume (Y-value) from one moment to the next of an audio stream (the array index or time index is the X-value). If several audio streams are derived from the same source, they will have high correlation coefficients and low measures of uniqueness. Request for Question Clarification by jeremymiles-ga on 17 Jan 2003 13:50 PST have you considered the tolerance of the sample, or some function of it? This is the 1-R^2 for each variable, when all other variables are used as predictors of it. If the values in the sample could be predicted from the other sample, the tolerance is zero. If you haven't considered this, I think it might solve your problem, and I will post an answer. If you have, I will keep thinking, jeremymiles-ga Request for Question Clarification by jeremymiles-ga on 17 Jan 2003 13:52 PST You wrote: "For example, if all the samples are nearly identical (each pair of samples has a correlation coefficient near 1.0) then each would have a uniqueness somewhere near 1/N." I would have thought it has a uniqueness near to zero, rather than near to 0.2. Why does the number of samples alter the uniqueness of the sample? Clarification of Question by gw-ga on 17 Jan 2003 14:42 PST Using the tolerance might work. The uniqueness value will be used to weight the samples so that each input pattern (group of highly-correlated inputs) will get roughly equal attention in the processing that happens later on. That is why I suggested 1/N when all signals are equal, or more generally, 1/M when M signals are equal and none match any of the other (N-M) signals. So ideally, the sum of the uniqueness values for any group of similar inputs would not get too low or they'll be ignored. A group of ten inputs that are equal should have about the same combined weight as one input that is distinct from the other ten, all other things being equal. I apologize that it's difficult to give an exact definition of uniqueness--that is, after all, what I'm hoping to get from you. Clarification of Question by gw-ga on 17 Jan 2003 15:03 PST Of course, when I mentioned the 1/M value, that was not intended to be an exact value, but an example for a very trivial case. Request for Question Clarification by jeremymiles-ga on 18 Jan 2003 02:20 PST I have two possible solutions, and I will post them both, however my knowledge of programming isn't up to giving you the complete pseudo code, for the tricky bits, so I will check first if you are able to program them, or if you are capable of finding (understanding) them. The first problem that you will need is matrix inversion/multiplication. The problem is simplified because the matrices are always symmetric. There is source code available to do this (e.g. in the book "Numerical recipes in C". The second thing is an iterative solver of some sort. If you are happy for the code to be pretty inefficient, this isn't too hard to write, if you want to make nicer code, you need to use something like the Newton-Raphson algorithm. Again, you can find the algorithms on the web. Some programs, such as Excel, (I believe) Matlab, and Mx (that's freeware) can do all of these things for you, however you would have to read the results into your program, and if you want to automate the process, or do it 'on the fly' this isn's going to work. jeremymiles-ga Clarification of Question by gw-ga on 18 Jan 2003 09:12 PST Your solutions sound plausible and I should be able to program them. And yes, it is most definitely my intention to automate the process.