Google Answers: probability/statistics, test of distribution

View Question

Q: probability/statistics, test of distribution ( No Answer, 2 Comments )

Question

Subject: probability/statistics, test of distribution
Category: Science > Math
Asked by: lambertch-ga
List Price: $50.00

Posted: 07 Jun 2005 07:50 PDT
Expires: 07 Jul 2005 07:50 PDT
Question ID: 530344

This is a question on probability/statistics. If I have one thousand points x_i \in R, i=1,2,...,1000, I can answer the question "Do they come from a normal distribution, with mean=0 and std=1?" Various normality tests can be applied here. Now if the claim is that each x_i is sampled from a distribution P_i, where P_i does not have a closed form description but can be sampled from, can I design a test to accept or reject this claim? Intuitively, I can represent P_i by drawing certain number of samples from it, and see where x_i falls. If it falls "outside" most of the time, I should certainly reject the claim. My question is how to quantify this. In your answer, please comment on whether it can be generalized to higher dimensions.
Request for Question Clarification by mathtalk-ga on 08 Jun 2005 08:48 PDT Hi, lambertch-ga: Is the premise here really that each x_i comes from a different distribution P_i? And that all such P_i's are known only by sampling? Certainly there are "distribution free" tests (eliminating the assumption of normality) that might be applied if we hypothesize that all the x_i came from one distribution P that could be independently sampled. But if you have a different distribution for every observation, it's hard to know what confidence one might have about rejecting or accepting a hypothesis based essentially on a single observation or even an unrelated collection of such observations. regard, mathtalk-ga
Clarification of Question by lambertch-ga on 08 Jun 2005 12:09 PDT Hi mathtalk-ga, Yes, your re-statement of the premise is correct. If we consider a special case where each P_i is a known Gaussian and differs from each other only in the mean, then "an unrelated collection of such observations" is not a problem, since we can shift the data x_i and treat them as coming from one single normal distribution P. (Or is there a problem even for this case?) So what breaks down for the general case in my original question? Would it help if we assume also that the P_i 's have the same "shape?" How would we define that quantitatively? Regards, lambertch-ga
Request for Question Clarification by mathtalk-ga on 08 Jun 2005 18:58 PDT To try and express the idea informally, the hypothesis that all the x_i's come from different distributions is simply too broad to allow us to ocnclude that the outcome is "unlikely" no matter what observations are obtained. After all you only have one observation from each distribution. It would be very hard to dispute that x_i might have come from distribution P_i, even if we never get that exact observation again by repeated sampling. The strength of statistical inference springs from being able to show that an observed outcome is unlikely to have occurred "by chance". Now the outcome might be tested in a variety of ways. But take as a specific case the hypothesis that all the P_i's are normal distributions with equal variance but differing means. You are allowing the possibility that the observations x_i are at or near the unspecified means of those distributions (mostly), which is as likely an outcome as one should expect. So under what circumstances would we reject the "hypothesis"? Of course in a practical sense one might resample each P_i many times and estimate each mean and variance, thus drawing a conclusion about how the x_i's are distributed with respect to the underlying populations. But my doubt would be why you have to initially assume that all the P_i's are different? It suggested to me that you have no way of pinning down that what you are sampling from, from one observation to the next, is indeed the same distribution. regards, mathtalk-ga
Clarification of Question by lambertch-ga on 09 Jun 2005 08:37 PDT Hi mathtalk-ga, I can more or less see your point, but I'm not sure I have made it absolutely clear where my difficulty is. So let me try again to see if it makes any difference. If not, you can give me the answer "This is an ill-posed problem" or something like that, and I'd be happy to accept it. So here it is, starting from scratch: (1) I have a set of points x_i, and I have a normal distribution with zero mean and known covariance. I can ask the question whether the set comes from the distribution, and various normality tests can answer that. (2) I have a set of points x_i, and I have a set of distributions P_i. Each P_i is known to be a normal distribution with KNOWN mean and THE SAME KNOWN covariance. And I ask the question "Is it likely that each x_i is a sample of P_i?" I can think of two answers: (a) This is an ill-posed problem, because for each distribution P_i we have only one data point x_i, and there is no relation between the x_i's. (b) Just subtract the known mean of P_i from x_i, and the problem is reduced to that of (1). If you choose answer (a), I'd like to know what's wrong with answer (b). If you choose answer (b), let's continue to the next level. (3) I have a set x_i, a set P_i, now each P_i is given only in the form of its independent samples; let's further assume that if you need more samples, you can get them. We ask the same question as in (1) and (2). Again I can think of two answers: (a) It is an ill-posed problem. (b) Suppose it is the case that each P_i can be approximated by a Gaussian with very good accuracy. Then the problem is essentially reduced to (2); the differing covariance is not a fundamental obstacle, because we can rescale the data. Now suppose it is the case that each P_i is multimodal. We can still look at the histogram of its samples, and see where x_i lies. If, for an extreme case, 90% of the time x_i lies in the "tail," then it's certainly unlikely that each x_i is a sample of the corresponding P_i. If you think (a) is the answer, I'd like to know what's wrong with (b). If you think (b) is reasonable, then comes my original question: How do we quantify it? Does it work in two dimensions and higher? Thanks, lambertch-ga
Request for Question Clarification by mathtalk-ga on 15 Jun 2005 18:58 PDT Hi, lambertch-ga: We agree about (1) and about (2), ie. that b. applies in case (2). In fact if the distributions P_i are known to be normal distributions, then each is characterized by a mean m_i together with a standard deviation s_i, and the idea of subtracting off a known mean (with common variance) can be generalized to doing a transformation of observations X_i into: Y_i = (X_i - m_i)/s_i which will then share a common "unit normal" (Gaussian) distribution. So to my mind the gap that exists is between a description of the P_i's as "given only in the form of its independent samples" and "each P_i can be approximated by a Gaussian with very good accuracy". Arbitrary distributions cannot be characterized by a finite number of parameters, as we noted that the normal distributions can be. My concern, if you say that each P_i can be sampled as many times as we wish, is what special status is given, if any, to observation X_i. Clearly there would be many possible samplings from an otherwise unknowable distribution that would make X_i seem to fit in well, along with a number that would make X_i appear to be an "outlier" to any such distribution. Since i runs in your problem formulation to 1000, one might expect some of each case, observations that seem to fit in reasonably well and others that seem "odd". Without clearer hypotheses about the actual distributions P_i, it will be difficult to predict quantitatively how many should "fit in", say to an estimated standard deviation or two. If you assume the P_i's are normal distributions, albeit with unknown (but estimated) means and variances, then you have gone a long way toward providing a rationale for such estimates. You have in essence reduced an infinite number of parameters needed to characterize unknown distributions to just two (per normal distribution). There are statistical methods called "distribution-free" because they do not rely on assumed normality of distributions. The "power" of such methods is less than methods assuming the additional information/restrictions, as reflected for example in size of confidence intervals for estimating means and variances. Perhaps a "middle way" here is to make some assumptions about the P_i without specifically assuming normality. That is, one might assume a continuous, unimodal family of distributions with finite first and second moments (so that mean and variance are defined). From there a weaker prediction of the fraction of "outliers" could be assembled. regards, mathtalk-ga
Clarification of Question by lambertch-ga on 16 Jun 2005 13:50 PDT Hi mathtalk-ga, Thanks for the reply. I don't quite understand your paragraph starting with "My concern ...." For each P_i, we can draw sufficient number of samples to approximate arbitrarily closely its probability density function f_i(), and consequently the function value f_i(X_i). For i from 1 to 1000, some of these values are low, and some are not low. Qualitatively, if most of f_i(X_i)'s are very low, then we can reject the hypothesis that each X_i comes from the corresponding P_i. We have encountered two difficulties: (1) There is only one point X_i for each P_i. I think we have just resolved that this fact by itself doesn't make the problem ill-posed. (2) P_i is infinitely parameterized. Again this fact by itself does not make the problem intractible, since if all P_i's are actually known to be the same P, then we can pool the X_i's together and examine how much the set {X_i} and the distribution P differ, by Kolmogorov-Smirnov Test, for example. It's the combination of the above two that makes the problem difficult. The raison d'etre of the problem is to be able to handle non-Gaussian multimodal distributions, so I can't follow the "middle way" you suggested. Regards, lambertch-ga

Answer

There is no answer at this time.

Comments

Subject: Re: probability/statistics, test of distribution
From: bocardo-ga on 21 Jun 2005 09:22 PDT

One way of tackling this problem might be to use the empirical
probability integral transform for each x_i with respect to the
corresponding P_i.

For each i, generate a large sample from the distribution P_i. Then
calculate the empirical fraction p_i of this sample that lies at or
below x_i.

Under the null hypothesis the p_i are a random sample from the
Uniform(0,1) distribution.

Use some test to assess the distribution of the p_i against this distribution.

OK?

Subject: Re: probability/statistics, test of distribution
From: lambertch-ga on 21 Jun 2005 19:02 PDT

Hi bocardo-ga,

Thanks a lot for your solution! In fact I have been doing exactly this
the last two days, and I used the Kolmogrov-Smirnov Goodness-of-fit
statistic, i.e., the maximum distance between the cumulative
distribution of the p_i's and that of U[0,1]. It is indeed working,
for the one-dimensional case.

Now if only I could figure out what to do with the 2-d case .... Any suggestions?

lambertch-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy