This is a question on probability/statistics.
If I have one thousand points x_i \in R, i=1,2,...,1000, I can answer
the question "Do they come from a normal distribution, with mean=0 and
std=1?" Various normality tests can be applied here.
Now if the claim is that each x_i is sampled from a distribution P_i,
where P_i does not have a closed form description but can be sampled
from, can I design a test to accept or reject this claim?
Intuitively, I can represent P_i by drawing certain number of samples
from it, and see where x_i falls. If it falls "outside" most of the
time, I should certainly reject the claim. My question is how to
quantify this.
In your answer, please comment on whether it can be generalized to
higher dimensions. |
Request for Question Clarification by
mathtalk-ga
on
08 Jun 2005 08:48 PDT
Hi, lambertch-ga:
Is the premise here really that each x_i comes from a different
distribution P_i? And that all such P_i's are known only by sampling?
Certainly there are "distribution free" tests (eliminating the
assumption of normality) that might be applied if we hypothesize that
all the x_i came from one distribution P that could be independently
sampled. But if you have a different distribution for every
observation, it's hard to know what confidence one might have about
rejecting or accepting a hypothesis based essentially on a single
observation or even an unrelated collection of such observations.
regard, mathtalk-ga
|
Clarification of Question by
lambertch-ga
on
08 Jun 2005 12:09 PDT
Hi mathtalk-ga,
Yes, your re-statement of the premise is correct.
If we consider a special case where each P_i is a known Gaussian and
differs from each other only in the mean, then "an unrelated
collection of such observations" is not a problem, since we can shift
the data x_i and treat them as coming from one single normal
distribution P. (Or is there a problem even for this case?)
So what breaks down for the general case in my original question?
Would it help if we assume also that the P_i 's have the same "shape?"
How would we define that quantitatively?
Regards,
lambertch-ga
|
Request for Question Clarification by
mathtalk-ga
on
08 Jun 2005 18:58 PDT
To try and express the idea informally, the hypothesis that all the
x_i's come from different distributions is simply too broad to allow
us to ocnclude that the outcome is "unlikely" no matter what
observations are obtained. After all you only have one observation
from each distribution. It would be very hard to dispute that x_i
might have come from distribution P_i, even if we never get that exact
observation again by repeated sampling.
The strength of statistical inference springs from being able to show
that an observed outcome is unlikely to have occurred "by chance".
Now the outcome might be tested in a variety of ways. But take as a
specific case the hypothesis that all the P_i's are normal
distributions with equal variance but differing means. You are
allowing the possibility that the observations x_i are at or near the
unspecified means of those distributions (mostly), which is as likely
an outcome as one should expect. So under what circumstances would we
reject the "hypothesis"?
Of course in a practical sense one might resample each P_i many times
and estimate each mean and variance, thus drawing a conclusion about
how the x_i's are distributed with respect to the underlying
populations. But my doubt would be why you have to initially assume
that all the P_i's are different? It suggested to me that you have no
way of pinning down that what you are sampling from, from one
observation to the next, is indeed the same distribution.
regards, mathtalk-ga
|
Clarification of Question by
lambertch-ga
on
09 Jun 2005 08:37 PDT
Hi mathtalk-ga,
I can more or less see your point, but I'm not sure I have made it
absolutely clear where my difficulty is. So let me try again to see if
it makes any difference. If not, you can give me the answer "This is
an ill-posed problem" or something like that, and I'd be happy to
accept it.
So here it is, starting from scratch:
(1) I have a set of points x_i, and I have a normal distribution with
zero mean and known covariance. I can ask the question whether the set
comes from the distribution, and various normality tests can answer
that.
(2) I have a set of points x_i, and I have a set of distributions P_i.
Each P_i is known to be a normal distribution with KNOWN mean and THE
SAME KNOWN covariance. And I ask the question "Is it likely that each
x_i is a sample of P_i?"
I can think of two answers:
(a) This is an ill-posed problem, because for each distribution P_i we
have only one data point x_i, and there is no relation between the
x_i's.
(b) Just subtract the known mean of P_i from x_i, and the problem is
reduced to that of (1).
If you choose answer (a), I'd like to know what's wrong with answer (b).
If you choose answer (b), let's continue to the next level.
(3) I have a set x_i, a set P_i, now each P_i is given only in the
form of its independent samples; let's further assume that if you need
more samples, you can get them. We ask the same question as in (1) and
(2).
Again I can think of two answers:
(a) It is an ill-posed problem.
(b) Suppose it is the case that each P_i can be approximated by a
Gaussian with very good accuracy. Then the problem is essentially
reduced to (2); the differing covariance is not a fundamental
obstacle, because we can rescale the data.
Now suppose it is the case that each P_i is multimodal. We can still
look at the histogram of its samples, and see where x_i lies. If, for
an extreme case, 90% of the time x_i lies in the "tail," then it's
certainly unlikely that each x_i is a sample of the corresponding P_i.
If you think (a) is the answer, I'd like to know what's wrong with (b).
If you think (b) is reasonable, then comes my original question: How
do we quantify it? Does it work in two dimensions and higher?
Thanks,
lambertch-ga
|
Request for Question Clarification by
mathtalk-ga
on
15 Jun 2005 18:58 PDT
Hi, lambertch-ga:
We agree about (1) and about (2), ie. that b. applies in case (2).
In fact if the distributions P_i are known to be normal distributions,
then each is characterized by a mean m_i together with a standard
deviation s_i, and the idea of subtracting off a known mean (with
common variance) can be generalized to doing a transformation of
observations X_i into:
Y_i = (X_i - m_i)/s_i
which will then share a common "unit normal" (Gaussian) distribution.
So to my mind the gap that exists is between a description of the
P_i's as "given only in the form of its independent samples" and "each
P_i can be approximated by a Gaussian with very good accuracy".
Arbitrary distributions cannot be characterized by a finite number of
parameters, as we noted that the normal distributions can be.
My concern, if you say that each P_i can be sampled as many times as
we wish, is what special status is given, if any, to observation X_i.
Clearly there would be many possible samplings from an otherwise
unknowable distribution that would make X_i seem to fit in well, along
with a number that would make X_i appear to be an "outlier" to any
such distribution. Since i runs in your problem formulation to 1000,
one might expect some of each case, observations that seem to fit in
reasonably well and others that seem "odd".
Without clearer hypotheses about the actual distributions P_i, it will
be difficult to predict quantitatively how many should "fit in", say
to an estimated standard deviation or two. If you assume the P_i's
are normal distributions, albeit with unknown (but estimated) means
and variances, then you have gone a long way toward providing a
rationale for such estimates. You have in essence reduced an infinite
number of parameters needed to characterize unknown distributions to
just two (per normal distribution).
There are statistical methods called "distribution-free" because they
do not rely on assumed normality of distributions. The "power" of
such methods is less than methods assuming the additional
information/restrictions, as reflected for example in size of
confidence intervals for estimating means and variances.
Perhaps a "middle way" here is to make some assumptions about the P_i
without specifically assuming normality. That is, one might assume a
continuous, unimodal family of distributions with finite first and
second moments (so that mean and variance are defined). From there a
weaker prediction of the fraction of "outliers" could be assembled.
regards, mathtalk-ga
|
Clarification of Question by
lambertch-ga
on
16 Jun 2005 13:50 PDT
Hi mathtalk-ga,
Thanks for the reply. I don't quite understand your paragraph starting
with "My concern ...." For each P_i, we can draw sufficient number of
samples to approximate arbitrarily closely its probability density
function f_i(), and consequently the function value f_i(X_i). For i
from 1 to 1000, some of these values are low, and some are not low.
Qualitatively, if most of f_i(X_i)'s are very low, then we can reject
the hypothesis that each X_i comes from the corresponding P_i.
We have encountered two difficulties:
(1) There is only one point X_i for each P_i. I think we have just
resolved that this fact by itself doesn't make the problem ill-posed.
(2) P_i is infinitely parameterized. Again this fact by itself does
not make the problem intractible, since if all P_i's are actually
known to be the same P, then we can pool the X_i's together and
examine how much the set {X_i} and the distribution P differ, by
Kolmogorov-Smirnov Test, for example.
It's the combination of the above two that makes the problem difficult.
The raison d'etre of the problem is to be able to handle non-Gaussian
multimodal distributions, so I can't follow the "middle way" you
suggested.
Regards,
lambertch-ga
|