I am sampling a single discrete random variable with k bins. Each bin
represents an outcome on a nominal scale, so none of the usual
distributions apply. Given k and n (n being the sample size), I need
to compute the probability that the sample represents the distribution
of outcomes in the population. I suppose some sort of confidence
interval comes into play here as well. Is there a direct formula or
set of formulas that determine this probability? If so, what is it,
and what are the caveats when applying it? |
Clarification of Question by
azeric64-ga
on
20 Dec 2004 09:08 PST
Thanks for your comments, and allow me to clarify.
There is a one-to-one correspondence between outcomes and bins: a
sample with a particular outcome always goes into the bin for that
outcome. So all samples with the outcome "A" go into the "A" bin, all
samples with outcome "B" go into the "B" bin, etc. There is typically
a small number of bins, with k ranging from 2 up to maybe 25 as an
upper bound.
Wouldn't the probability of 1/k only apply if I could assume that the
outcomes all have uniform probability? I do know that the outcomes
definitely do not have uniform probability; the idea is that we have a
good guess as to the distribution of the outcomes, but want to use
statistical feedback to correct the distribution, and want to know
when we have collected enough results to replace our initial guess.
Mathtalk, you make an excellent point about using all of the bins in
the sample. One requirement of this measurement is that we want to
collect a "complete" sample in the sense that every bin must contain
at least one outcome. However, it may be that certain outcomes that
we think will occur may never occur, especially if we know that they
will be particularly rare, and we want to know if that's the case too,
while still being able to know how prevalent the other outcomes are.
The way that we are applying this involves using our initial guess for
the prevalence of the outcomes while collecting the actual number of
outcomes until we are confident that the data gives us a better
picture of the prevalence. Ideally, we'd like to make incremental
corrections to the prevalences to gradually bring them into line with
the actual data.
Finally, I should have mentioned at the beginning that the population
is sampled without replacement. Not sure of this makes a difference
or not.
|
Request for Question Clarification by
mathtalk-ga
on
20 Dec 2004 13:21 PST
Hi, azeric64-ga:
In general outline one needs to define a measure of how close (or far
away) a sample population is from the hypothesized distribution. A
typical approach does this by a weighted sum-of-squares "error"
statistic, i.e. add for all bins the sum of squares of differences
between observed and expected content size, divided by the expected
content size. Such a statistic has a discrete distribution for any
fixed sample size, but if the sample size is large enough one can
apply a continuous approximation. That the "error" statistic is large
enough to have small probability of arising by "random" chance is then
consider "signficant" in the context of rejecting the hypothesized
distribution.
Would this sort of information respond to your request for a "direct"
formula or formulas and caveats for applying them?
regards, mathtalk-ga
|
Clarification of Question by
azeric64-ga
on
20 Dec 2004 16:18 PST
Hi again, Mathtalk-ga:
I'm not trying to test a hypothesis as such, just trying to adjust
what I initially believe to be the distribution of outcomes. (I'm
building a "self-learning" troubleshooting application.) I don't
think that a weighted sum-of-squares error statistic is quite what I'm
after. In fact, if I can possibly avoid it, I don't want to compare
the hypothesized distribution to the collected sample in order to
determine the significance. Instead, I'd like to look at the sample
by itself and have a function using only the sample size (n) and the
number of bins (k) to give me a numerical rating, say between 0 and 1
inclusive, of how statistically significant the sample is.
I'm assuming that the more bins you have, the larger the sample size
you need to ensure that it is statistically significant. Is this
assumption correct?
I think that what I want is more related to how, in designing a
survey, you would determine the minimum sample size given a discrete
random variable such as I have described, except that I want to apply
it to a sample that I already have to see if whether or not it meets
the minimum criteria for being statistically significant. If the
sample is statistically significant, I want to just use the sample as
my measure of the prevalence of each outcome; but if the sample is not
significant, I want to use a measure of its "fractional significance",
if that term makes sense, as a weighting factor against the
hypothesized distribution when I display the prevalence of the
outcomes to the user.
I'm guessing that, for a given value of k, the function of n that
yields the "fractional significance" of a not-yet-significant sample
resembles an elongated "S", much like the cumulative normal
distribution. The inflection point would be where the fractional
significance equals 0.5, and then the fractional significance would
asymptotically approach 1 as n continues to increase.
The effect I'm looking for is that as actual samples accumulate,
one-by-one, the user of the software would observe over time the
prevalence transitioning from the "hypothesized" distribution to the
actual distribution (I will actually be using the entire population as
a sample, with the population starting at zero and growing by
increments of one). So another way of asking this is determine when
the population is large enough to allow you to accurately determine
the distribution of outcomes as it continues to increase. But I still
need an initial guess to work with until it reaches that point, while
also being able to make gradual adjustments in the interim to give the
best possible guess as to the distribution at every increment of the
population.
I'm looking for something relatively simple because this measure has
to be made by a JavaScript program running in a Web browser, so it can
have a quick-and-dirty aspect that won't necessarily be valid for
boundary conditions, such as when the sample size is below a certain
threshold, which I can test for in my JavaScript program.
Granted, this is a somewhat perverse use of a statistical measurement,
but it seems that it should still yield sensible results.
Thanks for your perseverence!
Regards,
AZeric64-ga
|
Clarification of Question by
azeric64-ga
on
20 Dec 2004 16:25 PST
One last point: because the function only equals 1 as a limit as n
approaches infinity, once the fractional significance reaches, say,
0.95 or 0.99, I'd consider the sample to be "fully" significant. This
would not have to be as rigorous as other significance tests that
yield a "p" value, just something that works in a practical
application. It could even use an extra fudge factor to err on the
side of caution so that I would have to have a larger sample size than
is strictly necessary to attain that level of significance.
Thanks again!
AZeric64-ga
|
Request for Question Clarification by
mathtalk-ga
on
20 Dec 2004 17:31 PST
Sorry, I'm not grasping what it is you are looking for. Of course you
can constantly adjusted your estimated distribution based on what
samples you've observed, and you can base the adjustment on very
trivial calculations if you like.
However you've said above that you want to assign a number between 0
and 1 that describes "fractional significance" of a sample _without_
comparing "the hypothesized distribution to the collected sample in
order to determine the significance."
Now the notion of "statistical significance" is already perverse
enough, so I'm not concerned that we're tampering with sanctified
truth! But I just don't get what makes a sample, in and of itself,
more or less significant. In one respect I suppose you could base the
figure simply on the size of the sample, regardless of what bins the
observations fall into. Bigger is better, and we could just pick a
mathematical function of n that tends to 1 as n goes to infinity.
So, I think I'm missing something central about what you'd like to do
with the "significance" values.
regards, mathtalk-ga
|
Clarification of Question by
azeric64-ga
on
21 Dec 2004 15:50 PST
Mathtalk,
Thanks for bearing with me through this. Actually, we're already on
the same page, the same paragraph in fact, I think at this point its
just a matter of particulars. Talking this through with you has been
a great help so far!
I think at this point its best if I outline a solution that I have
just come up with, and then have you tell me where the flaws are and
further if they really matter in a practical sense.
Applying the problem requires that we divide it into two parts: The
first part determines the parameters we use for the fractional
significance formula, and the second part determines how we apply it.
First part:
Recall a statement I made in an earlier clarification: "the more bins
you have, the larger the sample size you need to ensure that it is
statistically significant."
A Web page dealing with determining sample size bears this out:
http://www.isixsigma.com/library/content/c000709.asp
This page also gives me some insights that I think will help. The
formula for sample size, if you look 2/3 of the way down the page
(before you get to all the banner ads and just after the illustration
of the normal distribution curve) directly supports this idea, if you
liken E (the margin of error) to 1/k (the inverse of the number of
outcomes). By increasing k, you increase the required sample size,
with the required sample size increasing proportionally to the square
of k. This happens because when we substitute E with 1/k, the
denominator disappears and we are left with the formula
s = [z(sub alpha/2) * sd * k]^2
where
s is the required sample size
z(sub alpha/2) is the critical value, and
sd is the standard deviation.
The critical value and standard deviation ought to be constant for all
applications of the formula. The trick is determining what their
values need to be: the critical value can depend on the degree of
confidence we want, typically 95% but I'd like 99% if that is still
practical, which you can determine with a table of standard Z-scores
(although I haven't taken a statistics class in about 12 years so the
technique escapes me).
Here's the approach to determining the standard deviation that I am
considering: even if we have a nominal scale, can we still assume that
all outcomes would fall within three standard deviations of the mean?
(This covers 99.7% of cases, a close enough approximation to 100%),
therefore making the standard deviation equal to one-sixth of the
range within which all the outcomes would fall. Does this make sense?
Remember, its a nominal scale, so I feel that I'm stretching it here.
So what we have so far, assuming that I have not completely left the
realm of statistical sanity, is a technique for determining the
minimum population size (which, recall, is equal to the sample size)
necessary to be able to replace the initial guess of the outcome
distribution with the distribution indicated by the collected results.
This will be an integer constant (call it "s", as indicated above)
that we can determine at the time that we build our applciation, and
will factor into the fractional significance formula that we will
apply to the collected results each time we run it.
So what I would need from you at this point, assuming that I have
explained all this clearly enough, is a determination of what the
critical value and the standard deviation should be. Again, the
critical value should come directly from the standard Z-score for the
confidences I idicated, and if my assumptions about what I can do with
the sd are correct, sd might simply be 0.166667. I need your opinion
on both counts.
Second part:
Now comes the point where, each time we run the program, we apply the
elongated S-curve function I previously described, adjusted according
to k, so that it indicates that the population has statistically
significant size when it equals s. For values less than s, the result
of the function yields the fractional significance.
The quandary now is that the cumulative normal distribution is
difficult to calculate, and I need to determine my fractional
significance on-the-fly. I could implement the Taylor series for it
if I had to, but instead I am considering a different approximation.
And here is where I really want to commit a statistical heresy for the
sake of simplifying the calculations and being happy with an
approximation (your testimony that significance measures are already
perverse enough gives me some confidence here). The cumulative normal
distribution, for the range of values that we are considering (three
standard deviations on either side of the mean), strongly resembles
the sin(x) function in the range -pi/2 to pi/2, along with a little
translation and scaling to get the outcomes between zero and one.
(sin() is a function included with JavaScript.) So if you ignore the
requirement that the distribution have asymptotic tails, run it
through the sin() function with the appropriate adjustments to the
parameter and the result, it should yield a value between 0 and 1 that
we can use to weigh the sample outcomes against our guesses for them.
Low values of n will yield values near zero, and values of n near s
will yield values near one. sin() also satisfies the requirement that
the inflection point occur when the result equals 0.5.
The actual formula would be
[1 + sin(pi*n/s - pi/2)] / 2
Now to get the outcome distribution that we want to display, we take
the fractional significance value yielded from our formula, and call
it the sample significance. Then we subtract it from one to get the
initial guess significance. Then, for each outcome k, the estimated
prevalence is the sample significance multiplied by the number of
actual outcomes k plus the initial guess significance multiplied by
the initial guess for outcome k. In other words, its just a weighted
sum of the initial guess with the collected results. The initial
guesses, of course, will first have to be normalized according to s.
So there you have it. Again, I think all I need from you at this
point is what the critical factor and the standard deviation should
be, and an opinion as to whether I can use the sin() function as I
have described.
Thanks Again,
AZeric64
|
Request for Question Clarification by
mathtalk-ga
on
21 Dec 2004 21:40 PST
Hi, azeric64-ga:
I've posted some initial remarks in response to your ideas below, as a Comment.
regards, mathtalk-ga
|
Request for Question Clarification by
mathtalk-ga
on
22 Dec 2004 05:07 PST
Maybe a few of questions can help me make useful suggestions:
1) When you sample the distribution (I believe you mentioned at one
point that this is "without replacement"), are the outcomes chosen one
at a time? If not one at a time, are the sizes of samples themselves
a random outcome? (For example, in birding, the number of
observations in a day will fluctuate.)
2) Are the samples chosen "independently"? This is a key issue from
the point of view of statistical estimation. For example, there was a
requirement in the construction of the Alaska pipeline that X-rays be
taken of a certain percentage of the welds, as a QA check that the
work was done properly. However it turned out that a contractor had
taken one X-ray and duplicated it hundreds of times! Not exactly
independent observations! But observations can be "correlated" and
fail to be independent due to other more subtle factors.
"Unscientific" polling in which samples are submitted by enthusiasts
will often be "biased" by interested parties providing "duplicate
copies" of favored samples. In trying to estimate frequencies in the
unbiased population, the duplication in sampling creates a significant
challenge.
3) What costs or risks are associated with updating (or failing to
update) your estimated distribution of outcomes? To illustrate,
suppose you were a jelly bean vendor, and you initially estimate the
demand for various colors of jelly beans to be equal. As time goes
by, your inventory of black jelly beans surges, but you continue to
forecast equal sales of each kind. Now as long as your restocking of
inventory is smart enough to only purchase those colors of jelly beans
that are depleted, the economic harm of maintaining the unrealistic
forecast of equal demand for all colors is not too serious. Your
reordering over time would probably provide data for a better
forecast, but a case could be made for not bothering to analyze the
situation too closely.
regards, mathtalk-ga
|
Clarification of Question by
azeric64-ga
on
28 Dec 2004 14:23 PST
Hello again Mathtalk-ga,
Sorry about the lag in my response, I was away for the Christmas
holiday and out of Internet range (harder to do every year, but
achieveable if you put your mind to it).
To answer your questions:
1) The outcomes are chosen one at a time.
2) Samples are independent.
3) The costs for failing to update the outcome distribution are not
catastrophic. It is more a case of trying to optimize the prevalence
of each cause of a problem in order to reduce the overall cost of
troubleshooting.
I also appreciate your latest comments. True, I'm not specifically
trying to measure something like process improvement, but I am
searching for techniques that may have been developed in that venue
that I can still apply here.
Looking forward to your reply,
AZeric64-ga
|