Google Answers Logo
View Question
 
Q: Significance of discrete random variable with outcomes on a nominal scale ( No Answer,   3 Comments )
Question  
Subject: Significance of discrete random variable with outcomes on a nominal scale
Category: Science > Math
Asked by: azeric64-ga
List Price: $20.00
Posted: 16 Dec 2004 10:59 PST
Expires: 15 Jan 2005 10:59 PST
Question ID: 443517
I am sampling a single discrete random variable with k bins.  Each bin
represents an outcome on a nominal scale, so none of the usual
distributions apply.  Given k and n (n being the sample size), I need
to compute the probability that the sample represents the distribution
of outcomes in the population.  I suppose some sort of confidence
interval comes into play here as well.  Is there a direct formula or
set of formulas that determine this probability?  If so, what is it,
and what are the caveats when applying it?

Clarification of Question by azeric64-ga on 20 Dec 2004 09:08 PST
Thanks for your comments, and allow me to clarify.  

There is a one-to-one correspondence between outcomes and bins: a
sample with a particular outcome always goes into the bin for that
outcome.  So all samples with the outcome "A" go into the "A" bin, all
samples with outcome "B" go into the "B" bin, etc.  There is typically
a small number of bins, with k ranging from 2 up to maybe 25 as an
upper bound.

Wouldn't the probability of 1/k only apply if I could assume that the
outcomes all have uniform probability?  I do know that the outcomes
definitely do not have uniform probability; the idea is that we have a
good guess as to the distribution of the outcomes, but want to use
statistical feedback to correct the distribution, and want to know
when we have collected enough results to replace our initial guess.

Mathtalk, you make an excellent point about using all of the bins in
the sample.  One requirement of this measurement is that we want to
collect a "complete" sample in the sense that every bin must contain
at least one outcome.  However, it may be that certain outcomes that
we think will occur may never occur, especially if we know that they
will be particularly rare, and we want to know if that's the case too,
while still being able to know how prevalent the other outcomes are.

The way that we are applying this involves using our initial guess for
the prevalence of the outcomes while collecting the actual number of
outcomes until we are confident that the data gives us a better
picture of the prevalence.  Ideally, we'd like to make incremental
corrections to the prevalences to gradually bring them into line with
the actual data.

Finally, I should have mentioned at the beginning that the population
is sampled without replacement.  Not sure of this makes a difference
or not.

Request for Question Clarification by mathtalk-ga on 20 Dec 2004 13:21 PST
Hi, azeric64-ga:

In general outline one needs to define a measure of how close (or far
away) a sample population is from the hypothesized distribution.  A
typical approach does this by a weighted sum-of-squares "error"
statistic, i.e. add for all bins the sum of squares of differences
between observed and expected content size, divided by the expected
content size.  Such a statistic has a discrete distribution for any
fixed sample size, but if the sample size is large enough one can
apply a continuous approximation.  That the "error" statistic is large
enough to have small probability of arising by "random" chance is then
consider "signficant" in the context of rejecting the hypothesized
distribution.

Would this sort of information respond to your request for a "direct"
formula or formulas and caveats for applying them?

regards, mathtalk-ga

Clarification of Question by azeric64-ga on 20 Dec 2004 16:18 PST
Hi again, Mathtalk-ga: 

I'm not trying to test a hypothesis as such, just trying to adjust
what I initially believe to be the distribution of outcomes.  (I'm
building a "self-learning" troubleshooting application.)  I don't
think that a weighted sum-of-squares error statistic is quite what I'm
after.  In fact, if I can possibly avoid it, I don't want to compare
the hypothesized distribution to the collected sample in order to
determine the significance.  Instead, I'd like to look at the sample
by itself and have a function using only the sample size (n) and the
number of bins (k) to give me a numerical rating, say between 0 and 1
inclusive, of how statistically significant the sample is.

I'm assuming that the more bins you have, the larger the sample size
you need to ensure that it is statistically significant.  Is this
assumption correct?

I think that what I want is more related to how, in designing a
survey, you would determine the minimum sample size given a discrete
random variable such as I have described, except that I want to apply
it to a sample that I already have to see if whether or not it meets
the minimum criteria for being statistically significant.  If the
sample is statistically significant, I want to just use the sample as
my measure of the prevalence of each outcome; but if the sample is not
significant, I want to use a measure of its "fractional significance",
if that term makes sense, as a weighting factor against the
hypothesized distribution when I display the prevalence of the
outcomes to the user.

I'm guessing that, for a given value of k, the function of n that
yields the "fractional significance" of a not-yet-significant sample
resembles an elongated "S", much like the cumulative normal
distribution.  The inflection point would be where the fractional
significance equals 0.5, and then the fractional significance would
asymptotically approach 1 as n continues to increase.

The effect I'm looking for is that as actual samples accumulate,
one-by-one, the user of the software would observe over time the
prevalence transitioning from the "hypothesized" distribution to the
actual distribution (I will actually be using the entire population as
a sample, with the population starting at zero and growing by
increments of one).  So another way of asking this is determine when
the population is large enough to allow you to accurately determine
the distribution of outcomes as it continues to increase.  But I still
need an initial guess to work with until it reaches that point, while
also being able to make gradual adjustments in the interim to give the
best possible guess as to the distribution at every increment of the
population.

I'm looking for something relatively simple because this measure has
to be made by a JavaScript program running in a Web browser, so it can
have a quick-and-dirty aspect that won't necessarily be valid for
boundary conditions, such as when the sample size is below a certain
threshold, which I can test for in my JavaScript program.

Granted, this is a somewhat perverse use of a statistical measurement,
but it seems that it should still yield sensible results.

Thanks for your perseverence!  

Regards, 

AZeric64-ga

Clarification of Question by azeric64-ga on 20 Dec 2004 16:25 PST
One last point: because the function only equals 1 as a limit as n
approaches infinity, once the fractional significance reaches, say,
0.95 or 0.99, I'd consider the sample to be "fully" significant.  This
would not have to be as rigorous as other significance tests that
yield a "p" value, just something that works in a practical
application.  It could even use an extra fudge factor to err on the
side of caution so that I would have to have a larger sample size than
is strictly necessary to attain that level of significance.

Thanks again!  

AZeric64-ga

Request for Question Clarification by mathtalk-ga on 20 Dec 2004 17:31 PST
Sorry, I'm not grasping what it is you are looking for.  Of course you
can constantly adjusted your estimated distribution based on what
samples you've observed, and you can base the adjustment on very
trivial calculations if you like.

However you've said above that you want to assign a number between 0
and 1 that describes "fractional significance" of a sample _without_
comparing "the hypothesized distribution to the collected sample in
order to determine the significance."

Now the notion of "statistical significance" is already perverse
enough, so I'm not concerned that we're tampering with sanctified
truth!  But I just don't get what makes a sample, in and of itself,
more or less significant.  In one respect I suppose you could base the
figure simply on the size of the sample, regardless of what bins the
observations fall into.  Bigger is better, and we could just pick a
mathematical function of n that tends to 1 as n goes to infinity.

So, I think I'm missing something central about what you'd like to do
with the "significance" values.

regards, mathtalk-ga

Clarification of Question by azeric64-ga on 21 Dec 2004 15:50 PST
Mathtalk, 

Thanks for bearing with me through this.  Actually, we're already on
the same page, the same paragraph in fact, I think at this point its
just a matter of particulars.  Talking this through with you has been
a great help so far!

I think at this point its best if I outline a solution that I have
just come up with, and then have you tell me where the flaws are and
further if they really matter in a practical sense.

Applying the problem requires that we divide it into two parts: The
first part determines the parameters we use for the fractional
significance formula, and the second part determines how we apply it.

First part: 

Recall a statement I made in an earlier clarification: "the more bins
you have, the larger the sample size you need to ensure that it is
statistically significant."

A Web page dealing with determining sample size bears this out:
http://www.isixsigma.com/library/content/c000709.asp
This page also gives me some insights that I think will help.  The
formula for sample size, if you look 2/3 of the way down the page
(before you get to all the banner ads and just after the illustration
of the normal distribution curve) directly supports this idea, if you
liken E (the margin of error) to 1/k (the inverse of the number of
outcomes).  By increasing k, you increase the required sample size,
with the required sample size increasing proportionally to the square
of k.  This happens because when we substitute E with 1/k, the
denominator disappears and we are left with the formula

      s = [z(sub alpha/2) * sd * k]^2

where 

   s is the required sample size  

   z(sub alpha/2) is the critical value, and
   
   sd is the standard deviation.  

The critical value and standard deviation ought to be constant for all
applications of the formula.  The trick is determining what their
values need to be: the critical value can depend on the degree of
confidence we want, typically 95% but I'd like 99% if that is still
practical, which you can determine with a table of standard Z-scores
(although I haven't taken a statistics class in about 12 years so the
technique escapes me).

Here's the approach to determining the standard deviation that I am
considering: even if we have a nominal scale, can we still assume that
all outcomes would fall within three standard deviations of the mean?
(This covers 99.7% of cases, a close enough approximation to 100%),
therefore making the standard deviation equal to one-sixth of the
range within which all the outcomes would fall.  Does this make sense?
 Remember, its a nominal scale, so I feel that I'm stretching it here.

So what we have so far, assuming that I have not completely left the
realm of statistical sanity, is a technique for determining the
minimum population size (which, recall, is equal to the sample size)
necessary to be able to replace the initial guess of the outcome
distribution with the distribution indicated by the collected results.
 This will be an integer constant (call it "s", as indicated above)
that we can determine at the time that we build our applciation, and
will factor into the fractional significance formula that we will
apply to the collected results each time we run it.

So what I would need from you at this point, assuming that I have
explained all this clearly enough, is a determination of what the
critical value and the standard deviation should be. Again, the
critical value should come directly from the standard Z-score for the
confidences I idicated, and if my assumptions about what I can do with
the sd are correct, sd might simply be 0.166667.  I need your opinion
on both counts.



Second part: 

Now comes the point where, each time we run the program, we apply the
elongated S-curve function I previously described, adjusted according
to k, so that it indicates that the population has statistically
significant size when it equals s.  For values less than s, the result
of the function yields the fractional significance.

The quandary now is that the cumulative normal distribution is
difficult to calculate, and I need to determine my fractional
significance on-the-fly.  I could implement the Taylor series for it
if I had to, but instead I am considering a different approximation.

And here is where I really want to commit a statistical heresy for the
sake of simplifying the calculations and being happy with an
approximation (your testimony that significance measures are already
perverse enough gives me some confidence here).  The cumulative normal
distribution, for the range of values that we are considering (three
standard deviations on either side of the mean), strongly resembles
the sin(x) function in the range -pi/2 to pi/2, along with a little
translation and scaling to get the outcomes between zero and one. 
(sin() is a function included with JavaScript.)  So if you ignore the
requirement that the distribution have asymptotic tails, run it
through the sin() function with the appropriate adjustments to the
parameter and the result, it should yield a value between 0 and 1 that
we can use to weigh the sample outcomes against our guesses for them. 
Low values of n will yield values near zero, and values of n near s
will yield values near one.  sin() also satisfies the requirement that
the inflection point occur when the result equals 0.5.

The actual formula would be 

   [1 + sin(pi*n/s - pi/2)] / 2

Now to get the outcome distribution that we want to display, we take
the fractional significance value yielded from our formula, and call
it the sample significance.  Then we subtract it from one to get the
initial guess significance.  Then, for each outcome k, the estimated
prevalence is the sample significance multiplied by the number of
actual outcomes k plus the initial guess significance multiplied by
the initial guess for outcome k.  In other words, its just a weighted
sum of the initial guess with the collected results.  The initial
guesses, of course, will first have to be normalized according to s.

So there you have it.  Again, I think all I need from you at this
point is what the critical factor and the standard deviation should
be, and an opinion as to whether I can use the sin() function as I
have described.

Thanks Again, 

AZeric64

Request for Question Clarification by mathtalk-ga on 21 Dec 2004 21:40 PST
Hi, azeric64-ga:

I've posted some initial remarks in response to your ideas below, as a Comment.

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 22 Dec 2004 05:07 PST
Maybe a few of questions can help me make useful suggestions:

1) When you sample the distribution (I believe you mentioned at one
point that this is "without replacement"), are the outcomes chosen one
at a time?  If not one at a time, are the sizes of samples themselves
a random outcome?  (For example, in birding, the number of
observations in a day will fluctuate.)

2) Are the samples chosen "independently"?  This is a key issue from
the point of view of statistical estimation.  For example, there was a
requirement in the construction of the Alaska pipeline that X-rays be
taken of a certain percentage of the welds, as a QA check that the
work was done properly.  However it turned out that a contractor had
taken one X-ray and duplicated it hundreds of times!  Not exactly
independent observations!  But observations can be "correlated" and
fail to be independent due to other more subtle factors. 
"Unscientific" polling in which samples are submitted by enthusiasts
will often be "biased" by interested parties providing "duplicate
copies" of favored samples.  In trying to estimate frequencies in the
unbiased population, the duplication in sampling creates a significant
challenge.

3) What costs or risks are associated with updating (or failing to
update) your estimated distribution of outcomes?  To illustrate,
suppose you were a jelly bean vendor, and you initially estimate the
demand for various colors of jelly beans to be equal.  As time goes
by, your inventory of black jelly beans surges, but you continue to
forecast equal sales of each kind.  Now as long as your restocking of
inventory is smart enough to only purchase those colors of jelly beans
that are depleted, the economic harm of maintaining the unrealistic
forecast of equal demand for all colors is not too serious.  Your
reordering over time would probably provide data for a better
forecast, but a case could be made for not bothering to analyze the
situation too closely.

regards, mathtalk-ga

Clarification of Question by azeric64-ga on 28 Dec 2004 14:23 PST
Hello again Mathtalk-ga, 

Sorry about the lag in my response, I was away for the Christmas
holiday and out of Internet range (harder to do every year, but
achieveable if you put your mind to it).

To answer your questions: 

1) The outcomes are chosen one at a time.  

2) Samples are independent.  

3) The costs for failing to update the outcome distribution are not
catastrophic.  It is more a case of trying to optimize the prevalence
of each cause of a problem in order to reduce the overall cost of
troubleshooting.


I also appreciate your latest comments.  True, I'm not specifically
trying to measure something like process improvement, but I am
searching for techniques that may have been developed in that venue
that I can still apply here.

Looking forward to your reply, 

AZeric64-ga
Answer  
There is no answer at this time.

Comments  
Subject: Re: Significance of discrete random variable with outcomes on a nominal scale
From: manuka-ga on 16 Dec 2004 16:49 PST
 
This will surely depend on the distribution of your random variable -
more precisely on the probability of the outcome lying within each
bin. At one extreme, if the outcomes are always within a single bin
then the sample will always represent the distribution of outcomes
exactly.

At the other extreme, if each of the k bins has probability 1/k of
being selected by a given outcome, and supposing for convenience that
n is a multiple of k (say n = mk) then the probability of getting an
exactly correct representation is the probability that each bin gets m
outcomes, i.e.

(n, m) (1/k)^m . (n-m, m) (1/k)^m . (n-2m, m) (1/k)^m . ... . (m, m) (1/k)^m

where I am using (n, m) to represent n choose m, i.e. n! / (m! (n-m)!).

This simplifies to n! / [(m!)^k . k^n].

Of course, you're more likely to be interested in the probability that
the distribution among the k bins is "close enough" to the correct
distribution. The answer to that will depend on how you define "close
enough", and as before on the distribution of the variable - though
the uniform random version will at least give you a lower bound.
Subject: Re: Significance of discrete random variable with outcomes on a nominal scale
From: mathtalk-ga on 17 Dec 2004 09:45 PST
 
As manuka-ga's Comment outlines, it is implied that a multinomial
distribution exists among the bins.  Developing a "confidence
interval" for each individual bin might be conservatively treated by
reducing to several binomial distribution cases, e.g. either in Bin X
or not in Bin X.

Significance testing, on the other hand, would be more powerful if all
bins are used in a sampling statistic.

regards, mathtalk-ga
Subject: Re: Significance of discrete random variable with outcomes on a nominal scale
From: mathtalk-ga on 21 Dec 2004 22:09 PST
 
It's important to bear in mind the context of the discussion of sample
size and standard deviation at the site linked above by azeric64-ga.

The premise there is settling a question of whether a change to
process has resulted in improvement, where the measurable improvement
is a quantity that has a random fluctuation according to a normal
distribution quite independent of whether any changes are made to the
process.

One should therefore be skeptical of whether an observed improvement
in the measured outcome can reasonably be ascribed to chance rather
than to the designed changes.

In a situation of this kind we asks for repeated independent "trials"
both with and without the designed changes.  The empirical testing
"samples" the outcomes, and a mean of the samples for both treatments
is derived by averaging the respective measurements.

For a normally distributed population, the sample means of a fixed
size k will themselves be normally distributed, with standard
deviations that are a factor of SQRT(k) less than the standard
deviation of the "host" population.

Assuming that the process change shifts the mean of the population
(but not its variance or standard deviation), one can estimate in
advance how large a sample size will be needed to detect a shift in
the mean of a certain size relative to the underlying standard
deviation with a certain level of "significance".

The notion of "significance" here is tricky, or even as we previously
alluded, perverse.  It does not mean what we would wish it to, namely
how likely it is that an observed impact is real.  Instead it is a
"converse" notion.  It tells us how "unlikely" an observed difference
would be under a hypothesis that the design changes were completely
ineffective (as if nothing whatever had actually been done to change
the setup).  A small "level of significance" is important, in this
context.  The smaller the level of significance, the more unlikely an
observed change _would_ occur by random chance (given the assumed
normal distribution of outcomes).

Now the situation of outcomes being distributed into bins differs from
this "process improvement" case study in a number of ways.  First and
most obviously, the distribution of outcomes is inherently discrete;
only a finite number of outcomes are possible, which is to be compared
with the continuous distribution for which a normal distribution is
the elementary paradigm.

Second, the motive for planning a sufficient sample size in a process
improvement study is clear enough that a rational basis exists for
balancing the costs of increasing sample size versus the potential
benefit of an improvement.  For example, a sample size may be chosen
so that it will be likely to produce a significant "result" _if_ the
magnitude of the improvement is big enough to be profitable.  There is
no absolute notion of a sample being significant or not significant,
even on a sliding scale from 0 to 1.

The part of statistics that is most closely tied to testing the "level
of significance" of experimental results is called statistical
inference.  Basically it incorporates all the methods which might be
invented to estimate parameters of distributions _or_ to make
comparisons between distributions that are known only or partly
through observations.

I have the feeling that some study (a good night's rest!) will be
needed before I can bring into sharp focus the new direction in which
you've pointed.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy