Google Answers Logo
View Question
 
Q: Statistics: Hypothesis Testing needed; rigged dice. ( Answered 5 out of 5 stars,   4 Comments )
Question  
Subject: Statistics: Hypothesis Testing needed; rigged dice.
Category: Science > Math
Asked by: donphiltrodt-ga
List Price: $11.00
Posted: 16 Apr 2003 04:28 PDT
Expires: 16 May 2003 04:28 PDT
Question ID: 191123
What if I somehow rigged a theoretical die so that two numbers never 
showed.  How many times would I have to roll that die to ensure that 
my die was successfully rigged?  What if I rigged a pair of die?  How
many times would I have to roll the pair to ensure that my cheating 
worked, and I wasn't just seeing a "streak"?

Note to Google editors: this is a simplification of my previous
statistical question, not an attempt to swindle.  Cancelleth not mine
question.

Request for Question Clarification by mathtalk-ga on 16 Apr 2003 11:34 PDT
Hi, donphiltrodt-ga:

I'm interested in answering your question, but let's make sure the
question isn't itself "loaded"!

The conventional statistical approach to questions of this type is
called "hypothesis testing".  A critical element is the formulation of
alternatives.

If it were only a matter of distinguishing between normal dice and
dice rigged so that two numbers never appear, then we could proceed as
follows.  You choose a "level of significance", a small probability,
in advance of doing the experiment.  For example a typical
significance value might be 0.5% (one in 200).  You would also choose
a large number of repetitions for rolling the dice and count how many
of those repetitions produce the "forbidden" numbers.

Here the "normal dice" case would probably be called the "null
hypothesis", ie. that there is no difference between these dice and
statistically random ones.  The alternative hypothesis is here is that
the dice have been rigged, and specifically so that two numbers never
appear.

After the experiment is conducted, one calculates the probability that
the observed number of repetitions (of forbidden numbers) would occur
under the null hypothesis, ie. with regular dice.  If this probability
is less than the foreordained "level of significance", then we reject
the null hypothesis.

If these were the only two actual alternatives, then the test results
would be clear cut.  But are the alternatives really so simple?  Is it
possible, for example, that your tampering with the dice _reduced_ the
probability of getting those numbers but not to zero?  Such an
interpretation might be natural if the number of observed "forbidden"
repetitions were significantly lower than would be expected from
regular dice, but not zero.

Please clarify what alternatives need to be considered here, and with
what precision.

Finally, you are certainly welcome to expire a question for any reason
prior to it being answered, and to repost a revised version of the
question (or not).  The terms of service encourage you to do this
whenever it is in your interest.  It is entirely at your discretion,
and no taint of swindle attachs to this.

However, you may have noticed that this question wound up being locked
for an extended period after you posted it.  This was due to your
having used the G-word (Google) in the post.  When customers do this,
there is an automatic lock placed on the question so that the Google
Answers Editors have an opportunity to respond to items that are
properly referred to them, which I take it was not your intention
here.

Thanks for posting these interesting questions, and I look forward to
reading your clarification.

regards, mathtalk-ga

Clarification of Question by donphiltrodt-ga on 16 Apr 2003 15:46 PDT
Excellent, thought-provoking questions from mathtalk and racecar. 
Thank you.

________Background___________

Now that I've a) reflected on your questions and b) learned more about
hypothesis testing, I'm starting to think that the skill I seek to
develop is the ability compare and constrast the frequency
distributions of two sample sets (A and B).  Abstract example
questions:...

1. Is set A different than set B because of dumb luck or because of
influence I exerted (changed a policy, advertised, moved a link)?

2. How certain am I of #1?

Such a skill goes far beyond what this forum can teach me.  But I sure
as heck can use the researchers as trail guides to  point me in the
right direction and help frame my questions -- just as you've already
done.  Thank you.

So I hereby alter the question frame and price...

=========================================================


Dice seem like a good realm for comparing frequency distributions. So
here's a hypothetical situation...

________Scenario_______________

Cletis has a way to rig the dice before each roll.  On any given roll,
the cheat (when succesful) prevents either die from displaying 2 of
its the numbers on it (1-6) The "forbidden numbers" are chosen by
Cletis.  Therefore, the cheat, when successful, changes the
probability distributions of the dice.  On each roll, the cheat either
works or it doesn't: the cheat is either 100% effective (changes
probability) or 0% effective (no effect).

But Cletis' cheating system doesn't provide any instant feedback on
the success of his cheating, IE, he has no way of knowing whether or
not his cheat/rig worked after the dice stop moving.  Therefore, the
only way he can see if his cheat is effective is to throw the dice a
lot and see if his cheating changes the frequency distribution of the
results.
 
Cletis is also lazy: He wants to see if his cheat works, but he
doesn't want to roll the dice any more than he has to.  But he knows
that Lady Luck could fool him into thinking his cheat was working when
it was wasn't.  For example, let's say he rolled the dice a bunch of
times and the frequency distribution was kinda-maybe-sorta close to
what it would be if the cheat was successful on, say, 50% the rolls. 
He doesn't know for certain whether or not his cheat was successful on
a) all the rolls, b) some of the rolls, or c) none of the rolls.  And
even if he did somehow determine a, b, or c, he doesn't know how
confidently he could state that his cheat work X% of the time, because
he doesn't know how many times he has to roll the dice to be sure.
 
________Break-down_______________

 Okay.  So that's the scenario.  I see a triangle of three factors:
 
 1) Sample size (number of rolls)
 
 2) Guessed percentage of rolls successfully rigged/cheated
 
 3) Certainty of #2.
 
Further, it seems to me like factors #2 and #3 might actually make
better sense like this

Column 1: assertion of cheat success percentage
Column 2: certainty of assertion (numbers pulled out of the air)

0-9	-	.6
10-19	-	.31
20-29	-	.23
30-39	-	.83
40-49	-	.52
50-59	-	.09
60-69	-	.61
70-79	-	.55
80-89	-	.98
90-99	-	.31
 

________My Questions_______________

1) What statistical principles are at work here?  I'm not the first
person to think of these things, so surely these concepts must have
names?  What are they?  Where can a non-math person learn more about
these principles, preferably in a step-by-step manner, assisted by
software (next quesiton)?

2) Is there statistical software you'd recommend for a non-math
stats-newbie who is willing to put in some time to learn the concepts
of statistics but would rather the software take care of the math?
Answer  
Subject: Re: Statistics: Hypothesis Testing needed; rigged dice.
Answered By: mathtalk-ga on 23 Apr 2003 21:56 PDT
Rated:5 out of 5 stars
 
Hi, donphiltrodt-ga:

Executive Summary
=================

Let p be the probability that a "cheat" works, preventing two of the
six equal faces of a die from appearing.  We can't directly "observe"
p, but in counting how many "forbidden faces" M appear in any sample
of N throws, the observed ratio M/N is on average expected to be:

q = (1 - p)/3

because forbidden faces appear with chance 1/3 when the cheat fails.

As p varies from 1 down to 0, q varies respectively from 0 up to 1/3.
Conversely if we solve for p:

p = 1 - 3q

Our assignment is to explain the relationship between the number of
trials N and the accuracy with which q and thus p can be estimated.

What is a "statistic"?

A statistic is a number which summarizes a set of data.  One of the
most useful (and common) statistics is the mean (average).

In a binomial distribution it is usual to label one outcome as 0 and
the as 1.  Then the mean of the distribution is also the probability
of the latter outcome.  In our case we'll consider "forbidden face"
throws as being the outcomes with the value 1, so that q is also the
mean of the binomial distribution.  

A binomial distribution is completely characterized by just this one
parameter q as the mean.  For example the mean q also tells us the
variance of the binomial distribution:

s^2 = q(1-q)

In practical terms one can only "sample" a binomially distributed
population through some finite subpopulations, as we do here by 
taking N consecutive throws.

The ratio M/N already discussed is then a "sample mean" statistic,
i.e. the mean of the sample (finite subpopulation).

The statistic M/N is an effective way to estimate the parameter q.
In such a role as this statistics are called "point estimates".

Note that in forming the "point estimate" of q:

q ~ M/N

we would also have the corresponding point estimate of p:

p ~ 1 - (3M/N)

How good is this estimate?  That is the critical question.

In itself a point estimate discloses nothing about the accuracy of
its approximation.  While it may be helpful to know something of the
average error of approximation, this doesn't directly address a need
to assess the accuracy of an estimate made from a single sample.

A classic tool for such analysis is "confidence intervals".  A pair
of values [a,b], able to bracket a range of possibilities for p or q,
is more adequate than one number to describe a "likely" truth about
a parameter estimate.

A third number, the "level of confidence" c, is also associated with
a confidence interval.  The level of confidence is the probability,
for a known population distribution, that a random sample (x1,...,xN)
produces an interval that contains the parameter being estimated.

There are many valid recipes for confidence intervals, and I have
been struggling to come up with an accessible yet mathematically
sound approach to explaining the basic ones.  I think I'll have to
punt on this though, because the discussion quickly gets into a lot
of theory of special distributions (normal, chi-square, Student's t,
and Fisher's f are all relevant).

Instead let me point you to a couple of Web pages that have Excel
implementations of some of these calculations.

[Confidence limits]
http://www.quantdec.com/envstats/notes/class_08/confidence.htm

[Exact Binomial and Poisson Confidence Intervals]
http://members.aol.com/johnp71/confint.html

The first of these links you to a download of an Excel spreadsheet
that allows one to plug in N and a confidence level and get the
corresponding confidence levels for various M.  This spreadsheet uses
Excel's built-in functions, which are not all that adequate for very
large N.

The second site has an online calculator, but also allows you to
download an Excel spreadsheet that implements the essential functions
more carefully in VBA "macros" (code).  The author claims to have
tested these for accuracy with very large N.

More than you probably wanted to know
=====================================

If you decide you want to read up on the mathematics behind these
various approaches, I'd suggest M.G. Bulmer's "Principles of
Statistics", available as an inexpensive Dover edition.  Chapter 10 on
Statistical Inference is where all the conflicting opinions over
confidence intervals are fairly compared.

The "classical" approach to confidence intervals has a characteristic
that, for a fixed confidence level, the width of the confidence
interval is roughly proportional to 1/sqrt(N).  That means to narrow
the "precision" of estimating p or q by a factor of 10 would require
increasing the number of throws by a factor of 100.  Obviously if high
precision estimates were required, this approach would be quite
frustrating.

Better results can be obtained by making some reasonably strong
assumptions about the "a priori" distribution of the parameter p or q,
i.e. what values they are likely to have before any throws are made.

Then one can apply the Bayesian approach to confidence intervals.  I
will sketch the calculations in a simple case, where we assume the
cheat either never works or it always works.

A conditional probability has the form:

Pr(A|B = Pr( event A occurs, given that event B occurs )

In other words, with a priori knowledge of B, what is the chance of
A happening.  In elementary probability we define a value thus:

Pr(A|B) = Pr(A&B)/Pr(B)

Now what often seems to confuse even experts is the distinction
between Pr(A|B) and Pr(B|A).  Such confusion, esp. in criminal
evidence, has come to be known as the "Prosecutor's fallacy":

[Prosecutor's fallacy - Wikipedia]
http://www.wikipedia.org/wiki/Prosecutor's_fallacy

A correct and rigorous relationship between Pr(A|B) and Pr(B|A)
can be stated, but it requires the a priori probabilities of
both A and B.

We can illustrate Bayesian analysis here by assuming an a priori
distribution of probabilities for probability p.  To keep things
simple, let's assume that p is either 0 or 1, with equal chances
before any experiment is done.

If an experiment, with any number of trials, were to produce a
"forbidden" number on a die, that would establish p = 0 under 
these conditions; the "cheat" must not be of any effect.

But suppose for simplicity that one throw of the die occurs,
and the result is not a forbidden number.  Does this affect the
probability that the "cheat" works, ie. that p=1 ?

Note these easily verified calculations:

Pr( number not forbidden | p=0 ) = 2/3

Pr( number not forbidden | p=1 ) =  1

Bayes formula is derived from the definition of conditional
probability above by rewriting in this way:

Pr(A|B) = Pr(A&B)/Pr(B)

        = Pr(A&B)/[Pr(A&B) + Pr(not(A)&B)]
        
where by further application of conditional probabilities:

Pr(A&B) = Pr(B|A)*Pr(A)

Pr(not(A)&B) = Pr(B|not(A))*Pr(not(A))

Thus, Bayes formula:

Pr(A|B) = Pr(B|A)*Pr(A)/[Pr(B|A)*Pr(A) + Pr(B|not(A))*Pr(not(A))]

To use this in our circumstances, where:

A means "p=1"

B means "number not forbidden (in single throw of die)"

we simply plug in the values previously determined, including the
a priori probabilities for A and not(A) in this case:

Pr(A) = Pr(not(A)) = 1/2

Pr(A|B) = 1*(1/2)/[1*(1/2) + (2/3)*(1/2)] = 3/5

To summarize, assuming that "a priori" the chances that the cheat
either always works or never works are equal (50-50), then the
way that a single roll of the die affects the "a posteriori"
probabilities is that:

If a forbidden number comes up, chance that cheat works is 0%.

If a non-forbidden number comes up, chance that cheat works is 60%.

Notice that a single roll of the die in which no forbidden
number appears raises the chance that p=1 from 50% to 60%.

This sort of calculation is easily extended to the case where
N consecutive rolls of the die all fail to produce forbidden
numbers.  Intuitively as N increases, so should the probability
that p=1, and the calculations bear this out:

Pr( N numbers not forbidden | p=0 ) = (2/3)^N

Pr( N numbers not forbidden | p=1 ) =   1

Bayes formula will then tell us:

Pr( p=1 | N numbers not forbidden )

     = 1*(1/2)/[1*(1/2) + (2/3)^N * (1/2)]
     
     = 1/[1 + (2/3)^N]

which is easily evaluated on a calculator for particular values
of N.  In fact for N = 5:

Pr( p=1 | 5 numbers not forbidden ) = 88 4/11 %

and for N = 20:

Pr( p=1 | 20 numbers not forbidden ) = 99.97% (approx.)

Qualitatively, the chance that p=1 approaches 100% exponentially
fast as N increases.  It never reaches 100% exactly of course, for
any finite value of N; there is always some "doubt" leftover as
a result of the increasingly tiny term in the denominator that
corresponds precisely to the chance of a "streak" under the
possibility of p=0 (cheat never works).

But this demonstrates, using a simple assumption about the
a priori probabilities of the cheat working, that practical
inference about p might not require a unduly large number of
trials.

regards, mathtalk-ga

Request for Answer Clarification by donphiltrodt-ga on 02 May 2003 16:15 PDT
Excellent answer.  I need a bit of semantic/structural
clarification...

Is the Bayesian formula you described in the "More than you probably
wanted to know" the same concept at work in the Excel spreadsheets you
provided?

In other words, is your discussion of the Bayesian formula like
"looking under the hood" of the Excel spreadsheets or did you instead
offer the Bayesian formula as an altogether different (or more
accurate) approach to the problem?

Clarification of Answer by mathtalk-ga on 02 May 2003 18:44 PDT
It is the same abstract concept.  However the Bayesian inference
_example_ that I worked with involved a discrete probability
distribution for p (equal chances that p=0 or p=1), where the
spreadsheet presumably works with a continuous probability
distribution for p.

The spreadsheet provides a "Bayesian confidence interval" for p that
has an a posteriori probability of 95% of containing p (given the run
of experimental results).  The Bayesian confidence intervals tend to
be narrower than the "classic" confidence intervals (as might be
expected from their use of the additional a priori assumptions about
p), but the main point I was trying to raise is that these are apples
and oranges.  Bayesian inference takes p as a random variable with a
certain density function; classic inference treats the experimental
results (and thus the confidence interval derived from it) as the
random variable.

If I wanted to do a serious analysis of the experimental results, I'd
use the Bayesian approach with q assumed to have uniform (continuous)
distribution on [0,1/3].

Using the spreadsheet's continuous distribution for p, the width of
the confidence interval still appears to shrink in inverse proportion
to the square root of N, so qualitatively the accuracy of the estimate
is roughly the same with either approach.  However my "simplified" use
of the discrete alternatives for p led to exponential convergence of
the "estimate", a dramatic if unfair illustration of the "value" of
the a priori information in estimation.

regards, mathtalk
donphiltrodt-ga rated this answer:5 out of 5 stars and gave an additional tip of: $3.00
Excellent work.  Thank you.

Comments  
Subject: Re: Statistics: Hypothesis Testing needed; rigged dice.
From: racecar-ga on 16 Apr 2003 10:47 PDT
 
Assuming a typical, six sided (fair) die, the probability that one of
two specified faces will come up on a given roll is 1/3.  So the
probability that neither of those faces will show is 2/3.  So if you
roll the die N times, the probability that you will never see either
of the specified faces is (2/3)^N.  For example, if the die is fair,
you must roll it 7 times for the probability of never seeing either of
two faces to be less than 5% [ (2/3)^7 = .039 ].  You must roll it 12
times for the probability to be less than 1% [ (2/3)^12 = .0077 ].

All this is straightforward, but it only applies to a fair die.  You
might be tempted to say, "If I know that there is only a 1% chance
that neither face will show in 12 rolls of a fair die, then if that
happens, there's a 99% chance that the die is successfully rigged." 
But that would be wrong.  It would be approximately correct if you
know that half the time the rigging works perfectly, and the other
half it completely fails, leaving a fair die, but this is not a fair
assumption.  A given rigging process is unlikely to work exactly half
the time.  Without knowing anything about the probabilities underlying
the rigging process, it is impossible to give a precise numerical
answer.  It is only possible to give an exact answer to the question:
'what is the probality this would happen IF the die were fair?'. 
Nonetheless, at some point, even without knowledge of the rigging
process, it is possible to be more or less certain the die is rigged. 
If you roll 50 times and never see the the two faces, you 'know' the
die is rigged because that would only happen about once in a billion
times with a fair die.
Subject: Re: Statistics: Hypothesis Testing needed; rigged dice.
From: mathtalk-ga on 16 Apr 2003 18:28 PDT
 
Hi, donphiltrodt:

Given the new "frame" of the problem, I would consider this to be a
rather simple estimation problem.  Think of it like this.  Let p be
the fraction of time that the "cheat" works.  Score a 0 if a
non-forbidden number comes up, and score 1 if a forbidden number comes
up.

Assuming the "cheat" is attempted on N tries, and that the total score
on these attempts is M <= N, then the best "unbiased" estimator of p
is given by a simple calculation:

M ~ (1 - p)N/3

p ~ 1 - (3M/N)

This corresponds to estimating the probability (1 - p) that the cheat
fails as three times the number of attempts which result in getting a
"forbidden" number divided by the total number of attempts.

The question of how good an estimate this is for p is then handled by
construction of a "confidence interval" around the estimated value. 
Even though we are dealing with repetitions of a binomial distribution
(either a forbidden number appears or it doesn't), the distribution of
the sample mean (average fraction of time a forbidden number appears)
is close to a normal distribution (the so-called law of large
numbers).

Perhaps the most readily available software for doing these sorts of
calculations is Excel.  If you were taking an introductory course in
statistics, there might very well be some "simple" software that only
does these "confidence interval" calculations, but Excel is more the
tool I would choose for the job.

If you like I can post the formulas for you, with a sample
calculation, as an answer.  The conclusions are typically of the form
"p is estimated to lie in an interval [a,b] with confidence level
95%".  The wording is intended to make it sound like one is 95% sure
that p is in the interval, but technically this is not what the
calculations say.  To understand the subject well one has to begin
with conditional probability and graduate to the more complex subject
of Bayesian inference.  There's some free search terms for you,
anyway.

regards, mathtalk-ga
Subject: Re: Statistics: Hypothesis Testing needed; rigged dice.
From: donphiltrodt-ga on 16 Apr 2003 19:38 PDT
 
>> If you like I can post the formulas for 
>> you, with a sample calculation, as an answer.

That'd be great.  Please do.  TIA.
Subject: Re: Statistics: Hypothesis Testing needed; rigged dice.
From: hfshaw-ga on 18 Apr 2003 13:17 PDT
 
Two words:  Chi-squared.

See:
http://www.stat.yale.edu/Courses/1997-98/101/chigf.htm
http://www.ulib.org/webRoot/Books/Numerical_Recipes/bookcpdf/c14-3.pdf
http://virtuallygenetics.net/SeriesI/Mendel/section_05.htm
http://www.anu.edu.au/nceph/surfstat/surfstat-home/4-2-4.html (scroll
down to "example 60")
http://fonsg3.let.uva.nl/Service/Statistics/X2Test.html 
http://www.uwsp.edu/psych/cw/statmanual/chisquare.html
and many more.  Most introductory statistics books will have sections
discussing the use of the chi-squared distribution in hypothesis
testing.

Your original question asked how many times you would have to roll
your rigged die to "ensure" that it was, in fact, not a fair die.  If,
by "ensure" you mean "the probability is equal to zero that the
observed run of results could not be generated by throws of a fair
die", then the answer to *that* question (as I think you must realize)
is "an infinite number of times".  For any finite number of throws,
there is always a nonzero probability that the observed run could be
due to a "streak" (as you put it) of a fair die.  The probability
might be very small, but it is nonzero.  Thus, in the real world, you
must pick some level of confidence at which you are willing to say,
"that's good enough",  I'm willing to live with the residual
uncertainty.”


Racecar-ga gave the method for calculating the probability that a run
of N throws of a fair die would result in a distribution in which two
of the faces never showed up (P = 2/3^N).  This, however, is *not* the
same as asking what level of confidence one has in saying that the
observed run was generated by a fair die. Formally, you want to test
the hypothesis that a given set of observations (the results of N
throws of your potentially rigged die) was drawn from the uniform
distribution (with p_i=1/6 for each face) generated by a fair die. 
For this, you need to calculate chi-squared for your set of
observations and compare it to the value of the chi-squared
distribution with the appropriate number of degrees of freedom, (in
this case equal to 5.  See references given above for why this is so)
and at your chosen level of confidence (your comfort level of residual
uncertainty).


The formula for chi squared is simply the sum over all observations of
[(observed value - expected value)^2]/(expected value).  In this case,
the “observed value” is the number of times a given face comes up in
your sequence of N throws, and the expected value is the number of
times that face would be expected to come up if the die were fair (and
is simply equal to N/6).   The sum extends over the results for all
six faces.  The chi-squared statistic measures the "goodness of fit"
between a set of observations and a comparison distribution.  (If you
are familiar with least-squares fitting, this is essentially the
quantity that is being minimized in that procedure.)


There are numerous tables of the chi-squared statistic as a function
of significance level and degrees of freedom available.  The links
above include some references to such tables and to on-line
calculators.  If the value of chi-squared calculated for your set of
data is larger than the tabulated value for then you can reject the
hypothesis that the observed results were a result of throwing a fair
die at the chosen level of confidence (i.e., you can be X% sure that
the results were not from a fair die, where you get to pick the value
of X).


Note that this test works just as well if the rigging of the die is
not perfect; if your method only reduces the probability that a face
will come up to something less than 1/6, but does not reduce the
probability to zero, you can still use this test.  Obviously, the more
subtle the change in "fairness" of the die, the more times you will
have to test it in order to achieve the same level of confidence that
it is, in fact, not fair.


One caveat on the use of this test is that because of some
approximations made in the derivation of the test, the expected number
of observations in any single "bin" must be >5.  That means that for
the results of the test to be even approximately correct, you will
need to throw a die at least 30 times (5 * 6 possible results).


As an example, using the on-line calculator at
http://fonsg3.let.uva.nl/Service/Statistics/X2Test.html, and assuming
you threw a die 36 times, and four of the faces came up 9 times each,
but the other two faces never came up, you could be 98.6% sure that
the die is not fair.  If, on the other hand, your "fix" is not
perfect, and four of the faces come up 8 times each, while the other
two come up twice each, you could only be ~65% sure that the die is
not fair.


As an aside, it would also be appropriate to use the chi-squared
distribution to test the hypotheses associated with the examples in
your original Google Answers question.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy