Google Answers: Probability question

View Question

Q: Probability question ( No Answer, 18 Comments )

Question

Subject: Probability question
Category: Science > Math
Asked by: psychephile-ga
List Price: $25.00

Posted: 09 Jun 2006 07:18 PDT
Expires: 09 Jul 2006 07:18 PDT
Question ID: 736681

Suppose I have a bag with 25000 unique marbles, and I want to identify
them all by reaching in and grabbing 32 at a time.  If I don't replace
the marbles, obviously I only have to sample the bag 25000/32 times.  But if I
do replace the marbles, it's not certain that I'll ever grab each of
the 25000 unique marbles at least once.  My question is:  If I want to
be 95% certain that I've grabbed each of the 25000 marbles, how many
times do I have to reach in the bag to grab 32?

This seems like an easy question, but I suspect it might be tricky for
some reason that I'm too bone-headed to even realize.

Request for Question Clarification by pafalafa-ga on 09 Jun 2006 09:10 PDT

I think it's:

(24968/25000) to the nth power < .05


Solve for n, and that's your answer. 


If it were five unique marbles, 2 at a time, then it would be:

(3/5)^n < .05

n=5 tries


Whaddya think?


pafalafa-ga

Answer

There is no answer at this time.

Comments

Subject: Re: Probability question
From: elids-ga on 09 Jun 2006 08:25 PDT

Not likely that you are boneheaded, given you dilemma it is probable
that you are still a furheaded man.

Subject: Re: Probability question
From: bcattwood-ga on 09 Jun 2006 10:36 PDT

I have a suspicion the answer given by pafalafa is for the number of
withdrawals needed to have 95% confidence that one particular marble
will be chosen at least once and that a correction needs to be made
for it to apply to all 25000 marbles. I THINK that you need to divide
the 0.05 by 25000, but perhaps pafalafa would like to respond.

Subject: Re: Probability question
From: berkeleychocolate-ga on 09 Jun 2006 10:49 PDT

Let p= 1 - 32/25000. Then p^n = the probability that any specific
marble has not been chosen in n draws. So [1 - p^n]^25000 is the
probability that all marbles have been chosen in n draws. Equate this
to .95 to solve for n.

Subject: Re: Probability question
From: youreh-ga on 09 Jun 2006 11:11 PDT

On the clarification, I think it is a little off and only works for
the first step (but I think the idea with < 0.05 is the right idea).
Writing out some thoughts...

Let's say there are N marbles total in the bag and we expect to have
NOT picked out K of them already (K is a real number greater than zero
and less than or equal to N -- it need not be an integer since this is
an expectation). Then, if we pick out P marbles in each step, the
probability that any single marble we pick out is not one we've
already picked out is equal K/N. Further, if we pick out P marbles in
a step, we expect the number of "new" marbles (ones we hadn't picked
out before) to be P*K/N. So, our new expected number of marbles after
this step is K*=K-(P*K/N).

This is our recursion. Running this recursion in MatLab with N and K
both starting at 32000, it took 10245 steps to get K to 0.05. The
program commands were as follows (I used a large matrix for K and so
each row represents a new step):

N=25000
K=25000*ones(15000,1) /* a matrix larger than is needed, but we can
check whichever step we like then */
P=32

for n=1:14999
K(n+1,1) = K(n,1)-P*(K(n,1)/N);
end

Checking the entry K(10246,1), which is the number of marbles after
10245 steps, we find it is 0.05. So, if I remember right (this is the
place I'm a little fuzzy), since our expected number of remaining
marbles is <=0.05, we are 95% certain we have picked them all out. I
think using this assumption may make it an approximation, but I would
bet it is still a good approximation for the problem.

Subject: Re: Probability question
From: youreh-ga on 09 Jun 2006 11:20 PDT

Wow, I should have seen that (shorter) method. I definitely agree with
berkeleychocolate's answer. It should be 10226 after doing the
computations.

Subject: Re: Probability question
From: rracecarr-ga on 09 Jun 2006 12:05 PDT

Way to go berkeleychocolate...

Subject: Re: Probability question
From: pafalafa-ga on 09 Jun 2006 13:18 PDT

I love taking a crack at these types of questions.

I'm always wrong...


paf

Subject: Re: Probability question
From: ansel001-ga on 10 Jun 2006 02:07 PDT

I would like to answer pafalafa's simplified question and suggest that
the same methodology needs to be used to answer the question as posed.

Given a bag with five marbles, draw two at a time and replace them
after each draw of two.  How many draws are necessary to be 95% sure
that each marble has been drawn at least once.

Let's define a state of n as meaning that in all the draws that have
been made so far, n different marbles have been drawn at least once. 
First let's construct a matrix that shows the probability of
transitioning from one state to another in one draw.

State transition matrix

	To						
From	0	1	2	3	4	5	Total
0	0.00	0.00	1.00	0.00	0.00	0.00	1.00
1							
2	0.00	0.00	0.10	0.60	0.30	0.00	1.00
3	0.00	0.00	0.00	0.30	0.60	0.10	1.00
4	0.00	0.00	0.00	0.00	0.60	0.40	1.00
5	0.00	0.00	0.00	0.00	0.00	1.00	1.00

This leads to a distribution of expected results after each draw.

	Distribution of Expected Results					
Draw	0	1	2	3	4	5
0	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
1	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
2	0.0000	0.0000	0.1000	0.6000	0.3000	0.0000
3	0.0000	0.0000	0.0100	0.2400	0.5700	0.1800
4	0.0000	0.0000	0.0010	0.0780	0.4890	0.4320
5	0.0000	0.0000	0.0001	0.0240	0.3405	0.6354
6	0.0000	0.0000	0.0000	0.0073	0.2187	0.7740
7	0.0000	0.0000	0.0000	0.0022	0.1356	0.8622
8	0.0000	0.0000	0.0000	0.0007	0.0827	0.9167
9	0.0000	0.0000	0.0000	0.0002	0.0500	0.9498
10	0.0000	0.0000	0.0000	0.0001	0.0301	0.9698

Please note that a state of one is not possible.  On the first draw
you will go from a state of zero to two.  After the second draw there
is a 10% chance that you will achieve a state of two, 60% chance of a
state of three, and a 30% chance of a state of four.  This is the
starting point for the next draw.  For the third draw you will need to
calculate the probabilities of going between the following states:

2 --> 2
2 --> 3
2 --> 4
3 --> 3
3 --> 4
3 --> 5
4 --> 4
4 --> 5

And so on.  As you can see, it takes 10 draws to be 95% sure that all
the marbles have been drawn at least once.  Berekelychocolate's method
is close.  It estimates 8.975 draws.

I'm not sure how much error is built into berekelychocolate's method
for the question as stated.  To work it out exactly using the Bayesian
analysis that I have set forth would require considerable computing
power.

Subject: Re: Probability question
From: activealexaoki-ga on 10 Jun 2006 11:48 PDT

This is a good question. I remember failing to answer to myself in my
high school year.
My opinion is that this is part of combinatoric question using
Pigeonhole Principle. Since your language seems to imply that your
"95%" is not the "Confidence Interval," I may just apply the
principle. The only question is how to substitute pigeons and
pigeonholes.
My suggested solution is to let pigeonholes number of grab you make
(let it x) and let pigeons associated with 32. Combination without
repetition says (25000,32)(24968,32)... and thus
25000!/(32!)^(25000/32). Suppose the integer is N.
General Pigeonhole Principle suggests the probability of picking at
least one same marbles is 1-x!/{(x-N)!*x^N}
In other words, to be 95% sure that you will be able to grab every one
of 25000 marbles, you should solve x!/{(x-N)!*x^N}=.95 Apparently x
must be greater than N. I cannot guarrantee the correctness of my
solution but just as the possibility. Let me know if my comment helped
you.

Subject: Re: Probability question
From: rracecarr-ga on 12 Jun 2006 11:07 PDT

Yeah, it looks like ansel001 is correct that there's a problem with
berkeleychocolate's method.  Too bad, it was so elegant.

The problem is that the method treats the probability of each of the
25000 marbles having been drawn as an independent event.  But if you
know the first 24999 marbles have all been chosen, that 'uses up' some
of the draws, and decreases the probablility that the 25000th marble
has been chosen.  So the actual answer is probably somewhat larger
than 10226.

Subject: Re: Probability question
From: berkeleychocolate-ga on 12 Jun 2006 18:44 PDT

I agree with recent comments. The events "draw the ith marble" and
"draw the jth marble" are not quite independent. Trouble is that even
though ansel001 can do the work for 5 and 2 with a calculator or
computer it's not practical for bigger numbers. Here's some thoughts
someone might like to follow up on using his basic idea.

 Let M be the transition matrix (as discussed in ansel001) for N
marbles , selecting D at a time. Then M is upper triangular. (Its
entries are all 0 or probabilities from the hypergeometric
distribution: the probability of picking x good and n-x bad from a
pile of G good and B bad.) So its eigenvalues are easy to find. They
are along the diagonal. I'm pretty sure that M is diagonalizable.
Starting from the last eigenvector and going backwards, one can easily
write equations to solve for the  N+1 eigenvectors (0 through N). If P
is the matrix of eigenvectors, and L is the diagonal matrix of
diagonal entries in M, then M=PLQ, where Q is the inverse of P. So M^n
= P L^n Q. We want the last entry in the first row to be at least .95.
One can actually do this for N=5 and D=2 since  it just involves 6 by
6 matrices. So one could get a closed expression for this probability
at least for a relatively small N and D.

It doesn't seem practical for larger N and D, though.

Subject: Re: Probability question
From: rracecarr-ga on 14 Jun 2006 16:02 PDT

Here is a piece of Matlab code to get an approximate answer by brute force:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear
N = 25000; % # of balls in bag
P = 32;    % # of balls chosen at a time
n = 1000;  % # of realizations
I = [];    % # of draws required for each realization
C = .95;   % certainty desired

for j=1:n
    s = zeros(N,1);  % number of times each ball has been drawn so far
    i = 0;           % draw number
    while 1
        i = i + 1;              % increment draw number
        p = ceil(rand(P,1)*N);  % pick P random balls
        s(p) = s(p) + 1;        % update times each ball drawn
        x = find(s==0,1);       % look for undrawn balls
        if isempty(x)
            break               % stop when there are none
        end
    end
    I = [I i];       % store # of draws required
end
I = sort(I);
I(round(C*n))
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

I'm running this version right now.  On my laptop it will take about
40 minutes.  I already ran it with n=100, and I know that 10226 is
pretty close...

(I know this is a clumsy method, and that my code is probably
inefficient.  Oh well.)

Subject: Re: Probability question
From: rracecarr-ga on 14 Jun 2006 16:57 PDT

10226 is correct within 1%.  Probably closer than that.

Subject: Re: Probability question
From: berkeleychocolate-ga on 14 Jun 2006 21:23 PDT

There's even more in this interesting question. Referring to my
comments about ansel001's transition matrix, I did the calculations
for N=5, D=2 (picking 2 from 5). Amazingly, P, the matrix of
eigenvectors, is an inverted Pascal's traingle and its inverse Q is
the same but with alternating signs. That is, the first column of P is
1 5 10 10 5 1. The second column is 0 1 4 6 4 1, etc. The first column
of Q is 1 -5 10 -10 5 -1. The second  column is 0 1 -4 6 -4 1, etc.
The eigenvalues are 0 0 .1 .3 .6 1. If L is the matrix whose diagonal
entries are these eigenvalues and all other entries are 0, then we
want the (1,6) entry of P(L^n)Q to be at least .95. Now the (1,6)
entry of P(L^n)Q is the sum over i of p(1,i)r(i)q(i,6), where p(i,j)
is the (i,j)th entry of P, r(i) is the ith eigenvalue raised to the
nth power and q(i,j) is the (i,j)th entry of Q. So the equation
becomes

-10*(.1^n) + 10*(.3^n) - 5*(.6^n) +1 >= .95. 

Just as ansel001 calculated, n must be 10.

Let C(a,b) be the number of combinations of a objects taken b at a time.
Generalizing to N marbles drawing D at a time, the formula becomes

-C(N,D)*[C(D,D)/C(N,D)]^n + C(N,D+1)*[C(D+1,D)/C(N,D)]^n -
C(N,D+2)*[C(D+2,D)/C(N,D)]^n + ... + C(N,N)*[C(N,D)/C(N,D)]^n >= .95.

Now the last term on the left is 1. Subtracting it, the remaining
terms must sum to at least -.05. The most important term is now the
rightmost term on the left since the bigger K is, the bigger C(K,D) is
and when n is large, this term dominates. The next most important term
is the one just before that, etc. Using just the two terms we get

C(N,N-1)*[C(N-1,D)/C(N,D)]^n = .05. 

This simplifies to N*[(N-D)/N]^n = .05, which is a first order
approximation to my original [1 - p^n]^N = .95. Take several terms for
greater accuracy. For example, one more term gives

-C(N,N-2)* [C(N-2,D)/C(N,D)]^n + C(N,N-1)*[C(N-1,D)/C(N,D)]^n = .05. 

This simplifies to -N(N-1)/2 * [(N-D)(N-D-1)/(N(N-1))]^n + N*[(N-D)/N]^n = .05.

Subject: Re: Probability question
From: berkeleychocolate-ga on 15 Jun 2006 12:59 PDT

Referring to my latest comments, when n is 10226 the terms are 1 -
.05125 + .00131 - .000022 + ... . Since this is an alternating
decreasing series, the error is less than the last term. Since 1 -
.05125 + .00131 + an error less than .000022 is still greater than
.95, n = 10226 works. A similar analysis for 10225 produces a number
less than .95. So n = 10226.

Subject: Re: Probability question
From: infimum-ga on 15 Jun 2006 15:45 PDT

With regard to the alternating sums appearing in berkeleychocolate's
previous comment, they have a natural interpretation in terms of the
inclusion-exclusion principle. It is not hard to verify that the
probability that a fixed subset of i marbles remains unpicked after m
draws is given by the formula

C(n,i)*[(n-k)_i / (n)_i ]^m,

where (n)_i denotes the falling factorial n(n-1)(n-2)...(n-i+1).
Inclusion-exclusion says that the probability that there exists an
unpicked marble after m draws is given by an alternating sum of terms
of the above form, where i ranges from 1 to n-k in the summation. In
general, the terms in these alternating sums need not be decreasing in
magnitude, but Bonferroni inequalities guarantee that taking an odd
number of terms gives an upper bound, and taking an even number of
terms gives a lower bound. In this case, with n = 10226, the sum after
three terms is 0.0499602, and thus the probability that an unpicked
marble remains is less than 5%.

By taking complements of everything, we recover berkeleychocolate's sums.

Subject: Re: Probability question
From: infimum-ga on 15 Jun 2006 16:33 PDT

Whoops! I wrote a bit too hastily. My previous post should have said,

> the probability that a fixed subset of i marbles remains unpicked
after m draws > is given by the formula [(n-k)_i / (n)_i ]^m,

and also,

> Inclusion-exclusion says that the probability that there exists an
> unpicked marble after m draws is given by a sum of terms
> of the form (-1)^(i+1) [ (n-k)_i / (n)_i ]^m, where the sum is taken over all 
> i-element subsets as i ranges from 1 to n-k.

In short,

Pr(exists unpicked marble) = sum_{i=1}^{n-k} (-1)^(i+1) C(n,i)*[(n-k)_i/(n)_i]^m.

Subject: Re: Probability question
From: berkeleychocolate-ga on 15 Jun 2006 21:45 PDT

This sure was fun, especially the interaction.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy