

Subject:
Probability question
Category: Science > Math Asked by: psychephilega List Price: $25.00 
Posted:
09 Jun 2006 07:18 PDT
Expires: 09 Jul 2006 07:18 PDT Question ID: 736681 
Suppose I have a bag with 25000 unique marbles, and I want to identify them all by reaching in and grabbing 32 at a time. If I don't replace the marbles, obviously I only have to sample the bag 25000/32 times. But if I do replace the marbles, it's not certain that I'll ever grab each of the 25000 unique marbles at least once. My question is: If I want to be 95% certain that I've grabbed each of the 25000 marbles, how many times do I have to reach in the bag to grab 32? This seems like an easy question, but I suspect it might be tricky for some reason that I'm too boneheaded to even realize.  


There is no answer at this time. 

Subject:
Re: Probability question
From: elidsga on 09 Jun 2006 08:25 PDT 
Not likely that you are boneheaded, given you dilemma it is probable that you are still a furheaded man. 
Subject:
Re: Probability question
From: bcattwoodga on 09 Jun 2006 10:36 PDT 
I have a suspicion the answer given by pafalafa is for the number of withdrawals needed to have 95% confidence that one particular marble will be chosen at least once and that a correction needs to be made for it to apply to all 25000 marbles. I THINK that you need to divide the 0.05 by 25000, but perhaps pafalafa would like to respond. 
Subject:
Re: Probability question
From: berkeleychocolatega on 09 Jun 2006 10:49 PDT 
Let p= 1  32/25000. Then p^n = the probability that any specific marble has not been chosen in n draws. So [1  p^n]^25000 is the probability that all marbles have been chosen in n draws. Equate this to .95 to solve for n. 
Subject:
Re: Probability question
From: yourehga on 09 Jun 2006 11:11 PDT 
On the clarification, I think it is a little off and only works for the first step (but I think the idea with < 0.05 is the right idea). Writing out some thoughts... Let's say there are N marbles total in the bag and we expect to have NOT picked out K of them already (K is a real number greater than zero and less than or equal to N  it need not be an integer since this is an expectation). Then, if we pick out P marbles in each step, the probability that any single marble we pick out is not one we've already picked out is equal K/N. Further, if we pick out P marbles in a step, we expect the number of "new" marbles (ones we hadn't picked out before) to be P*K/N. So, our new expected number of marbles after this step is K*=K(P*K/N). This is our recursion. Running this recursion in MatLab with N and K both starting at 32000, it took 10245 steps to get K to 0.05. The program commands were as follows (I used a large matrix for K and so each row represents a new step): N=25000 K=25000*ones(15000,1) /* a matrix larger than is needed, but we can check whichever step we like then */ P=32 for n=1:14999 K(n+1,1) = K(n,1)P*(K(n,1)/N); end Checking the entry K(10246,1), which is the number of marbles after 10245 steps, we find it is 0.05. So, if I remember right (this is the place I'm a little fuzzy), since our expected number of remaining marbles is <=0.05, we are 95% certain we have picked them all out. I think using this assumption may make it an approximation, but I would bet it is still a good approximation for the problem. 
Subject:
Re: Probability question
From: yourehga on 09 Jun 2006 11:20 PDT 
Wow, I should have seen that (shorter) method. I definitely agree with berkeleychocolate's answer. It should be 10226 after doing the computations. 
Subject:
Re: Probability question
From: rracecarrga on 09 Jun 2006 12:05 PDT 
Way to go berkeleychocolate... 
Subject:
Re: Probability question
From: pafalafaga on 09 Jun 2006 13:18 PDT 
I love taking a crack at these types of questions. I'm always wrong... paf 
Subject:
Re: Probability question
From: ansel001ga on 10 Jun 2006 02:07 PDT 
I would like to answer pafalafa's simplified question and suggest that the same methodology needs to be used to answer the question as posed. Given a bag with five marbles, draw two at a time and replace them after each draw of two. How many draws are necessary to be 95% sure that each marble has been drawn at least once. Let's define a state of n as meaning that in all the draws that have been made so far, n different marbles have been drawn at least once. First let's construct a matrix that shows the probability of transitioning from one state to another in one draw. State transition matrix To From 0 1 2 3 4 5 Total 0 0.00 0.00 1.00 0.00 0.00 0.00 1.00 1 2 0.00 0.00 0.10 0.60 0.30 0.00 1.00 3 0.00 0.00 0.00 0.30 0.60 0.10 1.00 4 0.00 0.00 0.00 0.00 0.60 0.40 1.00 5 0.00 0.00 0.00 0.00 0.00 1.00 1.00 This leads to a distribution of expected results after each draw. Distribution of Expected Results Draw 0 1 2 3 4 5 0 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 2 0.0000 0.0000 0.1000 0.6000 0.3000 0.0000 3 0.0000 0.0000 0.0100 0.2400 0.5700 0.1800 4 0.0000 0.0000 0.0010 0.0780 0.4890 0.4320 5 0.0000 0.0000 0.0001 0.0240 0.3405 0.6354 6 0.0000 0.0000 0.0000 0.0073 0.2187 0.7740 7 0.0000 0.0000 0.0000 0.0022 0.1356 0.8622 8 0.0000 0.0000 0.0000 0.0007 0.0827 0.9167 9 0.0000 0.0000 0.0000 0.0002 0.0500 0.9498 10 0.0000 0.0000 0.0000 0.0001 0.0301 0.9698 Please note that a state of one is not possible. On the first draw you will go from a state of zero to two. After the second draw there is a 10% chance that you will achieve a state of two, 60% chance of a state of three, and a 30% chance of a state of four. This is the starting point for the next draw. For the third draw you will need to calculate the probabilities of going between the following states: 2 > 2 2 > 3 2 > 4 3 > 3 3 > 4 3 > 5 4 > 4 4 > 5 And so on. As you can see, it takes 10 draws to be 95% sure that all the marbles have been drawn at least once. Berekelychocolate's method is close. It estimates 8.975 draws. I'm not sure how much error is built into berekelychocolate's method for the question as stated. To work it out exactly using the Bayesian analysis that I have set forth would require considerable computing power. 
Subject:
Re: Probability question
From: activealexaokiga on 10 Jun 2006 11:48 PDT 
This is a good question. I remember failing to answer to myself in my high school year. My opinion is that this is part of combinatoric question using Pigeonhole Principle. Since your language seems to imply that your "95%" is not the "Confidence Interval," I may just apply the principle. The only question is how to substitute pigeons and pigeonholes. My suggested solution is to let pigeonholes number of grab you make (let it x) and let pigeons associated with 32. Combination without repetition says (25000,32)(24968,32)... and thus 25000!/(32!)^(25000/32). Suppose the integer is N. General Pigeonhole Principle suggests the probability of picking at least one same marbles is 1x!/{(xN)!*x^N} In other words, to be 95% sure that you will be able to grab every one of 25000 marbles, you should solve x!/{(xN)!*x^N}=.95 Apparently x must be greater than N. I cannot guarrantee the correctness of my solution but just as the possibility. Let me know if my comment helped you. 
Subject:
Re: Probability question
From: rracecarrga on 12 Jun 2006 11:07 PDT 
Yeah, it looks like ansel001 is correct that there's a problem with berkeleychocolate's method. Too bad, it was so elegant. The problem is that the method treats the probability of each of the 25000 marbles having been drawn as an independent event. But if you know the first 24999 marbles have all been chosen, that 'uses up' some of the draws, and decreases the probablility that the 25000th marble has been chosen. So the actual answer is probably somewhat larger than 10226. 
Subject:
Re: Probability question
From: berkeleychocolatega on 12 Jun 2006 18:44 PDT 
I agree with recent comments. The events "draw the ith marble" and "draw the jth marble" are not quite independent. Trouble is that even though ansel001 can do the work for 5 and 2 with a calculator or computer it's not practical for bigger numbers. Here's some thoughts someone might like to follow up on using his basic idea. Let M be the transition matrix (as discussed in ansel001) for N marbles , selecting D at a time. Then M is upper triangular. (Its entries are all 0 or probabilities from the hypergeometric distribution: the probability of picking x good and nx bad from a pile of G good and B bad.) So its eigenvalues are easy to find. They are along the diagonal. I'm pretty sure that M is diagonalizable. Starting from the last eigenvector and going backwards, one can easily write equations to solve for the N+1 eigenvectors (0 through N). If P is the matrix of eigenvectors, and L is the diagonal matrix of diagonal entries in M, then M=PLQ, where Q is the inverse of P. So M^n = P L^n Q. We want the last entry in the first row to be at least .95. One can actually do this for N=5 and D=2 since it just involves 6 by 6 matrices. So one could get a closed expression for this probability at least for a relatively small N and D. It doesn't seem practical for larger N and D, though. 
Subject:
Re: Probability question
From: rracecarrga on 14 Jun 2006 16:02 PDT 
Here is a piece of Matlab code to get an approximate answer by brute force: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% clear N = 25000; % # of balls in bag P = 32; % # of balls chosen at a time n = 1000; % # of realizations I = []; % # of draws required for each realization C = .95; % certainty desired for j=1:n s = zeros(N,1); % number of times each ball has been drawn so far i = 0; % draw number while 1 i = i + 1; % increment draw number p = ceil(rand(P,1)*N); % pick P random balls s(p) = s(p) + 1; % update times each ball drawn x = find(s==0,1); % look for undrawn balls if isempty(x) break % stop when there are none end end I = [I i]; % store # of draws required end I = sort(I); I(round(C*n)) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% I'm running this version right now. On my laptop it will take about 40 minutes. I already ran it with n=100, and I know that 10226 is pretty close... (I know this is a clumsy method, and that my code is probably inefficient. Oh well.) 
Subject:
Re: Probability question
From: rracecarrga on 14 Jun 2006 16:57 PDT 
10226 is correct within 1%. Probably closer than that. 
Subject:
Re: Probability question
From: berkeleychocolatega on 14 Jun 2006 21:23 PDT 
There's even more in this interesting question. Referring to my comments about ansel001's transition matrix, I did the calculations for N=5, D=2 (picking 2 from 5). Amazingly, P, the matrix of eigenvectors, is an inverted Pascal's traingle and its inverse Q is the same but with alternating signs. That is, the first column of P is 1 5 10 10 5 1. The second column is 0 1 4 6 4 1, etc. The first column of Q is 1 5 10 10 5 1. The second column is 0 1 4 6 4 1, etc. The eigenvalues are 0 0 .1 .3 .6 1. If L is the matrix whose diagonal entries are these eigenvalues and all other entries are 0, then we want the (1,6) entry of P(L^n)Q to be at least .95. Now the (1,6) entry of P(L^n)Q is the sum over i of p(1,i)r(i)q(i,6), where p(i,j) is the (i,j)th entry of P, r(i) is the ith eigenvalue raised to the nth power and q(i,j) is the (i,j)th entry of Q. So the equation becomes 10*(.1^n) + 10*(.3^n)  5*(.6^n) +1 >= .95. Just as ansel001 calculated, n must be 10. Let C(a,b) be the number of combinations of a objects taken b at a time. Generalizing to N marbles drawing D at a time, the formula becomes C(N,D)*[C(D,D)/C(N,D)]^n + C(N,D+1)*[C(D+1,D)/C(N,D)]^n  C(N,D+2)*[C(D+2,D)/C(N,D)]^n + ... + C(N,N)*[C(N,D)/C(N,D)]^n >= .95. Now the last term on the left is 1. Subtracting it, the remaining terms must sum to at least .05. The most important term is now the rightmost term on the left since the bigger K is, the bigger C(K,D) is and when n is large, this term dominates. The next most important term is the one just before that, etc. Using just the two terms we get C(N,N1)*[C(N1,D)/C(N,D)]^n = .05. This simplifies to N*[(ND)/N]^n = .05, which is a first order approximation to my original [1  p^n]^N = .95. Take several terms for greater accuracy. For example, one more term gives C(N,N2)* [C(N2,D)/C(N,D)]^n + C(N,N1)*[C(N1,D)/C(N,D)]^n = .05. This simplifies to N(N1)/2 * [(ND)(ND1)/(N(N1))]^n + N*[(ND)/N]^n = .05. 
Subject:
Re: Probability question
From: berkeleychocolatega on 15 Jun 2006 12:59 PDT 
Referring to my latest comments, when n is 10226 the terms are 1  .05125 + .00131  .000022 + ... . Since this is an alternating decreasing series, the error is less than the last term. Since 1  .05125 + .00131 + an error less than .000022 is still greater than .95, n = 10226 works. A similar analysis for 10225 produces a number less than .95. So n = 10226. 
Subject:
Re: Probability question
From: infimumga on 15 Jun 2006 15:45 PDT 
With regard to the alternating sums appearing in berkeleychocolate's previous comment, they have a natural interpretation in terms of the inclusionexclusion principle. It is not hard to verify that the probability that a fixed subset of i marbles remains unpicked after m draws is given by the formula C(n,i)*[(nk)_i / (n)_i ]^m, where (n)_i denotes the falling factorial n(n1)(n2)...(ni+1). Inclusionexclusion says that the probability that there exists an unpicked marble after m draws is given by an alternating sum of terms of the above form, where i ranges from 1 to nk in the summation. In general, the terms in these alternating sums need not be decreasing in magnitude, but Bonferroni inequalities guarantee that taking an odd number of terms gives an upper bound, and taking an even number of terms gives a lower bound. In this case, with n = 10226, the sum after three terms is 0.0499602, and thus the probability that an unpicked marble remains is less than 5%. By taking complements of everything, we recover berkeleychocolate's sums. 
Subject:
Re: Probability question
From: infimumga on 15 Jun 2006 16:33 PDT 
Whoops! I wrote a bit too hastily. My previous post should have said, > the probability that a fixed subset of i marbles remains unpicked after m draws > is given by the formula [(nk)_i / (n)_i ]^m, and also, > Inclusionexclusion says that the probability that there exists an > unpicked marble after m draws is given by a sum of terms > of the form (1)^(i+1) [ (nk)_i / (n)_i ]^m, where the sum is taken over all > ielement subsets as i ranges from 1 to nk. In short, Pr(exists unpicked marble) = sum_{i=1}^{nk} (1)^(i+1) C(n,i)*[(nk)_i/(n)_i]^m. 
Subject:
Re: Probability question
From: berkeleychocolatega on 15 Jun 2006 21:45 PDT 
This sure was fun, especially the interaction. 
If you feel that you have found inappropriate content, please let us know by emailing us at answerssupport@google.com with the question ID listed above. Thank you. 
Search Google Answers for 
Google Home  Answers FAQ  Terms of Service  Privacy Policy 