|
|
Subject:
A problem on probabilities
Category: Science > Math Asked by: octava-ga List Price: $20.00 |
Posted:
25 Jun 2003 23:12 PDT
Expires: 25 Jul 2003 23:12 PDT Question ID: 221855 |
This is a problem on probabilities. The conditions are: 1) There is a set A with n elements. Each element has a different probability of being chosen: (p1, p2, ... ,pn) 2) In trial I, set B is built by choosing, without replacement, m elements from set A. 3) After returning set A to its original state; in trial II, set C is built by choosing, without replacement, k elements from set A. Question 1: Which is the best guess for the number of elements shared between sets B and C? The payment I offer is for question 1, but I am also interested in, and will reward the answers for, these other questions: Question 2: If we know the average of the probabilities of the elements that actually went into set B, can we improve on our guess? Question 3: If we also know the average of the probabilities of the elements that actually went into set C, can we further improve on our guess? Question 4: Would the problem be simplified if we knew that the probabilities associated to the elements in set A (p1, p2, ... ,pn) follow a binomial distribution (or some other well studied)? Note: This problem arised in a project in sequence analysis (bioinformatics), which I would be willing to explain if that input could be helpful! | |
| |
| |
| |
|
|
Subject:
Re: A problem on probabilities
Answered By: mathtalk-ga on 02 Jul 2003 10:56 PDT Rated: |
Hi, octava-ga: Thanks for posting your very interesting question. I will refer to (without repeating) our earlier discussion in the Requests for Clarification (above) and Comments (below). First let me address the three items you asked about in your last Comment. 1) I'm glad you are undertaking this computational exercise. A great deal of the discussion below may be read as advice on how to implement such calculations. Without knowing what programming language is convenient for you, I've tried to couch it in mainly algorithmic terms. 2) In order to share images and other non-text material, you (or I, in the converse) would need to create a Web page to display the images and/or a link from which the material can be downloaded. In terms of sending you material, I already have a way to do this. Note that the Google Answers Terms of Service forbid the exchange of email addresses, and that it is preferred to have as much content presented within the Google Answers site as possible (although this is currently limited to the sort of text-based discussion we are having). 3) "Amateur" and even "dilettante" are not bad words in mathematics and statistics, reflecting merely an unpaid or parttime status of one's investigations. Of probability books there are many, and for a true beginner I could recommend these inexpensive volumes: Schaum's Outline of Probability and Statistics Introduction to Probability by John E. Freund (Dover Books on Adv. Math.) I suspect that you are being overly modest about your background, so these books may easily be too elementary for your needs. Computational probability has undergone significant advances in recent years, thanks in some measure to an influx of hard problems in diverse application areas like quantum physics and genetic mapping. There is a pronounced overlap with classic topics in operations research, such as queuing theory and Markov chains. NOTATION ======== When one wants to define the expected value of an expression, a probability must be assigned to each possible value. The expected value of a function f of a single random variable X is related to the possible values of X as follows: E(f(X)) = SUM f(x) Pr(X=x) where the sum is taken over all possible values x of X. [The summation must be replaced by an integral in the case of a continuous probability distribution, but your question deals with a discrete probability distribution.] By convention we should use capital letters for the random variables (for which probabilities are attached to the outcome values) and small ones for ordinary ones. As we have not followed this convention very carefully to this point, I will not try to more than call attention to it. However the clarity it provides is I think worth the effort, should you wish to publish your investigations. (For more about publication, please see the concluding paragraphs.) EXPECTATION =========== Here the basic random variables are the subsets B,C of items (monomers, trimers, etc.) drawn from the universe A of all n items. The quantity of interest in your original question is the expected value of the cardinality of B intersect C, which I will denote in this manner for compactness: E(|B&C|) The use of "absolute value" signs for cardinality is fairly standard, but I confess that the ampersand for intersection is merely the best of several poor choices available to us in this "text only" medium. We can therefore write: E(|B&C|) = SUM |b&c| Pr(B=b) Pr(C=c) where the summation is to be taken over all possible values b for B and c for C. Implicit in the above is an assumption that the selections of B and C are "independent" in a probabilistic sense. This means formally that the probability that B = b AND C = c equals the product of their individual probailities: Pr(B=b AND C=c) = Pr(B=b) * Pr(C=c) Note that we already made use of this assumption above when writing the expected value of |B&C| as a summation. A fuller phrasing of your original question involves specifying the sizes of B and C. For the sake of convenience we will omit an explicit notation for this until necessary, but we acknowledge that the underlying process of selection gives: |A| = n |B| = m |C| = k In the underlying assumptions about the distribution of B (resp. C) among subsets of a fixed size, the items that may be drawn are associated with individual (not necessarily equal) probabilities. The "common sense" interpretation of this was discussed above, in the Requests for Clarification. It remains only to work out the algorithmic and practical implications of that, for once these probabilities are known, the computation above is "merely" a matter of plugging in those values. Since the roles of B and C are "symmetric" except for size, it suffices to go into detail about calculating Pr(B=b). COMPUTATION =========== How many subsets B of A have size m? Counting the possible subsets is of course a well-known "combinations" function: C(n,m) = n!/(m!(n-m)!) Explicitly listing all the possibilites is perhaps most easily done in a "very high level" language such as Prolog. However an implementation in C or C++ is not difficult. For fixed values of n and m one can use nested loops; code that treats n and m as variable numbers of items can be written recursively. I'm sure you may have your own ideas on this subject, so I'll simply mention one minor aspect that occurs to me, namely to work from an ordering of the items in A which sorts them by descending probability. Although not essential, this ordering (from prior experience) allows the best control over the optimization of numerical accuracy during implementation. Once an ordering of items in A is decided, subsets of A may be identified with bit patterns of length |A|. Using 0-based indexing, one can identify the inclusion or exclusion of an item from subset B by value (1 or 0, resp.) of the i'th bit in the pattern. So one approach is to label all the integers from 0 (the empty set) to 2^|A| - 1 (the entire set) with the number of 1's contained in the binary expansion, and select only those integers having the desired count m. We now turn to computing the probability Pr(B=b) for specific values of m. The case m=0 is sensible but trivial, as only one empty set is possible. The smallest interesting case is m=1. The probabilities of a single draw are simply those values given to us as "frequencies" for the various items. [Note: See my first Comment for the resulting computation of expected number of matches with m=1, k=1.] For m > 1 the naive expression for Pr(B=b) involves a summation over all possible orderings (permutations) of the m elements of B. Thus the complexity of the expression grows like m!. When m=2,3 there's nothing wrong with the naive approach, and it may be helpful to consider the cases m=2 and m=3 more explicitly. Suppose that b consists of two elements whose "raw" frequencies are p and p'. Then: Pr(B=b) = p*p'/(1-p) + p'*p/(1-p') = (p/(1-p))*(p'/(1-p'))*(2-p-p') where the two terms correspond to the two possible orderings of b. The case m=3 is similar but involves six terms, for the 3! possible orderings. Let p,p',p" denote the frequencies of the three elements of b. Then: Pr(B=b) = (p*p'*p")[ 1/((1-p)(1-p-p') + 1/((1-p')(1-p-p') + 1/((1-p')(1-p'-p") + 1/((1-p")(1-p'-p") + 1/((1-p)(1-p-p") + 1/((1-p")(1-p-p") ] Although this expression can be rewritten in alternate forms, this one shows the relation to the six possible orders in which the three given elements can be chosen. For larger values of m we seek ways to economize on computation, e.g. using a "divide and conquer" approach to recursively partition b into smaller disjoint subsets. Let me sketch how this might work for m=10, n=20. Any subset b of size 10 can be ordered in 10! = 3,628,800 ways. Rather than enumerating all of these, we can instead consider the C(10,5) = 252 ways to partition ten items into a first subset b' and a second subset b", each of size 5. That is to say, suppose we first draw five items B' without replacement from the urn containing A, and then draw another five items B" without replacement from the urn containing A\b' (the complement of b' in A). What is the chance that: B = B' U B will turn out to equal b? One finds that perfect reuse can be made of the values: Pr(B'=b') = Pr(B=b') if these were previously computed for m=5 since the urn contains A in both cases. However the probabilities of extending a subset b' to b (by later selection of b"=b\b') requires a recomputation, similar to the values Pr(B=b') but involving "dilated" underlying frequencies because the five elements B" are being drawn from the "censored" population A\b' of just 15 items. Then: Pr(B=b) = SUM Pr(B'=b') * Pr(B"=b\b') where the summation is over all five element subsets b' of b and where the random variable B' is based on drawing without replacement from A' = A\b' (rather than from A). It is not essential that the partitioning of b be into two equal subsets. Indeed if the probabilities for all cardinalities m (or all up to some threshold) need to be computed, it is attractive to partition b into one subset with only one element omitted and its corresponding (singleton set) complement. Then: Pr(B=b) = SUM Pr(B'=b\{x}) * p(x)/(1 - SUM p(y)) where the outer summation is over the m elements x belonging to b, and where the inner sum (in the denominator) is over the other m-1 elements y belonging to b\{x}. Here p(x),p(y) denote the original frequencies of elements x,y respectively. In this manner one can systematically build up from cases m=1,2,3 all the higher values of m probabilities in turn. SUGGESTIONS =========== As n becomes much larger than 20 the exhaustive calculations outlined above will not be feasible for all m,k. However some approximation techniques are available. A simple Monte Carlo approach would be to carry out a million drawings of sizes m,k from the urn and to tabulate the frequency of matches. While easy to program, such a procedure will probably only give modestly accurate results. Convergence to the mean in Monte Carlo algorithms is typically inversely proportional to the square root of the sample size. However the Monte Carlo approach does allow for "parallel" accumulation of information about cases "below" m,k, i.e. by counting the matches for each m' <= m and k' <= k as the sampling proceeds. I was originally thinking of creating a hybrid Monte Carlo approach, esp. for the case when B=b is given and we want to know about the distribution of matches with random C, given only that |C|=k. By hybrid Monte Carlo methods are usually meant approaches that combine simulation with explicit computation of probabilities. For example, one might randomly select B, then explicitly calculate the probabilities of matches (with C) for various k. It is worthwhile to recall from time to time that we are working with a rough approximation the "random model" and that perfect accuracy in the model is not of exceptional value for that reason. However understanding the characteristics of the model is very worthwhile in knowing what limitations its application may have. I'm struck by the thought that the expected value of matches may be approximately modelled by a simple second order partial differential equation. My grounds for this optimism are limited, but it very often happens that for a stochastic process governed by a first order differential equation, the non-stochastic quantity obtained by passing to the expected value is then governed by a second order differential equation. Also, we can say of the expected value of matches in the equiprobable case, mk/n, that it is exactly the "scaled" solution to the Laplace equation on [0,1] x [0,1] where we use variables: x = m/n y = k/n and take Dirichlet boundary conditions: f(0,y) = f(x,0) = 0 f(x,1) = x f(1,y) = y whose solution is f(x,y) = xy. Naturally an interesting problem such as you propose sparks many more ideas than can conveniently be summarized in one post, Let me close this by sharing some information which I solicited from the Google Answers Editors, about a hypothetical case in which a publication might be based on a Google Answers thread. I realize that publication in the field of bioinformatics may be of some importance to you, and so I wanted to find out the ground rules. As you may have noticed, Google asserts a copyright at the bottom of each Web page on this site. This copyright is acknowledged within the Terms of Service for Customers and Researchers (who are independent contractors). The Editors replied to my question about publication with the following: Hello mathtalk-ga, Thank you for your email. If the work that is done on Google Answers is published in a journal, then firstly the customer would have to take permission from Google prior to publishing it. Also, they would have to credit Google Answers as well as the Researcher for that work. Thank you for bringing up this interesting point! The Google Answers Team |
octava-ga
rated this answer:
and gave an additional tip of:
$30.00
When I posted my question, I had no idea of how difficult it was. Mathtalk didn't just give the correct answer, but also provided patience, comprehension, enthusiasm and effort far beyond I expected. Thanks to that generosity I learned a lot from the answer and comments. Many thanks! Really! Octava |
|
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 06:54 PDT |
The special case m = k = 1 allows only for zero or one shared elements between sets B and C of the two trials. We could allow a second set of probabilities q_i for the second trial, as suggested above. In that case the probability of one shared element is: d = SUM p_i q_i [FOR i=1,..n] and since otherwise the number of shared elements is zero, this expression is also the expected value or mean of the number of shared elements. Then if d > 1/2, the mode is 1, and if d < 1/2, the mode is 0. To illustrate suppose the p's and q's are the same, and that we have three elements whose probabilities of being chosen are 1/2, 1/3, and 1/6 respectively. d = (1/2)^2 + (1/3)^2 + (1/6)^2 = 7/18 So here the expected number of shared elements is 7/18, but the mode is zero (since 11 times out of 18 there will be no shared elements). A median value may not be useful in connection with a discrete distribution, but one can "interpolate" between discrete values to define one. In this simplified case the median would probably be identified with the mean, 7/18. regards, mathtalk-ga |
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 09:53 PDT |
Hi, Kalle: Thanks for the fascinating background information. My father is an organic chemist turned physician, and I've "inherited" some of his enthusiasm for biochemical investigations, esp. those with a nice math problem attached. regards, mathtalk-ga |
Subject:
Re: A problem on probabilities
From: octava-ga on 26 Jun 2003 12:14 PDT |
Hi, mathtalk-ga! Who is Kalle? Are you still working on the question? |
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 12:58 PDT |
Oops, octava-ga... sorry about the name! But yes, I'm working on the problem. -- mathtalk |
Subject:
Re: A problem on probabilities
From: octava-ga on 26 Jun 2003 17:28 PDT |
Thanks! I look forward to your answer |
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 28 Jun 2003 21:07 PDT |
Hi, octava: As a premliminary matter let's discuss the "equiprobable" case. Here if we drew m items without replacement, out of a possible n, in the first trial, followed by drawing k items out of n (again without replacement) in the second trial, then on average the number of shared items (matches) will be: E(s(m,k)) = mk/n for all 0 <= m,k <= n In other words the expected number of matches is a bilinear function of m and k, with the boundary conditions: s(0,k) = s(m,0) = 0 s(m,n) = m s(n,k) = k It might be of interest to see the effects of unequal probabilities in a case in which n is small enough that the exact computations can be done. For the sake of such a demonstration I would suggest treating the frequency of shared "monomers" (amino acids), where of course n = 20. See here for a tabulation of amino acid frequencies in vertebrates: [Amino Acid Frequency] http://www.tiem.utk.edu/~harrell/webmodules/aminoacid.htm Amino Acids Observed Frequency in Vertebrates Alanine 7.4 % Arginine 4.2 % Asparagine 4.4 % Aspartic Acid 5.9 % Cysteine 3.3 % Glutamic Acid 5.8 % Glutamine 3.7 % Glycine 7.4 % Histidine 2.9 % Isoleucine 3.8 % Leucine 7.6 % Lysine 7.2 % Methionine 1.8 % Phenylalanine 4.0 % Proline 5.0 % Serine 8.1 % Threonine 6.2 % Tryptophan 1.3 % Tyrosine 3.3 % Valine 6.8 % With n = 20 the largest group of subsets of fixed size is that of cardinality 10, and there are 184756 such subsets. Computing the probability Pr(B) of any particular subset B of this size being drawn is a difficult but not impossible task (for a computer program). If we were to do it for all such subsets, then the expected number of shared entries between two such subsets would be: SUM |B intersect C| Pr(B) Pr(C) where the summation is over all pairs B,C of subsets of size 10, and: |B intersect C| denotes the cardinality of their intersection (the number of shared values). Note that even in the unequal probabilities case, the boundary conditions for s(m,k) remain just as before. This leads to some optimism that an exact computation in the n=20 case may give some insight into effect methods of approximation for large n. In particular it might be possible to formulate a continuous model (such as a differential equation) whose solution is the limiting case as n tends to infinity. In this way one might be able to give a nice treatment of n=8000 with a continuous surface. regards, mathtalk-ga |
Subject:
Re: A problem on probabilities
From: octava-ga on 29 Jun 2003 13:38 PDT |
Hi, mathtalk-ga! First of all, let me tell you that your input has been of great help to me, and that I very much appreciate your effort. If you wish so, I would be glad to consider the question answered. Second, as you say, computing the probability of Pr(B) for a set B is difficult. However, given an actual protein B, I can easily compute the average frequency of its trimers. I will call it avgFr(B). Now, let us call len(B) the number of trimers in protein B. Given a query protein Q, and any other protein B, I have found empirically that a good approximation to the number of shared trimers is: sqrt( avgFr(Q) * avgFr(B) ) * len(Q) * len(B) Interestingly, if I disregard the information about avgFr(B) I obtain a marginally better approximation with the formula: avgFr(Q) * len(Q) * len(B) I believe this is due to the fact that if avgFr(Q) is large and avgFr(B) is small, or vice versa, this implies that the two proteins are made of very different trimers, so the overlap of the two sets is smaller than would be predicted from the "weight" of those sets. Besides, although this works for most proteins, this is a linear model, and I know that it fails when len(B) is near 8000, since at len(B)=8000 the correct number of shared trimers should be equal to len(Q) and this would only happen if avgFr(Q) * len(B)= avgFr(Q) * 8000=1. That is, only if avgFr(Q)=1/8000= 0.000125, and that is usually false (indeed avgFr(Q) is always between 0.00014 and 0.0004. To solve this, and without any formal justification, I have tried to predict the number of shared trimers Y(X) as a function of X, (where X stands for len(B)), with the following curve: Y= X/(a + bX), because I can easily choose parameters a and b in such way that at for len(B)=0 the slope dY/dX = avgFr(Q) * len(Q), while at len(B)=8000, the number of shared trimers Y=len(Q), which I know must be true: a=((1/avgFr(Q)) -1)/(1-(1/8000)), b=(1/avgFr(Q)) -a This approximation happens to work better than any of the two I mention above. However, as you very clearly stated in the comment you posted on 27 Jun 2003 07:27, my problem does not end there: I need to find the number of shared trimers for which the probability is below a threshold that would allow me to say that the proteins are homologous. Inspection of my data tells me that this depends on len(B), because the database is rich on proteins of certain sizes. That is distribution of len(B), for all B in my database, has maximal density around len(B)=400. This makes makes my cloud of points wider in that zone ( the dispersion is bigger there). Another factor that has a large impact on the dispersion is the number of trimers in the query len(Q). Smaller query proteins have overall large dispersions, because having less trimers they cause a sample size effect. Perhaps I should post another question to seek help in this unsolved part of my problem. I would very much appreciate your comments! Thanks Octava |
Subject:
Re: A problem on probabilities
From: octava-ga on 29 Jun 2003 13:46 PDT |
mathtalk, by the way: 1) As an exercises, I am coding the exact solution you suggested for amino acids monomers 2) How can I show you images? 3) Any good basic book on probabilities? (You can see I am at best an amateur!) Regards! Octava |
Subject:
Re: A problem on probabilities
From: octava-ga on 01 Jul 2003 21:52 PDT |
Hello, mathtalk-ga: You have been quiet for sometime now. Any reason? Greetings! Octava |
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 02 Jul 2003 04:55 PDT |
Umm... composing a long-winded thesis for you on computational probability? I'll post later today, just want to tweak my exposition a bit. regards, mathtalk |
Subject:
Re: A problem on probabilities
From: octava-ga on 02 Jul 2003 21:54 PDT |
Hi, mathtalk! Thanks for all your help! Your are right in assuming that this question was related to a future publication. Although the solution you gave me are impeccable (and I enjoyed going through them), I doubt that they would be practical for my purposes in database searching. Instead, I may use a much less elegant solution I have implemented, based on adjusting ad hoc curves to the empirical data. However, you have helped me in a way you might not suspect: being handicapped in math, I am always afraid that it will show if I do something in a contrived way when there was a much simpler way. Now I know there is no simple solution to my problem, so I am confident that I won't appear as a fool for providing my own practical solution. This is a great relief! Thanks again. I've learned a lot from your explanations. I hope you enjoyed the biology of the problem, and that we will have other chances for interaction in the future. Octava |
Subject:
Re: A problem on probabilities
From: octava-ga on 02 Jul 2003 22:09 PDT |
Postdata: I forgot to confess that I got lost in the part where you propose an approximation based on second order differential equations: "Laplace equation"?; "Dirichlet boundary"? ... I need to study a little more. |
Subject:
Re: A problem on probabilities
From: mathtalk-ga on 03 Jul 2003 08:49 PDT |
Let me expand a bit on the differential equation idea. In the case of equiprobable items "in an urn" we have the expected number of matches: s(m,k) = mk/n In order to pass to a "limit" as n goes to infinity (thereby hopefully obtaining a continuous function of simple form), we need to scale both the "inputs" (arguments) and the "output" to make sense independently of n. In other words, let's replace m and k as parameters by the relative fractions: x = m/n y = k/n and at the same time scale the output so that the value of the function is expressed, not as the expected number of matches, but the expected fraction of matching items (out of n): f(x,y) = s(m,k)/n Now as n tends to infinity, the values of x,y behave well. The domain of f(x,y) is [0,1] x [0,1]. Also the value of f behaves nicely, with a range of [0,1] rather than "expanding" with n. In fact after doing a bit of algebra above, we have exactly: f(x,y) = (mk/n)/n = (m/n)(k/n) = xy Boundary conditions refer to information that holds at the "boundary" of the domain, in this case the four edges of the square [0,1] x [0,1]. We know apriori, even in the unequal probabilities case, that: f(0,y) = f(x,0) = 0 f(x,1) = x f(1,y) = y So we know completely the values of f(x,y) around the boundary of the domain. We should be able to fill in the "interior" values of the function if we knew a second order "elliptic" PDE that is satisfied by the function. In the equal probability case, f(x,y) = xy, such a PDE (partial differential equation) is the Laplace equation: d(df/dx)/dx + d(df/dy)dy = 0 In words: the second partial derivative with respect to x plus the second partial derivative with respect to y equals zero. Such an approach to solving what was originally a "simple" discrete problem, by using an "infinite" continuous approximation, may seem bizarre. However the idea is not a new one. Many questions in statistics are treated by such techniques, as when a binomial distribution involving a large number of trials (also known as a Bernoulli process) is approximated by a normal distribution. To apply the "PDE" approach to your original problem would require some insight into what changes should be made in the differential equation itself. The roles of x,y continue to be symmetric in the unequal probabilities case, so the modification may be as simple as: d(df/dx)/dx + d(df/dy)dy = g(x,y) Typically one might proceed to investigate this by doing some numerical experiments to gain insight into what the nature of function g(x,y) might be. Given the right insight it should be possible to rigorously justify the values of g(x,y). If g(x,y) were known, the computation of f(x,y) would follow easily from either "finite difference" or "finite element" algorithms. regards, mathtalk-ga P.S. Thanks for the generous rating (and tip!). I also learned a good bit from working with you! |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |