Google Answers Logo
View Question
 
Q: A problem on probabilities ( Answered 5 out of 5 stars,   13 Comments )
Question  
Subject: A problem on probabilities
Category: Science > Math
Asked by: octava-ga
List Price: $20.00
Posted: 25 Jun 2003 23:12 PDT
Expires: 25 Jul 2003 23:12 PDT
Question ID: 221855
This is a problem on probabilities. The conditions are:

1) There is a set A with n elements. Each element has a different
probability of being chosen: (p1, p2, ... ,pn)

2) In trial I, set B is built by choosing, without replacement, m
elements from set A.

3) After returning set A to its original state; in trial II, set C is
built by choosing, without replacement, k elements from set A.

Question 1: Which is the best guess for the number of elements shared
between sets B and C?


The payment I offer is for question 1, but I am also interested in,
and will reward the answers for, these other questions:

Question 2: If we know the average of the probabilities of the
elements that actually went into set B, can we improve on our guess?

Question 3: If we also know the average of the probabilities of the
elements that actually went into set C, can we further improve on our
guess?

Question 4: Would the problem be simplified if we knew that the
probabilities associated to the elements in set A (p1, p2, ... ,pn)
follow a binomial distribution (or some other well studied)?

Note: This problem arised in a project in sequence analysis
(bioinformatics), which I would be willing to explain if that input
could be helpful!

Request for Question Clarification by mathtalk-ga on 26 Jun 2003 06:38 PDT
Hi, octava-ga:

I think to give a rigorous answer we need some clarification from you.
 This might go hand in hand with knowing more about the intended
application (which I'd be interested to hear about, in any case), but
let me raise the math points first:

1.  What makes a "guess" the best?  One usually asks from some
well-defined mathematical quantity, such as the "expected value" of
the number of shared elements.  Note that the expected value is the
mean.  Two other quantities that might be sought are the median
(roughly, a guess half the time too high and half the time too low)
and the mode (a number that occurs more often than any other).

2.  A little more needs to be said about the probabilities.  The
values given:

p1, p2, ... , pn

are apparently the chances of the respective element being drawn from
A, so:

p1 + p2 + ... + pn = 1

However, since items are chosen "without replacement," something needs
to be said about how the probabilities of being chosen change after
one or more elements are removed from A.  The simplest assumption
might be that the chance of an element being drawn is always
proportional to some "weight" (perhaps the length of a genetic
sequence), which is equivalent to a simple "conditional probability"
interpretation.

Incidentally the problem can be generalized to allow for different
probablilties  q1, q2, ... , qn for trial II (the selection of set C).

The special case m = k = 1 can be easily solved; I'll post this as a
comment below while awaiting your clarification.

regards, mathtalk-ga

Clarification of Question by octava-ga on 26 Jun 2003 08:46 PDT
Hi, mathtalk-ga! Thanks for your interest!

A) You are right, what I want to know is the expected value. However,
as you can see from my questions 2,3,4, I assume that the expected
value will change depending on how much information we are made
available.

b) You are also right in assuming that the probabilities are
proportional to some intrinsic "weight" of the elements (so that after
every draw, the ratio pi/pj remains constant, for any two elements i
and j which remain in A).

Perhaps the following explanation of my bioinformatics problem will
help clarify the mathematical problem.

I work on a method of comparing proteins. Proteins are strings of
amino acids. There are 20 different amino acids. For my purpose, I
encode a protein as a vector of "0"s and "1"s. Each 0 or 1 represents
that absence or presence of a particular amino acid trimer in the
protein. There are 20x20x20 (=8000) possible amino acid trimers, so
the length of the vectors is 8000. To compare two proteins, I compare
their vectors, and measure how many "1"s are common between both
vectors. (This I call the "observed" count). Then, what I need to know
is if observed count is significantly higher than would be expected by
assuming that proteins are random collections of trimers (drawn
without replacement, since I only consider presence or absence, and
not how many times a trimer appears in a protein). What I am trying to
determine with the question I posted is what would be the "expected"
count. What I ultimately need to know is what is the probability of
obtaining the observed count under the random model.

In the original posting of the problem, set A is the universe of all
possible trimers. The probabilities(p1,p2,...,pn) correspond to the
frequencies of each trimer in a large collection of proteins. Set B is
the encoding of a protein (the collection of trimers appearing in that
protein) and m is size of this collection (the number of "1"s in the
encoding vector). Similarly, set C is the encoding of another protein,
with k being the number of "1"s in the corresponding vector.

The number of trimers present in proteins is usually in the range of
30-3000. Of course, after comparing two proteins, I know not only how
many, but which trimers were common between them. However, I would
rather not use this trimer-specific information, because it would make
each comparison slow, and I usually have to compare one protein
against many thousands.

Request for Question Clarification by mathtalk-ga on 27 Jun 2003 07:27 PDT
Hi, octava-ga:

I have worked on the problem a good bit, and I can give exact
expressions (very complicated) for the expected number s of shared
elements in B intersect C.

Let me suggest what I think might be a practical implmentation of the
"random" approach.  You have a fixed protein with set B of trimers, to
which the "catalog" of proteins is to be compared.  What I think you'd
like to be able to say about a particular "catalog" entry with set C
of trimers is that given set B and a random selection of k trimers,
the probability of having as many or more shared entries as s is x. 
Thus, when probability x is especially low (lower than some
predetermined threshold) one infers a possible "homology" of the two
proteins.

To make such a claim one needs more than simply the expected number of
shared elements.  For a particular threshold probability x it would
suffice to know a corresponding number of shared elements S(x) such
that:

Pr( s >= S(x) | B and |C| = k ) = x

[NB: The final equality is only approximate since the probability
distribution is discrete; we would pick S(x) conservatively, so that
one has the largest probability less than or equal to x.]

To obtain information about the probability distribution of s (which
should more properly be labeled s_k since it is a random variable
conditioned on the number k of trimers in C) I would consider two
approaches.

One is a Monte Carlo method for counting matches.  Although Monte
Carlo methods are notoriously slow to converge, there are some
efficiencies here in being able to accumulate the distributions of s
for all values of k in "overlapped" fashion.  The other method would
employ a simplified "binary" approximation in which k draws are made
with replacement.  Although both techniques involve approximation,
they complement one another, and I believe a synthesis of the two
approaches might be capable of producing good estimates with rigorous
upper and lower bounds.

The point I'd like to emphasize here is that these computational
approaches, although intense, would need to be done only once for the
given selection B of trimers.  Comparisons to each catalog entry C of
selections would then be as quick as simply counting the shared
elements of B and C.

Any interest in such approaches?

regards, mathtalk-ga

Clarification of Question by octava-ga on 27 Jun 2003 09:56 PDT
Hello, mathtalk-ga:

Certainly!!!; I am very interested in those approaches, since a good
practical solution would be even more satisfactory than a complicated
exact solution.

Let me tell you that I have spent some time working in an empirical
solution based on the tedious, and not completely convincing, analysis
of many cases:

1) For every protein in the target collection, I find, Y, which
represent the fraction of the m trimers in the query protein that are
also among the k trimers in the target protein. Next, I plot Y against
k. The plot has thousands of points, one for every protein in the
target collection.

2) Then, I find the best regression line. The regression coefficient r
is usually above 0.8. However, I can see from the plot that the data
would be better described by a curve: the slope dY/dk is larger near
k=0 and smaller at k=8000. This, seem to be due to the fact that
proteins with few trimers tend to have few of the rare trimers, and
proportionally more of the most common ones, which are also likely to
be in the query.

3) From the linear regression model Y(k)=a + bk, I keep the slope
parameter 'b' (as expected, parameter 'a' is very close to zero).

Since b changes with every query, after repeating steps 1,2,3 with
many query proteins, I plot the 'b' values against m (the number of
trimers in the query protein). Then I find the best regression line
for this data.

My aim is to be able to predict b based on m. That is, I want an
approximation to b(m), so that, for any query/target pair, I can
predict Y, the fraction of shared trimers, based on m and k alone, by
applying the formula Y(k,m)=b(m)*k

Although I achieve some predictive capacity, I feel the approach is a
mess!!!

Sorry for making you read so much. I look forward to your answer, and
very much appreciate your effort!!

Best regards,

octava.
Answer  
Subject: Re: A problem on probabilities
Answered By: mathtalk-ga on 02 Jul 2003 10:56 PDT
Rated:5 out of 5 stars
 
Hi, octava-ga:

Thanks for posting your very interesting question.  I will refer to
(without repeating) our earlier discussion in the Requests for
Clarification (above) and Comments (below).  First let me address the
three items you asked about in your last Comment.

1) I'm glad you are undertaking this computational exercise.  A great
deal of the discussion below may be read as advice on how to implement
such calculations.  Without knowing what programming language is
convenient for you, I've tried to couch it in mainly algorithmic
terms.

2) In order to share images and other non-text material, you (or I, in
the converse) would need to create a Web page to display the images
and/or a link from which the material can be downloaded.  In terms of
sending you material, I already have a way to do this.  Note that the
Google Answers Terms of Service forbid the exchange of email
addresses, and that it is preferred to have as much content presented
within the Google Answers site as possible (although this is currently
limited to the sort of text-based discussion we are having).

3) "Amateur" and even "dilettante" are not bad words in mathematics
and statistics, reflecting merely an unpaid or parttime status of
one's investigations.  Of probability books there are many, and for a
true beginner I could recommend these inexpensive volumes:

Schaum's Outline of Probability and Statistics

Introduction to Probability by John E. Freund 
(Dover Books on Adv. Math.)

I suspect that you are being overly modest about your background, so
these books may easily be too elementary for your needs. 
Computational probability has undergone significant advances in recent
years, thanks in some measure to an influx of hard problems in diverse
application areas like quantum physics and genetic mapping.  There is
a pronounced overlap with classic topics in operations research, such
as queuing theory and Markov chains.

NOTATION
========

When one wants to define the expected value of an expression, a
probability must be assigned to each possible value.  The expected
value of a function f of a single random variable X is related to the
possible values of X as follows:

E(f(X)) = SUM f(x) Pr(X=x)

where the sum is taken over all possible values x of X.  [The
summation must be replaced by an integral in the case of a continuous
probability distribution, but your question deals with a discrete
probability distribution.]

By convention we should use capital letters for the random variables
(for which probabilities are attached to the outcome values) and small
ones for ordinary ones.  As we have not followed this convention very
carefully to this point, I will not try to more than call attention to
it.  However the clarity it provides is I think worth the effort,
should you wish to publish your investigations.  (For more about
publication, please see the concluding paragraphs.)

EXPECTATION
===========

Here the basic random variables are the subsets B,C of items
(monomers, trimers, etc.) drawn from the universe A of all n items. 
The quantity of interest in your original question is the expected
value of the cardinality of B intersect C, which I will denote in this
manner for compactness:

E(|B&C|)

The use of "absolute value" signs for cardinality is fairly standard,
but I confess that the ampersand for intersection is merely the best
of several poor choices available to us in this "text only" medium.

We can therefore write:

E(|B&C|) = SUM |b&c| Pr(B=b) Pr(C=c)

where the summation is to be taken over all possible values b for B
and c for C.

Implicit in the above is an assumption that the selections of B and C
are "independent" in a probabilistic sense.  This means formally that
the probability that B = b AND C = c equals the product of their
individual probailities:

Pr(B=b AND C=c) = Pr(B=b) * Pr(C=c)

Note that we already made use of this assumption above when writing
the expected value of |B&C| as a summation.

A fuller phrasing of your original question involves specifying the
sizes of B and C.  For the sake of convenience we will omit an
explicit notation for this until necessary, but we acknowledge that
the underlying process of selection gives:

|A| = n

|B| = m

|C| = k

In the underlying assumptions about the distribution of B (resp. C)
among subsets of a fixed size, the items that may be drawn are
associated with individual (not necessarily equal) probabilities.  The
"common sense" interpretation of this was discussed above, in the
Requests for Clarification.  It remains only to work out the
algorithmic and practical implications of that, for once these
probabilities are known, the computation above is "merely" a matter of
plugging in those values.

Since the roles of B and C are "symmetric" except for size, it
suffices to go into detail about calculating Pr(B=b).

COMPUTATION
===========

How many subsets B of A have size m?  Counting the possible subsets is
of course a well-known "combinations" function:

C(n,m) = n!/(m!(n-m)!)

Explicitly listing all the possibilites is perhaps most easily done in
a "very high level" language such as Prolog.  However an
implementation in C or C++ is not difficult.  For fixed values of n
and m one can use nested loops; code that treats n and m as variable
numbers of items can be written recursively.  I'm sure you may have
your own ideas on this subject, so I'll simply mention one minor
aspect that occurs to me, namely to work from an ordering of the items
in A which sorts them by descending probability.  Although not
essential, this ordering (from prior experience) allows the best
control over the optimization of numerical accuracy during
implementation.

Once an ordering of items in A is decided, subsets of A may be
identified with bit patterns of length |A|.  Using 0-based indexing,
one can identify the inclusion or exclusion of an item from subset B
by value (1 or 0, resp.) of the i'th bit in the pattern.  So one
approach is to label all the integers from 0 (the empty set) to 2^|A|
- 1 (the entire set) with the number of 1's contained in the binary
expansion, and select only those integers having the desired count m.

We now turn to computing the probability Pr(B=b) for specific values
of m.  The case m=0 is sensible but trivial, as only one empty set is
possible.

The smallest interesting case is m=1.  The probabilities of a single
draw are simply those values given to us as "frequencies" for the
various items.  [Note: See my first Comment for the resulting
computation of expected number of matches with m=1, k=1.]

For m > 1 the naive expression for Pr(B=b) involves a summation over
all possible orderings (permutations) of the m elements of B.  Thus
the complexity of the expression grows like m!.

When m=2,3 there's nothing wrong with the naive approach, and it may
be helpful to consider the cases m=2 and m=3 more explicitly.

Suppose that b consists of two elements whose "raw" frequencies are p
and p'.  Then:

Pr(B=b) = p*p'/(1-p) + p'*p/(1-p') 

        = (p/(1-p))*(p'/(1-p'))*(2-p-p')

where the two terms correspond to the two possible orderings of b.

The case m=3 is similar but involves six terms, for the 3! possible
orderings.  Let p,p',p" denote the frequencies of the three elements
of b.  Then:

Pr(B=b) = (p*p'*p")[ 1/((1-p)(1-p-p') + 1/((1-p')(1-p-p')

                   + 1/((1-p')(1-p'-p") + 1/((1-p")(1-p'-p") 

                   + 1/((1-p)(1-p-p") + 1/((1-p")(1-p-p") ]
 
Although this expression can be rewritten in alternate forms, this one
shows the relation to the six possible orders in which the three given
elements can be chosen.

For larger values of m we seek ways to economize on computation, e.g.
using a "divide and conquer" approach to recursively partition b into
smaller disjoint subsets.

Let me sketch how this might work for m=10, n=20.  Any subset b of
size 10 can be ordered in 10! = 3,628,800 ways.  Rather than
enumerating all of these, we can instead consider the C(10,5) = 252
ways to partition ten items into a first subset b' and a second subset
b", each of size 5.

That is to say, suppose we first draw five items B' without
replacement from the urn containing A, and then draw another five
items B" without replacement from the urn containing A\b' (the
complement of b' in A).  What is the chance that:

B = B' U B

will turn out to equal b?

One finds that perfect reuse can be made of the values:

Pr(B'=b') = Pr(B=b')

if these were previously computed for m=5 since the urn contains A in
both cases.

However the probabilities of extending a subset b' to b (by later
selection of b"=b\b') requires a recomputation, similar to the values
Pr(B=b') but involving "dilated" underlying frequencies because the
five elements B" are being drawn from the "censored" population A\b'
of just 15 items.  Then:

Pr(B=b) = SUM Pr(B'=b') * Pr(B"=b\b')

where the summation is over all five element subsets b' of b and where
the random variable B' is based on drawing without replacement from A'
= A\b' (rather than from A).

It is not essential that the partitioning of b be into two equal
subsets.  Indeed if the probabilities for all cardinalities m (or all
up to some threshold) need to be computed, it is attractive to
partition b into one subset with only one element omitted and its
corresponding (singleton set) complement.  Then:

Pr(B=b) = SUM Pr(B'=b\{x}) * p(x)/(1 - SUM p(y))

where the outer summation is over the m elements x belonging to b, and
where the inner sum (in the denominator) is over the other m-1
elements y belonging to b\{x}.  Here p(x),p(y) denote the original
frequencies of elements x,y respectively.

In this manner one can systematically build up from cases m=1,2,3 all
the higher values of m probabilities in turn.

SUGGESTIONS
===========

As n becomes much larger than 20 the exhaustive calculations outlined
above will not be feasible for all m,k.  However some approximation
techniques are available.

A simple Monte Carlo approach would be to carry out a million drawings
of sizes m,k from the urn and to tabulate the frequency of matches. 
While easy to program, such a procedure will probably only give
modestly accurate results.  Convergence to the mean in Monte Carlo
algorithms is typically inversely proportional to the square root of
the sample size.  However the Monte Carlo approach does allow for
"parallel" accumulation of information about cases "below" m,k, i.e.
by counting the matches for each m' <= m and k' <= k as the sampling
proceeds.

I was originally thinking of creating a hybrid Monte Carlo approach,
esp. for the case when B=b is given and we want to know about the
distribution of matches with random C, given only that |C|=k.  By
hybrid Monte Carlo methods are usually meant approaches that combine
simulation with explicit computation of probabilities.  For example,
one might randomly select B, then explicitly calculate the
probabilities of matches (with C) for various k.

It is worthwhile to recall from time to time that we are working with
a rough approximation the "random model" and that perfect accuracy in
the model is not of exceptional value for that reason.  However
understanding the characteristics of the model is very worthwhile in
knowing what limitations its application may have.

I'm struck by the thought that the expected value of matches may be
approximately modelled by a simple second order partial differential
equation.  My grounds for this optimism are limited, but it very often
happens that for a stochastic process governed by a first order
differential equation, the non-stochastic quantity obtained by passing
to the expected value is then governed by a second order differential
equation.  Also, we can say of the expected value of matches in the
equiprobable case, mk/n, that it is exactly the "scaled" solution to
the Laplace equation on [0,1] x [0,1] where we use variables:

x = m/n

y = k/n

and take Dirichlet boundary conditions:

f(0,y) = f(x,0) = 0

f(x,1) = x

f(1,y) = y

whose solution is f(x,y) = xy.

Naturally an interesting problem such as you propose sparks many more
ideas than can conveniently be summarized in one post,

Let me close this by sharing some information which I solicited from
the Google Answers Editors, about a hypothetical case in which a
publication might be based on a Google Answers thread.  I realize that
publication in the field of bioinformatics may be of some importance
to you, and so I wanted to find out the ground rules.

As you may have noticed, Google asserts a copyright at the bottom of
each Web page on this site.  This copyright is acknowledged within the
Terms of Service for Customers and Researchers (who are independent
contractors).

The Editors replied to my question about publication with the
following:

    Hello mathtalk-ga,

    Thank you for your email.  If the work that is done on Google 
    Answers is published in a journal, then firstly the customer 
    would have to take permission from Google prior to publishing
    it.  Also, they would have to credit Google Answers as well as
    the Researcher for that work. 

    Thank you for bringing up this interesting point!

    The Google Answers Team
octava-ga rated this answer:5 out of 5 stars and gave an additional tip of: $30.00
When I posted my question, I had no idea of how difficult it was.
Mathtalk didn't just give the correct answer, but also provided
patience, comprehension, enthusiasm and effort far beyond I expected.
Thanks to that generosity I learned a lot from the answer and
comments.

Many thanks! Really!

Octava

Comments  
Subject: Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 06:54 PDT
 
The special case m = k = 1 allows only for zero or one shared elements
between sets B and C of the two trials.

We could allow a second set of probabilities q_i for the second trial,
as suggested above.  In that case the probability of one shared
element is:

d = SUM p_i q_i [FOR i=1,..n]

and since otherwise the number of shared elements is zero, this
expression is also the expected value or mean of the number of shared
elements.

Then if d > 1/2, the mode is 1, and if d < 1/2, the mode is 0.

To illustrate suppose the p's and q's are the same, and that we have
three elements whose probabilities of being chosen are 1/2, 1/3, and
1/6 respectively.

d = (1/2)^2 + (1/3)^2 + (1/6)^2 = 7/18

So here the expected number of shared elements is 7/18, but the mode
is zero (since 11 times out of 18 there will be no shared elements).

A median value may not be useful in connection with a discrete
distribution, but one can "interpolate" between discrete values to
define one.  In this simplified case the median would probably be
identified with the mean, 7/18.

regards, mathtalk-ga
Subject: Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 09:53 PDT
 
Hi, Kalle:

Thanks for the fascinating background information.  My father is an
organic chemist turned physician, and I've "inherited" some of his
enthusiasm for biochemical investigations, esp. those with a nice math
problem attached.

regards, mathtalk-ga
Subject: Re: A problem on probabilities
From: octava-ga on 26 Jun 2003 12:14 PDT
 
Hi, mathtalk-ga!

Who is Kalle? 
Are you still working on the question?
Subject: Re: A problem on probabilities
From: mathtalk-ga on 26 Jun 2003 12:58 PDT
 
Oops, octava-ga... sorry about the name!  But yes, I'm working on the problem.

-- mathtalk
Subject: Re: A problem on probabilities
From: octava-ga on 26 Jun 2003 17:28 PDT
 
Thanks! I look forward to your answer
Subject: Re: A problem on probabilities
From: mathtalk-ga on 28 Jun 2003 21:07 PDT
 
Hi, octava:

As a premliminary matter let's discuss the "equiprobable" case.

Here if we drew m items without replacement, out of a possible n, in
the first trial, followed by drawing k items out of n (again without
replacement) in the second trial, then on average the number of shared
items (matches) will be:

E(s(m,k)) = mk/n  for all 0 <= m,k <= n

In other words the expected number of matches is a bilinear function
of m and k, with the boundary conditions:

s(0,k) = s(m,0) = 0

s(m,n) = m

s(n,k) = k

It might be of interest to see the effects of unequal probabilities in
a case in which n is small enough that the exact computations can be
done.  For the sake of such a demonstration I would suggest treating
the frequency of shared "monomers" (amino acids), where of course n =
20.

See here for a tabulation of amino acid frequencies in vertebrates:

[Amino Acid Frequency]
http://www.tiem.utk.edu/~harrell/webmodules/aminoacid.htm

Amino Acids    Observed Frequency in Vertebrates 
 
Alanine                 7.4 %
 
Arginine                4.2 %
 
Asparagine              4.4 %
 
Aspartic Acid           5.9 %
 
Cysteine                3.3 %
 
Glutamic Acid           5.8 %
 
Glutamine               3.7 %
 
Glycine                 7.4 %
 
Histidine               2.9 %
 
Isoleucine              3.8 %
 
Leucine                 7.6 %
 
Lysine                  7.2 %
 
Methionine              1.8 %
 
Phenylalanine           4.0 %
 
Proline                 5.0 %
 
Serine                  8.1 %
 
Threonine               6.2 %
 
Tryptophan              1.3 %
 
Tyrosine                3.3 %
 
Valine                  6.8 %
 
With n = 20 the largest group of subsets of fixed size is that of
cardinality 10, and there are 184756 such subsets.  Computing the
probability Pr(B) of any particular subset B of this size being drawn
is a difficult but not impossible task (for a computer program).  If
we were to do it for all such subsets, then the expected number of
shared entries between two such subsets would be:

SUM |B intersect C| Pr(B) Pr(C) 

where the summation is over all pairs B,C of subsets of size 10, and:

|B intersect C|

denotes the cardinality of their intersection (the number of shared
values).

Note that even in the unequal probabilities case, the boundary
conditions for s(m,k) remain just as before.  This leads to some
optimism that an exact computation in the n=20 case may give some
insight into effect methods of approximation for large n.  In
particular it might be possible to formulate a continuous model (such
as a differential equation) whose solution is the limiting case as n
tends to infinity.  In this way one might be able to give a nice
treatment of n=8000 with a continuous surface.

regards, mathtalk-ga
Subject: Re: A problem on probabilities
From: octava-ga on 29 Jun 2003 13:38 PDT
 
Hi, mathtalk-ga!

First of all, let me tell you that your input has been of great help
to me, and that I very much appreciate your effort. If you wish so, I
would be glad to consider the question answered.

Second, as you say, computing the probability of Pr(B) for a set B is
difficult. However, given an actual protein B, I can easily compute
the average frequency of its trimers. I will call it avgFr(B). Now,
let us call len(B) the number of trimers in protein B. Given a query
protein Q, and any other protein B, I have found empirically that a
good approximation to the number of shared trimers is:

sqrt( avgFr(Q) * avgFr(B) ) * len(Q) * len(B)

Interestingly, if I disregard the information about avgFr(B) I obtain
a marginally better approximation with the formula:

avgFr(Q) * len(Q) * len(B)

I believe this is due to the fact that if avgFr(Q) is large and
avgFr(B) is small, or vice versa, this implies that the two proteins
are made of very different trimers, so the overlap of the two sets is
smaller than would be predicted from the "weight" of those sets.

Besides, although this works for most proteins, this is a linear
model, and I know that it fails when len(B) is near 8000, since at
len(B)=8000 the correct number of shared trimers should be equal to
len(Q) and this would only happen if
 
avgFr(Q) * len(B)= avgFr(Q) * 8000=1. That is, only if 

avgFr(Q)=1/8000= 0.000125, and that is usually false (indeed avgFr(Q)
is always
between 0.00014 and 0.0004.

To solve this, and without any formal justification, I have tried to
predict the number of shared trimers Y(X) as a function of X, (where X
stands for len(B)), with the following curve:

Y= X/(a + bX),  

because I can easily choose parameters a and b in such way that at for
len(B)=0 the slope dY/dX = avgFr(Q) * len(Q),  while at len(B)=8000, 
the number of shared trimers Y=len(Q),  which I know must be true:

a=((1/avgFr(Q)) -1)/(1-(1/8000)),  b=(1/avgFr(Q)) -a

This approximation happens to work better than any of the two I
mention above.

However, as you very clearly stated in the comment you posted on 27
Jun 2003 07:27, my problem does not end there: I need to find the
number of shared trimers for which the probability is below a
threshold that would allow me to say that the proteins are homologous.
Inspection of my data tells me that this depends on len(B), because
the database is rich on proteins of certain sizes. That is
distribution of len(B), for all B in my database, has maximal density
around len(B)=400. This makes makes my cloud of points wider in that
zone ( the dispersion is bigger there). Another factor that has a
large impact on the dispersion is the number of trimers in the query
len(Q). Smaller query proteins have overall large dispersions, because
having less trimers they cause a sample size effect.

Perhaps I should post another question to seek help in this unsolved
part of my problem.

I would very much appreciate your comments!

Thanks 

Octava
Subject: Re: A problem on probabilities
From: octava-ga on 29 Jun 2003 13:46 PDT
 
mathtalk,  by the way:

1) As an exercises, I am coding the exact solution you suggested for
amino acids monomers

2) How can I show you images?

3) Any good basic book on probabilities? (You can see I am at best an
amateur!)

Regards!

Octava
Subject: Re: A problem on probabilities
From: octava-ga on 01 Jul 2003 21:52 PDT
 
Hello, mathtalk-ga:

You have been quiet for sometime now. Any reason?

Greetings!

Octava
Subject: Re: A problem on probabilities
From: mathtalk-ga on 02 Jul 2003 04:55 PDT
 
Umm... composing a long-winded thesis for you on computational
probability?  I'll post later today, just want to tweak my exposition
a bit.

regards, mathtalk
Subject: Re: A problem on probabilities
From: octava-ga on 02 Jul 2003 21:54 PDT
 
Hi, mathtalk!

Thanks for all your help! Your are right in assuming that this
question was related to a future publication. Although the solution
you gave me are impeccable (and I enjoyed going through them), I doubt
that they would be practical for my purposes in database searching.
Instead, I may use a much less elegant solution I have implemented,
based on adjusting ad hoc curves to the empirical data.

However, you have helped me in a way you might not suspect: being
handicapped in math, I am always afraid that it will show if I do
something in a contrived way when there was a much simpler way. Now I
know there is no simple solution to my problem, so I am confident that
I won't appear as a fool for providing my own practical solution. This
is a great relief!

Thanks again. I've learned a lot from your explanations. I hope you
enjoyed the biology of the problem, and that we will have other
chances for interaction in the future.

Octava
Subject: Re: A problem on probabilities
From: octava-ga on 02 Jul 2003 22:09 PDT
 
Postdata: 
I forgot to confess that I got lost in the part where you propose an
approximation based on second order differential equations: "Laplace
equation"?; "Dirichlet boundary"? ... I need to study a little more.
Subject: Re: A problem on probabilities
From: mathtalk-ga on 03 Jul 2003 08:49 PDT
 
Let me expand a bit on the differential equation idea.

In the case of equiprobable items "in an urn" we have the expected
number of matches:

s(m,k) = mk/n

In order to pass to a "limit" as n goes to infinity (thereby hopefully
obtaining a continuous function of simple form), we need to scale both
the "inputs" (arguments) and the "output" to make sense independently
of n.  In other words, let's replace m and k as parameters by the
relative fractions:

x = m/n

y = k/n

and at the same time scale the output so that the value of the
function is expressed, not as the expected number of matches, but the
expected fraction of matching items (out of n):

f(x,y) = s(m,k)/n

Now as n tends to infinity, the values of x,y behave well.  The domain
of f(x,y) is [0,1] x [0,1].  Also the value of f behaves nicely, with
a range of [0,1] rather than "expanding" with n.

In fact after doing a bit of algebra above, we have exactly:

f(x,y) = (mk/n)/n = (m/n)(k/n) = xy

Boundary conditions refer to information that holds at the "boundary"
of the domain, in this case the four edges of the square [0,1] x
[0,1].  We know apriori, even in the unequal probabilities case, that:

f(0,y) = f(x,0) = 0

f(x,1) = x

f(1,y) = y

So we know completely the values of f(x,y) around the boundary of the
domain.  We should be able to fill in the "interior" values of the
function if we knew a second order "elliptic" PDE that is satisfied by
the function.

In the equal probability case, f(x,y) = xy, such a PDE (partial
differential equation) is the Laplace equation:

d(df/dx)/dx + d(df/dy)dy = 0

In words: the second partial derivative with respect to x plus the
second partial derivative with respect to y equals zero.

Such an approach to solving what was originally a "simple" discrete
problem, by using an "infinite" continuous approximation, may seem
bizarre.  However the idea is not a new one.  Many questions in
statistics are treated by such techniques, as when a binomial
distribution involving a large number of trials (also known as a
Bernoulli process) is approximated by a normal distribution.

To apply the "PDE" approach to your original problem would require
some insight into what changes should be made in the differential
equation itself.  The roles of x,y continue to be symmetric in the
unequal probabilities case, so the modification may be as simple as:

d(df/dx)/dx + d(df/dy)dy = g(x,y)

Typically one might proceed to investigate this by doing some
numerical experiments to gain insight into what the nature of function
g(x,y) might be.

Given the right insight it should be possible to rigorously justify
the values of g(x,y).  If g(x,y) were known, the computation of f(x,y)
would follow easily from either "finite difference" or "finite
element" algorithms.

regards, mathtalk-ga

P.S.  Thanks for the generous rating (and tip!).  I also learned a
good bit from working with you!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy