Google Answers Logo
View Question
 
Q: which is the probability distribution ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: which is the probability distribution
Category: Science > Math
Asked by: octava-ga
List Price: $15.00
Posted: 01 Nov 2002 09:01 PST
Expires: 01 Dec 2002 09:01 PST
Question ID: 95478
In a series of n random events, outcome "a" occurs exactly m times,
outcome "b" occurs exactly n-m times. I want to know the frequency
distribution of the occurrence these patterns in that series:
aa
aba
abba
...
a[b]n-2a

Of course f(aa)+ f(aba) + f(abba)+ ... + f(a[b]n-2a) adds up to 1.


For instance, if n =6 and m=3, these are all the possible series:
a a a b b b
a a b a b b
a a b b a b
a a b b b a
a b a a b b
a b a b a b
a b a b b a
a b b a a b
a b b a b a
a b b b a a
b a a a b b
b a a b a b
b a a b b a
b a b a a b
b a b a b a
b a b b a a
b b a a a b
b b a a b a
b b a b a a
b b b a a a

and the occurrence of these patterns are:
aa appears 19 times, so its freq is 19/40
aba appears 12 times, its freq is 12/40
abba appears 7 times, its freq is 7/40
abbba appears 2 times, its freq is 2/40

I can estimate the distribution "experimentally" with a perl script,
but I know there has to be a relatively simple analytical solution.
However my skill in probability are insufficient.

Request for Question Clarification by rbnn-ga on 01 Nov 2002 09:21 PST
What do you mean by "know" in the phrase "know the frequency"? I can
give a dynamic programming solution in Java that solves the problem in
time polynomial in the inputs (I think). Would that be sufficient?

Clarification of Question by octava-ga on 01 Nov 2002 11:35 PST
I was hoping for an analytical solution: a formula,  which I assume
would be simple.
However,  if your solution takes as input the values m and n,  and can
produce the frequencies of all posible patterns with 8 digit precision
in reasonable time (when n is in the range 100:500 and m is in the
range 1:100),  it would be enough for me.
Answer  
Subject: Re: which is the probability distribution
Answered By: rbnn-ga on 01 Nov 2002 11:59 PST
Rated:5 out of 5 stars
 
Please ask for clarification using the "request clarification" button
if you have any questions about the answer.

Let me begin by rephrasing your question a bit.

We are given as input positive integers n, n_a and n_b such that
n_a+n_b= n.

I've just changed your "m" to "n_a" since it is easier for me to
remember what it means that way.

For each k, let S_k denote the string ab^ka , that is, 'a' followed k
'b's followed by a.

Let R be the set of words using the characters 'a' and 'b' of length n
that contain exactly n_a 'a' characters and n_b 'b' characters .

The question is, to compute, for each k, 

    (1) The total number of occurrences of the substring S_k in any
word in the set R

    (2) The ratio of the value in (1) to the total number of
occurrences of any substring S_r in any word in the set R .

Let us begin by computing 1.

First, a basic fact of combinatorics is that the number of ways of
choosing a set of  k elements from a larger set of n elements is n
choose k , usually written

   n
(     )
   k

where the parentheses are supposed to contain both values but are hard
to draw.

I will write this value as choose(n,k) .

choose(n,k) is equal to (n!/(k!)(n-k)!) . Here m! , m factorial, is
1*2*3*...*(m-1)*m .

First of all, the number of words in R is just choose(n,n_a).

How many words of R can be formed with the string S_k occurring, say,
at the beginning of the word?

Well, if n_a is <2 or n_b < (k-2) then obviously the answer is 0.

Otherwise, we start the word with

S_k....

and we have left over to form the remainder of the word n_a-2 a
characters, n_b-(k-2) b characters, and n-k total characters.

Hence, the total number of words that can be formed from the remainder
is:

choose (n-k, n_a-2)

Now, the same reasoning holds no matter what place we put S_k. There
are n-k+1 possible positions in the string where S_k can begin.

Hence, the total number of possibilities for placing S_k, that is the
value (1), is:

((n-k+1)*choose(n-k,n_a-2)

unless n_a<2 or n_b<(k-2), in which case the answer is 0 .

For part (2) , we want an analytical expression for the sum of all
occurrences of such words S_k.

Consider any word of length n . A word S_k, for some k, begins at each
position in the word except for positions beginning with the letter
'b' or the last 'a'. Since there are n positions in the word, there
are

n-(n_b+1)=na-1

words in some S_k that occur in the word. Since there are
choose(n,n_a) such words, the total number of S_k occurrences in R is:

choose(n,n_a)*(n_a-1)

The desired frequency for substrings of length k  is the quotient of
the values in (1) and (2) above, namely:

((n-k+1)*choose(n-k,n_a-2))/(choose(n,n_a)*(n_a-1))


To check this, I implemented the formula in a small Java program. I
ran it with the example you gave, and discovered that I believe the
number of "aa" occurrences and the number of "abba" occurrences was
miscounted (that's why we have computers, after all).

Here is the Java program, followed by some sample runs:

public class Prob{
    public static long factorial(long i){
	if (i<=1) return 1;
	return i*factorial(--i);
    }

    public static long choose(long n, long k){
	if (k>n) return 0;
	if (n==0) return 0;
	return factorial(n)/(factorial(k)*factorial(n-k));
    }

    public static long count(long substringlen, long nas, long nbs){
	if (nas<2||nbs<(substringlen-2))return 0;
	long stringlen=nas+nbs;
	return (stringlen-substringlen+1)*choose(stringlen-substringlen,nas-2);
    }

    public static long counttotal(long nas, long nbs){
	long stringlen=nas+nbs;
	return choose(stringlen,nas)*(nas-1);
    }

    public static void main(String[] args){
	int stringlength=Integer.parseInt(args[0]);
	int nas=Integer.parseInt(args[1]);
	int nbs=stringlength-nas;
	
	long counttotal=counttotal(nas,nbs);

	long sum=0; //check that the sums agree
	for (int substringlen=2;substringlen<=stringlength;++substringlen){
	    String result="a";
	    for (int i=0;i<substringlen-2;i++)
		result+="b";
	    result+="a";
	    long count = count(substringlen,nas,nbs);
	    System.out.println(result+": "+count+"/"+counttotal);
	    sum+=count;
	}
	System.out.println("\nCounted "+sum+" strings; expected:
"+counttotal);
    }
	
}


java Prob 6 3
aa: 20/40
aba: 12/40
abba: 6/40
abbba: 2/40
abbbba: 0/40


java Prob 10 3
aa: 72/240
aba: 56/240
abba: 42/240
abbba: 30/240
abbbba: 20/240
abbbbba: 12/240
abbbbbba: 6/240
abbbbbbba: 2/240
abbbbbbbba: 0/240

Counted 240 strings; expected: 240

SEARCH STRATEGY
---------------
An internet search on substrings and substring occurrences was
unsuccessful.

At first I thought an analytical solution would not be possible. Then
I realized the counting argument above yields the solution.

Clarification of Answer by rbnn-ga on 01 Nov 2002 12:02 PST
Oops! 

My definition of S_k was wrong.

It should read "let S_k denote the string ab^{k-2}a, that is, 'a'
followed by (k-2) b's followed by a .

Request for Answer Clarification by octava-ga on 01 Nov 2002 22:45 PST
Thank you for your answer: since you tested it, I can see that it is
correct. But before I officially rate your answer I wish to understand
your analysis, and that will take me some time.

You are right that I miscounted the occurrence of "aa" and "abba".

You have done a great job! Now is time for me to put some effort. If I
have doubts I will reach you. Thanks!

Clarification of Answer by rbnn-ga on 02 Nov 2002 01:54 PST
Thank you for the comment; once again, if there is any place in
particular that seems especially opaque, please just let me know. My
previous job actually involved analysis of genomes, which often boiled
down to looking at all words and counting various things (although the
words there had four letters, AGCT, now just a and b, typically).

By the way, the Java program will overflow for even moderate values of
n. I could write a version without overflow if you wanted to test some
fairly high n; I didn't because I wrote it primarily just for the sake
of clarifying and testing the existing algorithm.

Request for Answer Clarification by octava-ga on 02 Nov 2002 07:23 PST
I've finally understood your analysis.

I'm a molecular biologist turned doing bioinformatics. Unfortunately I
haven't had any formal training in math.

The problem you just solved for me is part of the null model for the 
distribution of the distances between amino acid pairs in proteins. To
get the final answer I just need to consider that symbol "a" is
actually composed of two other symbols (for instance A and G - ala and
gly). I think I can figure out that myself. If not, that might be my
next Google question

If this becomes part of a publication, I will be glad to acknowledge
your participation there. (Just tell me how).

Thanks again!

Clarification of Answer by rbnn-ga on 02 Nov 2002 11:53 PST
Thank you for your comments. Symmetrically, as a computer scientist
with a math background but no biology, I often had great difficulty
understanding the biology used by my coworkers.

In any case, it is not necessary to acknowledge google answers; if you
would like to, you can cite rbnn-ga if it is published. If you need
more detailed information, just email to answers-editors@google.com
when the paper is accepted.
octava-ga rated this answer:5 out of 5 stars and gave an additional tip of: $10.00
The answer was exactly what I needed. The approach was clever. The
explanation was a very welcome lesson in probability thinking. The
java program is another bonus. Many Thanks. Belive me, I wish I could
tip you as you deserve, but I lack the budget for that.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy