Google Answers: which is the probability distribution

View Question

Q: which is the probability distribution ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: which is the probability distribution
Category: Science > Math
Asked by: octava-ga
List Price: $15.00

Posted: 01 Nov 2002 09:01 PST
Expires: 01 Dec 2002 09:01 PST
Question ID: 95478

In a series of n random events, outcome "a" occurs exactly m times,
outcome "b" occurs exactly n-m times. I want to know the frequency
distribution of the occurrence these patterns in that series:
aa
aba
abba
...
a[b]n-2a

Of course f(aa)+ f(aba) + f(abba)+ ... + f(a[b]n-2a) adds up to 1.


For instance, if n =6 and m=3, these are all the possible series:
a a a b b b
a a b a b b
a a b b a b
a a b b b a
a b a a b b
a b a b a b
a b a b b a
a b b a a b
a b b a b a
a b b b a a
b a a a b b
b a a b a b
b a a b b a
b a b a a b
b a b a b a
b a b b a a
b b a a a b
b b a a b a
b b a b a a
b b b a a a

and the occurrence of these patterns are:
aa appears 19 times, so its freq is 19/40
aba appears 12 times, its freq is 12/40
abba appears 7 times, its freq is 7/40
abbba appears 2 times, its freq is 2/40

I can estimate the distribution "experimentally" with a perl script,
but I know there has to be a relatively simple analytical solution.
However my skill in probability are insufficient.

Request for Question Clarification by rbnn-ga on 01 Nov 2002 09:21 PST

What do you mean by "know" in the phrase "know the frequency"? I can
give a dynamic programming solution in Java that solves the problem in
time polynomial in the inputs (I think). Would that be sufficient?

Clarification of Question by octava-ga on 01 Nov 2002 11:35 PST

I was hoping for an analytical solution: a formula,  which I assume
would be simple.
However,  if your solution takes as input the values m and n,  and can
produce the frequencies of all posible patterns with 8 digit precision
in reasonable time (when n is in the range 100:500 and m is in the
range 1:100),  it would be enough for me.

Answer

Subject: Re: which is the probability distribution
Answered By: rbnn-ga on 01 Nov 2002 11:59 PST
Rated: 5 out of 5 stars

Please ask for clarification using the "request clarification" button if you have any questions about the answer. Let me begin by rephrasing your question a bit. We are given as input positive integers n, n_a and n_b such that n_a+n_b= n. I've just changed your "m" to "n_a" since it is easier for me to remember what it means that way. For each k, let S_k denote the string ab^ka , that is, 'a' followed k 'b's followed by a. Let R be the set of words using the characters 'a' and 'b' of length n that contain exactly n_a 'a' characters and n_b 'b' characters . The question is, to compute, for each k, (1) The total number of occurrences of the substring S_k in any word in the set R (2) The ratio of the value in (1) to the total number of occurrences of any substring S_r in any word in the set R . Let us begin by computing 1. First, a basic fact of combinatorics is that the number of ways of choosing a set of k elements from a larger set of n elements is n choose k , usually written n ( ) k where the parentheses are supposed to contain both values but are hard to draw. I will write this value as choose(n,k) . choose(n,k) is equal to (n!/(k!)(n-k)!) . Here m! , m factorial, is 123...(m-1)m . First of all, the number of words in R is just choose(n,n_a). How many words of R can be formed with the string S_k occurring, say, at the beginning of the word? Well, if n_a is <2 or n_b < (k-2) then obviously the answer is 0. Otherwise, we start the word with S_k.... and we have left over to form the remainder of the word n_a-2 a characters, n_b-(k-2) b characters, and n-k total characters. Hence, the total number of words that can be formed from the remainder is: choose (n-k, n_a-2) Now, the same reasoning holds no matter what place we put S_k. There are n-k+1 possible positions in the string where S_k can begin. Hence, the total number of possibilities for placing S_k, that is the value (1), is: ((n-k+1)choose(n-k,n_a-2) unless n_a<2 or n_b<(k-2), in which case the answer is 0 . For part (2) , we want an analytical expression for the sum of all occurrences of such words S_k. Consider any word of length n . A word S_k, for some k, begins at each position in the word except for positions beginning with the letter 'b' or the last 'a'. Since there are n positions in the word, there are n-(n_b+1)=na-1 words in some S_k that occur in the word. Since there are choose(n,n_a) such words, the total number of S_k occurrences in R is: choose(n,n_a)(n_a-1) The desired frequency for substrings of length k is the quotient of the values in (1) and (2) above, namely: ((n-k+1)choose(n-k,n_a-2))/(choose(n,n_a)(n_a-1)) To check this, I implemented the formula in a small Java program. I ran it with the example you gave, and discovered that I believe the number of "aa" occurrences and the number of "abba" occurrences was miscounted (that's why we have computers, after all). Here is the Java program, followed by some sample runs: public class Prob{ public static long factorial(long i){ if (i<=1) return 1; return ifactorial(--i); } public static long choose(long n, long k){ if (k>n) return 0; if (n==0) return 0; return factorial(n)/(factorial(k)factorial(n-k)); } public static long count(long substringlen, long nas, long nbs){ if (nas<2\|\|nbs<(substringlen-2))return 0; long stringlen=nas+nbs; return (stringlen-substringlen+1)choose(stringlen-substringlen,nas-2); } public static long counttotal(long nas, long nbs){ long stringlen=nas+nbs; return choose(stringlen,nas)*(nas-1); } public static void main(String[] args){ int stringlength=Integer.parseInt(args[0]); int nas=Integer.parseInt(args[1]); int nbs=stringlength-nas; long counttotal=counttotal(nas,nbs); long sum=0; //check that the sums agree for (int substringlen=2;substringlen<=stringlength;++substringlen){ String result="a"; for (int i=0;i<substringlen-2;i++) result+="b"; result+="a"; long count = count(substringlen,nas,nbs); System.out.println(result+": "+count+"/"+counttotal); sum+=count; } System.out.println("\nCounted "+sum+" strings; expected: "+counttotal); } } java Prob 6 3 aa: 20/40 aba: 12/40 abba: 6/40 abbba: 2/40 abbbba: 0/40 java Prob 10 3 aa: 72/240 aba: 56/240 abba: 42/240 abbba: 30/240 abbbba: 20/240 abbbbba: 12/240 abbbbbba: 6/240 abbbbbbba: 2/240 abbbbbbbba: 0/240 Counted 240 strings; expected: 240 SEARCH STRATEGY --------------- An internet search on substrings and substring occurrences was unsuccessful. At first I thought an analytical solution would not be possible. Then I realized the counting argument above yields the solution.
Clarification of Answer by rbnn-ga on 01 Nov 2002 12:02 PST Oops! My definition of S_k was wrong. It should read "let S_k denote the string ab^{k-2}a, that is, 'a' followed by (k-2) b's followed by a .
Request for Answer Clarification by octava-ga on 01 Nov 2002 22:45 PST Thank you for your answer: since you tested it, I can see that it is correct. But before I officially rate your answer I wish to understand your analysis, and that will take me some time. You are right that I miscounted the occurrence of "aa" and "abba". You have done a great job! Now is time for me to put some effort. If I have doubts I will reach you. Thanks!
Clarification of Answer by rbnn-ga on 02 Nov 2002 01:54 PST Thank you for the comment; once again, if there is any place in particular that seems especially opaque, please just let me know. My previous job actually involved analysis of genomes, which often boiled down to looking at all words and counting various things (although the words there had four letters, AGCT, now just a and b, typically). By the way, the Java program will overflow for even moderate values of n. I could write a version without overflow if you wanted to test some fairly high n; I didn't because I wrote it primarily just for the sake of clarifying and testing the existing algorithm.
Request for Answer Clarification by octava-ga on 02 Nov 2002 07:23 PST I've finally understood your analysis. I'm a molecular biologist turned doing bioinformatics. Unfortunately I haven't had any formal training in math. The problem you just solved for me is part of the null model for the distribution of the distances between amino acid pairs in proteins. To get the final answer I just need to consider that symbol "a" is actually composed of two other symbols (for instance A and G - ala and gly). I think I can figure out that myself. If not, that might be my next Google question If this becomes part of a publication, I will be glad to acknowledge your participation there. (Just tell me how). Thanks again!
Clarification of Answer by rbnn-ga on 02 Nov 2002 11:53 PST Thank you for your comments. Symmetrically, as a computer scientist with a math background but no biology, I often had great difficulty understanding the biology used by my coworkers. In any case, it is not necessary to acknowledge google answers; if you would like to, you can cite rbnn-ga if it is published. If you need more detailed information, just email to answers-editors@google.com when the paper is accepted.

octava-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $10.00

The answer was exactly what I needed. The approach was clever. The
explanation was a very welcome lesson in probability thinking. The
java program is another bonus. Many Thanks. Belive me, I wish I could
tip you as you deserve, but I lack the budget for that.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy