View Question
Q: which is the probability distribution ( Answered ,   0 Comments )
 Question
 Subject: which is the probability distribution Category: Science > Math Asked by: octava-ga List Price: \$15.00 Posted: 01 Nov 2002 09:01 PST Expires: 01 Dec 2002 09:01 PST Question ID: 95478
 ```In a series of n random events, outcome "a" occurs exactly m times, outcome "b" occurs exactly n-m times. I want to know the frequency distribution of the occurrence these patterns in that series: aa aba abba ... a[b]n-2a Of course f(aa)+ f(aba) + f(abba)+ ... + f(a[b]n-2a) adds up to 1. For instance, if n =6 and m=3, these are all the possible series: a a a b b b a a b a b b a a b b a b a a b b b a a b a a b b a b a b a b a b a b b a a b b a a b a b b a b a a b b b a a b a a a b b b a a b a b b a a b b a b a b a a b b a b a b a b a b b a a b b a a a b b b a a b a b b a b a a b b b a a a and the occurrence of these patterns are: aa appears 19 times, so its freq is 19/40 aba appears 12 times, its freq is 12/40 abba appears 7 times, its freq is 7/40 abbba appears 2 times, its freq is 2/40 I can estimate the distribution "experimentally" with a perl script, but I know there has to be a relatively simple analytical solution. However my skill in probability are insufficient.``` Request for Question Clarification by rbnn-ga on 01 Nov 2002 09:21 PST ```What do you mean by "know" in the phrase "know the frequency"? I can give a dynamic programming solution in Java that solves the problem in time polynomial in the inputs (I think). Would that be sufficient?``` Clarification of Question by octava-ga on 01 Nov 2002 11:35 PST ```I was hoping for an analytical solution: a formula, which I assume would be simple. However, if your solution takes as input the values m and n, and can produce the frequencies of all posible patterns with 8 digit precision in reasonable time (when n is in the range 100:500 and m is in the range 1:100), it would be enough for me.```
 ```Please ask for clarification using the "request clarification" button if you have any questions about the answer. Let me begin by rephrasing your question a bit. We are given as input positive integers n, n_a and n_b such that n_a+n_b= n. I've just changed your "m" to "n_a" since it is easier for me to remember what it means that way. For each k, let S_k denote the string ab^ka , that is, 'a' followed k 'b's followed by a. Let R be the set of words using the characters 'a' and 'b' of length n that contain exactly n_a 'a' characters and n_b 'b' characters . The question is, to compute, for each k, (1) The total number of occurrences of the substring S_k in any word in the set R (2) The ratio of the value in (1) to the total number of occurrences of any substring S_r in any word in the set R . Let us begin by computing 1. First, a basic fact of combinatorics is that the number of ways of choosing a set of k elements from a larger set of n elements is n choose k , usually written n ( ) k where the parentheses are supposed to contain both values but are hard to draw. I will write this value as choose(n,k) . choose(n,k) is equal to (n!/(k!)(n-k)!) . Here m! , m factorial, is 1*2*3*...*(m-1)*m . First of all, the number of words in R is just choose(n,n_a). How many words of R can be formed with the string S_k occurring, say, at the beginning of the word? Well, if n_a is <2 or n_b < (k-2) then obviously the answer is 0. Otherwise, we start the word with S_k.... and we have left over to form the remainder of the word n_a-2 a characters, n_b-(k-2) b characters, and n-k total characters. Hence, the total number of words that can be formed from the remainder is: choose (n-k, n_a-2) Now, the same reasoning holds no matter what place we put S_k. There are n-k+1 possible positions in the string where S_k can begin. Hence, the total number of possibilities for placing S_k, that is the value (1), is: ((n-k+1)*choose(n-k,n_a-2) unless n_a<2 or n_b<(k-2), in which case the answer is 0 . For part (2) , we want an analytical expression for the sum of all occurrences of such words S_k. Consider any word of length n . A word S_k, for some k, begins at each position in the word except for positions beginning with the letter 'b' or the last 'a'. Since there are n positions in the word, there are n-(n_b+1)=na-1 words in some S_k that occur in the word. Since there are choose(n,n_a) such words, the total number of S_k occurrences in R is: choose(n,n_a)*(n_a-1) The desired frequency for substrings of length k is the quotient of the values in (1) and (2) above, namely: ((n-k+1)*choose(n-k,n_a-2))/(choose(n,n_a)*(n_a-1)) To check this, I implemented the formula in a small Java program. I ran it with the example you gave, and discovered that I believe the number of "aa" occurrences and the number of "abba" occurrences was miscounted (that's why we have computers, after all). Here is the Java program, followed by some sample runs: public class Prob{ public static long factorial(long i){ if (i<=1) return 1; return i*factorial(--i); } public static long choose(long n, long k){ if (k>n) return 0; if (n==0) return 0; return factorial(n)/(factorial(k)*factorial(n-k)); } public static long count(long substringlen, long nas, long nbs){ if (nas<2||nbs<(substringlen-2))return 0; long stringlen=nas+nbs; return (stringlen-substringlen+1)*choose(stringlen-substringlen,nas-2); } public static long counttotal(long nas, long nbs){ long stringlen=nas+nbs; return choose(stringlen,nas)*(nas-1); } public static void main(String[] args){ int stringlength=Integer.parseInt(args[0]); int nas=Integer.parseInt(args[1]); int nbs=stringlength-nas; long counttotal=counttotal(nas,nbs); long sum=0; //check that the sums agree for (int substringlen=2;substringlen<=stringlength;++substringlen){ String result="a"; for (int i=0;i
 octava-ga rated this answer: and gave an additional tip of: \$10.00 ```The answer was exactly what I needed. The approach was clever. The explanation was a very welcome lesson in probability thinking. The java program is another bonus. Many Thanks. Belive me, I wish I could tip you as you deserve, but I lack the budget for that.```