

Subject:
Probability of a specific string in a random number
Category: Science > Math Asked by: tpresteroga List Price: $3.00 
Posted:
11 Apr 2005 14:31 PDT
Expires: 11 May 2005 14:31 PDT Question ID: 508010 
I've been thinking about Jorge Luis Borge's "Library of Babel", whose shelves contain all the possible combinations of the twentyodd orthographic symbols (whose number, though vast, is not infinite); that is, everything which can be expressed in all languages. I love this idea, and ever since I've wondered how likely it is that one could find a string of intelligible text in a random series of sufficient length. According to Richard Preston in the New Yorker article "Mountains of Pi" (02 Feb 1992), assuming you have a system for converting the digits of the transcendental number pi to letters: No book and none but the shortest poems will ever be seen in pi, since it is infinitesimally unlikely that even as brief a text as an English sonnet will appear in the first 10^77 digits of pi, which is the longest piece of pi that can be calculated in this universe. Shakespeare?s sonnet ?Shall I Compare Thee to a Summer?s Day? has less than a thousand, or 10^3, characters. What is, in fact, the probability of finding this sonnet in all the knowable digits of pi? More generally, what is the mathematical expression for the probability of finding a specific sequence that is X characters long, made up from an alphabet of N characters, within a perfectly random string that is itself Y characters long (assuming Y>X)? Please explain any special or nonalgebraic notation. 

Subject:
Re: Probability of a specific string in a random number
Answered By: mathtalkga on 11 May 2005 14:13 PDT Rated: 
Hi, tpresteroga: Let's begin with some exact computations that explore the theme of how many "independent" trials are equivalent to the chance for a match presented by a continuous string of symbols. Assume an alphabet of R > 1 symbols, and some discrete probability distribution for them. Given a string of length n in this alphabet: a_1, a_2, ... , a_n the chance of an exact match to a random sample of length n is: PRODUCT Pr(a_i) FOR i = 1 to n If we assume a uniform distribution for symbols, this becomes R^n. We can model (exactly) the chance of finding a match in strings of length n or more by using a Markov chain approach. Let state k of our process reflect the length of a maximum matching (leading) substring preceding a point in the sampling process. It may be shown that no more than this needs to be "remembered" in tracking progress toward reaching a full match, which if found is an "absorbing state" in Markov chain parlance, as we only care for whether a full match is found (e.g. not for how many such matches may exist beyond the first). To illustrate the idea let's specialize to binary strings & n = 4. If the string is "0000" (or equivalently, "1111") and 0's occur with equal probability as 1's, then the probabilty transition matrix is: / 1/2 1/2 0 0 0 \  1/2 0 1/2 0 0  M =  1/2 0 0 1/2 0   1/2 0 0 0 1/2  \ 0 0 0 0 1 / Taking powers of M, the upper right hand corner entry M^m(1,5) represents the likelihood of a match being found within a random string of length m. The first time this entry becomes nonzero is when m = 4, since we need at least four bits to complete a full match: / 1/2 1/4 1/4 0 0 \  1/2 1/4 0 1/4 0  M^2 =  1/2 1/4 0 0 1/4   1/4 1/4 0 0 1/2  \ 0 0 0 0 1 / / 1/2 1/4 1/8 1/16 1/16\  7/16 1/4 1/8 1/16 1/8  M^4 =  3/8 3/16 1/8 1/16 1/4   1/4 1/8 1/16 1/16 1/2  \ 0 0 0 0 1 / Thus we've shown in part that the chance of a full match to 4 bits out of 4 is 1/16. But consecutive squarings of these probability transition matrices effectively double the number of bits in a random sample with each pass. After the seventh doubling (128 bits) we find the chance of a full match is: 21061092471069230800519058043599728801  21267647932558653966460912964485513216 which is approximately 0.99029. However I wish to emphasize a comparison between the length of such an extended string and the number of independent 4 bit samples that give an equivalent chance of a full match: length of # independent "0000" string  3 4bit trials ratio    1 1.0 5 3.21729350 0.6434587 13 7.78696344 0.5989972 29 16.93280260 0.5838897 61 35.22448982 0.5774507 125 71.80786426 0.5744629 From this it seems the "effective" number of independent trials contained in a string is a fraction of the serially overlapping substrings of length 4. We can repeat the calculations with various other 4bit patterns, say "0110": / 1/2 1/2 0 0 0 \  0 1/2 1/2 0 0  M =  0 1/2 0 1/2 0   1/2 0 0 0 1/2  \ 0 0 0 0 1 / 1/16 75/256 5.3716820141376983206583696743202 [ 2427 ] [  ] 13.910858619517722096894705651687 [ 4096 ] [ 464276391 ] [  ] 31.002697185232184486661818689731 [ 536870912 ] [ 18172063231918459161 ] [  ] 65.186391606117529572124948760419 [ 18446744073709551616 ] [ 340220920378975731900102642850312417515 ] [  ] [ 340282366920938463463374607431768211456 ] 133.55378044791614328781588961545 Remarkably the chance of a full match with this 4bit pattern turns out to be slightly better in an extended string than the repeated independent trials model would predict: length of # independent "0110" string  3 4bit trials ratio    1 1.0 5 5.37168201 1.0743364 13 13.91085862 1.0700660 29 31.00269719 1.0690585 61 65.18639161 1.0686294 125 133.55378045 1.0684302 It would be stretching to conclude anything more than that the method of calculation proposed here allows a numerical comparison of the chance of finding a pattern at least once in an extended string vs. the chance of finding at least once in a series of independent trials. Note however that calculated ratios seems to be monotone decreasing sequences, which would imply they converge to a limit, apparently one that depends on the pattern. Applying this "exact" computation technique will require formulating a transition matrix whose order is the length of the pattern plus 1. We will continue the discussion next with a longer example based on the title of Shakespeare's sonnet, taking up a computation similar to one proposed by volterwdga but based on the exact probability of at least one match to 39 characters (including punctuation) in a string of variable length. regards, mathtalkga  

tpresteroga
rated this answer:
Mathtalkga has done a great job of engaging other experts in the field who'd already left comments. The answer, from my perspective as a nonwhiz in probability, was a bit hard to follow ("probability transition matrix"?) but my question was worded rather ambiguously. I'll be more clear in the future! 

Subject:
Re: Probability of a specific string in a random number
From: pinkfreudga on 11 Apr 2005 14:40 PDT 
You might find this to be interesting: http://pi.nersc.gov/ 
Subject:
Re: Probability of a specific string in a random number
From: willcodeforfoodga on 11 Apr 2005 17:03 PDT 
Assuming truly random distribution, the chances of finding your search string at any given position within the text (probability for each trial) is: 1/N^X You will only be able to search at starting positions from: 1 to 1+YX so the overall probability of finding your search string is determined using the binomial probability formula summarized here: [ http://www.mathwords.com/b/binomial_probability_formula.htm ] You number of trials can be anything you like, but 10^77 is probably a sufficient number to return a decent chance of finding your sonnet. How you map the digits of pi to letters will affect the overall odds of finding your search string at any given location. If you map each two consecutive digits to a single letter (or character) you'll have an alphabet of 100 characters. This is sufficient to write your sonnet. Since your string is 38 characters wide, you'd need 76 consecutive digits to find the sonnet. Your chances of finding the sonnet at any given location is 1 / 10^76. Use the binomial probability formula to calculate the overall odds. I'm guessing you're going to find it :) 
Subject:
Re: Probability of a specific string in a random number
From: bozo99ga on 11 Apr 2005 17:10 PDT 
First imagine searching for 1000 particular decimal digits in a 10^77 extract of pi. There are (10^77  10^3) starting points (approx 10^77) and the probability of the starting point matching your first digit is 0.1, etc and the probability of matching your entire string _at one place_ is 0.1 ^ 1000. So the probability of finding your 1000digit string _anywhere_ arises from 10^77 trials each with p=10^1000. Using the usual q=1p to invert success you get q=110^1000. Notice how this is 1 minus a very small number. The probability of failing to find our desired string at all is q^10^77 = q^770. Using a binomial expansion of that (1p)^770 and taking only the first 2 terms we say the probability of success is 770x10^1000 or 7.7x10^998. (Crude, but that's the way to do some stuff.) You have to define the alphabet you are interested in. Suppose this is 26 letters, space and newline. No punctuation, and case insensitive. You could make other choices here but that fixes your alphabet at 28 chars, where 28 is obviously different from 10 (as in the 10 digits). You'd then have to consider that your 1000char sonnet has 28^1000 possible values and is equivalent to 1447 decimal digits. So now you've got p=10^1447 and still about 10^77 trials. That makes my final estimate 770x10^1447 or 7.70x10^1445. 
Subject:
Re: Probability of a specific string in a random number
From: volterwdga on 11 Apr 2005 22:07 PDT 
Actually you could treat this as a trial from a geometric distribution. A success is if you get the desired sequence. Do it like this... i have Y > X numbers. Thus i have Y  X = Z possible sequences. for example if my total Y numbers were 7 and X was of length 3 Say my sequence was 8098908 desired sequence 000 then i have 809 as my first potential 098 as my second potential... and so on until 908 is my last potential Then the we check the first sequence if it is good we have one if not we check the second possible one... if its good we have one... if not we check the third and so on. If we have none... we have no sequences... Lets check the probability of this. basically we need to fail our series Z = Y  X times. So we have the probability of failure (not at least one) as (11/N^X)^(YX) So the probability of getting at least one sequence in the series is 1(11/N^X)^(YX) Which is an absurd number to calculate :P However we can calculate the mean of the geometric distribution as (11/N^X)/(1/N^X) + 1 = N^X So thats how many wed expect it to take to get one series well that many sequences... so we would need N^X + X  1 actual symbols You could use the geometric distribution to generate the desired quantiles. 
Subject:
Re: Probability of a specific string in a random number
From: hyphengaga on 12 Apr 2005 12:41 PDT 
Based on the above information, I think "Snooker" would be a cute name for a ferret. 
Subject:
Re: Probability of a specific string in a random number
From: mathtalkga on 12 Apr 2005 17:48 PDT 
The website that pinkfreudga links to above unfortunately doesn't locate the first matching "substring" of bits in the fractional part of pi. In fact the search seems to pad the requested pattern with a considerable number of zeroes (in hex). For example the first 4 hex "digits" after the "radix point" in pi would be 243F, but their search engine finds this bitstring way downstream. The "geometric probability" approach outlined by volterwdga and other Commenters correctly gives the probability of a single "random" string of length equal to the requested pattern. However the consecutive substrings of this length within a longer string are not independent "Bernoulli" trials, because the consecutive substrings overlap. This has the effect of reducing the actual "chances" there are for a substring match in a manner that depends on the substring sought. The effect is no worse than dividing the length of the target by the length of the substring pattern however, so this approach does give rigorous upper and lower bounds on the chances of finding at least one match. To give ourselves maximum flexibility, let's suppose that the Shakespearean sonnet is a pattern of 8,000 bits. The putative string of 10^77 digits of pi is then roughly 3.3E+77 bits expressed in binary. While this might seem like a wonderfully long search region within which to locate a poem and "lost works" from Borge's Library of Babel, in fact it is woefully inadequate to hope that ?Shall I Compare Thee to a Summer?s Day? will be found there. Since the chance of any _one_ stretch of 8,000 random bits will match our poem is: p = 2^8000 ~ 5.75E2409 Now the effective number of "independent trials" or chances to match this pattern is somewhere between the upper bound of 3.3E+77 and that figure divided by 8000, ie. 4.1E+74. The chance of no matches at all in N independent trials is: Pr(no match) = (1  p)^N Since for tiny values of p we have an excellent approximation: (1  p)^(1/p) ~ 1/e where e = 2.718281828... is the base of natural logarithms, we can say: Pr(no match) = (1  p)^(Np/p) ~ e^(Np) Depending on whether we use the upper bound on N: N = 3.3E+77, Pr(no match) ~ e^(1.90E2331) or the lower bound on N: N = 4.1E+74, Pr(no match) ~ e^(2.36E2334) Now these exponents are very close to zero (one might say immeasurably close!), and the slope of the exponential function (which has value 1 when the exponent is exactly zero) here is about 1. Therefore: Pr(at least one match) = 1  Pr(no match) 2.36E2334 <= Pr(at least one match) <= 1.90E2331 I really cannot imagine just how unlikely such a match is. The Infinite Improbability Drive of Douglas Adams' Hitchhikers Guide to the Universe comes to mind: [Infinite Improbability Drive] http://technovelgy.com/ct/content.asp?Bnum=134 regards, mathtalkga 
Subject:
Re: Probability of a specific string in a random number
From: volterwdga on 13 Apr 2005 22:39 PDT 
Thanks for pointing out my oversight... id like to add some comments someone said How you map the digits of pi to letters will affect the overall odds Actually... you can avoid this by converting pi to a base N representation and work out the corresponding cut off point in base N that corresponds to the cutoff in base 10... ie 10^77 digits in base 10 is 10^(77/log10(N)) digits in base N Going back the to geometric method... if we consider the overlapping sequences to be independent... we get the expected value as i had mentioned of N^X + X  1 ~ N^X if we assume the overlapping sequences to be fully dependent then the expected value is XN^X so the expected wait would be N^X <= N^X + X  1 < mu < XN^X now i would suggest using N = 26 (only letters no punctuation or spaces) because one could obtain the meaning from that. That would give me X = 30. So my expected wait would be about N^X = 10^(Xlog10(N)) = 10^42.4492 XN^X = 10^(log10(X)+Xlog10(N))=10^43.92632 However if i want to include all punctuation and spaces then X = 38 and N will be say 50 our bounds would be 10^64.56086 and 10^66.14064 Mind you this would be the number of base N digits... To convert to base 10 digits we just multiply the exponent by log10(N) and that will give the bounds as N=26 X=30 10^60.06449 < mu < 10^62.15457 N=50 X=38 10^109.6870 < mu < 10^112.3710 So if we use no spaces or punctuation then its almost certain to find that sequence. However if we use punctuation and spaces (not even accounting for capitols) then its absurd to expect to find it. BTW i used expected value not probability because it gives you an order of how long you need to wait. 
Subject:
Re: Probability of a specific string in a random number
From: racecarga on 18 Apr 2005 22:28 PDT 
mathtalk can you explain briefly why it's more likely to find a given sequence in N independent substrings than in N overlapping substrings? 
Subject:
Re: Probability of a specific string in a random number
From: mathtalkga on 20 Apr 2005 08:39 PDT 
Hi, racecarga: The events of two overlapping strings both matching a pattern are not independent. This motivated my use of N/m independent trials to get a lower bound on the probability of matching a string of length m anywhere in a run of length N. But depending on the pattern, it might actually be (slightly) more or less probable to find a match among the Nm+1 overlapping consecutive substrings of length m > 1 in a run of length N, than to find a match among Nm+1 independent such strings! Here's a simple case, the two consecutive binary strings of length 2 that are found overlapping in a run of 3 bits. If the pattern is 00 or 11, then the chance of matching someplace in the run of 3 bits is 3/8ths. If the pattern is 01 or 10, then the chance of matching either the front pair or the back pair is 4/8ths (it being impossible to have a match in both pairs with the latter patterns). However for two independent binary strings of length 2, the chance of at least one match is 7/16ths. This is the same for all four patterns. If we average out the chances of a front or back pair matching over _all_ four patterns (times the eight possible 3 bit runs), we see that the 3/8ths and the 4/8ths cases average out to 7/16ths, and this is a general circumstance. For a nonconstant pattern (01 or 10) the events of matching the front pair and back pair are mutually exclusive, but by essentially the same token it becomes more likely, given that the front pair match fails, that the back pair match will succeed. regards, mathtalkga 
Subject:
Re: Probability of a specific string in a random number
From: racecarga on 20 Apr 2005 22:59 PDT 
Thanks for the detailed reply. I have a hunch that in the limit of long substrings, the difference vanishes. My guess is that the probability of finding a given sequence of 1000 digits in a random string of 10^77 digits and the probability of finding a match among 10^77 (999 as if that matters)independent random 1000 digit strings are the same to more decimal places than you could count without losing track. I don't have any proof though. 
Subject:
Re: Probability of a specific string in a random number
From: tpresteroga on 21 Apr 2005 12:25 PDT 
Great comments, all. I guess the overwhelming majority of the Librarians in Borges's story are surrounded by nonsense. What if we turn the question aroundhow many digits of pi would I need to have a good (ie 99.9%) chance of finding Shakespeare's sonnet? 
Subject:
Re: Probability of a specific string in a random number
From: mathtalkga on 22 Apr 2005 07:30 PDT 
One vastly expands the number of opportunities if a "stride" between consecutive characters greater than 1 is allowed, cum the "Bible codes" controversy (in which future names and events are purported). [Bible Code Digest] http://www.biblecodedigest.com/ A serious attempt to frame these ideas in a scientific study "was made by three Israelis: Doron Witztum, Eliyahu Rips, and Yoav Rosenberg (WRR)... WRR's 'experiment' appeared in the respectable journal Statistical Science in 1994." However their statistical methodology turned out to be flawed: [Scientific Refutation of the Bible Codes] http://cs.anu.edu.au/~bdm/dilugim/torah.html The number of opportunities to locate a particular string increases roughly by squaring. regards, mathtalkga 
Subject:
Re: Probability of a specific string in a random number
From: racecarga on 12 May 2005 16:36 PDT 
Very enlightening Answer. Repeated squaring of the prob. trans. matrix is an amazingly efficient way to calculate probabilities for long strings. I admit that I was sort of surprised at the results. I still think though that with a large alphabet and a long sequence made of a pretty even mix of the members of the alphabet, the effective number of independent trials approaches the length of the string. Basically, it seems like with a quasirandom sequence, knowing that A(n):A(n+L1) is not a match doesn't really change the probability that A(n+1):A(n+L) is a match. [A is the string (library) and L is the length of the sequence]. However, if you're looking for the sequence 555555555555555555555555555555555555555555555 in the digits of pi, knowing that A(n):A(n+L1) is not a match makes it significantly less probable that A(n+1):A(n+L) is a match (significantly, I mean, when you look at the cumulative effectthat is, the effective number of independent trials). One way to see this is that if you know that A(n):A(n+L1) IS a match (they're all 5's), then the probability that A(n+1):A(n+L) is a match is 10^1 instead of 10^40 or whatever for an independent trial. If knowing the first subtring is a match increases likelihood that the second substring is a match, knowing the first one is not a match must decrease the likelihood that the second one is a match. Awaiting next installment... 
If you feel that you have found inappropriate content, please let us know by emailing us at answerssupport@google.com with the question ID listed above. Thank you. 
Search Google Answers for 
Google Home  Answers FAQ  Terms of Service  Privacy Policy 