Google Answers: Probability of a specific string in a random number

View Question

Q: Probability of a specific string in a random number ( Answered 3 out of 5 stars

Question

Subject: Probability of a specific string in a random number
Category: Science > Math
Asked by: tprestero-ga
List Price: $3.00

Posted: 11 Apr 2005 14:31 PDT
Expires: 11 May 2005 14:31 PDT
Question ID: 508010

I've been thinking about Jorge Luis Borge's "Library of Babel", whose

  shelves contain all the possible
  combinations of the twenty-odd
  orthographic symbols (whose number,
  though vast, is not infinite); that is,
  everything which can be expressed in all
  languages.

I love this idea, and ever since I've wondered how likely it is that
one could find a string of intelligible text in a random series of
sufficient length.  According to Richard Preston in the New Yorker
article "Mountains of Pi" (02 Feb 1992), assuming you have a system
for converting the digits of the transcendental number pi to letters:

  No book and none but the shortest poems
  will ever be seen in pi, since it is
  infinitesimally unlikely that even as
  brief a text as an English sonnet will
  appear in the first 10^77 digits of pi,
  which is the longest piece of pi that
  can be calculated in this universe.

Shakespeare?s sonnet ?Shall I Compare Thee to a Summer?s Day? has less
than a thousand, or 10^3, characters.  What is, in fact, the
probability of finding this sonnet in all the knowable digits of pi?

More generally, what is the mathematical expression for the
probability of finding a specific sequence that is X characters long,
made up from an alphabet of N characters, within a perfectly random
string that is itself Y characters long (assuming Y>X)?  Please
explain any special or non-algebraic notation.

Answer

Subject: Re: Probability of a specific string in a random number
Answered By: mathtalk-ga on 11 May 2005 14:13 PDT
Rated: 3 out of 5 stars

Hi, tprestero-ga:

Let's begin with some exact computations that explore the theme of
how many "independent" trials are equivalent to the chance for a
match presented by a continuous string of symbols.

Assume an alphabet of R > 1 symbols, and some discrete probability
distribution for them.  Given a string of length n in this alphabet:

    a_1, a_2, ... , a_n

the chance of an exact match to a random sample of length n is:

    PRODUCT Pr(a_i)  FOR i = 1 to n

If we assume a uniform distribution for symbols, this becomes R^-n.

We can model (exactly) the chance of finding a match in strings of
length n or more by using a Markov chain approach.

Let state k of our process reflect the length of a maximum matching
(leading) substring preceding a point in the sampling process.  It
may be shown that no more than this needs to be "remembered" in
tracking progress toward reaching a full match, which if found is
an "absorbing state" in Markov chain parlance, as we only care for
whether a full match is found (e.g. not for how many such matches
may exist beyond the first).

To illustrate the idea let's specialize to binary strings & n = 4.

If the string is "0000" (or equivalently, "1111") and 0's occur with
equal probability as 1's, then the probabilty transition matrix is:

        / 1/2  1/2   0    0    0  \
        | 1/2   0   1/2   0    0  |
   M =  | 1/2   0    0   1/2   0  |
        | 1/2   0    0    0   1/2 |
        \  0    0    0    0    1  /

Taking powers of M, the upper right hand corner entry M^m(1,5)
represents the likelihood of a match being found within a random
string of length m.  The first time this entry becomes nonzero is
when m = 4, since we need at least four bits to complete a full
match:

        / 1/2  1/4  1/4   0    0  \
        | 1/2  1/4   0   1/4   0  |
  M^2 = | 1/2  1/4   0    0   1/4 |
        | 1/4  1/4   0    0   1/2 |
        \  0    0    0    0    1  /

        / 1/2  1/4  1/8  1/16 1/16\
        | 7/16 1/4  1/8  1/16 1/8 |
  M^4 = | 3/8  3/16 1/8  1/16 1/4 |
        | 1/4  1/8  1/16 1/16 1/2 |
        \  0    0    0    0    1  /

Thus we've shown in part that the chance of a full match to 4
bits out of 4 is 1/16.  But consecutive squarings of these
probability transition matrices effectively double the number
of bits in a random sample with each pass.

After the seventh doubling (128 bits) we find the chance of a
full match is:

          21061092471069230800519058043599728801
          --------------------------------------
          21267647932558653966460912964485513216

which is approximately 0.99029.

However I wish to emphasize a comparison between the length
of such an extended string and the number of independent 4 bit
samples that give an equivalent chance of a full match:

   length of      # independent         "0000"
   string - 3     4-bit trials          ratio
---------------- -----------------   -----------
        1          1.0
        5          3.21729350         0.6434587
       13          7.78696344         0.5989972
       29         16.93280260         0.5838897
       61         35.22448982         0.5774507
      125         71.80786426         0.5744629

From this it seems the "effective" number of independent trials
contained in a string is a fraction of the serially overlapping
substrings of length 4.

We can repeat the calculations with various other 4-bit patterns,
say "0110":

        / 1/2  1/2   0    0    0  \
        |  0   1/2  1/2   0    0  |
   M =  |  0   1/2   0   1/2   0  |
        | 1/2   0    0    0   1/2 |
        \  0    0    0    0    1  /

1/16

75/256                          5.3716820141376983206583696743202

[    2427     ]
[    ----     ]                13.910858619517722096894705651687
[    4096     ]


[  464276391  ]
[  ---------  ]                31.002697185232184486661818689731
[  536870912  ]


         [ 18172063231918459161 ]
         [ -------------------- ]  65.186391606117529572124948760419
         [ 18446744073709551616 ]


         [ 340220920378975731900102642850312417515 ]
         [ --------------------------------------- ]
         [ 340282366920938463463374607431768211456 ]

                                  133.55378044791614328781588961545


Remarkably the chance of a full match with this 4-bit pattern turns
out to be slightly better in an extended string than the repeated
independent trials model would predict:

   length of      # independent         "0110"
   string - 3     4-bit trials          ratio
---------------- -----------------   -----------
        1          1.0
        5          5.37168201         1.0743364
       13         13.91085862         1.0700660
       29         31.00269719         1.0690585
       61         65.18639161         1.0686294
      125        133.55378045         1.0684302

It would be stretching to conclude anything more than that the method
of calculation proposed here allows a numerical comparison of the 
chance of finding a pattern at least once in an extended string vs.
the chance of finding at least once in a series of independent trials.
Note however that calculated ratios seems to be monotone decreasing 
sequences, which would imply they converge to a limit, apparently one
that depends on the pattern.

Applying this "exact" computation technique will require formulating
a transition matrix whose order is the length of the pattern plus 1.

We will continue the discussion next with a longer example based on
the title of Shakespeare's sonnet, taking up a computation similar to
one proposed by volterwd-ga but based on the exact probability of at
least one match to 39 characters (including punctuation) in a string
of variable length.


regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 25 Jul 2005 23:15 PDT

Hi, tprestero-ga:

I regret that my attempt to provide "the mathematical expression for the
probability of finding a specific sequence that is X characters long,
made up from an alphabet of N characters, within a perfectly random
string that is itself Y characters long" fell short of your
satisfaction.

In a moment I will attempt to summarize all the aspects of this
Question which were addressed here, potentially to mitigate my
improperly assuming familiarity with the Comments portion of this
thread in my prepared Answer.  But I also would like to acknowledge a
couple of options that are open to you.

First would be to Request a Clarification of something you would like
to understand better.  Pin down for me what is and isn't clear about
the mathematics & computations, and I'll be happy to do what I can to
advance your ability to use these methods.

A second option would be, assuming you are still not satisfied by such
additional information, to consider applying for a Refund.  Google
Answers guarantees your satisfaction within the first 30 days of an
Answer being posted.  While you have waited more than two months to
respond, I have a feeling that the Editors would rather address your
concerns than have them go unnoticed.

See here, toward the end of this page:

[Google Answers: Help & Tips]
http://answers.google.com/answers/help.html

Summary
=======

You began with a mention of Jorge Luis Borge's "Library of Babel", in
which volumes contain arrangements of "the twenty-odd orthographic
symbols," then proceeded to site a claim from Richard Preston's
"Mountains of Pi" about the remote likelihood of finding the text of
an English sonnet within the 10^77 or so digits of pi that one might
reasonably encode using the entire visible universe.

The final paragraph of your original Question asks "generally" how to
solve such problems, subject to explaining "any special or
non-algebraic notation."

Commenter willcodeforfood-ga first points out, justifiably, that a
basic approach to these computations would be to apply a method of
independent Bernoulli or "binomial" trials.  However the intuition
expressed that "10^77 is probably a sufficient number to return a
decent chance of finding your sonnet" proves literally incorrect.

A definite calculation requires at a bare minimum the data you termed:

   X := length of a search string

   N := size of the alphabet in symbols

   Y := length of a random string

Clearly when Y = X, the convenient if not always realistic assumption
that all symbols are equally likely leads to:

   Probability of match  =  (1/N)^X  = N^(-X)

When Y > X, the probability of a match is better due to the increasing
number of possible starting locations for the search string within the
random string.  If we consider all the Y - X + 1 starting locations as
giving an independent trial, an over-estimate of the probability is
obtained:

   Probability of match  <  1 - (1 - N^(-X))^(Y - X + 1)

On the other hand, cutting the random string of length Y into
approximately Y/X separate pieces of length X would give an
under-estimate of the probability:

   Probability of match  >  1 - (1 - N^(-X))^(Y/X)

My Comment of April 12, 2005 applies these formulas, which had to a
great extent already been described by Commenters.  Using a bitwise
representation:

   X = 8000

   N = 2

   Y = 3.3E77

I calculated that:

   2.36E-2334 <= Probability of a match <= 1.90E-2331

and detailed the methods I used to evaluate the limiting expression
here, since they are mind-boggling small (but nonzero!) chances.

A Comment by racecar-ga resulted in my explaining, still as a free
Comment, why I treated the probability as a range instead of just
asserting the upper bound as an accurate estimate.  The issue I tried
to outline is how the patterns of the search string affect the
effective "number of independent trials" presented by the
overlapping/consecutive substrings of length X within a longer random
string of length Y.

I read your Comment of April 21, 2005 as acceptance of the basic
methodology of computations in terms of some number of independent
trials, and I responded on April 22, 2005 with a presentation of links
on an alternative problem interpretation which vastly enlarges the
number of possible matches (varying the "stride" between symbols as
well as the starting locations).

My conceptual framework to "turn the question around" and determine
how many digits are required to promise a very likely match, e.g. at
the 99.9% level, is to ask how many independent trials are needed to
reach this level, then ask what length Y would "effectively" give that
number of trials.

Recall that for M independent trials:

  Probability of a match  =  1 - (1 - N^(-X))^M

which is accurately approximated by 1 - e^(-M * N^(-X)).  Therefore solving:

  1 - e^(-M * N^(-X))  =  0.999

  e^(-M * N^(-X))  =  0.001

  -M * N^(-X)  =  ln(0.001)  =  -3 * ln(10)

  M  =  3 * ln(10) * (N^X)

Using the values N = 2 and X = 8000 as before, this would require M to
be about 1.2E+2409.  The last step of the puzzle is then to
"translate" this number of independent trials into some length Y (in
bits) of a random string.

The simple upper bound on probability of a match corresponds to an
upper bound (optimistic?) on the effective number of independent
trials:

  M  <  Y - X + 1

But as this is perhaps overly optimistic, we recall also the
respective lower bound:

  M  >  Y/X

in which the random string got cut up into nonoverlapping pieces of
size X.  Solving these inequalities for the "unknown" length Y of the
random string gives us at least a range:

  M + X - 1  <  Y  <  M*X

A precise accounting for the effects of a search string having
"patterns" can be managed by some matrix algebra, as outlined in my
Answer.  Unfortunately the size of the matrix increases with the size
X of the search string.  To see why, let's imagine walking along an
avenue of symbols, such as the bits or digits of pi.

With each location we will reach some number of symbols already
matched up to that point, which may be from 0 up to X.  So the "state"
of mind we are in during this travel is based on how many have been
match right to the point we are at any moment.  Such this count ever
reach X, then we are done, but until such time we must always
anticipate a chance of 1/N of increasing the count, and collectively a
chance of 1 - 1/N that the "match count" will drop its state down  by
some amount (quite possibly all the way to zero, though the
possibility exists for many strings of only dropping part of the way
back).

The matrix, whose every row and column reflects the possibility as
either input or output of each possible "match count" from 0 through
X, allows us to combine, through matrix multiplication, these
step-by-step probabilities in exactly the right way, with k steps
corresponding to k multiplications of the matrix (or multiplication by
the k'th power of the same matrix, if you prefer).

It was for this last purpose that the "probability transition
matrices" in my posted Answer were invoked.  At the time I wrote I
intended to present a computation involving a larger string, not of an
entire 8,000 bits, but rather of the title string.  Assuming an
alphabet of 32 symbols (ignoring case and limiting the space &
punctuation operations to a minimum), the sonnet's title:

  SHALL I COMPARE THEE TO A SUMMER'S DAY?

would consist of 40 symbols (including a line termination at its end).
 The corresponding probability transition matrix could be taken in
fairly simple fashion as this "lower Hessenburg" matrix:

         / 31   1   0   0   ...    0 \
         | 30   1   1   0     ...  0 |
         | 30   1   0   1      ... 0 |
  (1/32)*|  .   .   .         .    . |
         |  .   .   .           .  . |
         |  .   .   .            . . |
         | 30   1   0              1 |
         \  0   0   0   0   ...   32 /

Naturally I will not attempt to preserve the exact rational form of
powers of this rational matrix, but this version of the problem
(searching only for the title, as volterwd-ga's second Comment
suggests) has:

   X = 40

   N = 32

with length Y an unknown or perhaps, converting 10^77 digits into
3.3E+77 bits as before, dividing this by 5 (because 32 = 2^5) to get
the number of consecutive symbols corresponding to the alphabet of
size 32.  Accurately omputing high powers of matrices of this size is
possible by using some of the same "shortcuts" presented already, with
a matrix "exponential" available that could play a role like that of
the scalar exponential in my first Comment.


regards, mathtalk-ga

tprestero-ga rated this answer: 3 out of 5 stars

Mathtalk-ga has done a great job of engaging other experts in the
field who'd already left comments.  The answer, from my perspective as
a non-whiz in probability, was a bit hard to follow ("probability
transition matrix"?) but my question was worded rather ambiguously. 
I'll be more clear in the future!

Comments

Subject: Re: Probability of a specific string in a random number
From: pinkfreud-ga on 11 Apr 2005 14:40 PDT

You might find this to be interesting:

http://pi.nersc.gov/

Subject: Re: Probability of a specific string in a random number
From: willcodeforfood-ga on 11 Apr 2005 17:03 PDT

Assuming truly random distribution, the chances of finding your search
string at any given position within the text (probability for each
trial) is: 1/N^X

You will only be able to search at starting positions from: 1 to 1+Y-X
so the overall probability of finding your search string is determined
using the binomial probability formula summarized here:
  [ http://www.mathwords.com/b/binomial_probability_formula.htm ]

You number of trials can be anything you like, but 10^77 is probably a
sufficient number to return a decent chance of finding your sonnet.

How you map the digits of pi to letters will affect the overall odds
of finding your search string at any given location.  If you map each
two consecutive digits to a single letter (or character) you'll have
an alphabet of 100 characters.  This is sufficient to write your
sonnet.  Since your string is 38 characters wide, you'd need 76
consecutive digits to find the sonnet.  Your chances of finding the
sonnet at any given location is 1 / 10^76.  Use the binomial
probability formula to calculate the overall odds.  I'm guessing
you're going to find it :)

Subject: Re: Probability of a specific string in a random number
From: bozo99-ga on 11 Apr 2005 17:10 PDT

First imagine searching for 1000 particular decimal digits in a 10^77
extract of pi.   There are (10^77 - 10^3) starting points (approx
10^77) and the probability of the starting point matching your first
digit is 0.1, etc and the probability of matching your entire string
_at one place_ is 0.1 ^ 1000.

So the probability of finding your 1000-digit string _anywhere_ arises
from 10^77 trials each with p=10^-1000.  Using the usual q=1-p to
invert success you get q=1-10^-1000.  Notice how this is 1 minus a
very small number.   The probability of failing to find our desired
string at all is q^10^77 = q^770.  Using a binomial expansion of that
(1-p)^770 and taking only the first 2 terms we say the probability of
success is 770x10^-1000 or 7.7x10^-998.  (Crude, but that's the way to
do some stuff.)

You have to define the alphabet you are interested in.  Suppose this
is 26 letters, space and newline.  No punctuation, and case
insensitive.   You could make other choices here but that fixes your
alphabet at 28 chars, where 28 is obviously different from 10 (as in
the 10 digits).  You'd then have to consider that your 1000-char
sonnet has 28^1000 possible values and is equivalent to 1447 decimal
digits.  So now you've got p=10^-1447 and still about 10^77 trials. 
That makes my final estimate 770x10^-1447 or 7.70x10^-1445.

Subject: Re: Probability of a specific string in a random number
From: volterwd-ga on 11 Apr 2005 22:07 PDT

Actually you could treat this as a trial from a geometric distribution.

A success is if you get the desired sequence.

Do it like this... i have  Y > X numbers.  Thus i have Y - X = Z
possible sequences.

for example if my total Y numbers were 7 and X was of length 3

Say my sequence was 8098908
desired sequence 000

then i have 809 as my first potential
            098 as my second potential...

and so on until 908 is my last potential

Then the we check the first sequence if it is good we have one if not
we check the second possible one... if its good we have one... if not
we check the third and so on.

If we have none... we have no sequences...

Lets check the probability of this.  

basically we need to fail our series Z = Y - X times.

So we have the probability of failure (not at least one) as 

(1-1/N^X)^(Y-X)

So the probability of getting at least one sequence in the series is

1-(1-1/N^X)^(Y-X)

Which is an absurd number to calculate :P

However we can calculate the mean of the geometric distribution as

(1-1/N^X)/(1/N^X) + 1 = N^X 

So thats how many wed expect it to take to get one series well that
many sequences... so we would need N^X + X - 1 actual symbols

You could use the geometric distribution to generate the desired quantiles.

Subject: Re: Probability of a specific string in a random number
From: hyphenga-ga on 12 Apr 2005 12:41 PDT

Based on the above information, I think "Snooker" would be a cute name for a ferret.

Subject: Re: Probability of a specific string in a random number
From: mathtalk-ga on 12 Apr 2005 17:48 PDT

The website that pinkfreud-ga links to above unfortunately doesn't
locate the first matching "substring" of bits in the fractional part
of pi.  In fact the search seems to pad the requested pattern with a
considerable number of zeroes (in hex).  For example the first 4 hex
"digits" after the "radix point" in pi would be 243F, but their search
engine finds this bit-string way downstream.

The "geometric probability" approach outlined by volterwd-ga and other
Commenters correctly gives the probability of a single "random" string
of length equal to the requested pattern.  However the consecutive
substrings of this length within a longer string are not independent
"Bernoulli" trials, because the consecutive substrings overlap.  This
has the effect of reducing the actual "chances" there are for a
substring match in a manner that depends on the substring sought.  The
effect is no worse than dividing the length of the target by the
length of the substring pattern however, so this approach does give
rigorous upper and lower bounds on the chances of finding at least one
match.

To give ourselves maximum flexibility, let's suppose that the
Shakespearean sonnet is a pattern of 8,000 bits.  The putative string
of 10^77 digits of pi is then roughly 3.3E+77 bits expressed in
binary.

While this might seem like a wonderfully long search region within
which to locate a poem and "lost works" from Borge's Library of Babel,
in fact it is woefully inadequate to hope that ?Shall I Compare Thee
to a Summer?s Day? will be found there.  Since the chance of any _one_
stretch of 8,000 random bits will match our poem is:

  p = 2^-8000 ~ 5.75E-2409

Now the effective number of "independent trials" or chances to match
this pattern is somewhere between the upper bound of 3.3E+77 and that
figure divided by 8000, ie. 4.1E+74.

The chance of no matches at all in N independent trials is:

  Pr(no match) = (1 - p)^N

Since for tiny values of p we have an excellent approximation:

  (1 - p)^(1/p) ~ 1/e

where e = 2.718281828... is the base of natural logarithms, we can say:

  Pr(no match) = (1 - p)^(Np/p) ~ e^(-Np)

Depending on whether we use the upper bound on N:

   N = 3.3E+77,  Pr(no match) ~ e^(-1.90E-2331)

or the lower bound on N:

   N = 4.1E+74,  Pr(no match) ~ e^(-2.36E-2334)

Now these exponents are very close to zero (one might say immeasurably
close!), and the slope of the exponential function (which has value 1
when the exponent is exactly zero) here is about 1.  Therefore:

  Pr(at least one match) = 1 - Pr(no match)

  2.36E-2334 <= Pr(at least one match) <= 1.90E-2331

I really cannot imagine just how unlikely such a match is.  The
Infinite Improbability Drive of Douglas Adams' Hitchhikers Guide to
the Universe comes to mind:

[Infinite Improbability Drive]
http://technovelgy.com/ct/content.asp?Bnum=134

regards, mathtalk-ga

Subject: Re: Probability of a specific string in a random number
From: volterwd-ga on 13 Apr 2005 22:39 PDT

Thanks for pointing out my oversight... id like to add some comments

someone said 

How you map the digits of pi to letters will affect the overall odds

Actually... you can avoid this by converting pi to a base N
representation and  work out the corresponding cut off point in base N
that corresponds to the cutoff in base 10... ie 10^77 digits in base
10 is 10^(77/log10(N)) digits in base N

Going back the to geometric method... if we consider the overlapping
sequences to be independent... we get the expected value as i had
mentioned of

N^X + X - 1 ~ N^X

if we assume the overlapping sequences to be fully dependent then the
expected value is

XN^X

so the expected wait would be

N^X <= N^X + X - 1 < mu < XN^X

now i would suggest using N = 26 (only letters no punctuation or
spaces) because one could obtain the meaning from that.  That would
give me X = 30.  So my expected wait would be about

N^X = 10^(Xlog10(N)) = 10^42.4492

XN^X = 10^(log10(X)+Xlog10(N))=10^43.92632

However if i want to include all punctuation and spaces then X = 38
and N will be say 50

our bounds would be 

10^64.56086
and
10^66.14064

Mind you this would be the number of base N digits... 

To convert to base 10 digits we just multiply the exponent by log10(N)
and that will give the bounds as

N=26 X=30  10^60.06449 < mu < 10^62.15457
N=50 X=38  10^109.6870 < mu < 10^112.3710

So if we use no spaces or punctuation then its almost certain to find
that sequence. However if we use punctuation and spaces (not even
accounting for capitols) then its absurd to expect to find it.

BTW i used expected value not probability because it gives you an
order of how long you need to wait.

Subject: Re: Probability of a specific string in a random number
From: racecar-ga on 18 Apr 2005 22:28 PDT

mathtalk--

can you explain briefly why it's more likely to find a given sequence
in N independent substrings than in N overlapping substrings?

Subject: Re: Probability of a specific string in a random number
From: mathtalk-ga on 20 Apr 2005 08:39 PDT

Hi, racecar-ga:

The events of two overlapping strings both matching a pattern are not
independent.  This motivated my use of N/m independent trials to get a
lower bound on the probability of matching a string of length m
anywhere in a run of length N.

But depending on the pattern, it might actually be (slightly) more or
less probable to find a match among the N-m+1 overlapping consecutive
substrings of length m > 1 in a run of length N, than to find a match
among N-m+1 independent such strings!

Here's a simple case, the two consecutive binary strings of length 2
that are found overlapping in a run of 3 bits.

If the pattern is 00 or 11, then the chance of matching someplace in
the run of 3 bits is 3/8ths.  If the pattern is 01 or 10, then the
chance of matching either the front pair or the back pair is 4/8ths
(it being impossible to have a match in both pairs with the latter
patterns).

However for two independent binary strings of length 2, the chance of
at least one match is 7/16ths.  This is the same for all four
patterns.

If we average out the chances of a front or back pair matching over
_all_ four patterns (times the eight possible 3 bit runs), we see that
the 3/8ths and the 4/8ths cases average out to 7/16ths, and this is a
general circumstance.

For a non-constant pattern (01 or 10) the events of matching the front
pair and back pair are mutually exclusive, but by essentially the same
token it becomes more likely, given that the front pair match fails,
that the back pair match will succeed.

regards, mathtalk-ga

Subject: Re: Probability of a specific string in a random number
From: racecar-ga on 20 Apr 2005 22:59 PDT

Thanks for the detailed reply.  I have a hunch that in the limit of
long substrings, the difference vanishes.  My guess is that the
probability of finding a given sequence of 1000 digits in a random
string of 10^77 digits and the probability of finding a match among
10^77 (-999 as if that matters)independent random 1000 digit strings
are the same to more decimal places than you could count without
losing track.  I don't have any proof though.

Subject: Re: Probability of a specific string in a random number
From: tprestero-ga on 21 Apr 2005 12:25 PDT

Great comments, all.  I guess the overwhelming majority of the
Librarians in Borges's story are surrounded by nonsense.

What if we turn the question around--how many digits of pi would I
need to have a good (ie 99.9%) chance of finding Shakespeare's sonnet?

Subject: Re: Probability of a specific string in a random number
From: mathtalk-ga on 22 Apr 2005 07:30 PDT

One vastly expands the number of opportunities if a "stride" between
consecutive characters greater than 1 is allowed, cum the "Bible
codes" controversy (in which future names and events are purported).

[Bible Code Digest]
http://www.biblecodedigest.com/

A serious attempt to frame these ideas in a scientific study "was made
by three Israelis: Doron Witztum, Eliyahu Rips, and Yoav Rosenberg
(WRR)... WRR's 'experiment' appeared in the respectable journal
Statistical Science in 1994."

However their statistical methodology turned out to be flawed:

[Scientific Refutation of the Bible Codes]
http://cs.anu.edu.au/~bdm/dilugim/torah.html

The number of opportunities to locate a particular string increases
roughly by squaring.

regards, mathtalk-ga

Subject: Re: Probability of a specific string in a random number
From: racecar-ga on 12 May 2005 16:36 PDT

Very enlightening Answer.  Repeated squaring of the prob. trans.
matrix is an amazingly efficient way to calculate probabilities for
long strings.  I admit that I was sort of surprised at the results.  I
still think though that with a large alphabet and a long sequence made
of a pretty even mix of the members of the alphabet, the effective
number of independent trials approaches the length of the string. 
Basically, it seems like with a quasi-random sequence, knowing that
A(n):A(n+L-1) is not a match doesn't really change the probability
that
A(n+1):A(n+L) is a match. [A is the string (library) and L is the
length of the sequence]. However, if you're looking for the sequence
555555555555555555555555555555555555555555555 in the digits of pi,
knowing that A(n):A(n+L-1) is not a match makes it significantly less
probable that A(n+1):A(n+L) is a match (significantly, I mean, when
you look at the cumulative effect--that is, the effective number of
independent trials).  One way to see this is that if you know that
A(n):A(n+L-1) IS a match (they're all 5's), then the probability that
A(n+1):A(n+L) is a match is 10^-1 instead of 10^-40 or whatever for an
independent trial.  If knowing the first subtring is a match increases
likelihood that the second substring is a match, knowing the first one
is not a match must decrease the likelihood that the second one is a
match.

Awaiting next installment...

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy