Google Answers: Finding patterns in files, given only the length

View Question

Q: Finding patterns in files, given only the length ( No Answer, 18 Comments )

Question

Subject: Finding patterns in files, given only the length
Category: Computers > Algorithms
Asked by: nediam1234-ga
List Price: $35.00

Posted: 26 Oct 2006 05:58 PDT
Expires: 25 Nov 2006 04:58 PST
Question ID: 777067

I'm looking for a pseudocode for a problem I'm having..

I want to find out what are the most common byte patterns in files
where I give only the length of the pattern.


I try to explain by example. If I have 3 files containing this binary data:

file 1: A6 74 29 8E 6F 4D 43 67 23 98 FF 4A A6 74 A6 74 A4 A6 74

file 2: 71 6F 4D 43 A6 74 6F 4D 43 75 33

file 3: 4D 43 4D 43 A6 74 4D 43 8F A3 DC FF FF A6 74


If I would say the program that I want pattern of length 3 I would get:

pattern "6F 4D 43" is 3 times in the given files.
pattern "4D 43 A6" is 2 times in the given files.
pattern "43 A6 74" is 2 times in the given files.




I could have hundreds of files and each can be up to 15 MB and the
pattern length asked for can be up to length 64.

I know this would be very very slow algorithm in all cases.. but what
would be the best way to do this without taking many GB of memory..
and without it running for many days :)!  It should be ok for the code
to take 1 GB of memory, but not much more than that.


Did I explain my problem well enough?

Answer

There is no answer at this time.

Comments

Subject: Re: Finding patterns in files, given only the length
From: snowmoon-ga on 26 Oct 2006 09:53 PDT

1) Do you need to know the exact number of occurrences or would 16+ be
an acceptable answer?

2) Is the input totally random or is there some order to it?  If you
are trying to search random or "pseudo" random sets this will be time
consuming.

Since I still can't "answer" the question because of some conflict in
google.. here is a quick answer.

If you are willing to accept answers of 0-15 and 16 or more you can
divide up the 1GB of memory into 4 bit answers for every possible 4
byte combination and then just do a single pass over all files and
then read back the bitmap.  In order to get exact counts you would
have to take the limited set of patterns that had 16 or more hits and
iterate over the file set again to get exact counts of each pattern.

For sequences larger than 4 bytes must start with a common 4 byte
pattern.  You can then start down the tree using the initial 4 byte
sequences looking for longer sequences off of them each one will
require 1 more iteration over the input source set.  If at any time in
this tree search you come up empty you can stop searching that tree. 
You can optimize the solution when you get down to a limited set of
patterns left ( say 4k 44 byte patterns exist you can read them in as
64 bytes each and count duplicates at that point ).  While time
consuming I don't know if there are any short cuts considering the
arbitrary bounds of the input set, hopefully this will quickly resolve
into a handful of patterns that can be read directly into memory and
counted.

Subject: Re: Finding patterns in files, given only the length
From: frankcorrao-ga on 26 Oct 2006 10:00 PDT

Are you willing to use diskspace instead of memory?  I think a
speedy-ish way to do it is to find every substring of length n, of
which there should be p -n + 1 (where p is the total length of all
files) and then sort them.  Once they're sorted it is easy to
calculate how often each is present with just once pass over the
sorted data.  There is no reason you need to sort them all in memory. 
So here is the idea:
for each file
   read the file
   find every substring of length n. 
   Sort these substrings 
   write sorted substrings to a new file
done

That part should not take more memory than ~2 times the size of the
largest individual file assuming you free the space used to read a
file before you read the next one.

Now that you have a bunch of sorted substring files, use an external
merge sort to create one giant file of sorted substrings
http://en.wikipedia.org/wiki/External_sort

Once you have that, all you need to do is read the file line at a
time, keeping a count for each substring, resetting that count when
the substring changes.

Subject: Re: Finding patterns in files, given only the length
From: snowmoon-ga on 26 Oct 2006 10:40 PDT

Frank is correct... much better solution as long as disk space is
available.  Here is a link to the google "Map and Reduce"  paper that
could easily be applicable to your problem.

http://labs.google.com/papers/mapreduce.html

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 26 Oct 2006 13:29 PDT

>Is the input totally random or is there some order to it?
yes the input is totally random.


thank you both for good answers, First, I can't accept only 16+ possibilities.

frankcorrao, I like your idea very much and I can spend some disk
space. But I think the disk space needed would be to much since,

if a file contains for example: 

A0 A1 A2 A3 A4 A5 A6

and I'm looking for lenght 3. 
Then the following patterns are available
A0 A1 A2
A1 A2 A3
A2 A3 A4
A3 A4 A5
A4 A5 A6

not just the patterns: 
A0 A1 A2
A3 A4 A5


So if I would sort all these substrings in a file the file can get very very big.
lets say that I have input file that is only 10MB = 10485760 bytes
if I'm looking for patterns at length 40 the size would be around
40*10MB ~400 MB for that single file.
Am I wrong in this calculation?
running this on hundredes of files will take a lot of space. And then
mergin them into one will make one gigantic file, right?



I've been thinking about this a lot and one idea I have is using
hashtable, like this:


for each file
  for each pattern in the file
    if pattern is in a hashtable then
      increase the value for this pattern in the hashtable by one
    else
      add the pattern to the hashtable with value 1
  end for
end for
  
This would create one very big hashtable, and that hashtable would
probably not fit in memory, so I would need a hashtable that keeps its
keys and values on the hard disk. And then that does raise another
question, what hash function would be good to use in this case.
How is this idea... the same, better or worse?
Would the hashtable maybe be to slow reading and writing it to a file?

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 27 Oct 2006 01:45 PDT

After doing some more calculation, I see what I'm asking for is not
possible (unless I hava at least hundreds (up to hundreds of
thousands) of terabytes!).

Just taking that all possible patterns of 40 byte string is 40*8=320
bits which gives us 2^320 possibilities of patterns. Even though I
would only get 0.000000001% of the available patterns in my files, the
size would be very very big (talking about many many terabytes). (even
"only" string of length 20 has to many possibilities).

It may be acceptable for me doing sorting of few GB of data using the
method frankcorrao came up with. Then throw away all patterns that
have only come up once, and hopefully that will throw away "almost
all" of the patterns found (since the data is totally random) Then
continue with the next chunk of data, etc.

Subject: Re: Finding patterns in files, given only the length
From: frde-ga on 27 Oct 2006 03:33 PDT

1) The files are up to 15mb
2) The patterns can be up to 64 bytes
3) ABCDE resolves to ABC, BCD, CDE

I am totally with FrankCorrao-ga on this one, actually it was my
immediate reaction and I was miffed to see he had got there first.

With a 2 byte pattern a 15mb file would produce a sort file of 1 byte
less than 30mb

With a 3 byte pattern a 15mb file would produce a sort file ~ 45mb 
With a 64 byte pattern we have 15mb * 64 = 960mb

Sorting a file of 960mb is very feasible

Actually the data looks like Hex, so if the files are really in Hex,
each 2 bytes could be collapsed to 1 byte.

Now if I were doing this I would read the file into chunks of about
100kb and perform an initial sort in RAM and then write out the unique
patterns followed by their frequency which could substantially reduce
the size of the disk sort file.

OTOH 255^63 or if binary 255^31 gives a lot of potentially unique
numbers, so the saving from pre-processing might not be that great.

It would be interesting to know what you need this for.
Sequencing analysis sounds interesting.

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 27 Oct 2006 04:50 PDT

hi again and thanks for the interest in my problem :) 

>It would be interesting to know what you need this for.
Well, I'm sorry but I'm not allowed to say exactly what I need this
for, but I can say that the data is taken by meters that gets it data
from "the environment". And I'm actually trying to find "patterns in
the chaos"! .. sounds weird :)
The data is so different that I have not found any patterns by "simple
methods", so my only solution would be doing some kind of a "brute
force pattern search".. like I'm asking for here to see if I find any
patterns.


>Actually the data looks like Hex, so if the
>files are really in Hex, each 2 bytes could
>be collapsed to 1 byte.
Sorry if I have not explained this before, but when I say a file contains:
A1 A2 A3
then I mean it's a binary file (3 bytes) containing these hex values.


frde, just to see if I understand you correctly.
You're saying that I sould use FrankCorrao method (except storing each
pattern only once with their frequency)? I like that.

But then the problem I see is keeping the data for all the files,
since I'm not only searching by patterns inside one file, but checking
if there are the same patterns between all (or some) of the files.
And I don't see that I can keep all the patterns between files since
it would probably take to much space.. right? any thoughts about that?

Subject: Re: Finding patterns in files, given only the length
From: frankcorrao-ga on 27 Oct 2006 06:29 PDT

You are right though about my initial calculatation.  It is off by a
factor of the length of ths substring.  Looking at fpe's thoughts
reminded me of an idea from Programming Pearls that can help you
reduce the maximum size you ever need to hold in memory.  You don't
need to store each substring separately in memory to sort them,
assuming you are using a memory addressable language like C.

Let's say you have each item stored in a array S.  With this create a
parallel array of pointers P, where P[i] = &(S[i]) (address of the
element ith element of S).  Now all you have to do is sort the array
of pointers, where your comparison function is smart enough to only
look at the first n bytes of the string when doing the comparison,
where n is the length of the substring you are interested in.

for example, say S has {a,b,c,d,e} and n is 3.  p[0] = abcde, p[1] =
bcde, etc but your comparison function will only compare at most the
first n bytes so to the sort it looks like p[0]=abc, p[1]=bcd etc. 
Substrings without actually storing substrings.  This might not make
sense if you are not familiar with C and pointers, but I am sure it
will reduce your memory overhead to at most 4 * max size of a file.

Subject: Re: Finding patterns in files, given only the length
From: frankcorrao-ga on 27 Oct 2006 06:50 PDT

clearly i meant frde when I said fpe :)

I'm still thinking about the problem of storing all the substrings on
disk or keeping some kind of hash of the counts.

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 27 Oct 2006 07:00 PDT

Thanks for this FrankCorrao,
I would do this in C++, so I'm very familiar with pointers.
And sorting it like you suggest is a very good idea.

But like you say.. the problem now is how I would store all the substrings!? :)
if that's possible at all...

Subject: Re: Finding patterns in files, given only the length
From: theline-ga on 27 Oct 2006 08:08 PDT

Well,

i'm thinking from another point of view.

You can try a "trie". With independence from the number of files and
its size, 64 bytes patterns and 256 possible bytes means (64+1)*256 =
16640 (plus some more data) bytes of memory and only one pass to the
file(s).

Building this with care you can have all the patterns beetwen 1 and 64 bytes.

A trie is a tree in this way:
    /
  A   B
 A B A B

And so on.

You can store the ocurrences number in each node.

Example:
File: A A A B A B
Trie:
            /
    A(3)        B(2)
 A(2) B(2)   A(1) B(0) (this last node is not necessary).


I think it's easy and time/space efficient.
Good luck!
(Sorry, my English is not good enough).

Subject: Re: Finding patterns in files, given only the length
From: theline-ga on 27 Oct 2006 08:22 PDT

A correction to my last post. A complete trie for 64 bytes patterns
and 256 possible bytes require much more space (you can do with that
size i've wrote before, but is very difficult). But is very hard to
find a file with all possible 64 bytes patterns :P.

Another one: the pattern A appears 4 times, not 3.

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 27 Oct 2006 08:44 PDT

TheLine, I see where you're going with this, and I think using a trie
like you describe is a very good solution (actually I think it could
be the best solution). But like you say, listing all available 64
bytes patterns is not an option :)

Using a trie is an easy and a good way to find the patterns.

Now I just need to find a way to discard the data that I think has the
least change of coming up again.

A way to do that is maybe to go through the trie on a "regular bases"
(after each X files) and delete patterns that have come up only once?

Subject: Re: Finding patterns in files, given only the length
From: theline-ga on 27 Oct 2006 08:48 PDT

I don't see why yo want to delete some patterns. Deleting a pattern
means to decrease a number in a trie node or maybe delete one node.
Not very useful.

Subject: Re: Finding patterns in files, given only the length
From: nediam1234-ga on 27 Oct 2006 09:29 PDT

The reason I think I would need to delete from the trie is that it
will grow very very big if each pattern does not appear often.
To simplify, lets say that I have 500 files that are 10MB each (total
5 GB). If the data is "very random" (few patterns that actually comes
more often than once) the trie nodes would be so many that I would
need many terabites to store the trie (even if I would only search for
32 byte patterns)? Is that right, or am I missing something how this
is done and calculating it wrong?

Subject: Re: Finding patterns in files, given only the length
From: theline-ga on 27 Oct 2006 09:54 PDT

Note that a trie is a way of compress a a stream of data by blocks. We
don't need more space for common prefixes.

I think that with some variant of a trie, the final size would be relatively small.

You can study that. I'll do some tests...

Subject: Re: Finding patterns in files, given only the length
From: frankcorrao-ga on 27 Oct 2006 12:39 PDT

I was only very vaguely familiar with tries before this but I read up
on it I think it is the right solution.  When you add the substring to
the trie, if the subtring already exists, just increment the counter
stored at the leaf node for the substring.  I think since all the of
the substrings are the same length, those are  the only places you
need to store counters.  Then when you are done, do a depth first
traversal to retrieve each substring.

perl has some free implementations of trie, but I don't know about C++

Subject: Re: Finding patterns in files, given only the length
From: frde-ga on 28 Oct 2006 04:18 PDT

I totally disagree with using a Trie or Tree in this case.

The data is in no way dynamic, and to add complication to essentially
simple data is pointless.

I also would not worry about disk space, hard disks are incredibly
cheap, also you can use disk caddies (replaceable drives) and burn
things onto CD/DVD.

Incidentally a quick and simple test for any form of pattern is to
simply zip the files - the better the compression the more 'regular'
the data.

Currently you are at the stage of 'exploration' so speed is not that important.
The only area that I would pay attention to is buffering disk reads
and writes as they are boring. A 100k buffer is perfectly adequate.

I have a Delphi DLL for sorting fixed length records in files at:
http://tinyurl.com/yfrtxo

It will sort files up to 2Gb and internally uses the technique that
FrankCorrao-ga described.

You might be well advised to put a header or footer into the files so
they are self documenting - that or work out a naming convention.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy