Google Answers Logo
View Question
 
Q: blaster algorithm (or at least close to it) ( Answered,   0 Comments )
Question  
Subject: blaster algorithm (or at least close to it)
Category: Computers > Software
Asked by: amos88-ga
List Price: $130.00
Posted: 27 Sep 2004 10:53 PDT
Expires: 27 Oct 2004 10:53 PDT
Question ID: 406959
my quesiton concerns the analysis of sequence data. i try to find
sequences in clickstreams of a weblog.
my files looks like this:

user 1: 100 1200 200 1308 309 
user 2: 200 200 1308 309 509
...
user n

each number corresponds to a click in an database. "200 1308 309"
would be a sequence occuring for user1 and user2.
given this intput i need some output telling me whether there are
sequences which iterate.

i THINK the following tool does the same for dna code:
http://www.hip.harvard.edu/informatics/programs/JAVA%20BLAST%20Parser.html

anybody who can provide a software solution for this problem
(preferential running on windows).

best 

amos

Request for Question Clarification by andyt-ga on 27 Sep 2004 11:35 PDT
Hi amos88,
After a quick look at the BLAST tool, I think it would be easiest to
write a custom solution in a scripting language, rather then modify
BLAST.  Would something written using PHP, Perl, or Python meet your
needs?  All of these languages are available for Windows and can be
run from the command line.

Something to the effect of
>>analyzeSequenceData inputFile outputFile

Clarification of Question by amos88-ga on 27 Sep 2004 23:02 PDT
hi andyt,

i have no preferences concerning the language. 


best

amos88

Clarification of Question by amos88-ga on 30 Sep 2004 19:53 PDT
hi andyt!

how are things?
i did not mention that time (as for the most of us) is quite valueable
for me at the moment (have some deadlines to keep). can you already
state an estimated time for answering the problem?

thanks a lot 
amos

Request for Question Clarification by andyt-ga on 30 Sep 2004 23:30 PDT
Amos,
Another GA staffer had been working on this problem over the past
couple of days, although the lock is now off.  I've spent some time
attempting it, but it's turning out to be more of a complex problem
then I thought.  I'm going to be pretty busy over the weekend, so I
doubt I'll be able to spend much time on it.  So in the hopes that it
will help someone else to help you, I'll post what I have.  If no one
else wants to give it a try, I'll continue working on it, but I don't
want to give an estimated time of delivery at this time.

I think the solution involves generating all possible subsequences of
each user, for instance for user 1 that would be [[100, 1200], [100,
1200, 200], [100, 1200, 200, 1308], [100, 1200, 200, 1308, 309],
[1200, 200], [1200, 200, 1308], [1200, 200, 1308, 309], [200, 1308],
[200, 1308, 309], [1308, 309]]

Then comparing that new list against every other user's subsequence list.

Here's some python code I came up with to generate all subsequences
given an input list li:
li = [100, 1200, 200, 1308, 309]
liSub=[] #list to hold output subsequences
count=0
for i in range(len(li)-1):
	for j in range(len(li) -1 - i):
		liSub.append(li[i:j+2+count])
	count=count+1
print liSub

The problem with this approach is that it is very computationally
intensive as the number of users increases, and I'm guessing there may
be a better way to do it.

Thanks for your patience and understanding,
Andy

Request for Question Clarification by hedgie-ga on 01 Oct 2004 00:34 PDT
To feasibility of Andy's suggestion to
 generate all subsequences
depends on length of your sequence - which is a parameter you did not
tell us. In Sequence of length N there are about N*N/2 subsequences ..
Now- to find
repetitions (I assume that's what you mean by 'which iterate')
 you probably have to keep subsequences in memory.
Some spellcheckers do that (they build a sorted sub-dictionary of words in
an article) and then compare it it a full dictionary. That, the first part,
is what you need - and you want it in c or c{{ 0 bir uin a scripting language.
Such programs are around.

Request for Question Clarification by hedgie-ga on 01 Oct 2004 00:36 PDT
correction, 

 c{{ 0 bir uin a


should be

inc or in c++, not in a scripting language 
(it is CPU intensive)

Clarification of Question by amos88-ga on 01 Oct 2004 00:40 PDT
andy, hedgie!

thanks for your suggestions / comments.

the length of the sequences are group dependent: the "shortest" group
has sequences of 9 items, a "medium" group sequences of 20 and the
"longest" group sequences of max. 40.
i need a solution for at least the first two, i could skip the largest
group if necessary.

best
amos

Request for Question Clarification by hedgie-ga on 01 Oct 2004 05:01 PDT
Amos,
   
   allow me to rephrase the task to see if I understand this:
   
   --------------------
   
   There is large number Nl of log sequences 
   (lets say 100 users every day generate 100 sequences so in a month
we have 3000 of them)
   
   each has length N which is 20 or so  and consists of integers
    which are in range  1 -- Nr (I see Nr ~10000 from your examples)
   
   This is your Ensemble to be analysed.
   
   We are interested in extracting a list of Repetitive Subsequences
of maximal lenght  Na
   lets call them Acts (or activities)
   
   Act is  sub sequence of your integer-sequences SHORTER then Na
   
    Frequency f  f(act) says how many times act was fount in the  Ensemmble.
   
   List all acts,  sorted by frequency f,  where f>1
   
   -------------------------------------
   
   There is  Nr to the power of Na of all possible acts. If that
number is small (relative to RAM of your machine)
   task is simple. You all acts then and just run a counter:
   
      Count(act)++   (in c) as you read once your Ensemble
      
      In this number is large, you need to build a binary tree of
those acts which do ocur in the ensamble.
      
      Question then is: 
       
       case 1: Do you want references to programs which do that for
words (made of 25 english letters)
      and which needs to be adapted (your "words" Acts, are made of Nr letters.
      Are you able and willing to adapt them?
      
      Or, case 2:
       do you want a complete program, ready to run, with unlimited
technical support aftertward?
      
       
      Couple references (which would have to be extended andd sorted
and tried for case 1) as sample are here:
      
 http://pyflag.sourceforge.net/

 http://javalinux.rallylobster.com/CDROM/Chapter02/Examples/wordtree.C

  This is the online version of the CD-ROM for Java Programming on
Linux, by Nathan Meyers
 http://javalinux.rallylobster.com/CDROM/#Sources

       Case 2, I am afraid cannot be had for $100
       
       So - how do you plead?

Clarification of Question by amos88-ga on 01 Oct 2004 07:05 PDT
hedgie,

>There is large number Nl of log sequences (lets say 100 users every
day >generate 100 sequences so in a month we have 3000 of them)

things are not that bad: I got 330 log sequences in six subgroups
(analysis will be run for each subgroup with approx. 330/6 sequences)
   
>each has length N which is 20 or so  and consists of integers which
are in >range  1 -- Nr (I see Nr ~10000 from your examples)

0 < N < 40 and the range is 0 < Nr < 100 (i can recode the numbers
from the example)

>We are interested in extracting a list of Repetitive Subsequences of
maximal >lenght Na lets call them Acts (or activities)
>Act is  sub sequence of your integer-sequences SHORTER then Na

hmm - as far as I can see it Na (which is equal to acts) is shorter
than lgo sequence ?!
   
>Frequency f  f(act) says how many times act was fount in the 
Ensemmble. List >all acts,  sorted by frequency f,  where f>1

right.

i would plead for case 2.
can you please tell me a price?

best

amos

Request for Question Clarification by andyt-ga on 02 Oct 2004 22:00 PDT
Hi Amos,
For the output for this, do you want to include matches for all
possible subsequences, or just the largest unique ones.  So for
instance with input:

user 1: 100 1200 200 1308 309 56
user 2: 200 200 1308 309 56 509

output of:
user 1, user 2: 1308 309 56
user 1, user2: 1308 309
user 1, user 2: 309 56

OR

user 1, user 2: 1308, 309 56

If you want the output of all the sequences, I have the code ready
now; otherwise I'll have to do some additional filtering of the
output, which I probably won't be able to have finished until the
middle of next week.

Right now, with 50 users and 20 sets of data each, it takes about 4
minutes on my 867 G4 with a gig of ram to run the script, so I think
it's doable with that amount of data.

Andy

Request for Question Clarification by hedgie-ga on 02 Oct 2004 23:03 PDT
Hello again Amos

    So, as Andy just demonstrated, it can be done for $130. That's great.
 I just want to reach completion and clarity on that parameter Na I
introduced. It is not necessary to have it; in that case we are
looking for 'acts' of any length, limited in fact by N .. by the
lengths of log sequences
(about 20). That works too.
  The idea if ignoring (in the first pass at least) all acts longer than Na
(may be 5 or 8) had to do with processing time and also with ease of reading
the output: There are going to be few log sequences with repeated Acts
which are long, longer than Na. It may be easier to look at them
directly, then having all their subsequences listed. It may be
expedient to sort them out
and do a second pass (with bigger Na) on this smaller ensemble..
Anyway. Good job, Andy.
Hedgie.

Clarification of Question by amos88-ga on 03 Oct 2004 00:08 PDT
hi,

this soudns very good.
i am not sure about the longest sequence issue without seeing the
data. i suggest the following: i take the current version and accept
the question as answered - if it turns out that the output is simply
overwhelming with my data i would ask for solution 2 (longest sequence
only). (don't know yet exactly how we do this - but we'll find out)

does this sound good?

best

amos
Answer  
Subject: Re: blaster algorithm (or at least close to it)
Answered By: andyt-ga on 03 Oct 2004 05:28 PDT
 
Amos,
Attached is a python script that will process an input file of user
sequence data, and an output a file of identical subsequences between
users.  To use this on Windows, you will need to install a windows
version of Python.  You can find a binary download at
http://www.python.org/download.

Once Python is installed, you invoke the script below by saving it in
a text file and calling it like:
python.exe analyzeSequenceData.py inputFileName outputFileName

I have formatted the output in the form:
[['user 4', 'user 1'], ['100', '1200']]
with each new match on a new line.  It prints all matches, even if the
match is a subset of a larger match.  If you require that to be
modified to only include the largest unique matches, just post a
clarification request and I'll try to get it modified for you.

You can download the script at
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt (you
must remove the .txt to run it) or copy and paste the code below into
a text editor.

Thanks for your great question, I had a lot of fun working on it.

-Andy

#usage:
#python analyzeSequenceData.py inputFileName outputFile

#input file in the format:
#user 1: 100 1200 200 1308 
#user 2: 200 200 1308 309 
#...
#user n: 32 543 53 52 89

import sys

#opening file
#and creating lists
f = open(sys.argv[1]);
inputList=[]
for line in f:
	line = line.replace("\n", "")
	line = line.strip()
	inputList.append(line.split(":"))
f.close()

#cleans trailing/leading spaces and splits on interior spaces
for i in range(len(inputList)):
	inputList[i][1] = inputList[i][1].strip()
	inputList[i][1] = inputList[i][1].split(" ")

#generating subsequences of lists
seqList=[]
for li in inputList:
	liSub=[] 
	count=0
	for i in range(len(li[1])-1):
		for j in range(len(li[1]) -1 - i):
			liSub.append(li[1][i:j+2+count])
		count=count+1
	seqList.append([li[0], liSub])


#finding matches
matches=[]	
for q, i in zip(seqList, range(len(seqList))):
	for r in q[1]:
		#list of all possible sequences, not including the one
		#being compared against
		allOtherSeq = seqList[0:i]+seqList[i+1:len(seqList)]
		for s in  allOtherSeq:
			for t in s[1]:
				if t == r:
					found = 0
					for v, ii in zip(matches, range(len(matches))):
						if t == v[1]:
							#if user is not in the match list
							#but it is matching the sequence modify
							#the user list to include that user
							#if the user isn't in the matches list
							#add it
							if s[0] not in v[0]:
								v[0].append(s[0])
							found = 1							
					if found == 0:
						matches.append([[s[0], q[0]], t])


#print output to file
fOut = open(sys.argv[2], 'w')				
for x in matches:
	print >> fOut, x
	
fOut.close()

Request for Answer Clarification by amos88-ga on 03 Oct 2004 11:01 PDT
hi

thanks a lot for your program.
i tried to run it on a small dataset and it worked out fine however on
the actual data there seems to be a problem: an output file is
generated but it is empty. my input file looks like:

2541:	28	42	36	37	30	43	34	49	32
2535:	33	32	43	29	36	31	30	35
2523:	33	32	35	34	29	28	84	63	62	

any idea?

thanks

amos

Clarification of Answer by andyt-ga on 03 Oct 2004 11:38 PDT
Yep, since the input is separated by more then one space, there is a
small modification needed:
this line:
	inputList[i][1] = inputList[i][1].split(" ")
should be changed to:
	inputList[i][1] = inputList[i][1].split()

I uploaded the new version to
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt

Request for Answer Clarification by amos88-ga on 03 Oct 2004 13:59 PDT
thanks!
this is ok now. 
question: what is the maximum length for a sequence - i get a "list
index out of range error" with a quite large file

thanks for your help

amos

Clarification of Answer by andyt-ga on 03 Oct 2004 14:46 PDT
I didn't test it with anything larger then 50...what size are we
talking here?  I didn't put any arbitrary limit, and as far as I know
it Python should allow for arbitrarily long lists, up to the limits of
your machine's processing power.  I can try to duplicate it on my end.
 Also, there will be an error of that type if the file contains any
new lines without data.  Let me know if it was the new line issue, or
the number of users that you're having the problem with.

-Andy

Clarification of Answer by andyt-ga on 03 Oct 2004 15:06 PDT
I put some error checking that should handle the problem with lines
that don't contain data.  It's uploaded to
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt

Clarification of Answer by andyt-ga on 09 Jun 2006 20:21 PDT
Solution link is http://tunebounce.com/answers/406959/analyzeSequenceData.txt
My web host couldn't handle the .py.
Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy