Google Answers: blaster algorithm (or at least close to it)

View Question

Q: blaster algorithm (or at least close to it) ( Answered, 0 Comments )

Question

Subject: blaster algorithm (or at least close to it)
Category: Computers > Software
Asked by: amos88-ga
List Price: $130.00

Posted: 27 Sep 2004 10:53 PDT
Expires: 27 Oct 2004 10:53 PDT
Question ID: 406959

my quesiton concerns the analysis of sequence data. i try to find sequences in clickstreams of a weblog. my files looks like this: user 1: 100 1200 200 1308 309 user 2: 200 200 1308 309 509 ... user n each number corresponds to a click in an database. "200 1308 309" would be a sequence occuring for user1 and user2. given this intput i need some output telling me whether there are sequences which iterate. i THINK the following tool does the same for dna code: http://www.hip.harvard.edu/informatics/programs/JAVA%20BLAST%20Parser.html anybody who can provide a software solution for this problem (preferential running on windows). best amos
Request for Question Clarification by andyt-ga on 27 Sep 2004 11:35 PDT Hi amos88, After a quick look at the BLAST tool, I think it would be easiest to write a custom solution in a scripting language, rather then modify BLAST. Would something written using PHP, Perl, or Python meet your needs? All of these languages are available for Windows and can be run from the command line. Something to the effect of >>analyzeSequenceData inputFile outputFile
Clarification of Question by amos88-ga on 27 Sep 2004 23:02 PDT hi andyt, i have no preferences concerning the language. best amos88
Clarification of Question by amos88-ga on 30 Sep 2004 19:53 PDT hi andyt! how are things? i did not mention that time (as for the most of us) is quite valueable for me at the moment (have some deadlines to keep). can you already state an estimated time for answering the problem? thanks a lot amos
Request for Question Clarification by andyt-ga on 30 Sep 2004 23:30 PDT Amos, Another GA staffer had been working on this problem over the past couple of days, although the lock is now off. I've spent some time attempting it, but it's turning out to be more of a complex problem then I thought. I'm going to be pretty busy over the weekend, so I doubt I'll be able to spend much time on it. So in the hopes that it will help someone else to help you, I'll post what I have. If no one else wants to give it a try, I'll continue working on it, but I don't want to give an estimated time of delivery at this time. I think the solution involves generating all possible subsequences of each user, for instance for user 1 that would be [[100, 1200], [100, 1200, 200], [100, 1200, 200, 1308], [100, 1200, 200, 1308, 309], [1200, 200], [1200, 200, 1308], [1200, 200, 1308, 309], [200, 1308], [200, 1308, 309], [1308, 309]] Then comparing that new list against every other user's subsequence list. Here's some python code I came up with to generate all subsequences given an input list li: li = [100, 1200, 200, 1308, 309] liSub=[] #list to hold output subsequences count=0 for i in range(len(li)-1): for j in range(len(li) -1 - i): liSub.append(li[i:j+2+count]) count=count+1 print liSub The problem with this approach is that it is very computationally intensive as the number of users increases, and I'm guessing there may be a better way to do it. Thanks for your patience and understanding, Andy
Request for Question Clarification by hedgie-ga on 01 Oct 2004 00:34 PDT To feasibility of Andy's suggestion to generate all subsequences depends on length of your sequence - which is a parameter you did not tell us. In Sequence of length N there are about N*N/2 subsequences .. Now- to find repetitions (I assume that's what you mean by 'which iterate') you probably have to keep subsequences in memory. Some spellcheckers do that (they build a sorted sub-dictionary of words in an article) and then compare it it a full dictionary. That, the first part, is what you need - and you want it in c or c{{ 0 bir uin a scripting language. Such programs are around.
Request for Question Clarification by hedgie-ga on 01 Oct 2004 00:36 PDT correction, c{{ 0 bir uin a should be inc or in c++, not in a scripting language (it is CPU intensive)
Clarification of Question by amos88-ga on 01 Oct 2004 00:40 PDT andy, hedgie! thanks for your suggestions / comments. the length of the sequences are group dependent: the "shortest" group has sequences of 9 items, a "medium" group sequences of 20 and the "longest" group sequences of max. 40. i need a solution for at least the first two, i could skip the largest group if necessary. best amos
Request for Question Clarification by hedgie-ga on 01 Oct 2004 05:01 PDT Amos, allow me to rephrase the task to see if I understand this: -------------------- There is large number Nl of log sequences (lets say 100 users every day generate 100 sequences so in a month we have 3000 of them) each has length N which is 20 or so and consists of integers which are in range 1 -- Nr (I see Nr ~10000 from your examples) This is your Ensemble to be analysed. We are interested in extracting a list of Repetitive Subsequences of maximal lenght Na lets call them Acts (or activities) Act is sub sequence of your integer-sequences SHORTER then Na Frequency f f(act) says how many times act was fount in the Ensemmble. List all acts, sorted by frequency f, where f>1 ------------------------------------- There is Nr to the power of Na of all possible acts. If that number is small (relative to RAM of your machine) task is simple. You all acts then and just run a counter: Count(act)++ (in c) as you read once your Ensemble In this number is large, you need to build a binary tree of those acts which do ocur in the ensamble. Question then is: case 1: Do you want references to programs which do that for words (made of 25 english letters) and which needs to be adapted (your "words" Acts, are made of Nr letters. Are you able and willing to adapt them? Or, case 2: do you want a complete program, ready to run, with unlimited technical support aftertward? Couple references (which would have to be extended andd sorted and tried for case 1) as sample are here: http://pyflag.sourceforge.net/ http://javalinux.rallylobster.com/CDROM/Chapter02/Examples/wordtree.C This is the online version of the CD-ROM for Java Programming on Linux, by Nathan Meyers http://javalinux.rallylobster.com/CDROM/#Sources Case 2, I am afraid cannot be had for $100 So - how do you plead?
Clarification of Question by amos88-ga on 01 Oct 2004 07:05 PDT hedgie, >There is large number Nl of log sequences (lets say 100 users every day >generate 100 sequences so in a month we have 3000 of them) things are not that bad: I got 330 log sequences in six subgroups (analysis will be run for each subgroup with approx. 330/6 sequences) >each has length N which is 20 or so and consists of integers which are in >range 1 -- Nr (I see Nr ~10000 from your examples) 0 < N < 40 and the range is 0 < Nr < 100 (i can recode the numbers from the example) >We are interested in extracting a list of Repetitive Subsequences of maximal >lenght Na lets call them Acts (or activities) >Act is sub sequence of your integer-sequences SHORTER then Na hmm - as far as I can see it Na (which is equal to acts) is shorter than lgo sequence ?! >Frequency f f(act) says how many times act was fount in the Ensemmble. List >all acts, sorted by frequency f, where f>1 right. i would plead for case 2. can you please tell me a price? best amos
Request for Question Clarification by andyt-ga on 02 Oct 2004 22:00 PDT Hi Amos, For the output for this, do you want to include matches for all possible subsequences, or just the largest unique ones. So for instance with input: user 1: 100 1200 200 1308 309 56 user 2: 200 200 1308 309 56 509 output of: user 1, user 2: 1308 309 56 user 1, user2: 1308 309 user 1, user 2: 309 56 OR user 1, user 2: 1308, 309 56 If you want the output of all the sequences, I have the code ready now; otherwise I'll have to do some additional filtering of the output, which I probably won't be able to have finished until the middle of next week. Right now, with 50 users and 20 sets of data each, it takes about 4 minutes on my 867 G4 with a gig of ram to run the script, so I think it's doable with that amount of data. Andy
Request for Question Clarification by hedgie-ga on 02 Oct 2004 23:03 PDT Hello again Amos So, as Andy just demonstrated, it can be done for $130. That's great. I just want to reach completion and clarity on that parameter Na I introduced. It is not necessary to have it; in that case we are looking for 'acts' of any length, limited in fact by N .. by the lengths of log sequences (about 20). That works too. The idea if ignoring (in the first pass at least) all acts longer than Na (may be 5 or 8) had to do with processing time and also with ease of reading the output: There are going to be few log sequences with repeated Acts which are long, longer than Na. It may be easier to look at them directly, then having all their subsequences listed. It may be expedient to sort them out and do a second pass (with bigger Na) on this smaller ensemble.. Anyway. Good job, Andy. Hedgie.
Clarification of Question by amos88-ga on 03 Oct 2004 00:08 PDT hi, this soudns very good. i am not sure about the longest sequence issue without seeing the data. i suggest the following: i take the current version and accept the question as answered - if it turns out that the output is simply overwhelming with my data i would ask for solution 2 (longest sequence only). (don't know yet exactly how we do this - but we'll find out) does this sound good? best amos

Answer

Subject: Re: blaster algorithm (or at least close to it)
Answered By: andyt-ga on 03 Oct 2004 05:28 PDT

Amos, Attached is a python script that will process an input file of user sequence data, and an output a file of identical subsequences between users. To use this on Windows, you will need to install a windows version of Python. You can find a binary download at http://www.python.org/download. Once Python is installed, you invoke the script below by saving it in a text file and calling it like: python.exe analyzeSequenceData.py inputFileName outputFileName I have formatted the output in the form: [['user 4', 'user 1'], ['100', '1200']] with each new match on a new line. It prints all matches, even if the match is a subset of a larger match. If you require that to be modified to only include the largest unique matches, just post a clarification request and I'll try to get it modified for you. You can download the script at http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt (you must remove the .txt to run it) or copy and paste the code below into a text editor. Thanks for your great question, I had a lot of fun working on it. -Andy #usage: #python analyzeSequenceData.py inputFileName outputFile #input file in the format: #user 1: 100 1200 200 1308 #user 2: 200 200 1308 309 #... #user n: 32 543 53 52 89 import sys #opening file #and creating lists f = open(sys.argv[1]); inputList=[] for line in f: line = line.replace("\n", "") line = line.strip() inputList.append(line.split(":")) f.close() #cleans trailing/leading spaces and splits on interior spaces for i in range(len(inputList)): inputList[i][1] = inputList[i][1].strip() inputList[i][1] = inputList[i][1].split(" ") #generating subsequences of lists seqList=[] for li in inputList: liSub=[] count=0 for i in range(len(li[1])-1): for j in range(len(li[1]) -1 - i): liSub.append(li[1][i:j+2+count]) count=count+1 seqList.append([li[0], liSub]) #finding matches matches=[] for q, i in zip(seqList, range(len(seqList))): for r in q[1]: #list of all possible sequences, not including the one #being compared against allOtherSeq = seqList[0:i]+seqList[i+1:len(seqList)] for s in allOtherSeq: for t in s[1]: if t == r: found = 0 for v, ii in zip(matches, range(len(matches))): if t == v[1]: #if user is not in the match list #but it is matching the sequence modify #the user list to include that user #if the user isn't in the matches list #add it if s[0] not in v[0]: v[0].append(s[0]) found = 1 if found == 0: matches.append([[s[0], q[0]], t]) #print output to file fOut = open(sys.argv[2], 'w') for x in matches: print >> fOut, x fOut.close()
Request for Answer Clarification by amos88-ga on 03 Oct 2004 11:01 PDT hi thanks a lot for your program. i tried to run it on a small dataset and it worked out fine however on the actual data there seems to be a problem: an output file is generated but it is empty. my input file looks like: 2541: 28 42 36 37 30 43 34 49 32 2535: 33 32 43 29 36 31 30 35 2523: 33 32 35 34 29 28 84 63 62 any idea? thanks amos
Clarification of Answer by andyt-ga on 03 Oct 2004 11:38 PDT Yep, since the input is separated by more then one space, there is a small modification needed: this line: inputList[i][1] = inputList[i][1].split(" ") should be changed to: inputList[i][1] = inputList[i][1].split() I uploaded the new version to http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt
Request for Answer Clarification by amos88-ga on 03 Oct 2004 13:59 PDT thanks! this is ok now. question: what is the maximum length for a sequence - i get a "list index out of range error" with a quite large file thanks for your help amos
Clarification of Answer by andyt-ga on 03 Oct 2004 14:46 PDT I didn't test it with anything larger then 50...what size are we talking here? I didn't put any arbitrary limit, and as far as I know it Python should allow for arbitrarily long lists, up to the limits of your machine's processing power. I can try to duplicate it on my end. Also, there will be an error of that type if the file contains any new lines without data. Let me know if it was the new line issue, or the number of users that you're having the problem with. -Andy
Clarification of Answer by andyt-ga on 03 Oct 2004 15:06 PDT I put some error checking that should handle the problem with lines that don't contain data. It's uploaded to http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt
Clarification of Answer by andyt-ga on 09 Jun 2006 20:21 PDT Solution link is http://tunebounce.com/answers/406959/analyzeSequenceData.txt My web host couldn't handle the .py.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy