my quesiton concerns the analysis of sequence data. i try to find
sequences in clickstreams of a weblog.
my files looks like this:
user 1: 100 1200 200 1308 309
user 2: 200 200 1308 309 509
...
user n
each number corresponds to a click in an database. "200 1308 309"
would be a sequence occuring for user1 and user2.
given this intput i need some output telling me whether there are
sequences which iterate.
i THINK the following tool does the same for dna code:
http://www.hip.harvard.edu/informatics/programs/JAVA%20BLAST%20Parser.html
anybody who can provide a software solution for this problem
(preferential running on windows).
best
amos |
Request for Question Clarification by
andyt-ga
on
27 Sep 2004 11:35 PDT
Hi amos88,
After a quick look at the BLAST tool, I think it would be easiest to
write a custom solution in a scripting language, rather then modify
BLAST. Would something written using PHP, Perl, or Python meet your
needs? All of these languages are available for Windows and can be
run from the command line.
Something to the effect of
>>analyzeSequenceData inputFile outputFile
|
Clarification of Question by
amos88-ga
on
27 Sep 2004 23:02 PDT
hi andyt,
i have no preferences concerning the language.
best
amos88
|
Clarification of Question by
amos88-ga
on
30 Sep 2004 19:53 PDT
hi andyt!
how are things?
i did not mention that time (as for the most of us) is quite valueable
for me at the moment (have some deadlines to keep). can you already
state an estimated time for answering the problem?
thanks a lot
amos
|
Request for Question Clarification by
andyt-ga
on
30 Sep 2004 23:30 PDT
Amos,
Another GA staffer had been working on this problem over the past
couple of days, although the lock is now off. I've spent some time
attempting it, but it's turning out to be more of a complex problem
then I thought. I'm going to be pretty busy over the weekend, so I
doubt I'll be able to spend much time on it. So in the hopes that it
will help someone else to help you, I'll post what I have. If no one
else wants to give it a try, I'll continue working on it, but I don't
want to give an estimated time of delivery at this time.
I think the solution involves generating all possible subsequences of
each user, for instance for user 1 that would be [[100, 1200], [100,
1200, 200], [100, 1200, 200, 1308], [100, 1200, 200, 1308, 309],
[1200, 200], [1200, 200, 1308], [1200, 200, 1308, 309], [200, 1308],
[200, 1308, 309], [1308, 309]]
Then comparing that new list against every other user's subsequence list.
Here's some python code I came up with to generate all subsequences
given an input list li:
li = [100, 1200, 200, 1308, 309]
liSub=[] #list to hold output subsequences
count=0
for i in range(len(li)-1):
for j in range(len(li) -1 - i):
liSub.append(li[i:j+2+count])
count=count+1
print liSub
The problem with this approach is that it is very computationally
intensive as the number of users increases, and I'm guessing there may
be a better way to do it.
Thanks for your patience and understanding,
Andy
|
Request for Question Clarification by
hedgie-ga
on
01 Oct 2004 00:34 PDT
To feasibility of Andy's suggestion to
generate all subsequences
depends on length of your sequence - which is a parameter you did not
tell us. In Sequence of length N there are about N*N/2 subsequences ..
Now- to find
repetitions (I assume that's what you mean by 'which iterate')
you probably have to keep subsequences in memory.
Some spellcheckers do that (they build a sorted sub-dictionary of words in
an article) and then compare it it a full dictionary. That, the first part,
is what you need - and you want it in c or c{{ 0 bir uin a scripting language.
Such programs are around.
|
Request for Question Clarification by
hedgie-ga
on
01 Oct 2004 00:36 PDT
correction,
c{{ 0 bir uin a
should be
inc or in c++, not in a scripting language
(it is CPU intensive)
|
Clarification of Question by
amos88-ga
on
01 Oct 2004 00:40 PDT
andy, hedgie!
thanks for your suggestions / comments.
the length of the sequences are group dependent: the "shortest" group
has sequences of 9 items, a "medium" group sequences of 20 and the
"longest" group sequences of max. 40.
i need a solution for at least the first two, i could skip the largest
group if necessary.
best
amos
|
Request for Question Clarification by
hedgie-ga
on
01 Oct 2004 05:01 PDT
Amos,
allow me to rephrase the task to see if I understand this:
--------------------
There is large number Nl of log sequences
(lets say 100 users every day generate 100 sequences so in a month
we have 3000 of them)
each has length N which is 20 or so and consists of integers
which are in range 1 -- Nr (I see Nr ~10000 from your examples)
This is your Ensemble to be analysed.
We are interested in extracting a list of Repetitive Subsequences
of maximal lenght Na
lets call them Acts (or activities)
Act is sub sequence of your integer-sequences SHORTER then Na
Frequency f f(act) says how many times act was fount in the Ensemmble.
List all acts, sorted by frequency f, where f>1
-------------------------------------
There is Nr to the power of Na of all possible acts. If that
number is small (relative to RAM of your machine)
task is simple. You all acts then and just run a counter:
Count(act)++ (in c) as you read once your Ensemble
In this number is large, you need to build a binary tree of
those acts which do ocur in the ensamble.
Question then is:
case 1: Do you want references to programs which do that for
words (made of 25 english letters)
and which needs to be adapted (your "words" Acts, are made of Nr letters.
Are you able and willing to adapt them?
Or, case 2:
do you want a complete program, ready to run, with unlimited
technical support aftertward?
Couple references (which would have to be extended andd sorted
and tried for case 1) as sample are here:
http://pyflag.sourceforge.net/
http://javalinux.rallylobster.com/CDROM/Chapter02/Examples/wordtree.C
This is the online version of the CD-ROM for Java Programming on
Linux, by Nathan Meyers
http://javalinux.rallylobster.com/CDROM/#Sources
Case 2, I am afraid cannot be had for $100
So - how do you plead?
|
Clarification of Question by
amos88-ga
on
01 Oct 2004 07:05 PDT
hedgie,
>There is large number Nl of log sequences (lets say 100 users every
day >generate 100 sequences so in a month we have 3000 of them)
things are not that bad: I got 330 log sequences in six subgroups
(analysis will be run for each subgroup with approx. 330/6 sequences)
>each has length N which is 20 or so and consists of integers which
are in >range 1 -- Nr (I see Nr ~10000 from your examples)
0 < N < 40 and the range is 0 < Nr < 100 (i can recode the numbers
from the example)
>We are interested in extracting a list of Repetitive Subsequences of
maximal >lenght Na lets call them Acts (or activities)
>Act is sub sequence of your integer-sequences SHORTER then Na
hmm - as far as I can see it Na (which is equal to acts) is shorter
than lgo sequence ?!
>Frequency f f(act) says how many times act was fount in the
Ensemmble. List >all acts, sorted by frequency f, where f>1
right.
i would plead for case 2.
can you please tell me a price?
best
amos
|
Request for Question Clarification by
andyt-ga
on
02 Oct 2004 22:00 PDT
Hi Amos,
For the output for this, do you want to include matches for all
possible subsequences, or just the largest unique ones. So for
instance with input:
user 1: 100 1200 200 1308 309 56
user 2: 200 200 1308 309 56 509
output of:
user 1, user 2: 1308 309 56
user 1, user2: 1308 309
user 1, user 2: 309 56
OR
user 1, user 2: 1308, 309 56
If you want the output of all the sequences, I have the code ready
now; otherwise I'll have to do some additional filtering of the
output, which I probably won't be able to have finished until the
middle of next week.
Right now, with 50 users and 20 sets of data each, it takes about 4
minutes on my 867 G4 with a gig of ram to run the script, so I think
it's doable with that amount of data.
Andy
|
Request for Question Clarification by
hedgie-ga
on
02 Oct 2004 23:03 PDT
Hello again Amos
So, as Andy just demonstrated, it can be done for $130. That's great.
I just want to reach completion and clarity on that parameter Na I
introduced. It is not necessary to have it; in that case we are
looking for 'acts' of any length, limited in fact by N .. by the
lengths of log sequences
(about 20). That works too.
The idea if ignoring (in the first pass at least) all acts longer than Na
(may be 5 or 8) had to do with processing time and also with ease of reading
the output: There are going to be few log sequences with repeated Acts
which are long, longer than Na. It may be easier to look at them
directly, then having all their subsequences listed. It may be
expedient to sort them out
and do a second pass (with bigger Na) on this smaller ensemble..
Anyway. Good job, Andy.
Hedgie.
|
Clarification of Question by
amos88-ga
on
03 Oct 2004 00:08 PDT
hi,
this soudns very good.
i am not sure about the longest sequence issue without seeing the
data. i suggest the following: i take the current version and accept
the question as answered - if it turns out that the output is simply
overwhelming with my data i would ask for solution 2 (longest sequence
only). (don't know yet exactly how we do this - but we'll find out)
does this sound good?
best
amos
|
Amos,
Attached is a python script that will process an input file of user
sequence data, and an output a file of identical subsequences between
users. To use this on Windows, you will need to install a windows
version of Python. You can find a binary download at
http://www.python.org/download.
Once Python is installed, you invoke the script below by saving it in
a text file and calling it like:
python.exe analyzeSequenceData.py inputFileName outputFileName
I have formatted the output in the form:
[['user 4', 'user 1'], ['100', '1200']]
with each new match on a new line. It prints all matches, even if the
match is a subset of a larger match. If you require that to be
modified to only include the largest unique matches, just post a
clarification request and I'll try to get it modified for you.
You can download the script at
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt (you
must remove the .txt to run it) or copy and paste the code below into
a text editor.
Thanks for your great question, I had a lot of fun working on it.
-Andy
#usage:
#python analyzeSequenceData.py inputFileName outputFile
#input file in the format:
#user 1: 100 1200 200 1308
#user 2: 200 200 1308 309
#...
#user n: 32 543 53 52 89
import sys
#opening file
#and creating lists
f = open(sys.argv[1]);
inputList=[]
for line in f:
line = line.replace("\n", "")
line = line.strip()
inputList.append(line.split(":"))
f.close()
#cleans trailing/leading spaces and splits on interior spaces
for i in range(len(inputList)):
inputList[i][1] = inputList[i][1].strip()
inputList[i][1] = inputList[i][1].split(" ")
#generating subsequences of lists
seqList=[]
for li in inputList:
liSub=[]
count=0
for i in range(len(li[1])-1):
for j in range(len(li[1]) -1 - i):
liSub.append(li[1][i:j+2+count])
count=count+1
seqList.append([li[0], liSub])
#finding matches
matches=[]
for q, i in zip(seqList, range(len(seqList))):
for r in q[1]:
#list of all possible sequences, not including the one
#being compared against
allOtherSeq = seqList[0:i]+seqList[i+1:len(seqList)]
for s in allOtherSeq:
for t in s[1]:
if t == r:
found = 0
for v, ii in zip(matches, range(len(matches))):
if t == v[1]:
#if user is not in the match list
#but it is matching the sequence modify
#the user list to include that user
#if the user isn't in the matches list
#add it
if s[0] not in v[0]:
v[0].append(s[0])
found = 1
if found == 0:
matches.append([[s[0], q[0]], t])
#print output to file
fOut = open(sys.argv[2], 'w')
for x in matches:
print >> fOut, x
fOut.close() |
Request for Answer Clarification by
amos88-ga
on
03 Oct 2004 11:01 PDT
hi
thanks a lot for your program.
i tried to run it on a small dataset and it worked out fine however on
the actual data there seems to be a problem: an output file is
generated but it is empty. my input file looks like:
2541: 28 42 36 37 30 43 34 49 32
2535: 33 32 43 29 36 31 30 35
2523: 33 32 35 34 29 28 84 63 62
any idea?
thanks
amos
|
Clarification of Answer by
andyt-ga
on
03 Oct 2004 11:38 PDT
Yep, since the input is separated by more then one space, there is a
small modification needed:
this line:
inputList[i][1] = inputList[i][1].split(" ")
should be changed to:
inputList[i][1] = inputList[i][1].split()
I uploaded the new version to
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt
|
Request for Answer Clarification by
amos88-ga
on
03 Oct 2004 13:59 PDT
thanks!
this is ok now.
question: what is the maximum length for a sequence - i get a "list
index out of range error" with a quite large file
thanks for your help
amos
|
Clarification of Answer by
andyt-ga
on
03 Oct 2004 14:46 PDT
I didn't test it with anything larger then 50...what size are we
talking here? I didn't put any arbitrary limit, and as far as I know
it Python should allow for arbitrarily long lists, up to the limits of
your machine's processing power. I can try to duplicate it on my end.
Also, there will be an error of that type if the file contains any
new lines without data. Let me know if it was the new line issue, or
the number of users that you're having the problem with.
-Andy
|
Clarification of Answer by
andyt-ga
on
03 Oct 2004 15:06 PDT
I put some error checking that should handle the problem with lines
that don't contain data. It's uploaded to
http://tunebounce.com/answers/406959/analyzeSequenceData.py.txt
|
Clarification of Answer by
andyt-ga
on
09 Jun 2006 20:21 PDT
Solution link is http://tunebounce.com/answers/406959/analyzeSequenceData.txt
My web host couldn't handle the .py.
|