I am facing the following problem: in a database I have got N
information items (datasets) which can be searched through by users. I
can generate a list of accessed items e.g. items 1,10,22,15....130
(this is like a clickstream in web site analysis) were accessed for
each user.
the ideal solution now would be an analysis that shows me an average
"path" through the data. the mentioned clickstream analysis would be
the ideal solution but the problem is, that the server logs do not get
all the necessary information (because the whole process is database
driven) and therefore the analysis does not work ...
as a second analysis a "types" of searches would be interesting. i can
categorize information e.g., items 1 to 10 are category A items 2 to
40 are category B and so on. is it possible to extract information
search types when a user 1 accesses cat A 3 times and B 2 times in
comparison to user 2 who accesses A 1 and B 8 ....
i am aware that these ideas might sound vague :-)
thanks for your help
best
markus |
Request for Question Clarification by
mathtalk-ga
on
30 Jul 2004 11:47 PDT
Hi, amos88-ga:
Is the idea to try to produce "the" average sequence of accessed
items, or to develop a notion of how far a given access sequence
deviates from "average"?
Perhaps your log contains session cookies that allow estimating the
probability that if an item J is accessed, subsequently in the same
session item K will be accessed?
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
31 Jul 2004 05:00 PDT
hi mathtalk,
i have got different search tasks for the user - it is interesting for
me how average sequences change when the instruction is changed.
task 1 - average sequence 1
task 2 - average sequence 2
task 3 - average sequence 3
is there a difference between the three (the database underneath is of
course the same)
non session cookies i am afraid -> really just a sequence for each user in a task.
**************************
another way seeing it in a binary mode:
assuming in my dataset are 5 informations:
USER 1 USER 2 USER 3
i1 1 0 1
i2 0 1 1
i3 1 1 0
i4 0 0 0
i5 0 0 1
1 - hit
0 - no hit
question: are there differences between users?
does this help?
cheers
amos
|
Request for Question Clarification by
mathtalk-ga
on
31 Jul 2004 05:50 PDT
Hi, amos:
It looks like we cannot derive "sequence" information, but there is
"correlation" data: Given that a user has accessed one item, what
other items are "popular".
One approach might be do a "tree clustering", ie. to group together
items that have the strongest correlations, continuing to group groups
together until only one group remains. These groupings create a tree
that ultimately lumps all items together.
While such a tree is fairly easy to draw by itself, I would guess
comparing how two such trees are different is not easily done by
visual inspection (unless the changes happen to be localized in one
part of the tree). Still, if a change in your instructions seems to
cause a large scale rearrangement in the tree this might be worth
"seeing" (as would the case that changes cause little or no
rearrangement in clustering).
Such an approach might also help address your interest in a "types" of
search analysis. I or another Researcher would be glad to post some
algorithmic details and references if this woud be of interest.
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
31 Jul 2004 08:10 PDT
hi again,
maybe we are talking of different things when we mean sequence - just
to be sure ...
i can get sequence information (not from a session cookie, but from my
database) - a sequence of eg IDs of accessed information is no problem
to generate.
in any case - the tree clustering idea sounds interesting, although
the expected change between tasks will not be big, i am afraid.
cheers
amos
|
Request for Question Clarification by
mathtalk-ga
on
01 Aug 2004 20:06 PDT
By "sequence" I meant the order in which items are accessed by a
particular user in a visit. It seemed to me at first that your focus
was on the order in which a "path" was taken through the available
data items in your site database.
Then I got (apparently mistakenly) the notion that such "sequence"
data was not readily available, as it is not reflected in the 'binary
mode' sample (which shows only items hit/not hit for three users).
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
02 Aug 2004 01:02 PDT
new week :-)
hi again,
ok - we are getting closer i think.
i got sequence information (user 1 accessed items 1,15,2,3,74; user 2
accessed items 20,14,144,23,3) which can be transfered into the binary
example.
in the "type" of surfer case a mean sequence is interesting for
describing user's behavior between different tasks.
any idea about the time series - is it a case of this approach?
cheers
a
|
Request for Question Clarification by
mathtalk-ga
on
02 Aug 2004 08:52 PDT
Time series are normally used in connection with quantitative data,
e.g. the price of coffee varying from month to month. Time series can
be analyzed for trends and for correlations.
Your numbered items appear to be "nominal" data. That is the number
associated with an item merely identifies it, without implying special
significance to the magnitude of the numbers. E.g. 10 < 22 has no
intrinsic meaning in terms of relations between the two items so
labelled.
Perhaps you are trying to "group" not so much the database items as
the users of your Web site. We might think in terms of "strength" of
similarity:
1. Two users whose "paths" through database items, given the same
task, are similar both in sequence and in the set of items browsed.
2. Two users whose paths are similar in the set of items browsed but
differ a good bit as to sequence.
3. Two users whose paths are not similar either in sequence or in set
of items browsed.
Of course users might be "similar" for one task and not for others. I
guess it must depend on the purpose of the analysis how detailed it
would be worthwhile to be in distinguishing groups of users (or groups
of items).
With sequence information one can model the space of "paths" as a
Markov chain model. This would approximate your notion of the
"average path" as it would allow the calculation of the likelihood of
any given path through the data.
If you have sequence data, then does it happen that a user will
sometimes "revisit" an item already viewed?
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
02 Aug 2004 09:36 PDT
>Time series are normally used in connection with quantitative data,
>e.g. the price of coffee varying from month to month. Time series can
>be analyzed for trends and for correlations.
right - this makes sense - stupid nomial data ;-)
it's exactltly the idea of grouping the users ...
1. right
2. did not see this yet - but yes
3. exactly
>If you have sequence data, then does it happen that a user will
>sometimes "revisit" an item already viewed?
users see items they have already accessed - in a way you see items in
a shopping bag with eg. amazon ...
so it can happen that they access an item for a second time to make
sure what it is about.
best
amos
|
Request for Question Clarification by
mathtalk-ga
on
13 Aug 2004 14:24 PDT
Hi, amos88-ga:
Perhaps I've been distracted by other problems I've been working on
that involve Markov chains and the like. I work with a lot of folks
who deal with problems of sequence alignments and matching of
"strings". So today I began to think of your problem as being of an
abbreviated form of this problem.
Given the sequences of items chosen by different users for a task, I
think we can define a "distance" between them in a fairly natural way.
The distance between two identical choices would be zero. If two
choices differ in the order of two adjacent selections (one right
after the other), this (it seems to me) would be a slight difference
(and a correspondingly small distance). Omitting an item from one
list (but otherwise the same sequence) would be a greater difference
(but not much). Substituting one unique item for another (at the same
point in the sequences) would be a still greater difference.
The goal would then be to find (using the cumulative distances as
"penalties") the parsimonious distance from one user's sequence to
another's.
In measuring the distance between two groups (as opposed to between
two individuals), one simple approach is to use the minimum distance
between two individuals belonging to the two groups. Maybe a better
measure would be the median distance between all discordant pairs (one
from each group) from the two groups. Both of these approaches would
take advantage of an initial computation of all distances between all
pairs of individuals' selections (for a fixed task).
Certainly this approach would provided the clustering of users that
seems to be of interest to you. Whether it provides valuable insights
is hard to judge without doing the experiment with some real data.
Please kick these ideas around in your own mind, and let me know what
you think. A number of parameters would need to be decided, e.g. the
penalty for an omitted item vs. the penalty for a substituted item,
etc. But I think this could be programmed in a practical way. This
sort of matching with minimum penalty is the bread and butter of a lot
of the processes around me at work, and since the number of items
involved in any one task you deal with is fairly modest, I have to
believe it is computationally feasible to do.
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
16 Aug 2004 07:51 PDT
dear mathtalk,
thanks for your brain-work and the ideas you posted.
i think we are a big step nearer to the goal :-)
in discussions with a friend (mainly he - but lets say) we developed
some ideas based on your approach - i asked him to post them as a
comment into this thread ... just to clarify who is who
cheers
amos
|
Clarification of Question by
amos88-ga
on
16 Aug 2004 09:29 PDT
still me ...
had another disucssion about the distances approach.
my aim is to show differences between tasks - this means that apriori
users should click on (not exactly) the same information (but closely
connected ones) in one task -> doesn't this bring us into trouble with
the distances idea which should result in a very small distance for a
given task. compared to the others this measure would give me more a
"regression" than a value to compare two tasks ...
still the penalties idea would bring a solution - my problem is the
definition of such penalties in connection to my tasks - there are
some but not sufficient predictions for a sequence of information ...
anyway - i will paste in my friends summary now - which is much more
mathematical than my writing anyway BUT before the insight with the
distances problem.
cheers
a
*********
Hi Amos, Hi Mathtalk,
first of all I'd like to clarify the following. I know amos personally and
he showed me what he's currently working on and as I found it quite
interesting I spent some time thinking on that problem, so, I'm not after
the reward set for this question ;)
I guess I'm mainly referring to Mathtalk's last post on 08/13/04.
Just to have things clear and understandable I quickly restate what we've
got.
We got a set of information items (numbered from 1 to n, i.e. I_1,...,I_n)
that can be viewed by a user.
Further more we got a set of categories (C_1,...,C_p). Every information
item I_i belongs to exactly one category.
By S_k I'll denote the kth user's path through the set of information, i.e.
an ordered set consisting of information items (basically these are IDs of
information items but to avoid complicating things we assume that these are
the actual items).
This was called sequence so far, a sequence/path reads then as
S_k=(I_2,I_214,I_120,...)
Abstractly speaking we got different ways (different tasks which all have
the same underlying information items and structure) to generate those
sequences S_i.
What we'd like to show is that there's a difference in the way the user
accessed the information according to setting we had for this particular
user (the group he's in).
The main problem in my eyes is that there is no "natural way" of aggregation
of user's sequences within a group.
What mathalk suggested is to consider some sort of distance. According to
some "live discussion" with amos it seems reasonable to give it try on real
data.
To me there are some three ways of seeing this.
The first thing that came to my mind was to use the distance that is used
for instance on the space of binary sequences, i.e.
that is to compare two user's path by defining
d_i = 0 if the ith Information coincides, 1/2^i if not.
I'm fully aware of the fact that the weighting factor 1/2^i has it's origin
only in mathematics and a phenomenological interpretation might get
difficult.
So, the distance d(S_k,S_l) would then be sum of all d_i.
To me it seems that one point in using such "heavy weighting" factor is to
overcome the problem that the sequences may differ in length.
Another approach would be (inspired by mathtalk) to define specific
penalties for every information item I_k. That means that you still
calculate the difference
d(S_k,S_l) by comparing the actual sequence but weight them differently
(according to whether information items are "close or not" which could
probably be done by using the categorie, I suppose that the information in
the same category is some kind of similar, thus close).
Still we got one further problem, namely the aggretation of distances
calculated and the comparison between the groups. Possible solutions are --
as said before -- minimum distance (which makes the least sence to me
personally)
, median and probably even average.
The approach above leads to another possible way: According to amos there's
some theoretical model which suggests a predefined paths through the
information items.
What could then be done is to define penalties according to the "gravity of
deviation" of the "predefined path".
What's been done so far is do attribute penalites in a static way. Mathtalk
suggested to do it kind of dynamically.
That is if two sequences differ only in order (e.g.
S_1=(...,I_2,I_3,I_4,....) and S_2=(...,I_4,I_3,I_4)) this should correspond
to a minor distance
wheras omitting an item should correspond to a bigger difference. A fairly
natural way here would be the number of clicks (i.e. the difference of
positions of the items in question)
with some upper bound i.e. indicating that either there were too many clicks
(to big difference of positions of the items) or that one user accessed this
specific item and another didn't.
The latter solution is kind of inversing the order: Given the information
items we look for the position (when it was accessed) it is on in a user's
sequence.
To me this seems to be slighty more complicated to implement but as the
amount of data is not outreagous it could be done quite straightforward by
some kind of brute force algorithm.
Another approach which would be quite natural as well is to only compare
distances within items of one category. This would prevent outliers from
having a too big contribution to the "overall mean".
What seems to be the problem here is that my "dynamic analysis" / "dynamic
penalty strategy" would need some refinement as one user could leave a
category and "come back to that category" later which would result in
reaching the upper bound.
To me right now there are tow sub-questions open:
1) Which approach admits the best phenomenological fit?
2) What would be the best technique to compare different groups (the overall
comparison or the sub-comparison at level of categories).
|
Request for Question Clarification by
mathtalk-ga
on
16 Aug 2004 11:33 PDT
Thoughtful (and thought-provoking!) comments.
Let me ask something that occurred to me while reading the above. Are
the information items available the same for different task? I had
assumed that the items were typically unique to the task assigned, but
it dawned on me from the above description that the comparison of
tasks on the basis of a selection of information items requires a
shared set.
regards, mathtalk-ga
|
Clarification of Question by
amos88-ga
on
16 Aug 2004 12:50 PDT
ah - its funny how such basic things miss the chance to communicated ;-)
the answer about the information items is, well, yes and no...
there are 3 tasks (a,b,c) with 3 different sets of items BUT the basic
structure of the tasks (it is a decision task) is the same for the
three and the information items are mapped i.e. there is a
corresponding item for item 1/task a in task b and c. for means of
comparison i assume that the items in the different tasks provide the
same information to the user - they difference is only in task texts.
so thinking again about your question - the answer is: they are the
same.
cheers
a
|