Google Answers: statistics web

View Question

Q: statistics web ( No Answer, 6 Comments )

Question

Subject: statistics web
Category: Science > Math
Asked by: amos88-ga
List Price: $40.00

Posted: 30 Jul 2004 01:32 PDT
Expires: 29 Aug 2004 01:32 PDT
Question ID: 381198

I am facing the following problem: in a database I have got N information items (datasets) which can be searched through by users. I can generate a list of accessed items e.g. items 1,10,22,15....130 (this is like a clickstream in web site analysis) were accessed for each user. the ideal solution now would be an analysis that shows me an average "path" through the data. the mentioned clickstream analysis would be the ideal solution but the problem is, that the server logs do not get all the necessary information (because the whole process is database driven) and therefore the analysis does not work ... as a second analysis a "types" of searches would be interesting. i can categorize information e.g., items 1 to 10 are category A items 2 to 40 are category B and so on. is it possible to extract information search types when a user 1 accesses cat A 3 times and B 2 times in comparison to user 2 who accesses A 1 and B 8 .... i am aware that these ideas might sound vague :-) thanks for your help best markus
Request for Question Clarification by mathtalk-ga on 30 Jul 2004 11:47 PDT Hi, amos88-ga: Is the idea to try to produce "the" average sequence of accessed items, or to develop a notion of how far a given access sequence deviates from "average"? Perhaps your log contains session cookies that allow estimating the probability that if an item J is accessed, subsequently in the same session item K will be accessed? regards, mathtalk-ga
Clarification of Question by amos88-ga on 31 Jul 2004 05:00 PDT hi mathtalk, i have got different search tasks for the user - it is interesting for me how average sequences change when the instruction is changed. task 1 - average sequence 1 task 2 - average sequence 2 task 3 - average sequence 3 is there a difference between the three (the database underneath is of course the same) non session cookies i am afraid -> really just a sequence for each user in a task. ************************** another way seeing it in a binary mode: assuming in my dataset are 5 informations: USER 1 USER 2 USER 3 i1 1 0 1 i2 0 1 1 i3 1 1 0 i4 0 0 0 i5 0 0 1 1 - hit 0 - no hit question: are there differences between users? does this help? cheers amos
Request for Question Clarification by mathtalk-ga on 31 Jul 2004 05:50 PDT Hi, amos: It looks like we cannot derive "sequence" information, but there is "correlation" data: Given that a user has accessed one item, what other items are "popular". One approach might be do a "tree clustering", ie. to group together items that have the strongest correlations, continuing to group groups together until only one group remains. These groupings create a tree that ultimately lumps all items together. While such a tree is fairly easy to draw by itself, I would guess comparing how two such trees are different is not easily done by visual inspection (unless the changes happen to be localized in one part of the tree). Still, if a change in your instructions seems to cause a large scale rearrangement in the tree this might be worth "seeing" (as would the case that changes cause little or no rearrangement in clustering). Such an approach might also help address your interest in a "types" of search analysis. I or another Researcher would be glad to post some algorithmic details and references if this woud be of interest. regards, mathtalk-ga
Clarification of Question by amos88-ga on 31 Jul 2004 08:10 PDT hi again, maybe we are talking of different things when we mean sequence - just to be sure ... i can get sequence information (not from a session cookie, but from my database) - a sequence of eg IDs of accessed information is no problem to generate. in any case - the tree clustering idea sounds interesting, although the expected change between tasks will not be big, i am afraid. cheers amos
Request for Question Clarification by mathtalk-ga on 01 Aug 2004 20:06 PDT By "sequence" I meant the order in which items are accessed by a particular user in a visit. It seemed to me at first that your focus was on the order in which a "path" was taken through the available data items in your site database. Then I got (apparently mistakenly) the notion that such "sequence" data was not readily available, as it is not reflected in the 'binary mode' sample (which shows only items hit/not hit for three users). regards, mathtalk-ga
Clarification of Question by amos88-ga on 02 Aug 2004 01:02 PDT new week :-) hi again, ok - we are getting closer i think. i got sequence information (user 1 accessed items 1,15,2,3,74; user 2 accessed items 20,14,144,23,3) which can be transfered into the binary example. in the "type" of surfer case a mean sequence is interesting for describing user's behavior between different tasks. any idea about the time series - is it a case of this approach? cheers a
Request for Question Clarification by mathtalk-ga on 02 Aug 2004 08:52 PDT Time series are normally used in connection with quantitative data, e.g. the price of coffee varying from month to month. Time series can be analyzed for trends and for correlations. Your numbered items appear to be "nominal" data. That is the number associated with an item merely identifies it, without implying special significance to the magnitude of the numbers. E.g. 10 < 22 has no intrinsic meaning in terms of relations between the two items so labelled. Perhaps you are trying to "group" not so much the database items as the users of your Web site. We might think in terms of "strength" of similarity: 1. Two users whose "paths" through database items, given the same task, are similar both in sequence and in the set of items browsed. 2. Two users whose paths are similar in the set of items browsed but differ a good bit as to sequence. 3. Two users whose paths are not similar either in sequence or in set of items browsed. Of course users might be "similar" for one task and not for others. I guess it must depend on the purpose of the analysis how detailed it would be worthwhile to be in distinguishing groups of users (or groups of items). With sequence information one can model the space of "paths" as a Markov chain model. This would approximate your notion of the "average path" as it would allow the calculation of the likelihood of any given path through the data. If you have sequence data, then does it happen that a user will sometimes "revisit" an item already viewed? regards, mathtalk-ga
Clarification of Question by amos88-ga on 02 Aug 2004 09:36 PDT >Time series are normally used in connection with quantitative data, >e.g. the price of coffee varying from month to month. Time series can >be analyzed for trends and for correlations. right - this makes sense - stupid nomial data ;-) it's exactltly the idea of grouping the users ... 1. right 2. did not see this yet - but yes 3. exactly >If you have sequence data, then does it happen that a user will >sometimes "revisit" an item already viewed? users see items they have already accessed - in a way you see items in a shopping bag with eg. amazon ... so it can happen that they access an item for a second time to make sure what it is about. best amos
Request for Question Clarification by mathtalk-ga on 13 Aug 2004 14:24 PDT Hi, amos88-ga: Perhaps I've been distracted by other problems I've been working on that involve Markov chains and the like. I work with a lot of folks who deal with problems of sequence alignments and matching of "strings". So today I began to think of your problem as being of an abbreviated form of this problem. Given the sequences of items chosen by different users for a task, I think we can define a "distance" between them in a fairly natural way. The distance between two identical choices would be zero. If two choices differ in the order of two adjacent selections (one right after the other), this (it seems to me) would be a slight difference (and a correspondingly small distance). Omitting an item from one list (but otherwise the same sequence) would be a greater difference (but not much). Substituting one unique item for another (at the same point in the sequences) would be a still greater difference. The goal would then be to find (using the cumulative distances as "penalties") the parsimonious distance from one user's sequence to another's. In measuring the distance between two groups (as opposed to between two individuals), one simple approach is to use the minimum distance between two individuals belonging to the two groups. Maybe a better measure would be the median distance between all discordant pairs (one from each group) from the two groups. Both of these approaches would take advantage of an initial computation of all distances between all pairs of individuals' selections (for a fixed task). Certainly this approach would provided the clustering of users that seems to be of interest to you. Whether it provides valuable insights is hard to judge without doing the experiment with some real data. Please kick these ideas around in your own mind, and let me know what you think. A number of parameters would need to be decided, e.g. the penalty for an omitted item vs. the penalty for a substituted item, etc. But I think this could be programmed in a practical way. This sort of matching with minimum penalty is the bread and butter of a lot of the processes around me at work, and since the number of items involved in any one task you deal with is fairly modest, I have to believe it is computationally feasible to do. regards, mathtalk-ga
Clarification of Question by amos88-ga on 16 Aug 2004 07:51 PDT dear mathtalk, thanks for your brain-work and the ideas you posted. i think we are a big step nearer to the goal :-) in discussions with a friend (mainly he - but lets say) we developed some ideas based on your approach - i asked him to post them as a comment into this thread ... just to clarify who is who cheers amos
Clarification of Question by amos88-ga on 16 Aug 2004 09:29 PDT still me ... had another disucssion about the distances approach. my aim is to show differences between tasks - this means that apriori users should click on (not exactly) the same information (but closely connected ones) in one task -> doesn't this bring us into trouble with the distances idea which should result in a very small distance for a given task. compared to the others this measure would give me more a "regression" than a value to compare two tasks ... still the penalties idea would bring a solution - my problem is the definition of such penalties in connection to my tasks - there are some but not sufficient predictions for a sequence of information ... anyway - i will paste in my friends summary now - which is much more mathematical than my writing anyway BUT before the insight with the distances problem. cheers a ********* Hi Amos, Hi Mathtalk, first of all I'd like to clarify the following. I know amos personally and he showed me what he's currently working on and as I found it quite interesting I spent some time thinking on that problem, so, I'm not after the reward set for this question ;) I guess I'm mainly referring to Mathtalk's last post on 08/13/04. Just to have things clear and understandable I quickly restate what we've got. We got a set of information items (numbered from 1 to n, i.e. I_1,...,I_n) that can be viewed by a user. Further more we got a set of categories (C_1,...,C_p). Every information item I_i belongs to exactly one category. By S_k I'll denote the kth user's path through the set of information, i.e. an ordered set consisting of information items (basically these are IDs of information items but to avoid complicating things we assume that these are the actual items). This was called sequence so far, a sequence/path reads then as S_k=(I_2,I_214,I_120,...) Abstractly speaking we got different ways (different tasks which all have the same underlying information items and structure) to generate those sequences S_i. What we'd like to show is that there's a difference in the way the user accessed the information according to setting we had for this particular user (the group he's in). The main problem in my eyes is that there is no "natural way" of aggregation of user's sequences within a group. What mathalk suggested is to consider some sort of distance. According to some "live discussion" with amos it seems reasonable to give it try on real data. To me there are some three ways of seeing this. The first thing that came to my mind was to use the distance that is used for instance on the space of binary sequences, i.e. that is to compare two user's path by defining d_i = 0 if the ith Information coincides, 1/2^i if not. I'm fully aware of the fact that the weighting factor 1/2^i has it's origin only in mathematics and a phenomenological interpretation might get difficult. So, the distance d(S_k,S_l) would then be sum of all d_i. To me it seems that one point in using such "heavy weighting" factor is to overcome the problem that the sequences may differ in length. Another approach would be (inspired by mathtalk) to define specific penalties for every information item I_k. That means that you still calculate the difference d(S_k,S_l) by comparing the actual sequence but weight them differently (according to whether information items are "close or not" which could probably be done by using the categorie, I suppose that the information in the same category is some kind of similar, thus close). Still we got one further problem, namely the aggretation of distances calculated and the comparison between the groups. Possible solutions are -- as said before -- minimum distance (which makes the least sence to me personally) , median and probably even average. The approach above leads to another possible way: According to amos there's some theoretical model which suggests a predefined paths through the information items. What could then be done is to define penalties according to the "gravity of deviation" of the "predefined path". What's been done so far is do attribute penalites in a static way. Mathtalk suggested to do it kind of dynamically. That is if two sequences differ only in order (e.g. S_1=(...,I_2,I_3,I_4,....) and S_2=(...,I_4,I_3,I_4)) this should correspond to a minor distance wheras omitting an item should correspond to a bigger difference. A fairly natural way here would be the number of clicks (i.e. the difference of positions of the items in question) with some upper bound i.e. indicating that either there were too many clicks (to big difference of positions of the items) or that one user accessed this specific item and another didn't. The latter solution is kind of inversing the order: Given the information items we look for the position (when it was accessed) it is on in a user's sequence. To me this seems to be slighty more complicated to implement but as the amount of data is not outreagous it could be done quite straightforward by some kind of brute force algorithm. Another approach which would be quite natural as well is to only compare distances within items of one category. This would prevent outliers from having a too big contribution to the "overall mean". What seems to be the problem here is that my "dynamic analysis" / "dynamic penalty strategy" would need some refinement as one user could leave a category and "come back to that category" later which would result in reaching the upper bound. To me right now there are tow sub-questions open: 1) Which approach admits the best phenomenological fit? 2) What would be the best technique to compare different groups (the overall comparison or the sub-comparison at level of categories).
Request for Question Clarification by mathtalk-ga on 16 Aug 2004 11:33 PDT Thoughtful (and thought-provoking!) comments. Let me ask something that occurred to me while reading the above. Are the information items available the same for different task? I had assumed that the items were typically unique to the task assigned, but it dawned on me from the above description that the comparison of tasks on the basis of a selection of information items requires a shared set. regards, mathtalk-ga
Clarification of Question by amos88-ga on 16 Aug 2004 12:50 PDT ah - its funny how such basic things miss the chance to communicated ;-) the answer about the information items is, well, yes and no... there are 3 tasks (a,b,c) with 3 different sets of items BUT the basic structure of the tasks (it is a decision task) is the same for the three and the information items are mapped i.e. there is a corresponding item for item 1/task a in task b and c. for means of comparison i assume that the items in the different tasks provide the same information to the user - they difference is only in task texts. so thinking again about your question - the answer is: they are the same. cheers a

Answer

There is no answer at this time.

Comments

Subject: Re: statistics web
From: mathtalk-ga on 31 Jul 2004 06:43 PDT

For a brief overview of clustering methods:

[Cluster Analysis]
http://sunsite.univie.ac.at/textbooks/statistics/stcluan.html

regards, mathtalk-ga

Subject: Re: statistics web
From: amos88-ga on 31 Jul 2004 09:10 PDT

great book about stats :-)
surfing around a little bit i was wondering if time-series would be a solution?
a

Subject: Re: statistics web
From: amos88-ga on 04 Aug 2004 23:05 PDT

Mathtalk - are you still there?
thanks
amos

Subject: Re: statistics web
From: mathtalk-ga on 05 Aug 2004 04:54 PDT

Hi, amos:

Yes, thinking... !

To do a clustering one needs a measure of distance (a "metric")
between the items to be clustered, and not only between individual
items but between the subsets of them that are formed as the
clustering proceeds.

Here we need to balance the degree of overlap in the number of items
chosen with the total number of items available.  For example, if two
users only chose one item each and that turns out to be the same item
in each case, then it would be a remarkable similarity.  On the other
hand, if each chooses all the possible items (eg. 15 of 15), then
necessarily the overlap will be complete.

Throwing in the sequence of items chosen and the need to handle a
group of users as well as an individual, and you have an interesting
challenge.

regards, mathtalk-ga

Subject: Re: statistics web
From: amos88-ga on 05 Aug 2004 06:12 PDT

*relieve* :-)

maybe the following info helps: the number of items between tasks
could be kept constant - but i guess if the problem of distance is
solved for one task then the problem of comparing disapears too ...

the case that two users choose the same item (which may also be true
for a lot of items) sounds interesting. the second case that they
choose all can be ruled out (there are too many of them).

i have to admit that i am trying to find a solution with multiple
sources - one concept that comes up again and again are "markov
chains" - does this help?


thanks for still thinking

a

Subject: Re: statistics web
From: amos88-ga on 17 Aug 2004 13:49 PDT

just found an interesting paper dealing with a very similar problem:

http://www.andrew.cmu.edu/user/alm3/papers/purchase%20conversion.pdf

with an appendix:
http://www.andrew.cmu.edu/user/alm3/papers/path%20analysis%20-%20technical%20appendix.pdf

cheers
a

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy