Google Answers: Affinity grouping?

View Question

Q: Affinity grouping? ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Affinity grouping?
Category: Computers > Algorithms
Asked by: mxnmatch-ga
List Price: $20.00

Posted: 11 Dec 2002 18:33 PST
Expires: 10 Jan 2003 18:33 PST
Question ID: 123403

I need to know how to do affinity grouping of information in the following circumstance (I've simplified it to the bare bones of what needs to be done.) There's a table GameProfile which contains (userID,gameID). There are about 20,000 gameIDs and perhaps a 1,000,000 userIDs. The number of gameIDs doesn't increase very quickly, but the number of userIDs may continue to increase. For any given gameID, I need to know what other games are most often owned by other users that have that game. Essentially I'm trying to do something like Amazon.com's affinity grouping which allows you to look at a page for a book and see other books that other people have purchased who also purchased that book. How should I accomplish this? The straightforward way seems far too inefficient to be practical. The straightforward way would be to make a list of all the gameIDs then group all gameIDs by userID and then prune all userID groups that don't contain the desired gameID. I would then group by the gameID and take the most commonly occurring gameIDs to be the ones with the greatest affinity. Of course, then there's the problem of popular games which would end up with an affinity to ALL gameIDs. I'd need to prune those somehow. So, what should I do? Where can I get information on how to approach this problem? All I can find on the web are articles that tell me how important affinity grouping is in data mining. I can't find any information on what algorithms I actually need to use to accomplish this.
Request for Question Clarification by mathtalk-ga on 12 Dec 2002 13:48 PST Hi, mxnmatch-ga: I think the mathematical framework for what you want is conditional probability. For example, let's say that 90% of the game owners have Simpson's Road Rage, and this is independent of whether or not they own (say) GTA III, which is true of 40% of the game owners. But among owners of GTA III, the number who have SimCity Used Car Lot (I'm making stuff up now) is 30%, not a great number but much higher than the 15% of all gameowners who have it. So I think you are looking for a relative comparison of Pr(A\|B) to Pr(A), where A means owning game A, B means owning game B, and the population is all game owners. Once you have an exact mathematical criterion defined, then we can proceed to discuss the most efficient database query to retrieve the answer. BTW, how are the table entries stored? regards, mathtalk-ga
Clarification of Question by mxnmatch-ga on 12 Dec 2002 14:17 PST I like that. I was originally just thinking of getting the most commonly associated games and then just chop off the top X in the list, but I wasn't sure where to draw the line. Your suggestion makes more sense, but the question is how easily can it be done? The data is stored in a simple table (I'm using Oracle): gameID NUMBER userID NUMBER The original table has more columns than that, but they're not relevant to this problem and I intend to copy them out to a temp table before operating on them anyway (to minimize tieing up the original table). So, once I have Pr(A\|B) and Pr(A) would I just take the difference between the two Pr(A\|B)-Pr(A) and then order from largest to smallest? I've already figured out a query which determines the number of times that a particular game is associated with other games. To get Pr(A\|B) would I just take the count (cnt) and divide that by the total number of users? Here's the query: SELECT g1.game g1Game, g2.game g2Game, count() cnt FROM (select unum,game from tmp20021212_userGameColl where game=482099) g1, tmp20021212_userGameColl g2 WHERE g1.unum=g2.unum and g1.game!=g2.game GROUP BY g1.game,g2.game ORDER BY g1.game, count() desc, g2.game ; "unum" is the actual name of the userID column and "game" is the actual name of the gameID column. The table "tmp20021212_userGameColl" is the temp table that contains those 2 columns.
Clarification of Question by mxnmatch-ga on 12 Dec 2002 14:18 PST The be clear, that query only gets the counts for the game "482099". I would have to run that query once for each game. It currently takes about 9 seconds to run.

Answer

Subject: Re: Affinity grouping?
Answered By: mathtalk-ga on 12 Dec 2002 21:10 PST
Rated: 5 out of 5 stars

Hi, mxnmatch-ga: I'm not sure if selecting the given columns into a smaller temp table is necessarily the best strategy. It seems likely to me that with appropriate indexes on the existing table, the queries may execute just as effectively. Of course this is a matter for experimentation. For the sake of discussion I'll simplify the table name to usergame, and we'll assume the compound primary key(user,game). Here is how we might compute, for a pair of games A and B, the quantity: Pr(A\|B) - Pr(A) Note that by definition Pr(A\|B) = Pr(A&B)/Pr(B). Therefore four counts are needed: SELECT count(DISTINCT userid) AS cntU FROM usergame; SELECT count() AS cntA FROM usergame WHERE game = A; SELECT count() AS cntB FROM usergame WHERE game = B; SELECT count() AS cntAB FROM usergame ug1 WHERE game = A AND EXISTS ( SELECT FROM usergame WHERE user = ug1.user AND game = B ); In the event that we are trying to find the top ten games A having affinity to a fixed game B, the first and third queries need only be run once (for fixed B) and results reused for all games A. Indeed the first query, which simply counts the total number of distinct users in the table, may reused across all such calculations of affinity, while the second and third queries might properly be combined (if a volume of such calculations are required) into: SELECT game, count() FROM usergame GROUP BY game; whose output could be saved in the form of a summary table. The real effort required may thus be reduced after this fashion to the fourth query above. But let us give the mathematical formula promised earlier: Pr(A\|B) = cntAB/cntB Pr(A) = cntA/cntU Pr(A\|B) - Pr(A) = (cntAB/cntB) - (cntA/cntU) = (cntABcntU - cntAcntB)/(cntBcntU) Now in terms of the rankings, the denominator here would be the same for all values of Pr(A\|B) - Pr(A) when game B is fixed. So we might as well look for the top ten values of the numerator: cntABcntU - cntAcntB as these integer values are related to those expressed above in terms of condition probabilities by a simple scaling. Depending on the number of records involved, the all integer computation might be attractive or not (as when the numbers are so great as to cause overflow in multiplying). From the rough numbers mentioned in your original question (a million userids), I don't think overflow would be a problem in most cases, but if both A,B happened to be games owned by most users, then the 4-byte Oracle integer would not be big enough to hold results like one million squared. I'm unclear how much of the computations need to be done by the database. Often the balance between database work and middleware work can be optimized so that the arithmetic is done in quickest fashion. I'd love to hear more about the architecture of your program before making a recommendation. regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 12 Dec 2002 21:16 PST The query you propose in your clarification is essentially the same as my fourth query, done for all possible games A simultaneously (with B = 482099). If you can do this in ten seconds or so, the rest of the computation could be done in probably an addition 3 to 5 seconds by the database, or even faster if the number crunching is done in the middleware. -- mathtalk-ga
Clarification of Answer by mathtalk-ga on 13 Dec 2002 08:44 PST Hi, mxnmatch-ga: If you want I can give you a SQL only implementation, or sketch how to combine the SQL results with some "C" code to compute the rankings. Which would best fit your programming model? -- mathtalk-ga P.S. I enjoyed your question about Douglas Adams/Dirk Gently!
Request for Answer Clarification by mxnmatch-ga on 13 Dec 2002 16:57 PST Here's the PL/SQL I came up with. Does this look ok? It seems to actually take 6 seconds per game now. It went down when I added the "HAVING count()>5". DECLARE UNIQUEUNUMCNT NUMBER; BEGIN SELECT count(DISTINCT unum) INTO UNIQUEUNUMCNT FROM userGameAffinityGames; dbms_output.ENABLE(10000); dbms_output.put_line('started: '\|\|sysdate); -- Fill temp table with (unum,game) pairs / CREATE TABLE userGameAffinityGames ( unum NUMBER(11) NOT NULL, game NUMBER(20) NOT NULL ); CREATE index userGameAffinityGames_ixUnum on userGameAffinityGames(unum); CREATE index userGameAffinityGames_ixGame on userGameAffinityGames(game); / DELETE FROM userGameAffinityGames; INSERT INTO userGameAffinityGames (unum,game) SELECT unum, game FROM userGameProfile where isowned=1; COMMIT; dbms_output.put_line('finished copying unum,game pairs: '\|\|sysdate); / CREATE TABLE userGameAffinityCounts ( game NUMBER(20) NOT NULL, cnt NUMBER(20) NOT NULL ); CREATE index userGameAffinityCounts_ixGame on userGameAffinityCounts(game); / DELETE FROM userGameAffinityCounts; INSERT INTO userGameAffinityCounts (game,cnt) SELECT game, count() cnt FROM userGameAffinityGames GROUP BY game; COMMIT; dbms_output.put_line('finished getting game counts: '\|\|sysdate); FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames) LOOP BEGIN DELETE FROM userGameAffinityGroupings WHERE game=theGame.game; INSERT INTO userGameAffinityGroupings (game,orderNum,gameWithAffinity) SELECT gB,rownum,gA FROM ( SELECT gB, gA, (gBgACntuniqueUnumCnt) - (gAGameCntgBGameCnt) justNumerator FROM ( SELECT g1Game gB ,g2Game gA ,gAgBCnts.cnt gBgACnt ,gBGameCnts.cnt gBGameCnt ,gAGameCnts.cnt gAGameCnt FROM (select c.game,c.cnt from userGameAffinityCounts c) gBGameCnts, (select c.game,c.cnt from userGameAffinityCounts c) gAGameCnts, ( SELECT rownum rn, g1Game, g2Game, cnt FROM ( SELECT g1.game g1Game, g2.game g2Game, count() cnt FROM (select unum,game from userGameAffinityGames where game=theGame.game) g1, userGameAffinityGames g2 WHERE g1.unum=g2.unum and g1.game!=g2.game GROUP BY g1.game,g2.game HAVING count()>5 ORDER BY g1.game, count(*) desc, g2.game ) ) gAgBCnts WHERE gAGameCnts.game=gAgBCnts.g1Game AND gBGameCnts.game=gAgBCnts.g2Game ) ORDER BY justNumerator DESC ) WHERE rownum <= 10 ; COMMIT; END; END LOOP; dbms_output.put_line('finished '\|\|sysdate); END;
Clarification of Answer by mathtalk-ga on 13 Dec 2002 21:43 PST This isn't my full analysis, but one thing I'd add would be a restriction in that final query to justNumerator > 0. There might not be as many as ten games with positive "A" affinity for the given "B" base game, in which case I don't think you'll want to list games with zero affinity or smallish negative affinities. -- mathtalk
Clarification of Answer by mathtalk-ga on 15 Dec 2002 22:38 PST I might be misunderstanding something, but it seems that the "FOR ... LOOP" is meant to find all the "top-ten" affinities of all the games. Is that indeed the intention? Note that the "numerator only" formulation is symmetric in A,B: (# owners of A&B)(# owners) - (# owners of A)(# owners of B) The original post mentioned 20,000 games, so I assumed you'd want to fix B (say) and just compute the top-ten affinity games A. This would probably make the most sense if the underlying data is volatile, and so apt to make reuse of previous computations questionable. But if the data were not often changed, then you might save some time in computing _all_ affinities by making use of the symmetry. Let me know if I should elaborate. regards, mathtalk-ga
Request for Answer Clarification by mxnmatch-ga on 16 Dec 2002 12:02 PST Before doing the computations I move all the data to the holding table userGameAffinityGames, so the data is fixed until the function finishes. Also, we're thinking that, given the cost of doing these calculations, we would only recompute once a month except possibly for new games. However, I'm not sure about the best way to use the symmetry. Would I just add an additional restriction to the WHERE clause of the innermost query which does something like "WHERE NOT EXISTS (select 1 FROM userGameAffinityGroupings ag WHERE ag.game=g2.game and ag.gameWithAffinity=g1.game)"?
Clarification of Answer by mathtalk-ga on 16 Dec 2002 20:11 PST Hi, mxnmatch-ga: I'm not sure what the optimal way to exploit symmetry would be, but here is where my intuition is taking me. As written you are looping through records with: FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames) LOOP But we already collected a list of unique values of game in userGameAffinityCounts, so we could instead loop through: FOR B IN (SELECT game FROM userGameAffinityCounts ORDER BY cnt DESC) LOOP For each "B" game in this list (starting with the most popular ones, as guaranteed by the ordering), we might compute and store all the positive affinities "A" with "B" as follows within the loop: /* CREATE TABLE userGamePositiveAffinity ( gA NUMBER(20) NOT NULL, gB NUMBER(20) NOT NULL, jNum NUMBER NOT NULL ); / INSERT INTO userGamePositiveAffinity(gA,gB,jNum) SELECT gA, B.game, cntABUNIQUEUNUMCNT - gctA.cntB.cnt justNumerator FROM (SELECT ownA.game gA, count() cntAB FROM (SELECT * FROM userGameAffinityGames WHERE game = B.game) ownB, (SELECT * FROM userGameAffinityGames WHERE game != B.game) ownA WHERE ownA.unum = ownB.unum GROUP BY ownA.game) ownAB, (SELECT * FROM userGameAffinityCounts) gctA WHERE ownAB.gA = gctA.game AND justNumerator > 0; Thereafter we can DELETE all the userGameAffinityGames records in which game is B.game, and move on to the next most popular game in the "loop". Since the "scratch" table userGameAffinityGames is shrinking with each step in the loop, it should run faster as the loop progresses. At the end we have all the records: userGamePositiveAffinity(gA,gB,jNum) where game gA is _less_ popular than game gB. We could then supply symmetry by: INSERT INTO userGamePositiveAffinity(gA,gB,jNum) SELECT gB, gA, jNum FROM userGamePositiveAffinity; END LOOP; Finally we would loop again through userGameAffinityCounts and, for each B.game in that table, select the ten highest jNum (affinity) records from userGamePositiveAffinity where gB = B.game, writing the results out to userGameAffinityGroupings. Only trials with some realistic data can determine if this approach is reasonably fast or even faster than another plausible process. I like the idea of running it as a batch, but you might do it weekly or even daily if it is not too intensive. regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 17 Dec 2002 06:53 PST One other (undigested) thought... it is possible that you are recording ratings along with the ownership of games. A more sophisticated analysis might then include the ratings information in predicting which games would be enjoyed (by an owner of a particular game). -- mathtalk
Clarification of Answer by mathtalk-ga on 24 Dec 2002 17:55 PST As I thought about my suggestion for exploiting symmetry (and computing all the positive affinity rankings), I realized the biggest obstacle would likely be storage space for all the "numerators" (cntABcntU - cntAcntB). True, we only keep the positive ones, and this cuts down the storage required by at least half. But I think that it would be reasonable to reduce it further by setting a higher (positive) threshold, like: (cntABcntU - cntAcntB) > cntU/1000 Also, I thought a simple way to incorporate the ratings into the final "affinity" groupings might be just to give the average rating of each affinity A game (by those who are also owners of base B game). regards and seasons greetings, mathtalk-ga
Request for Answer Clarification by mxnmatch-ga on 30 Dec 2002 13:41 PST Ok, I'm back. One of my coworkers thought that Pr(A\|B)/Pr(A) made more sense than Pr(A\|B) - Pr(A). What do you think? In either case a larger Pr(A) will decrease the value.
Clarification of Answer by mathtalk-ga on 30 Dec 2002 15:17 PST Yes, it certainly is worth considering. I might have gone that way myself if you had not seized on the difference rather than the ratio. Note that: Pr(A\|B) = Pr(A&B)/Pr(B), so Pr(A\|B)/Pr(A) = Pr(A&B)/(Pr(A)Pr(B)) which again is symmetric in A and B. My impression is that the ratio approach handicaps the very popular games more than the (earlier considered) difference approach would. Consider this example: Game B (base) is owned by 25% of those surveyed. Games A1 and A2 are owned, resp. by 50% and 5% of the general population, but among owners of game B these percentages jump to 75% and 20% respectively. Which has the higher affinity? Using the ratio approach: Pr(A1\|B)/Pr(A1) = (0.75)/(0.50) = 1.5 Pr(A2\|B)/Pr(A2) = (0.20)/(0.05) = 4.0 and game A2 would have the greater measure of affinity. Using the difference approach: Pr(A1\|B) - Pr(A1) = 0.75 - 0.50 = 0.25 Pr(A2\|B) - Pr(A2) = 0.20 - 0.05 = 0.15 and game A1 would have the greater measure of affinity. Which of these is the correct (or better) measure? Beats me. I had kind of reached the thought that perhaps the ratio approach makes it too hard to rank a very popular game as having a high affinity, but on the other hand popular games are well known, so perhaps "suggesting" an affinity that is well known is not so beneficial as mentioning one that may be lesser known. I'll have to defer to you guys as "domain experts" on what best represents the quality you intend to present to Web vistors. regards, mathtalk
Clarification of Answer by mathtalk-ga on 30 Dec 2002 20:57 PST Thanks, mxnmatch, for the kind words and generous tip! I wish you guys the best with your project and with the New Year... -- mathtalk

mxnmatch-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $10.00

Very helpful. They worked through the problem with me until I had what I needed.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy

Hi, mxnmatch-ga: I'm not sure if selecting the given columns into a smaller temp table is necessarily the best strategy. It seems likely to me that with appropriate indexes on the existing table, the queries may execute just as effectively. Of course this is a matter for experimentation. For the sake of discussion I'll simplify the table name to usergame, and we'll assume the compound primary key(user,game). Here is how we might compute, for a pair of games A and B, the quantity: Pr(A\|B) - Pr(A) Note that by definition Pr(A\|B) = Pr(A&B)/Pr(B). Therefore four counts are needed: SELECT count(DISTINCT userid) AS cntU FROM usergame; SELECT count() AS cntA FROM usergame WHERE game = A; SELECT count() AS cntB FROM usergame WHERE game = B; SELECT count() AS cntAB FROM usergame ug1 WHERE game = A AND EXISTS ( SELECT FROM usergame WHERE user = ug1.user AND game = B ); In the event that we are trying to find the top ten games A having affinity to a fixed game B, the first and third queries need only be run once (for fixed B) and results reused for all games A. Indeed the first query, which simply counts the total number of distinct users in the table, may reused across all such calculations of affinity, while the second and third queries might properly be combined (if a volume of such calculations are required) into: SELECT game, count() FROM usergame GROUP BY game; whose output could be saved in the form of a summary table. The real effort required may thus be reduced after this fashion to the fourth query above. But let us give the mathematical formula promised earlier: Pr(A\|B) = cntAB/cntB Pr(A) = cntA/cntU Pr(A\|B) - Pr(A) = (cntAB/cntB) - (cntA/cntU) = (cntABcntU - cntAcntB)/(cntBcntU) Now in terms of the rankings, the denominator here would be the same for all values of Pr(A\|B) - Pr(A) when game B is fixed. So we might as well look for the top ten values of the numerator: cntABcntU - cntAcntB as these integer values are related to those expressed above in terms of condition probabilities by a simple scaling. Depending on the number of records involved, the all integer computation might be attractive or not (as when the numbers are so great as to cause overflow in multiplying). From the rough numbers mentioned in your original question (a million userids), I don't think overflow would be a problem in most cases, but if both A,B happened to be games owned by most users, then the 4-byte Oracle integer would not be big enough to hold results like one million squared. I'm unclear how much of the computations need to be done by the database. Often the balance between database work and middleware work can be optimized so that the arithmetic is done in quickest fashion. I'd love to hear more about the architecture of your program before making a recommendation. regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 12 Dec 2002 21:16 PST The query you propose in your clarification is essentially the same as my fourth query, done for all possible games A simultaneously (with B = 482099). If you can do this in ten seconds or so, the rest of the computation could be done in probably an addition 3 to 5 seconds by the database, or even faster if the number crunching is done in the middleware. -- mathtalk-ga
Clarification of Answer by mathtalk-ga on 13 Dec 2002 08:44 PST Hi, mxnmatch-ga: If you want I can give you a SQL only implementation, or sketch how to combine the SQL results with some "C" code to compute the rankings. Which would best fit your programming model? -- mathtalk-ga P.S. I enjoyed your question about Douglas Adams/Dirk Gently!
Request for Answer Clarification by mxnmatch-ga on 13 Dec 2002 16:57 PST Here's the PL/SQL I came up with. Does this look ok? It seems to actually take 6 seconds per game now. It went down when I added the "HAVING count()>5". DECLARE UNIQUEUNUMCNT NUMBER; BEGIN SELECT count(DISTINCT unum) INTO UNIQUEUNUMCNT FROM userGameAffinityGames; dbms_output.ENABLE(10000); dbms_output.put_line('started: '\|\|sysdate); -- Fill temp table with (unum,game) pairs / CREATE TABLE userGameAffinityGames ( unum NUMBER(11) NOT NULL, game NUMBER(20) NOT NULL ); CREATE index userGameAffinityGames_ixUnum on userGameAffinityGames(unum); CREATE index userGameAffinityGames_ixGame on userGameAffinityGames(game); / DELETE FROM userGameAffinityGames; INSERT INTO userGameAffinityGames (unum,game) SELECT unum, game FROM userGameProfile where isowned=1; COMMIT; dbms_output.put_line('finished copying unum,game pairs: '\|\|sysdate); / CREATE TABLE userGameAffinityCounts ( game NUMBER(20) NOT NULL, cnt NUMBER(20) NOT NULL ); CREATE index userGameAffinityCounts_ixGame on userGameAffinityCounts(game); / DELETE FROM userGameAffinityCounts; INSERT INTO userGameAffinityCounts (game,cnt) SELECT game, count() cnt FROM userGameAffinityGames GROUP BY game; COMMIT; dbms_output.put_line('finished getting game counts: '\|\|sysdate); FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames) LOOP BEGIN DELETE FROM userGameAffinityGroupings WHERE game=theGame.game; INSERT INTO userGameAffinityGroupings (game,orderNum,gameWithAffinity) SELECT gB,rownum,gA FROM ( SELECT gB, gA, (gBgACntuniqueUnumCnt) - (gAGameCntgBGameCnt) justNumerator FROM ( SELECT g1Game gB ,g2Game gA ,gAgBCnts.cnt gBgACnt ,gBGameCnts.cnt gBGameCnt ,gAGameCnts.cnt gAGameCnt FROM (select c.game,c.cnt from userGameAffinityCounts c) gBGameCnts, (select c.game,c.cnt from userGameAffinityCounts c) gAGameCnts, ( SELECT rownum rn, g1Game, g2Game, cnt FROM ( SELECT g1.game g1Game, g2.game g2Game, count() cnt FROM (select unum,game from userGameAffinityGames where game=theGame.game) g1, userGameAffinityGames g2 WHERE g1.unum=g2.unum and g1.game!=g2.game GROUP BY g1.game,g2.game HAVING count()>5 ORDER BY g1.game, count(*) desc, g2.game ) ) gAgBCnts WHERE gAGameCnts.game=gAgBCnts.g1Game AND gBGameCnts.game=gAgBCnts.g2Game ) ORDER BY justNumerator DESC ) WHERE rownum <= 10 ; COMMIT; END; END LOOP; dbms_output.put_line('finished '\|\|sysdate); END;
Clarification of Answer by mathtalk-ga on 13 Dec 2002 21:43 PST This isn't my full analysis, but one thing I'd add would be a restriction in that final query to justNumerator > 0. There might not be as many as ten games with positive "A" affinity for the given "B" base game, in which case I don't think you'll want to list games with zero affinity or smallish negative affinities. -- mathtalk
Clarification of Answer by mathtalk-ga on 15 Dec 2002 22:38 PST I might be misunderstanding something, but it seems that the "FOR ... LOOP" is meant to find all the "top-ten" affinities of all the games. Is that indeed the intention? Note that the "numerator only" formulation is symmetric in A,B: (# owners of A&B)(# owners) - (# owners of A)(# owners of B) The original post mentioned 20,000 games, so I assumed you'd want to fix B (say) and just compute the top-ten affinity games A. This would probably make the most sense if the underlying data is volatile, and so apt to make reuse of previous computations questionable. But if the data were not often changed, then you might save some time in computing _all_ affinities by making use of the symmetry. Let me know if I should elaborate. regards, mathtalk-ga
Request for Answer Clarification by mxnmatch-ga on 16 Dec 2002 12:02 PST Before doing the computations I move all the data to the holding table userGameAffinityGames, so the data is fixed until the function finishes. Also, we're thinking that, given the cost of doing these calculations, we would only recompute once a month except possibly for new games. However, I'm not sure about the best way to use the symmetry. Would I just add an additional restriction to the WHERE clause of the innermost query which does something like "WHERE NOT EXISTS (select 1 FROM userGameAffinityGroupings ag WHERE ag.game=g2.game and ag.gameWithAffinity=g1.game)"?
Clarification of Answer by mathtalk-ga on 16 Dec 2002 20:11 PST Hi, mxnmatch-ga: I'm not sure what the optimal way to exploit symmetry would be, but here is where my intuition is taking me. As written you are looping through records with: FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames) LOOP But we already collected a list of unique values of game in userGameAffinityCounts, so we could instead loop through: FOR B IN (SELECT game FROM userGameAffinityCounts ORDER BY cnt DESC) LOOP For each "B" game in this list (starting with the most popular ones, as guaranteed by the ordering), we might compute and store all the positive affinities "A" with "B" as follows within the loop: /* CREATE TABLE userGamePositiveAffinity ( gA NUMBER(20) NOT NULL, gB NUMBER(20) NOT NULL, jNum NUMBER NOT NULL ); / INSERT INTO userGamePositiveAffinity(gA,gB,jNum) SELECT gA, B.game, cntABUNIQUEUNUMCNT - gctA.cntB.cnt justNumerator FROM (SELECT ownA.game gA, count() cntAB FROM (SELECT * FROM userGameAffinityGames WHERE game = B.game) ownB, (SELECT * FROM userGameAffinityGames WHERE game != B.game) ownA WHERE ownA.unum = ownB.unum GROUP BY ownA.game) ownAB, (SELECT * FROM userGameAffinityCounts) gctA WHERE ownAB.gA = gctA.game AND justNumerator > 0; Thereafter we can DELETE all the userGameAffinityGames records in which game is B.game, and move on to the next most popular game in the "loop". Since the "scratch" table userGameAffinityGames is shrinking with each step in the loop, it should run faster as the loop progresses. At the end we have all the records: userGamePositiveAffinity(gA,gB,jNum) where game gA is _less_ popular than game gB. We could then supply symmetry by: INSERT INTO userGamePositiveAffinity(gA,gB,jNum) SELECT gB, gA, jNum FROM userGamePositiveAffinity; END LOOP; Finally we would loop again through userGameAffinityCounts and, for each B.game in that table, select the ten highest jNum (affinity) records from userGamePositiveAffinity where gB = B.game, writing the results out to userGameAffinityGroupings. Only trials with some realistic data can determine if this approach is reasonably fast or even faster than another plausible process. I like the idea of running it as a batch, but you might do it weekly or even daily if it is not too intensive. regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 17 Dec 2002 06:53 PST One other (undigested) thought... it is possible that you are recording ratings along with the ownership of games. A more sophisticated analysis might then include the ratings information in predicting which games would be enjoyed (by an owner of a particular game). -- mathtalk
Clarification of Answer by mathtalk-ga on 24 Dec 2002 17:55 PST As I thought about my suggestion for exploiting symmetry (and computing all the positive affinity rankings), I realized the biggest obstacle would likely be storage space for all the "numerators" (cntABcntU - cntAcntB). True, we only keep the positive ones, and this cuts down the storage required by at least half. But I think that it would be reasonable to reduce it further by setting a higher (positive) threshold, like: (cntABcntU - cntAcntB) > cntU/1000 Also, I thought a simple way to incorporate the ratings into the final "affinity" groupings might be just to give the average rating of each affinity A game (by those who are also owners of base B game). regards and seasons greetings, mathtalk-ga
Request for Answer Clarification by mxnmatch-ga on 30 Dec 2002 13:41 PST Ok, I'm back. One of my coworkers thought that Pr(A\|B)/Pr(A) made more sense than Pr(A\|B) - Pr(A). What do you think? In either case a larger Pr(A) will decrease the value.
Clarification of Answer by mathtalk-ga on 30 Dec 2002 15:17 PST Yes, it certainly is worth considering. I might have gone that way myself if you had not seized on the difference rather than the ratio. Note that: Pr(A\|B) = Pr(A&B)/Pr(B), so Pr(A\|B)/Pr(A) = Pr(A&B)/(Pr(A)Pr(B)) which again is symmetric in A and B. My impression is that the ratio approach handicaps the very popular games more than the (earlier considered) difference approach would. Consider this example: Game B (base) is owned by 25% of those surveyed. Games A1 and A2 are owned, resp. by 50% and 5% of the general population, but among owners of game B these percentages jump to 75% and 20% respectively. Which has the higher affinity? Using the ratio approach: Pr(A1\|B)/Pr(A1) = (0.75)/(0.50) = 1.5 Pr(A2\|B)/Pr(A2) = (0.20)/(0.05) = 4.0 and game A2 would have the greater measure of affinity. Using the difference approach: Pr(A1\|B) - Pr(A1) = 0.75 - 0.50 = 0.25 Pr(A2\|B) - Pr(A2) = 0.20 - 0.05 = 0.15 and game A1 would have the greater measure of affinity. Which of these is the correct (or better) measure? Beats me. I had kind of reached the thought that perhaps the ratio approach makes it too hard to rank a very popular game as having a high affinity, but on the other hand popular games are well known, so perhaps "suggesting" an affinity that is well known is not so beneficial as mentioning one that may be lesser known. I'll have to defer to you guys as "domain experts" on what best represents the quality you intend to present to Web vistors. regards, mathtalk
Clarification of Answer by mathtalk-ga on 30 Dec 2002 20:57 PST Thanks, mxnmatch, for the kind words and generous tip! I wish you guys the best with your project and with the New Year... -- mathtalk