Google Answers Logo
View Question
 
Q: Affinity grouping? ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Affinity grouping?
Category: Computers > Algorithms
Asked by: mxnmatch-ga
List Price: $20.00
Posted: 11 Dec 2002 18:33 PST
Expires: 10 Jan 2003 18:33 PST
Question ID: 123403
I need to know how to do affinity grouping of information in the
following circumstance (I've simplified it to the bare bones of what
needs to be done.)

There's a table GameProfile which contains (userID,gameID). There are
about 20,000 gameIDs and perhaps a 1,000,000 userIDs. The number of
gameIDs doesn't increase very quickly, but the number of userIDs may
continue to increase.

For any given gameID, I need to know what other games are most often
owned by other users that have that game. Essentially I'm trying to do
something like Amazon.com's affinity grouping which allows you to look
at a page for a book and see other books that other people have
purchased who also purchased that book.

How should I accomplish this? The straightforward way seems far too
inefficient to be practical. The straightforward way would be to make
a list of all the gameIDs then group all gameIDs by userID and then
prune all userID groups that don't contain the desired gameID. I would
then group by the gameID and take the most commonly occurring gameIDs
to be the ones with the greatest affinity.

Of course, then there's the problem of popular games which would end
up with an affinity to ALL gameIDs. I'd need to prune those somehow.

So, what should I do? Where can I get information on how to approach
this problem? All I can find on the web are articles that tell me how
important affinity grouping is in data mining. I can't find any
information on what algorithms I actually need to use to accomplish
this.

Request for Question Clarification by mathtalk-ga on 12 Dec 2002 13:48 PST
Hi, mxnmatch-ga:

I think the mathematical framework for what you want is conditional
probability.  For example, let's say that 90% of the game owners have
Simpson's Road Rage, and this is independent of whether or not they
own (say) GTA III, which is true of 40% of the game owners.

But among owners of GTA III, the number who have SimCity Used Car Lot
(I'm making stuff up now) is 30%, not a great number but much higher
than the 15% of all gameowners who have it.

So I think you are looking for a relative comparison of Pr(A|B) to
Pr(A), where A means owning game A, B means owning game B, and the
population is all game owners.

Once you have an exact mathematical criterion defined, then we can
proceed to discuss the most efficient database query to retrieve the
answer.  BTW, how are the table entries stored?

regards, mathtalk-ga

Clarification of Question by mxnmatch-ga on 12 Dec 2002 14:17 PST
I like that. I was originally just thinking of getting the most
commonly associated games and then just chop off the top X in the
list, but I wasn't sure where to draw the line. Your suggestion makes
more sense, but the question is how easily can it be done?

The data is stored in a simple table (I'm using Oracle):
  gameID  NUMBER
  userID  NUMBER
The original table has more columns than that, but they're not
relevant to this problem and I intend to copy them out to a temp table
before operating on them anyway (to minimize tieing up the original
table).

So, once I have  Pr(A|B) and Pr(A) would I just take the difference
between the two Pr(A|B)-Pr(A) and then order from largest to smallest?

I've already figured out a query which determines the number of times
that a particular game is associated with other games. To get Pr(A|B)
would I just take the count (cnt) and divide that by the total number
of users?

Here's the query:
SELECT
	g1.game g1Game, g2.game g2Game, count(*) cnt
FROM
	(select unum,game from tmp20021212_userGameColl where game=482099)
g1,
	tmp20021212_userGameColl g2
WHERE
	g1.unum=g2.unum
	and g1.game!=g2.game
GROUP BY g1.game,g2.game
ORDER BY g1.game, count(*) desc, g2.game
;


"unum" is the actual name of the userID column and "game" is the
actual name of the gameID column. The table "tmp20021212_userGameColl"
is the temp table that contains those 2 columns.

Clarification of Question by mxnmatch-ga on 12 Dec 2002 14:18 PST
The be clear, that query only gets the counts for the game "482099". I
would have to run that query once for each game. It currently takes
about 9 seconds to run.
Answer  
Subject: Re: Affinity grouping?
Answered By: mathtalk-ga on 12 Dec 2002 21:10 PST
Rated:5 out of 5 stars
 
Hi, mxnmatch-ga:

I'm not sure if selecting the given columns into a smaller temp table
is necessarily the best strategy.  It seems likely to me that with
appropriate indexes on the existing table, the queries may execute
just as effectively.  Of course this is a matter for experimentation. 
For the sake of discussion I'll simplify the table name to usergame,
and we'll assume the compound primary key(user,game).

Here is how we might compute, for a pair of games A and B, the
quantity:

 Pr(A|B) - Pr(A)

Note that by definition Pr(A|B) = Pr(A&B)/Pr(B).  Therefore four
counts are needed:

SELECT count(DISTINCT userid) AS cntU 
FROM usergame;

SELECT count(*) AS cntA 
FROM usergame 
WHERE game = A;

SELECT count(*) AS cntB 
FROM usergame 
WHERE game = B;

SELECT count(*) AS cntAB 
FROM usergame ug1
WHERE game = A
AND EXISTS
 ( SELECT * 
   FROM usergame
   WHERE user = ug1.user
   AND game = B );

In the event that we are trying to find the top ten games A having
affinity to a fixed game B, the first and third queries need only be
run once (for fixed B) and results reused for all games A.  Indeed the
first query, which simply counts the total number of distinct users in
the table, may reused across all such calculations of affinity, while
the second and third queries might properly be combined (if a volume
of such calculations are required) into:

SELECT game, count(*) 
FROM usergame
GROUP BY game;

whose output could be saved in the form of a summary table.

The real effort required may thus be reduced after this fashion to the
fourth query above.  But let us give the mathematical formula promised
earlier:

Pr(A|B) = cntAB/cntB

Pr(A) = cntA/cntU

Pr(A|B) - Pr(A) = (cntAB/cntB) - (cntA/cntU)

 = (cntAB*cntU - cntA*cntB)/(cntB*cntU)

Now in terms of the rankings, the denominator here would be the same
for all values of Pr(A|B) - Pr(A) when game B is fixed.  So we might
as well look for the top ten values of the numerator:

cntAB*cntU - cntA*cntB

as these integer values are related to those expressed above in terms
of condition probabilities by a simple scaling.  Depending on the
number of records involved, the all integer computation might be
attractive or not (as when the numbers are so great as to cause
overflow in multiplying).  From the rough numbers mentioned in your
original question (a million userids), I don't think overflow would be
a problem in most cases, but if both A,B happened to be games owned by
most users, then the 4-byte Oracle integer would not be big enough to
hold results like one million squared.

I'm unclear how much of the computations need to be done by the
database.  Often the balance between database work and middleware work
can be optimized so that the arithmetic is done in quickest fashion. 
I'd love to hear more about the architecture of your program before
making a recommendation.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 12 Dec 2002 21:16 PST
The query you propose in your clarification is essentially the same as
my fourth query, done for all possible games A simultaneously (with B
= 482099).  If you can do this in ten seconds or so, the rest of the
computation could be done in probably an addition 3 to 5 seconds by
the database, or even faster if the number crunching is done in the
middleware.

-- mathtalk-ga

Clarification of Answer by mathtalk-ga on 13 Dec 2002 08:44 PST
Hi, mxnmatch-ga:

If you want I can give you a SQL only implementation, or sketch how to
combine the SQL results with some "C" code to compute the rankings. 
Which would best fit your programming model?

-- mathtalk-ga

P.S.  I enjoyed your question about Douglas Adams/Dirk Gently!

Request for Answer Clarification by mxnmatch-ga on 13 Dec 2002 16:57 PST
Here's the PL/SQL I came up with. Does this look ok? It seems to
actually take 6 seconds per game now. It went down when I added the
"HAVING count(*)>5".


DECLARE UNIQUEUNUMCNT NUMBER;
BEGIN
  SELECT count(DISTINCT unum) INTO UNIQUEUNUMCNT FROM
userGameAffinityGames;

  dbms_output.ENABLE(10000);
  dbms_output.put_line('started: '||sysdate);


  -- Fill temp table with (unum,game) pairs
  /*
  CREATE TABLE userGameAffinityGames
  (
     unum  NUMBER(11)  NOT NULL,
     game  NUMBER(20)  NOT NULL
  );
  CREATE index userGameAffinityGames_ixUnum on
userGameAffinityGames(unum);
  CREATE index userGameAffinityGames_ixGame on
userGameAffinityGames(game);
  */
  DELETE FROM userGameAffinityGames;
  INSERT INTO userGameAffinityGames (unum,game)
  	SELECT unum, game FROM userGameProfile where isowned=1;
  COMMIT;
  dbms_output.put_line('finished copying unum,game pairs: '||sysdate);


  /*
  CREATE TABLE userGameAffinityCounts
  (
     game  NUMBER(20)  NOT NULL,
     cnt  NUMBER(20)  NOT NULL
  );
  CREATE index userGameAffinityCounts_ixGame on
userGameAffinityCounts(game);
  */
  DELETE FROM userGameAffinityCounts;
  INSERT INTO userGameAffinityCounts (game,cnt)
  	SELECT game, count(*) cnt FROM userGameAffinityGames GROUP BY game;
  COMMIT;
  dbms_output.put_line('finished getting game counts: '||sysdate);



  FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames)
LOOP
    BEGIN
		DELETE FROM userGameAffinityGroupings WHERE game=theGame.game;

		INSERT INTO userGameAffinityGroupings
(game,orderNum,gameWithAffinity)
		SELECT gB,rownum,gA
		FROM
		(
				SELECT
					gB, gA,
					(gBgACnt*uniqueUnumCnt) - (gAGameCnt*gBGameCnt) justNumerator
				FROM
					(
					SELECT
						g1Game gB
						,g2Game gA
						,gAgBCnts.cnt gBgACnt
						,gBGameCnts.cnt gBGameCnt
						,gAGameCnts.cnt gAGameCnt
					FROM
					(select c.game,c.cnt from userGameAffinityCounts c) gBGameCnts,
					(select c.game,c.cnt from userGameAffinityCounts c) gAGameCnts,
					(
						SELECT rownum rn, g1Game, g2Game, cnt
						FROM
						(
							SELECT
								g1.game g1Game, g2.game g2Game, count(*) cnt
							FROM
								(select unum,game from userGameAffinityGames where
game=theGame.game) g1,
								userGameAffinityGames g2
							WHERE
								g1.unum=g2.unum
								and g1.game!=g2.game
							GROUP BY g1.game,g2.game
							HAVING count(*)>5
							ORDER BY g1.game, count(*) desc, g2.game
						)
					) gAgBCnts
					WHERE
						gAGameCnts.game=gAgBCnts.g1Game
						AND gBGameCnts.game=gAgBCnts.g2Game
				)
				ORDER BY justNumerator DESC
		) WHERE rownum <= 10
		;

		COMMIT;

    END;
  END LOOP;
  dbms_output.put_line('finished '||sysdate);

END;

Clarification of Answer by mathtalk-ga on 13 Dec 2002 21:43 PST
This isn't my full analysis, but one thing I'd add would be a
restriction in that final query to justNumerator > 0.  There might not
be as many as ten games with positive "A" affinity for the given "B"
base game, in which case I don't think you'll want to list games with
zero affinity or smallish negative affinities.

-- mathtalk

Clarification of Answer by mathtalk-ga on 15 Dec 2002 22:38 PST
I might be misunderstanding something, but it seems that the "FOR ...
LOOP" is meant to find all the "top-ten" affinities of all the games. 
Is that indeed the intention?

Note that the "numerator only" formulation is symmetric in A,B:

(# owners of A&B)*(# owners) - (# owners of A)*(# owners of B)

The original post mentioned 20,000 games, so I assumed you'd want to
fix B (say) and just compute the top-ten affinity games A.   This
would probably make the most sense if the underlying data is volatile,
and so apt to make reuse of previous computations questionable.

But if the data were not often changed, then you might save some time
in computing _all_ affinities by making use of the symmetry.  Let me
know if I should elaborate.

regards, mathtalk-ga

Request for Answer Clarification by mxnmatch-ga on 16 Dec 2002 12:02 PST
Before doing the computations I move all the data to the holding table
userGameAffinityGames, so the data is fixed until the function
finishes. Also, we're thinking that, given the cost of doing these
calculations, we would only recompute once a month except possibly for
new games.

However, I'm not sure about the best way to use the symmetry. Would I
just add an additional restriction to the WHERE clause of the
innermost query which does something like "WHERE NOT EXISTS (select 1
FROM userGameAffinityGroupings ag WHERE ag.game=g2.game and
ag.gameWithAffinity=g1.game)"?

Clarification of Answer by mathtalk-ga on 16 Dec 2002 20:11 PST
Hi, mxnmatch-ga:

I'm not sure what the optimal way to exploit symmetry would be, but
here is where my intuition is taking me.  As written you are looping
through records with:

FOR theGame IN (SELECT DISTINCT game FROM userGameAffinityGames)
  LOOP

But we already collected a list of unique values of game in
userGameAffinityCounts, so we could instead loop through:

FOR B IN (SELECT game FROM userGameAffinityCounts ORDER BY cnt DESC)
  LOOP

For each "B" game in this list (starting with the most popular ones,
as guaranteed by the ordering), we might compute and store all the
positive affinities "A" with "B" as follows within the loop:

/* CREATE TABLE userGamePositiveAffinity
   (
        gA NUMBER(20) NOT NULL,
        gB NUMBER(20) NOT NULL,
        jNum NUMBER  NOT NULL
   );
*/

INSERT INTO userGamePositiveAffinity(gA,gB,jNum)
SELECT gA, B.game,
  cntAB*UNIQUEUNUMCNT - gctA.cnt*B.cnt justNumerator
FROM
  (SELECT ownA.game gA, count(*) cntAB 
    FROM 
      (SELECT * FROM userGameAffinityGames
        WHERE game = B.game) ownB,
      (SELECT * FROM userGameAffinityGames
        WHERE game != B.game) ownA
    WHERE ownA.unum = ownB.unum
    GROUP BY ownA.game) ownAB,
  (SELECT * 
    FROM userGameAffinityCounts) gctA
WHERE ownAB.gA = gctA.game
 AND  justNumerator > 0;

Thereafter we can DELETE all the userGameAffinityGames records in
which game is B.game, and move on to the next most popular game in the
"loop".  Since the "scratch" table userGameAffinityGames is shrinking
with each step in the loop, it should run faster as the loop
progresses.

At the end we have all the records:

userGamePositiveAffinity(gA,gB,jNum)

where game gA is _less_ popular than game gB.  We could then supply
symmetry by:

INSERT INTO userGamePositiveAffinity(gA,gB,jNum)
SELECT gB, gA, jNum
FROM userGamePositiveAffinity;

END LOOP;

Finally we would loop again through userGameAffinityCounts and, for
each B.game in that table, select the ten highest jNum (affinity)
records from userGamePositiveAffinity where gB = B.game, writing the
results out to userGameAffinityGroupings.

Only trials with some realistic data can determine if this approach is
reasonably fast or even faster than another plausible process.  I like
the idea of running it as a batch, but you might do it weekly or even
daily if it is not too intensive.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 17 Dec 2002 06:53 PST
One other (undigested) thought... it is possible that you are
recording ratings along with the ownership of games.  A more
sophisticated analysis might then include the ratings information in
predicting which games would be enjoyed (by an owner of a particular
game).

-- mathtalk

Clarification of Answer by mathtalk-ga on 24 Dec 2002 17:55 PST
As I thought about my suggestion for exploiting symmetry (and
computing all the positive affinity rankings), I realized the biggest
obstacle would likely be storage space for all the "numerators"
(cntAB*cntU - cntA*cntB).  True, we only keep the positive ones, and
this cuts down the storage required by at least half.  But I think
that it would be reasonable to reduce it further by setting a higher
(positive) threshold, like:

(cntAB*cntU - cntA*cntB) > cntU/1000

Also, I thought a simple way to incorporate the ratings into the final
"affinity" groupings might be just to give the average rating of each
affinity A game (by those who are also owners of base B game).

regards and seasons greetings,
mathtalk-ga

Request for Answer Clarification by mxnmatch-ga on 30 Dec 2002 13:41 PST
Ok, I'm back. One of my coworkers thought that Pr(A|B)/Pr(A) made more
sense than Pr(A|B) - Pr(A). What do you think? In either case a larger
Pr(A) will decrease the value.

Clarification of Answer by mathtalk-ga on 30 Dec 2002 15:17 PST
Yes, it certainly is worth considering.  I might have gone that way
myself if you had not seized on the difference rather than the ratio. 
Note that:

Pr(A|B) = Pr(A&B)/Pr(B), so

Pr(A|B)/Pr(A) = Pr(A&B)/(Pr(A)Pr(B))

which again is symmetric in A and B.

My impression is that the ratio approach handicaps the very popular
games more than the (earlier considered) difference approach would. 
Consider this example:

Game B (base) is owned by 25% of those surveyed.  Games A1 and A2 are
owned, resp. by 50% and 5% of the general population, but among owners
of game B these percentages jump to 75% and 20% respectively.

Which has the higher affinity?  Using the ratio approach:

Pr(A1|B)/Pr(A1) = (0.75)/(0.50) = 1.5

Pr(A2|B)/Pr(A2) = (0.20)/(0.05) = 4.0

and game A2 would have the greater measure of affinity.

Using the difference approach:

Pr(A1|B) - Pr(A1) = 0.75 - 0.50 = 0.25

Pr(A2|B) - Pr(A2) = 0.20 - 0.05 = 0.15

and game A1 would have the greater measure of affinity.

Which of these is the correct (or better) measure?  Beats me.  I had
kind of reached the thought that perhaps the ratio approach makes it
too hard to rank a very popular game as having a high affinity, but on
the other hand popular games are well known, so perhaps "suggesting"
an affinity that is well known is not so beneficial as mentioning one
that may be lesser known.

I'll have to defer to you guys as "domain experts" on what best
represents the quality you intend to present to Web vistors.

regards, mathtalk

Clarification of Answer by mathtalk-ga on 30 Dec 2002 20:57 PST
Thanks, mxnmatch, for the kind words and generous tip!  I wish you
guys the best with your project and with the New Year...

-- mathtalk
mxnmatch-ga rated this answer:5 out of 5 stars and gave an additional tip of: $10.00
Very helpful. They worked through the problem with me until I had what I needed.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy