Assume we have a database table that stores the results of a photo
rating website. People rate the photos of others on a 1 to 10 scale.
The table where we store the results has the following fields:
id (INT) e.g. 17426
viewer_id (INT) e.g. 1023
subject_id (INT) e.g. 4721
rating (TINYINT) e.g. 7
The query we would like is as follows - it's in three parts, but this
is just to make it easier to understand. The MySQL query should be a
single SQL statement:
1. For a given user X, we are only interested in the last 200 photos
that X has rated.
2. We now want to find the 100 other people, who have seen and rated
the most number of same people as X (i.e. people who've seen some, or
all of the same 200 people).
3. We want to sort those 100 people in descending order, with the
person who has seen the most out of those 200 people at the top.
In other words...
John has seen and rated 500 people on the system. We are only
interested in the last 200 people he has rated. Taking those 200
people, we find the 100 people who have seen and rated the most of
those 200 people. E.G.Another user Bill has viewed 163 of the same 200
people. Another user Calvin has viewed 127 of those 200 people.
Another user Todd has viewed 56 of those 200 people. We want to find
the other 97 people who have seen large numbers of those 200 people,
and put them in descending order of priority. Using our example Bill
is at the top of the list with 163/200, a guy by the name of Fred is
at the very bottom with 17/100.
The result should be the 100 user ID's sorted as described. It should
be a single SQL query that is the most efficient as possible. We
expect a high use of this query. You can assume we have choices in
terms of when we create the table, whether to create indexes or not,
and what type of table type we use (MyISAM,InnoDB etc.)
The answer we are seeking is...
1. What the MySQL query should be.
2. Any requirements for indexes or table type.
2. The actual number of queries per second we could achieve for a
given processor speed and RAM configuration (your machine), and a
table size of 10,000,000 ratings, spread evenly among 50,000 users.
3. The actual number of queries per second we could achieve for the
same processor speed and RAM configuration, and a table size of
100,000,000 ratings, spread evenly among 500,000 users.
4. Any general suggestions for optimizing this process.
We do expect someone the answer to include the results of a real
physical test using sample data, which you would generate. |