Google Answers: SQL: How to find similar sets among records.

View Question

Q: SQL: How to find similar sets among records. ( Answered, 1 Comment )

Question

Subject: SQL: How to find similar sets among records.
Category: Computers > Programming
Asked by: davidjonsson-ga
List Price: $10.00

Posted: 16 Feb 2003 08:03 PST
Expires: 18 Mar 2003 08:03 PST
Question ID: 162036

How do I search for most common object in a MySQL and/or PostgrelQL database? All users are recorded in a database. To each user a set of IP numbers are associated. How can I find the users that have most IP-numbers in common? I also need to find users who only have a subset of IP-numbers in common. (Say that the database search find N common IP-numbers for a specific user. I also must be able to search for all users who have N-1, N-2, N-3 ... IP-numbers in common.)
Request for Question Clarification by mathtalk-ga on 17 Feb 2003 07:51 PST Hi, davidjonsson-ga: Is there a fixed set of IP addresses involved? How many are there? How many users, roughly? Are you asking two questions here, one about "most common object" in a database and another about (pairs of?) users with a certain number of IP addresses in common? I'm unclear about the phrases "most common" and "in common" as you have used them. Does the fact that you've posted this in the Computer > Programming section mean you are looking for some working "code" (SQL or otherwise), or (perhaps commensurate with the list price offered) just a few words about a viable approach to this solution? regards, mathtalk
Clarification of Question by davidjonsson-ga on 17 Feb 2003 11:58 PST Hi There is no fixed set so it can be any of 2^32 different IP-numbers. There can be many users as well, 100 000 or more. I will build a big database. If it grows too big it will be distributed over several computers. I want the SQL sentence to use and also how to make the record set that each user will have, Usually a user has never more that 32 IP-number associatied with themselves.
Request for Question Clarification by mathtalk-ga on 18 Feb 2003 12:58 PST Hi, davidjonsson-ga: I have prepared an answer, but I would still like for you to clarify something I asked about earlier. Are you trying to find a "pair" of users who have a certain number of IP-numbers in common (eg. the "most" IP-numbers in common of any pair of users)? If not please explain what it means for "users" to have a subset of IP-numbers "in common". regards, mathtalk-ga

Answer

Subject: Re: SQL: How to find similar sets among records.
Answered By: mathtalk-ga on 18 Feb 2003 13:42 PST

Hi, davidjonsson-ga:

As clarified your question has two parts, as I understand it.  You
want (1) a SQL query that identifies pairs of users with certain
numbers of IP addresses in common, and also (2) directions for making
"the record set that each user will have".

You also want an approach that works with MySQL and/or PostgreSQL.

Please let me know, after reading my answer, if I've interpreted these
parts of your question correctly.  I plan on giving standard SQL
syntax that is correct for your intended purpose, but in addition it
is useful to understand practical optimizations.  SQL is a
"non-procedural" language, especially as it applies across a variety
of relational database platforms.  So that the discussion can address
basic optimizations, the SQL syntax will therefore focus on the MySQL
platform.  For reference the searchable MySQL manual with user
comments can be found here:

[MySQL Manual]
http://www.mysql.com/doc/en/index.html

Let's begin with the creation of a record set to hold information
about users and their IP addresses.  For the sake of this discussion
we can use:

mysql> CREATE TABLE tblUserIp (username CHAR(10), ipnumber CHAR(8));

This MySQL statement defines a new table called tblUserIp in the
current database.  I have chosen to represent ipnumber as eight
"hexadecimal" digits, because this is a natural "fixed length"
approach, but you might prefer INTEGER UNSIGNED since this would put
the 32 bits in a single lump.  Also I'm simply using a fixed-length
character string to identify users in the absence of any specific
information about how you plan to do this.  The fact that both fields
are fixed-length means that rows for this table can be fixed-length
and that makes possible the efficient implementation of the table in
MySQL as a MyISAM type table of static (fixed-length) format.  This
would occur by default with the above statement, but for advanced
configuration one might choose instead the InnoDB table type
(available on MySQL Windows distributions).

In order to complete the definition of this table we have one other
detail to attend to.  Presumably we only want to record the
combination of a particular user and IP address once, so in the argot
of relational databases, the two fields together form a (compound)
primary key for the table.  This could have been specified in the
above CREATE TABLE statement, but it can also be done after the fact
this way:

mysql> ALTER TABLE tblUserIp ADD PRIMARY KEY (ipnumber, username);

Notice that I have placed the field tblUserIp.ipnumber first in
defining the primary key.  This is an attempt at optimization which
relates to how we will use the table below.

At this point we have created the table's structure, but it remains to
fill the table with actual data.  I am unclear where your data will
come from, so I cannot be very specific about this step.  If data is
to be added one row at a time, then an INSERT statement might be
appropriate.  If the data is in bulk from a delimited text file, then
the LOAD DATA statement is perhaps more appropriate.

Once all the valid combinations of users and IP addresses are entered
into the table, we can return to the first part of your question.  How
do we identify pairs of users that have the greatest or some specified
number of IP addresses "in common"?

To expedite the discussion we will give a SQL statement that does the
job, then explain how it works and some optimizations that might be
practical.

mysql> SELECT t1.username, t2.username, count(*) AS incommon
    -> FROM tblUserIp AS t1 INNER JOIN tblUser AS t2
    -> ON t1.ipnumber = t2.ipnumber AND t1.username < t2.username
    -> GROUP BY t1.username, t2.username
    -> HAVING incommon > 0
    -> ORDER BY incommon DESC;

This query involves a self-join of the table tblUserIp with itself. 
This is how we can identify how many IP addresses are shared in common
by pairs of users.  The join condition asks for two records that have
equal values of ipnumber.  Also, to avoid "double counting" we throw
in the condition that the first user's name is strictly less than the
second user's name.
This also eliminates pairings of any user with himself.

In order to make the references to the first and second "copies" of
the table unambiguous, we have "aliased" these with the new names t1
and t2.  These temporary designations only have meaning within the
scope of this single query.

The top line of the query selects the results that will be returned,
ie. the names of the two different users who share one or more IP
addresses and the exact count of how many of these shared addresses
there are.  Notice that this summary count is itself aliased as
"incommon" and is referred to that way in later clauses of the query.

The GROUP BY clause is important because it determines the "scope" for
how to count the shared addresses.  We get only counts for IP
addresses shared by any pair of users denoted by t1.username and
t2.username.

As written, the HAVING clause shown above is merely a placeholder.  It
is automatic that any returned value of incommon will be greater than
zero, since zero would mean no pairing occurred in the underlying
self-join on tblUserIp.  Thus a zero value of incommon can never
appear.  However it might be of interest to restrict the values of
incommmon to either one specific value (HAVING incommon = N) or to a
range of values (HAVING incommon BETWEEN nmin AND nmax).

Finally the ORDER BY clause sorts the resulting output so that pairs
of users with the largest number of IP addresses in common will appear
first.

The subject of how best to optimize such a query is complex,
especially if as you indicate is possible, the records are to be
distributed across a number of different computers.  These next
suggestions should therefore be considered as starting points for
exploring a variety of alternatives and "tweakings" on the way to
finding a practical approach.

It has been suggested that count(1) is evaluated more efficiently by
MySQL than count(*), although the two expressions are logically
equivalent.  This would be an easy change to test, to see which works
faster.

The focus here seems to be one of finding users with "large" numbers
of IP addresses in common.  In that context we should restrict the
query to just those users who have at least the target number of IP
addresses which we seek to find in common.  After all, if a user has
less than N IP addresses, that user cannot share N IP addresses in
common with anyone.

As a last point, the MySQL EXPLAIN SELECT command can be used to
illuminate the way that the query will be executed, especially its use
of indexes.  Recall that above I added a primary key to the table and
ordered its fields in a certain way.  The hope is that this might
create a unique index structured in a manner that facilitates the
"join" operation central to the proposed query.  If MySQL does not
automatically figure out how to do this, it is possible to added
"hints" to the SQL in order to "force" the execution plan in the
desired manner.

regards, mathtalk-ga

Comments

Subject: Re: SQL: How to find similar sets among records.
From: hammer-ga on 16 Feb 2003 08:12 PST

DavidJonsson,

You may want to review the Pricing Guidelines.
http://answers.google.com/answers/pricing.html

- Hammer

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy