Hi, davidjonsson-ga:
As clarified your question has two parts, as I understand it. You
want (1) a SQL query that identifies pairs of users with certain
numbers of IP addresses in common, and also (2) directions for making
"the record set that each user will have".
You also want an approach that works with MySQL and/or PostgreSQL.
Please let me know, after reading my answer, if I've interpreted these
parts of your question correctly. I plan on giving standard SQL
syntax that is correct for your intended purpose, but in addition it
is useful to understand practical optimizations. SQL is a
"non-procedural" language, especially as it applies across a variety
of relational database platforms. So that the discussion can address
basic optimizations, the SQL syntax will therefore focus on the MySQL
platform. For reference the searchable MySQL manual with user
comments can be found here:
[MySQL Manual]
http://www.mysql.com/doc/en/index.html
Let's begin with the creation of a record set to hold information
about users and their IP addresses. For the sake of this discussion
we can use:
mysql> CREATE TABLE tblUserIp (username CHAR(10), ipnumber CHAR(8));
This MySQL statement defines a new table called tblUserIp in the
current database. I have chosen to represent ipnumber as eight
"hexadecimal" digits, because this is a natural "fixed length"
approach, but you might prefer INTEGER UNSIGNED since this would put
the 32 bits in a single lump. Also I'm simply using a fixed-length
character string to identify users in the absence of any specific
information about how you plan to do this. The fact that both fields
are fixed-length means that rows for this table can be fixed-length
and that makes possible the efficient implementation of the table in
MySQL as a MyISAM type table of static (fixed-length) format. This
would occur by default with the above statement, but for advanced
configuration one might choose instead the InnoDB table type
(available on MySQL Windows distributions).
In order to complete the definition of this table we have one other
detail to attend to. Presumably we only want to record the
combination of a particular user and IP address once, so in the argot
of relational databases, the two fields together form a (compound)
primary key for the table. This could have been specified in the
above CREATE TABLE statement, but it can also be done after the fact
this way:
mysql> ALTER TABLE tblUserIp ADD PRIMARY KEY (ipnumber, username);
Notice that I have placed the field tblUserIp.ipnumber first in
defining the primary key. This is an attempt at optimization which
relates to how we will use the table below.
At this point we have created the table's structure, but it remains to
fill the table with actual data. I am unclear where your data will
come from, so I cannot be very specific about this step. If data is
to be added one row at a time, then an INSERT statement might be
appropriate. If the data is in bulk from a delimited text file, then
the LOAD DATA statement is perhaps more appropriate.
Once all the valid combinations of users and IP addresses are entered
into the table, we can return to the first part of your question. How
do we identify pairs of users that have the greatest or some specified
number of IP addresses "in common"?
To expedite the discussion we will give a SQL statement that does the
job, then explain how it works and some optimizations that might be
practical.
mysql> SELECT t1.username, t2.username, count(*) AS incommon
-> FROM tblUserIp AS t1 INNER JOIN tblUser AS t2
-> ON t1.ipnumber = t2.ipnumber AND t1.username < t2.username
-> GROUP BY t1.username, t2.username
-> HAVING incommon > 0
-> ORDER BY incommon DESC;
This query involves a self-join of the table tblUserIp with itself.
This is how we can identify how many IP addresses are shared in common
by pairs of users. The join condition asks for two records that have
equal values of ipnumber. Also, to avoid "double counting" we throw
in the condition that the first user's name is strictly less than the
second user's name.
This also eliminates pairings of any user with himself.
In order to make the references to the first and second "copies" of
the table unambiguous, we have "aliased" these with the new names t1
and t2. These temporary designations only have meaning within the
scope of this single query.
The top line of the query selects the results that will be returned,
ie. the names of the two different users who share one or more IP
addresses and the exact count of how many of these shared addresses
there are. Notice that this summary count is itself aliased as
"incommon" and is referred to that way in later clauses of the query.
The GROUP BY clause is important because it determines the "scope" for
how to count the shared addresses. We get only counts for IP
addresses shared by any pair of users denoted by t1.username and
t2.username.
As written, the HAVING clause shown above is merely a placeholder. It
is automatic that any returned value of incommon will be greater than
zero, since zero would mean no pairing occurred in the underlying
self-join on tblUserIp. Thus a zero value of incommon can never
appear. However it might be of interest to restrict the values of
incommmon to either one specific value (HAVING incommon = N) or to a
range of values (HAVING incommon BETWEEN nmin AND nmax).
Finally the ORDER BY clause sorts the resulting output so that pairs
of users with the largest number of IP addresses in common will appear
first.
The subject of how best to optimize such a query is complex,
especially if as you indicate is possible, the records are to be
distributed across a number of different computers. These next
suggestions should therefore be considered as starting points for
exploring a variety of alternatives and "tweakings" on the way to
finding a practical approach.
It has been suggested that count(1) is evaluated more efficiently by
MySQL than count(*), although the two expressions are logically
equivalent. This would be an easy change to test, to see which works
faster.
The focus here seems to be one of finding users with "large" numbers
of IP addresses in common. In that context we should restrict the
query to just those users who have at least the target number of IP
addresses which we seek to find in common. After all, if a user has
less than N IP addresses, that user cannot share N IP addresses in
common with anyone.
As a last point, the MySQL EXPLAIN SELECT command can be used to
illuminate the way that the query will be executed, especially its use
of indexes. Recall that above I added a primary key to the table and
ordered its fields in a certain way. The hope is that this might
create a unique index structured in a manner that facilitates the
"join" operation central to the proposed query. If MySQL does not
automatically figure out how to do this, it is possible to added
"hints" to the SQL in order to "force" the execution plan in the
desired manner.
regards, mathtalk-ga |