Hi, shane43-ga:
Sometimes a composite (multi-column) index will be inherent in a
normalized table's design, as when the primary key consists of a
combination of fields. Other times we may choose to create composite
indexes as performance tools.
Let's begin with the sample tables and the query you've provided:
Select count(*) from emails,
orders,
purchases,
customers
WHERE emails.order_id = orders.order_id
and orders.purchase_id = purchases.purchase_id
and purchases.customer_id = 1001
Keeping in mind that this is only an illustration, of course, the
first thing I'd like to do is strip out the table customers from the
query. There aren't any conditions on rows in customers in the WHERE
clause, so in reality the SQL shown would create a lot of duplication!
We'd certainly like to have an index on table purchases for
customer_id, which allows us to quickly pick out all the rows
pertaining to a fixed customer. And if you've already done all the
"one column" indexes, then we have that, and the (unique) purchase_id
values will have a functional dependence then on customer_id.
Next we have the join between table purchases and table orders.
From the look of things, the primary key on table purchases is
purchase_id and the primary key on table orders is order_id. So far
we don't have any requirement to create a composite index just for the
sake of defining a primary key. But between the purchases and orders
tables there seems to be a one-many relationship, because each row in
orders has a field purchase_id (but there is no converse reference in
table purchases to a value of order_id).
This suggests that we might want a composite index on table orders
that has the two columns purchase_id and order_id. While order_id by
itself would be unique (as the primary key), creating the composite
index will most likely allow the MySQL query optimizer to use only the
index and not the entire record in table orders (which involves a lot
of additional data, as the tables above are defined). In effect the
index is a smaller and more easily searched version of the table
itself, but I think we would want to follow the sequence:
CREATE UNIQUE INDEX purchase_order
ON orders (purchase_id, order_id)
[MySQL Manual | 14.2.4 CREATE INDEX Syntax]
http://dev.mysql.com/doc/mysql/en/CREATE_INDEX.html
If the indexes for MySQL tables are done in the usual way, it will
create "branches" for the first field and "leaves" following them for
the subsequent fields (here purchase_id --> order_id), like a directed
graph.
The only advantage here in having the composite index over having the
single column index on purchase_id in the table orders is the savings
in scanning the composite index vs. scanning the index plus the table
(to pull out the order_id).
Finally we come to the table emails, on which we presumably have as
primary key the field email_id and a non-unique index for order_id.
From the given definition there are only two other fields (email_type
and email_address), so it might not be a big advantage to go with any
composite index. On the other hand this is a "simplified" schema; in
your actual application something of the sort might be useful.
* * * * * * * * * * * * * * * * * *
Now let's talk about "denormalizing" the database, as you originally
suggested. The "data-warehouse" approach mentioned by scubajim-ga
could be applied to your existing schema in some minor ways without
creating too many problems, assuming that the volume of updates is
low.
The idea would be to populate tables with "counts" that summarize the
portion of a hierachy below them, so queries can be written with fewer
joins. Here we have a functional dependence (I think) that proceeds:
customers
--> purchases
--> orders
--> emails
by virtue of the synthetic key at each level being used to
characterize rows at the next level down.
In such a strict hierachy one could add a column at each level that
summarizes the counts of records below. For example, we might add a
new column to table orders, which gives the number of rows in emails
which are associated to any given row in orders. Provided the
orders/emails relationship is rarely changed after the fact (which may
or may not correspond to your business model), we might enforce the
accuracy of this summary count by a "trigger". Unfortunately it
appears that triggers have not been implemented in MySQL prior to
version 5.1, so you would have to rely on some regularly timed batch
jobs to maintain these counts until an upgrade becomes available:
[MySQL Manual | 1.8.5.4 Stored Procedures and Triggers]
http://dev.mysql.com/doc/mysql/en/ANSI_diff_Triggers.html
However assuming the existence of an integer-valued column num_emails
added to the table orders, your original query (minus table customers)
would be equivalent to:
Select sum(num_emails) from orders,
purchases
WHERE orders.purchase_id = purchases.purchase_id
and purchases.customer_id = 1001
[MySQL Manual | 13.9.1 GROUP BY (Aggregate) Functions]
http://dev.mysql.com/doc/mysql/en/GROUP-BY-Functions.html
If performance required additional tweaking, you could carry the
pre-computation of counts to the next higher level, adding a column
num_emails to table purchases (which would add the results over
corresponding rows in table orders), or even all the way back to table
customers if that were useful (though of course the higher you go with
the summary pre-computation, the more vulnerable the data is to
becoming outdated).
regards, mathtalk-ga |
Request for Answer Clarification by
shane43-ga
on
28 Jun 2004 10:17 PDT
Thanks for the thorough answer. A few follow-up questions:
1) Can you explain why you suggest putting a unique index on orders
(order_id,purchase_id) but you don't talk about one on purchases
(purchase_id,customer_id). If I understand composite indices
correctly, it seems like that would also help.
2) So, from now on, whenever I have a query that is looking up more
than 1 column from a table, instead of indexing the columns
seperately, I should make a composite index of all of them?
3) you write:
>The only advantage here in having the composite index over having the
>single column index on purchase_id in the table orders is the savings
>in scanning the composite index vs. scanning the index plus the table
>(to pull out the order_id).
Is this still the case if there is a single column index on order_id
in both tables? I'm trying to see why the whole row would need to be
scanned when their is an index for order_id.
FYI, the inclusion of the customer table was my typo. It's not like
that in the real system.
Thanks!
|
Clarification of Answer by
mathtalk-ga
on
28 Jun 2004 11:12 PDT
Hi, shane43-ga:
1) Logically the same reasoning exists for both tables (purchases and
orders). However the point I was trying to make is that from the
standpoint of using the single column index on (say) customer_id to
find the rows in purchases (and thus the value of purchase_id for each
such row), versus using a composite index on purchases consisting of
customer_id and purchase_id, the main advantage would be working with
the index rather than the table as a smaller chunk of memory I/O.
Here the table purchases as you defined it only has one other field
(date), so I presented the case for the composite index on orders as
having a more compelling argument.
2) I wouldn't say that the rule is to replace all single column
indexes with a composite index that combines them all, even if the
query looks them all up. The thing to keep an eye on is how the query
gets narrowed down to the required records, so it's more a matter of
what the query uses in the WHERE clause rather than what it looks up
in the SELECT's head. In some cases like the one here the main hope
is that the index will provide all the information needed to execute
the query without actually reading the table at all. In other cases
the choice of indexes has to be forcefully presented to the query
optimizer so that it does the joins in the most efficient order.
That's where the EXPLAIN facility can give you some feedback, if you
think you've defined enough indexes to make the query run quickly.
The Query Plan will tell whether or not your intentions are being
realized.
3) The flow I visualize for your SQL query is this:
given customer_id, find rows in purchases (get purchase_id)
given purchase_id, find rows in orders (get order_id)
given order_id, find rows in emails (count them)
Now given one index on orders with purchase_id and another (unique)
index there for order_id, there's no navigation from one (single
column) index to another. The index points into the table, so in this
particular query the single column index of orders by order_id doesn't
help (even though it is important to the "business logic" of the
table).
So again, putting both purchase_id and order_id into the _same_ index
on table orders should be useful, if you put them in this order,
because the MySQL query optimizer should be able to figure out how to
implement the flow of information outlined above (e.g. the functional
dependence of order_id on purchase_id).
I don't know how closely the example follows your actual schema, but
it seems from the follow-up questions you are asking, you will
understand how the same sort of experiments could be done with other
queries and schemas. Do some timings to see if things really are
improved by new indexes! Too many times programmers take it on faith
that the index is making things better, but I keep my "chronometer"
watch handy to verify! Timings are also affected by how busy the
machine is with other work (esp. other database transactions).
regards, mathtalk-ga
|