Google Answers Logo
View Question
 
Q: Database Normalization ( Answered 4 out of 5 stars,   11 Comments )
Question  
Subject: Database Normalization
Category: Computers > Programming
Asked by: shane43-ga
List Price: $20.00
Posted: 24 Jun 2004 13:21 PDT
Expires: 24 Jul 2004 13:21 PDT
Question ID: 365759
I created a database a while ago that I thought was pretty "normal",
meaning there is no duplicate data anywhere which could allow for
corruption. I have been pretty satisfied with it for the past year,
but now it is beginning to be pretty big - almost 2 GB in size, and it
started to lag a lot on certain queries. I know that it could be
faster if I duplicated data, but I am worried about doing that because
I like it being normal. Can someone provide advice on what (if any)
changes I should make to the database structure in order to make it
reliable, yet speedy?

Here's a breakdown of how it is now. I have removed irrelevant parts:

Customers
---------
customer_id
Name

Purchases
---------
purchase_id
customer_id
date

Orders
---------
order_id
purchase_id
number_ordered
email_type
is_filled
price_paid

Emails
---------
email_id
order_id
email_type
email_address

This isn't our business model, but the structure is similar. Simply
put - A customer will sign up and create a purchase to receive email
addresses from the site. The purchase can contain 1 or more orders of
a particular email type. Each order is linked to the purchase and
tells how many of what each type of email was needed. As emails get
collected by affiliate sites, they get assigned to an order that is
not filled. To better explain, here's a sample of rows:

Customers
---------
1001 / Bob

Purchases
---------
2001 / 1001 / Jan 1, 2001


Orders
---------
3001 / 2001 / 3 / British / 0 / $20


Emails
---------
4001 / somegirl@england.com / 3001 / British 
4002 / someguy@england.com / 3001 / British

Anyway - It seems to work well, except since our database is so filled
with orders and emails it takes a while to pull statistics. For
example, in order to find out how many emails customer 1001 has ever
received to date, we have to do:
Select count(*) from emails,orders,purchases,customers WHERE
emails.order_id = orders.order_id and orders.purchase_id =
purchases.purchase_id and purchases.customer_id = 1001.

Since there are a lot of rows in these tables the query could take up
to 10 seconds. I have things indexed too. So - Is there a better way
to be doing this?

Should I keep a field in the orders table that says "number_received"
so that we would just have to total up that column instead of count
all returned email rows?

Should I cache the results from the above query every couple hours so
that it wouldn't take up time every time it was run?

Or can you recommend a good way of archiving data so that none of the
old rows have to be searched through?

Similar questions for accounting issues. Each order has a price
attached to it. To get the price of the purchase we have to sum the
orders up. Would it be better to store the sum of all the orders in
the purchases table? Keep in mind that we still need to keep a
breakdown of the prices of the orders.

What would you do?

Thanks so much!
Shane

Request for Question Clarification by mathtalk-ga on 24 Jun 2004 13:37 PDT
Hi, shane43-ga:

Except for the brief remark "I have things indexed too" you give no
information about what indexes are defined.  This is the first place
to try and improve peformance, then tweaking of SQL queries, and only
as a last resort would I consider denormalization as an optimization
technique.

So, how about giving the definitions of indexes on these tables, and
then some suggestions about optimizing the particular query:

Select count(*) from emails,orders,purchases,customers
 WHERE emails.order_id = orders.order_id
   and orders.purchase_id = purchases.purchase_id
   and purchases.customer_id = 1001

can be made.  It would also be helpful to know which relational DBMS
you are using.  Most will display a "query plan" that helps to
identify where the query execution is bogging down.

regards, mathtalk-ga

Clarification of Question by shane43-ga on 24 Jun 2004 15:56 PDT
Hey MathTalk - 

When I say I indexed things I meant I have indexed every single column
that is ever searched. I don't care about how long it takes to insert
data anymore, so I practically have indexed all columns.

I'm using MySQL 3.23. You just made me think about indexing multiple
rows, such as order_id and purchase_id as one index. Is that supported
by MySQL? Would that help at all?

To give you an example of the size of the database, the actual 'email'
table is 2.1 GB, with the indices taking up a little over 1 MB.

A friend of mine told me to increase the memory of our server. We
increased it up to 2 gigs. There was a page that took 2 mins and 30
seconds to load, it now takes 1:30, which is an improvement, but we
want better.

If you need more info, please ask it. The real database is made up of
57 tables and occupies 2.4 gigs. I was trying to simplify things by
using this model I described, but if you feel like diving deep into
more details I can expand.

Request for Question Clarification by mathtalk-ga on 25 Jun 2004 07:51 PDT
An index which combines two or more columns is called a composite
index, and yes, this is exactly the sort of idea I think you need to
explore.

Indexes on single columns may not be helping the way you want, but
they probably do no harm on SELECT queries.  They may cost a bit in
overhead for UPDATE, INSERT, and DELETE queries though, so it's a good
idea to design indexes with an eye on how the database will be used.

I'll look into the capabilities of MySQL 3.23 and make some suggestions.

--mathtalk-ga

Clarification of Question by shane43-ga on 25 Jun 2004 10:35 PDT
Like I said - we only do a couple inserts or updates at a time, so I'm
not worried about the time there. When we do selects, we are selecting
several thousand rows usually, so that's where I need the help. Please
point me in the right direction of a good tutorial that will help me
learn multiple column indexing for mysql.

Thanks!
Answer  
Subject: Re: Database Normalization
Answered By: mathtalk-ga on 27 Jun 2004 15:01 PDT
Rated:4 out of 5 stars
 
Hi, shane43-ga:

Sometimes a composite (multi-column) index will be inherent in a
normalized table's design, as when the primary key consists of a
combination of fields.  Other times we may choose to create composite
indexes as performance tools.

Let's begin with the sample tables and the query you've provided:

Select count(*) from emails,
                     orders,
                     purchases,
                     customers
 WHERE emails.order_id = orders.order_id
 and   orders.purchase_id = purchases.purchase_id
 and   purchases.customer_id = 1001

Keeping in mind that this is only an illustration, of course, the
first thing I'd like to do is strip out the table customers from the
query.  There aren't any conditions on rows in customers in the WHERE
clause, so in reality the SQL shown would create a lot of duplication!

We'd certainly like to have an index on table purchases for
customer_id, which allows us to quickly pick out all the rows
pertaining to a fixed customer.  And if you've already done all the
"one column" indexes, then we have that, and the (unique) purchase_id
values will have a functional dependence then on customer_id.

Next we have the join between table purchases and table orders.

From the look of things, the primary key on table purchases is
purchase_id and the primary key on table orders is order_id.  So far
we don't have any requirement to create a composite index just for the
sake of defining a primary key.  But between the purchases and orders
tables there seems to be a one-many relationship, because each row in
orders has a field purchase_id (but there is no converse reference in
table purchases to a value of order_id).

This suggests that we might want a composite index on table orders
that has the two columns purchase_id and order_id.  While order_id by
itself would be unique (as the primary key), creating the composite
index will most likely allow the MySQL query optimizer to use only the
index and not the entire record in table orders (which involves a lot
of additional data, as the tables above are defined).  In effect the
index is a smaller and more easily searched version of the table
itself, but I think we would want to follow the sequence:

CREATE UNIQUE INDEX purchase_order
  ON orders (purchase_id, order_id)

[MySQL Manual | 14.2.4 CREATE INDEX Syntax]
http://dev.mysql.com/doc/mysql/en/CREATE_INDEX.html

If the indexes for MySQL tables are done in the usual way, it will
create "branches" for the first field and "leaves" following them for
the subsequent fields (here purchase_id --> order_id), like a directed
graph.

The only advantage here in having the composite index over having the
single column index on purchase_id in the table orders is the savings
in scanning the composite index vs. scanning the index plus the table
(to pull out the order_id).

Finally we come to the table emails, on which we presumably have as
primary key the field email_id and a non-unique index for order_id. 
From the given definition there are only two other fields (email_type
and email_address), so it might not be a big advantage to go with any
composite index.  On the other hand this is a "simplified" schema; in
your actual application something of the sort might be useful.

* * * * * * * * * * * * * * * * * *

Now let's talk about "denormalizing" the database, as you originally
suggested.  The "data-warehouse" approach mentioned by scubajim-ga
could be applied to your existing schema in some minor ways without
creating too many problems, assuming that the volume of updates is
low.

The idea would be to populate tables with "counts" that summarize the
portion of a hierachy below them, so queries can be written with fewer
joins.  Here we have a functional dependence (I think) that proceeds:

customers
--> purchases
    --> orders
        --> emails

by virtue of the synthetic key at each level being used to
characterize rows at the next level down.

In such a strict hierachy one could add a column at each level that
summarizes the counts of records below.  For example, we might add a
new column to table orders, which gives the number of rows in emails
which are associated to any given row in orders.  Provided the
orders/emails relationship is rarely changed after the fact (which may
or may not correspond to your business model), we might enforce the
accuracy of this summary count by a "trigger".  Unfortunately it
appears that triggers have not been implemented in MySQL prior to
version 5.1, so you would have to rely on some regularly timed batch
jobs to maintain these counts until an upgrade becomes available:

[MySQL Manual | 1.8.5.4 Stored Procedures and Triggers]
http://dev.mysql.com/doc/mysql/en/ANSI_diff_Triggers.html

However assuming the existence of an integer-valued column num_emails
added to the table orders, your original query (minus table customers)
would be equivalent to:

Select sum(num_emails) from orders,
                            purchases
 WHERE orders.purchase_id = purchases.purchase_id
 and   purchases.customer_id = 1001

[MySQL Manual | 13.9.1 GROUP BY (Aggregate) Functions]
http://dev.mysql.com/doc/mysql/en/GROUP-BY-Functions.html

If performance required additional tweaking, you could carry the
pre-computation of counts to the next higher level, adding a column
num_emails to table purchases (which would add the results over
corresponding rows in table orders), or even all the way back to table
customers if that were useful (though of course the higher you go with
the summary pre-computation, the more vulnerable the data is to
becoming outdated).

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 27 Jun 2004 15:07 PDT
Another technique to be aware of is the EXPLAIN option to see what
"query plan" MySQL puts together to execute:

[MySQL Manual | 7.2.1 EXPLAIN Syntax (Get Information About a SELECT)]
http://dev.mysql.com/doc/mysql/en/EXPLAIN.html

If you find that MySQL isn't using your indexes the way you'd hope it
would, some ways to "coax" the engine are here:

[MySQL Manual | A.6 Optimizer-Related Issues]
http://dev.mysql.com/doc/mysql/en/Optimiser_Issues.html

regards, mathtalk-ga

Request for Answer Clarification by shane43-ga on 28 Jun 2004 10:17 PDT
Thanks for the thorough answer. A few follow-up questions:

1) Can you explain why you suggest putting a unique index on orders
(order_id,purchase_id) but you don't talk about one on purchases
(purchase_id,customer_id). If I understand composite indices
correctly, it seems like that would also help.

2) So, from now on, whenever I have a query that is looking up more
than 1 column from a table, instead of indexing the columns
seperately, I should make a composite index of all of them?

3) you write:
>The only advantage here in having the composite index over having the
>single column index on purchase_id in the table orders is the savings
>in scanning the composite index vs. scanning the index plus the table
>(to pull out the order_id).

Is this still the case if there is a single column index on order_id
in both tables? I'm trying to see why the whole row would need to be
scanned when their is an index for order_id.

FYI, the inclusion of the customer table was my typo. It's not like
that in the real system.

Thanks!

Clarification of Answer by mathtalk-ga on 28 Jun 2004 11:12 PDT
Hi, shane43-ga:

1) Logically the same reasoning exists for both tables (purchases and
orders).  However the point I was trying to make is that from the
standpoint of using the single column index on (say) customer_id to
find the rows in purchases (and thus the value of purchase_id for each
such row), versus using a composite index on purchases consisting of
customer_id and purchase_id, the main advantage would be working with
the index rather than the table as a smaller chunk of memory I/O. 
Here the table purchases as you defined it only has one other field
(date), so I presented the case for the composite index on orders as
having a more compelling argument.

2) I wouldn't say that the rule is to replace all single column
indexes with a composite index that combines them all, even if the
query looks them all up.  The thing to keep an eye on is how the query
gets narrowed down to the required records, so it's more a matter of
what the query uses in the WHERE clause rather than what it looks up
in the SELECT's head.  In some cases like the one here the main hope
is that the index will provide all the information needed to execute
the query without actually reading the table at all.  In other cases
the choice of indexes has to be forcefully presented to the query
optimizer so that it does the joins in the most efficient order. 
That's where the EXPLAIN facility can give you some feedback, if you
think you've defined enough indexes to make the query run quickly. 
The Query Plan will tell whether or not your intentions are being
realized.

3) The flow I visualize for your SQL query is this:

given customer_id, find rows in purchases (get purchase_id)
  given purchase_id, find rows in orders (get order_id)
    given order_id, find rows in emails (count them)

Now given one index on orders with purchase_id and another (unique)
index there for order_id, there's no navigation from one (single
column) index to another.  The index points into the table, so in this
particular query the single column index of orders by order_id doesn't
help (even though it is important to the "business logic" of the
table).

So again, putting both purchase_id and order_id into the _same_ index
on table orders should be useful, if you put them in this order,
because the MySQL query optimizer should be able to figure out how to
implement the flow of information outlined above (e.g. the functional
dependence of order_id on purchase_id).

I don't know how closely the example follows your actual schema, but
it seems from the follow-up questions you are asking, you will
understand how the same sort of experiments could be done with other
queries and schemas.  Do some timings to see if things really are
improved by new indexes!  Too many times programmers take it on faith
that the index is making things better, but I keep my "chronometer"
watch handy to verify!  Timings are also affected by how busy the
machine is with other work (esp. other database transactions).

regards, mathtalk-ga
shane43-ga rated this answer:4 out of 5 stars
Nice thorough answer. I'm a bit unclear on one thing, but I probably
just need to re-read it a couple more times. Thanks!

Comments  
Subject: Re: Database Normalization
From: scubajim-ga on 24 Jun 2004 15:49 PDT
 
For statistical purposes and data spelunking you need a data
wharehouse.  A data wharehouse uses different principals for schemas
than oltp. (whihc is what you have here).
Subject: Re: Database Normalization
From: djlarsu-ga on 24 Jun 2004 18:44 PDT
 
Yes, MySQL supports multi-column indices.  With your comment of your
e-mail table taking up 2.1G, and the index taking up 1M, my guess is
something isn't right.  The MySQL manual has lots of good stuff on
optimization, and the mailing list is a great resource too.  In
particular, prefix some of your queries with 'explain ' to find out
what indices are being used.

If you're not trying to hide your schema / queries, you can list them
out somewhere, and I'll give whatever help I can.
Subject: Re: Database Normalization
From: crythias-ga on 24 Jun 2004 23:05 PDT
 
Do the owners of the emails know you're selling their emails to other people?
Subject: Re: Database Normalization
From: shane43-ga on 25 Jun 2004 10:32 PDT
 
scubajim - I guess we need something that offers both oltp and data
mining. Are you suggesting we have 2 databases?

djlarsu - Thanks for the tip of explain, I'll look into that. I
actually can't release the actual queries and schema, as my boss is
very protective about proprietary stuff.

crythias - As I said before, this is not our business model, but it is
similar. We are not selling email addresses, I just thought that that
was a good example that everyone could understand w/o having to write
out a 3 page explanation of what exactly is going on.
Subject: Re: Database Normalization
From: scubajim-ga on 25 Jun 2004 10:59 PDT
 
You could 
A. Keep it in the same database but have a different set of tables.
B. Create another database and copy the data there.

In either event a data warehouse involves copying the data from the
oltp tables into the data sarehouse tables.  The data warehouse tables
 have a different structure than OLTP tables.  This "copying process"
is called ETL (which stands for extract, transform, load) which is a
description of the process.

A data warehouse usually has fact tables and look up tables.  It makes
it much easier to write reports etc.  It is NOT designed for a
transaction system.
Subject: Re: Database Normalization
From: crythias-ga on 26 Jun 2004 06:55 PDT
 
I wonder if you might not add customer_id to Emails and Orders.
*ducking as thousands of database gurus scream in disgust*
Subject: Re: Database Normalization
From: shane43-ga on 26 Jun 2004 09:42 PDT
 
crythias - That's exactly my sort of question. Of course it will make
things speedy, which is nice - but the integrity of the database will
be compromised for sure. I wish there was an easy solution...
Subject: Re: Database Normalization
From: crythias-ga on 26 Jun 2004 20:08 PDT
 
Actually, what corruption of the database can there be by adding
fields? I mean, unless your database is reliant on only so many
fields, and one more field will cause turmoil, ...

The old queries are always valid. A new (one time) update query can be
made based upon the old queries to fill the field in existing orders
and emails, and the INSERT queries merely have one more field to
populate for new records.

Of course, backups are recommended, etc., before changing your schema,
but I don't really believe what I proposed is necessarily destructive,
nor does it invalidate anything you currently have in place.

The proposed query: Select count(*) from emails,orders,purchases,customers
 WHERE emails.order_id = orders.order_id
   and orders.purchase_id = purchases.purchase_id
   and purchases.customer_id = 1001

could be simply select count(*) from emails where customer_id=1001

Also, and maybe I am wrong here, but the customer-to-email and
customer-to-order are each a one-to-many relationship. Unless you are
in need of a grouping by purchase or order (which doesn't appear to be
the case in a count as above) ...

I have a question: 
Orders
---------
order_id
purchase_id
number_ordered
email_type
is_filled
price_paid

Emails
---------
email_id
order_id
email_type
email_address

Why does Order reference email_type and Emails reference email_type? I
know, I know, it's not your business model, but is it redundant or
loop-referencing?
Subject: Re: Database Normalization
From: shane43-ga on 27 Jun 2004 09:30 PDT
 
Thanks for the input, crythias - 

I consider adding another customer_id to the Emails table as risky
because if somehow the customer_id gets changed on Emails, but the
order_id doesn't get updated to reflect this, then the database is
corrupt. We will never know who that email was assigned to since the
data conflicts.

I know you would respond 'well just make sure that you update it in
both places!' and I would have agreed with you a year ago. A while ago
I used a similar method in the original design of the database. I had
a num_received column in the orders table. To my surprise, every once
in a while that number would be out of sync with the actual number of
rows it represented. I'm sure it was because I was not locking tables
when updating them, simply because there was so much traffic that to
lock them would make users have to wait. So I removed that column and
relied on the count() for that value from then on. Maybe I should have
stuck to the original plan and used transactions, but I think they
weren't supported in mysql at the time. Is that something you
recommend looking into now?

In regards to email_type: Sometimes when an email is submitted, it
isn't immediately assigned to an order, so we 'queue' it and wait
until there is an order that needs it. Then we use the email_type to
match it up with the type of the order when a new order comes along.
It is a little redundant, can you think of a better method?

Thanks!
Shane
Subject: Re: Database Normalization
From: crythias-ga on 27 Jun 2004 16:31 PDT
 
I'm not nearly close to being an expert, and what I'd suggest may be
in difference  to what data warehousing might offer, or even what an
expert might suggest.

here's yours:
Customers
---------
customer_id
Name

Purchases
---------
purchase_id
customer_id
date

Orders
---------
order_id
purchase_id
number_ordered
email_type
is_filled
price_paid

Emails
---------
email_id
order_id
email_type
email_address

OK, now I have a question. Would "Emails" be an inventory of widgets
that could be sold multiple times, or one widget sold one time?

What I'm trying to say is that you introduced the possibility of the
widget not being assigned to an order ... in that case, you should
have an Email_Inventory table that would be looked-up when an order
was placed. This would be separate from what was actually sold. If the
Email_Inventory item was sold, the quantity needs to either be
decremented (for multiples of the widget) or marked as sold (for
unique of the widget, or do nothing if this widget is reusable, like a
download).

Email_inventory
---------------
email_id
email_type
email_address
email_quantity (potentially optional).

Maybe "Emails" is the inventory list, but it ordered emails really
should be separate from what is in inventory.

Does a purchase have multiple orders? if not, reduce the load by
getting rid of the purchase table.

An order can request multiple email types, that's fine. or if the
purchase says each order is a different email type, that's fine, too.

the Tree (if you're using a relational database) should point one
customerID to the number of purchases or orders that customerID makes.
Frankly, I'd consider orders and purchases to be a one-to-one, but
your methods may vary. Each purchase contains x orders of email_types
each. (is order-to-email_type one to one or one-to-many?... based upon
the schema posted, order-to-email_type *should* be one-to-one) The
"Emails" table should actually be:
Emails
------
index_key (optional)
email_id
order_id

Where email_id is a subset of Email_Inventory.email_id. Actually,
order_id doesn't (shouldn't?) even need to have email_type. email_type
can be determined from email_id. (you don't need to ask order_id what
email_type was ordered. it's redundant).

If I'm way off, please let me know. If I'm not helping you, I'm really
sorry, and I'll bug out.
Subject: Re: Database Normalization
From: shane43-ga on 28 Jun 2004 10:23 PDT
 
crythias - how dare you think yourself a bother. You're giving me free
advice! I should feel in debt to you :)

Each 'email' entry is a unique item, and it can be assigned to more
than 1 order (i left that part out), so the inventory quantity doesn't
really apply to our business model.  The email_type also is needed
since we won't be able to determine the type of the email if it hasn't
been assigned to an order yet.

Thanks for all your suggestions! I think I am going to try out some
composite indices and see if that speeds things up. If not, then I'll
be forces to change the schema :(

Thanks again!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy