Google Answers: Database Normalization

View Question

Q: Database Normalization ( Answered 4 out of 5 stars

Question

Subject: Database Normalization
Category: Computers > Programming
Asked by: shane43-ga
List Price: $20.00

Posted: 24 Jun 2004 13:21 PDT
Expires: 24 Jul 2004 13:21 PDT
Question ID: 365759

I created a database a while ago that I thought was pretty "normal", meaning there is no duplicate data anywhere which could allow for corruption. I have been pretty satisfied with it for the past year, but now it is beginning to be pretty big - almost 2 GB in size, and it started to lag a lot on certain queries. I know that it could be faster if I duplicated data, but I am worried about doing that because I like it being normal. Can someone provide advice on what (if any) changes I should make to the database structure in order to make it reliable, yet speedy? Here's a breakdown of how it is now. I have removed irrelevant parts: Customers --------- customer_id Name Purchases --------- purchase_id customer_id date Orders --------- order_id purchase_id number_ordered email_type is_filled price_paid Emails --------- email_id order_id email_type email_address This isn't our business model, but the structure is similar. Simply put - A customer will sign up and create a purchase to receive email addresses from the site. The purchase can contain 1 or more orders of a particular email type. Each order is linked to the purchase and tells how many of what each type of email was needed. As emails get collected by affiliate sites, they get assigned to an order that is not filled. To better explain, here's a sample of rows: Customers --------- 1001 / Bob Purchases --------- 2001 / 1001 / Jan 1, 2001 Orders --------- 3001 / 2001 / 3 / British / 0 / $20 Emails --------- 4001 / somegirl@england.com / 3001 / British 4002 / someguy@england.com / 3001 / British Anyway - It seems to work well, except since our database is so filled with orders and emails it takes a while to pull statistics. For example, in order to find out how many emails customer 1001 has ever received to date, we have to do: Select count(*) from emails,orders,purchases,customers WHERE emails.order_id = orders.order_id and orders.purchase_id = purchases.purchase_id and purchases.customer_id = 1001. Since there are a lot of rows in these tables the query could take up to 10 seconds. I have things indexed too. So - Is there a better way to be doing this? Should I keep a field in the orders table that says "number_received" so that we would just have to total up that column instead of count all returned email rows? Should I cache the results from the above query every couple hours so that it wouldn't take up time every time it was run? Or can you recommend a good way of archiving data so that none of the old rows have to be searched through? Similar questions for accounting issues. Each order has a price attached to it. To get the price of the purchase we have to sum the orders up. Would it be better to store the sum of all the orders in the purchases table? Keep in mind that we still need to keep a breakdown of the prices of the orders. What would you do? Thanks so much! Shane
Request for Question Clarification by mathtalk-ga on 24 Jun 2004 13:37 PDT Hi, shane43-ga: Except for the brief remark "I have things indexed too" you give no information about what indexes are defined. This is the first place to try and improve peformance, then tweaking of SQL queries, and only as a last resort would I consider denormalization as an optimization technique. So, how about giving the definitions of indexes on these tables, and then some suggestions about optimizing the particular query: Select count(*) from emails,orders,purchases,customers WHERE emails.order_id = orders.order_id and orders.purchase_id = purchases.purchase_id and purchases.customer_id = 1001 can be made. It would also be helpful to know which relational DBMS you are using. Most will display a "query plan" that helps to identify where the query execution is bogging down. regards, mathtalk-ga
Clarification of Question by shane43-ga on 24 Jun 2004 15:56 PDT Hey MathTalk - When I say I indexed things I meant I have indexed every single column that is ever searched. I don't care about how long it takes to insert data anymore, so I practically have indexed all columns. I'm using MySQL 3.23. You just made me think about indexing multiple rows, such as order_id and purchase_id as one index. Is that supported by MySQL? Would that help at all? To give you an example of the size of the database, the actual 'email' table is 2.1 GB, with the indices taking up a little over 1 MB. A friend of mine told me to increase the memory of our server. We increased it up to 2 gigs. There was a page that took 2 mins and 30 seconds to load, it now takes 1:30, which is an improvement, but we want better. If you need more info, please ask it. The real database is made up of 57 tables and occupies 2.4 gigs. I was trying to simplify things by using this model I described, but if you feel like diving deep into more details I can expand.
Request for Question Clarification by mathtalk-ga on 25 Jun 2004 07:51 PDT An index which combines two or more columns is called a composite index, and yes, this is exactly the sort of idea I think you need to explore. Indexes on single columns may not be helping the way you want, but they probably do no harm on SELECT queries. They may cost a bit in overhead for UPDATE, INSERT, and DELETE queries though, so it's a good idea to design indexes with an eye on how the database will be used. I'll look into the capabilities of MySQL 3.23 and make some suggestions. --mathtalk-ga
Clarification of Question by shane43-ga on 25 Jun 2004 10:35 PDT Like I said - we only do a couple inserts or updates at a time, so I'm not worried about the time there. When we do selects, we are selecting several thousand rows usually, so that's where I need the help. Please point me in the right direction of a good tutorial that will help me learn multiple column indexing for mysql. Thanks!

Answer

Subject: Re: Database Normalization
Answered By: mathtalk-ga on 27 Jun 2004 15:01 PDT
Rated: 4 out of 5 stars

Hi, shane43-ga: Sometimes a composite (multi-column) index will be inherent in a normalized table's design, as when the primary key consists of a combination of fields. Other times we may choose to create composite indexes as performance tools. Let's begin with the sample tables and the query you've provided: Select count() from emails, orders, purchases, customers WHERE emails.order_id = orders.order_id and orders.purchase_id = purchases.purchase_id and purchases.customer_id = 1001 Keeping in mind that this is only an illustration, of course, the first thing I'd like to do is strip out the table customers from the query. There aren't any conditions on rows in customers in the WHERE clause, so in reality the SQL shown would create a lot of duplication! We'd certainly like to have an index on table purchases for customer_id, which allows us to quickly pick out all the rows pertaining to a fixed customer. And if you've already done all the "one column" indexes, then we have that, and the (unique) purchase_id values will have a functional dependence then on customer_id. Next we have the join between table purchases and table orders. From the look of things, the primary key on table purchases is purchase_id and the primary key on table orders is order_id. So far we don't have any requirement to create a composite index just for the sake of defining a primary key. But between the purchases and orders tables there seems to be a one-many relationship, because each row in orders has a field purchase_id (but there is no converse reference in table purchases to a value of order_id). This suggests that we might want a composite index on table orders that has the two columns purchase_id and order_id. While order_id by itself would be unique (as the primary key), creating the composite index will most likely allow the MySQL query optimizer to use only the index and not the entire record in table orders (which involves a lot of additional data, as the tables above are defined). In effect the index is a smaller and more easily searched version of the table itself, but I think we would want to follow the sequence: CREATE UNIQUE INDEX purchase_order ON orders (purchase_id, order_id) [MySQL Manual \| 14.2.4 CREATE INDEX Syntax] http://dev.mysql.com/doc/mysql/en/CREATE_INDEX.html If the indexes for MySQL tables are done in the usual way, it will create "branches" for the first field and "leaves" following them for the subsequent fields (here purchase_id --> order_id), like a directed graph. The only advantage here in having the composite index over having the single column index on purchase_id in the table orders is the savings in scanning the composite index vs. scanning the index plus the table (to pull out the order_id). Finally we come to the table emails, on which we presumably have as primary key the field email_id and a non-unique index for order_id. From the given definition there are only two other fields (email_type and email_address), so it might not be a big advantage to go with any composite index. On the other hand this is a "simplified" schema; in your actual application something of the sort might be useful. * * * * * * * * * * * * * * * * * Now let's talk about "denormalizing" the database, as you originally suggested. The "data-warehouse" approach mentioned by scubajim-ga could be applied to your existing schema in some minor ways without creating too many problems, assuming that the volume of updates is low. The idea would be to populate tables with "counts" that summarize the portion of a hierachy below them, so queries can be written with fewer joins. Here we have a functional dependence (I think) that proceeds: customers --> purchases --> orders --> emails by virtue of the synthetic key at each level being used to characterize rows at the next level down. In such a strict hierachy one could add a column at each level that summarizes the counts of records below. For example, we might add a new column to table orders, which gives the number of rows in emails which are associated to any given row in orders. Provided the orders/emails relationship is rarely changed after the fact (which may or may not correspond to your business model), we might enforce the accuracy of this summary count by a "trigger". Unfortunately it appears that triggers have not been implemented in MySQL prior to version 5.1, so you would have to rely on some regularly timed batch jobs to maintain these counts until an upgrade becomes available: [MySQL Manual \| 1.8.5.4 Stored Procedures and Triggers] http://dev.mysql.com/doc/mysql/en/ANSI_diff_Triggers.html However assuming the existence of an integer-valued column num_emails added to the table orders, your original query (minus table customers) would be equivalent to: Select sum(num_emails) from orders, purchases WHERE orders.purchase_id = purchases.purchase_id and purchases.customer_id = 1001 [MySQL Manual \| 13.9.1 GROUP BY (Aggregate) Functions] http://dev.mysql.com/doc/mysql/en/GROUP-BY-Functions.html If performance required additional tweaking, you could carry the pre-computation of counts to the next higher level, adding a column num_emails to table purchases (which would add the results over corresponding rows in table orders), or even all the way back to table customers if that were useful (though of course the higher you go with the summary pre-computation, the more vulnerable the data is to becoming outdated). regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 27 Jun 2004 15:07 PDT Another technique to be aware of is the EXPLAIN option to see what "query plan" MySQL puts together to execute: [MySQL Manual \| 7.2.1 EXPLAIN Syntax (Get Information About a SELECT)] http://dev.mysql.com/doc/mysql/en/EXPLAIN.html If you find that MySQL isn't using your indexes the way you'd hope it would, some ways to "coax" the engine are here: [MySQL Manual \| A.6 Optimizer-Related Issues] http://dev.mysql.com/doc/mysql/en/Optimiser_Issues.html regards, mathtalk-ga
Request for Answer Clarification by shane43-ga on 28 Jun 2004 10:17 PDT Thanks for the thorough answer. A few follow-up questions: 1) Can you explain why you suggest putting a unique index on orders (order_id,purchase_id) but you don't talk about one on purchases (purchase_id,customer_id). If I understand composite indices correctly, it seems like that would also help. 2) So, from now on, whenever I have a query that is looking up more than 1 column from a table, instead of indexing the columns seperately, I should make a composite index of all of them? 3) you write: >The only advantage here in having the composite index over having the >single column index on purchase_id in the table orders is the savings >in scanning the composite index vs. scanning the index plus the table >(to pull out the order_id). Is this still the case if there is a single column index on order_id in both tables? I'm trying to see why the whole row would need to be scanned when their is an index for order_id. FYI, the inclusion of the customer table was my typo. It's not like that in the real system. Thanks!
Clarification of Answer by mathtalk-ga on 28 Jun 2004 11:12 PDT Hi, shane43-ga: 1) Logically the same reasoning exists for both tables (purchases and orders). However the point I was trying to make is that from the standpoint of using the single column index on (say) customer_id to find the rows in purchases (and thus the value of purchase_id for each such row), versus using a composite index on purchases consisting of customer_id and purchase_id, the main advantage would be working with the index rather than the table as a smaller chunk of memory I/O. Here the table purchases as you defined it only has one other field (date), so I presented the case for the composite index on orders as having a more compelling argument. 2) I wouldn't say that the rule is to replace all single column indexes with a composite index that combines them all, even if the query looks them all up. The thing to keep an eye on is how the query gets narrowed down to the required records, so it's more a matter of what the query uses in the WHERE clause rather than what it looks up in the SELECT's head. In some cases like the one here the main hope is that the index will provide all the information needed to execute the query without actually reading the table at all. In other cases the choice of indexes has to be forcefully presented to the query optimizer so that it does the joins in the most efficient order. That's where the EXPLAIN facility can give you some feedback, if you think you've defined enough indexes to make the query run quickly. The Query Plan will tell whether or not your intentions are being realized. 3) The flow I visualize for your SQL query is this: given customer_id, find rows in purchases (get purchase_id) given purchase_id, find rows in orders (get order_id) given order_id, find rows in emails (count them) Now given one index on orders with purchase_id and another (unique) index there for order_id, there's no navigation from one (single column) index to another. The index points into the table, so in this particular query the single column index of orders by order_id doesn't help (even though it is important to the "business logic" of the table). So again, putting both purchase_id and order_id into the _same_ index on table orders should be useful, if you put them in this order, because the MySQL query optimizer should be able to figure out how to implement the flow of information outlined above (e.g. the functional dependence of order_id on purchase_id). I don't know how closely the example follows your actual schema, but it seems from the follow-up questions you are asking, you will understand how the same sort of experiments could be done with other queries and schemas. Do some timings to see if things really are improved by new indexes! Too many times programmers take it on faith that the index is making things better, but I keep my "chronometer" watch handy to verify! Timings are also affected by how busy the machine is with other work (esp. other database transactions). regards, mathtalk-ga

shane43-ga rated this answer: 4 out of 5 stars

Nice thorough answer. I'm a bit unclear on one thing, but I probably
just need to re-read it a couple more times. Thanks!

Comments

Subject: Re: Database Normalization
From: scubajim-ga on 24 Jun 2004 15:49 PDT

For statistical purposes and data spelunking you need a data
wharehouse.  A data wharehouse uses different principals for schemas
than oltp. (whihc is what you have here).

Subject: Re: Database Normalization
From: djlarsu-ga on 24 Jun 2004 18:44 PDT

Yes, MySQL supports multi-column indices.  With your comment of your
e-mail table taking up 2.1G, and the index taking up 1M, my guess is
something isn't right.  The MySQL manual has lots of good stuff on
optimization, and the mailing list is a great resource too.  In
particular, prefix some of your queries with 'explain ' to find out
what indices are being used.

If you're not trying to hide your schema / queries, you can list them
out somewhere, and I'll give whatever help I can.

Subject: Re: Database Normalization
From: crythias-ga on 24 Jun 2004 23:05 PDT

Do the owners of the emails know you're selling their emails to other people?

Subject: Re: Database Normalization
From: shane43-ga on 25 Jun 2004 10:32 PDT

scubajim - I guess we need something that offers both oltp and data
mining. Are you suggesting we have 2 databases?

djlarsu - Thanks for the tip of explain, I'll look into that. I
actually can't release the actual queries and schema, as my boss is
very protective about proprietary stuff.

crythias - As I said before, this is not our business model, but it is
similar. We are not selling email addresses, I just thought that that
was a good example that everyone could understand w/o having to write
out a 3 page explanation of what exactly is going on.

Subject: Re: Database Normalization
From: scubajim-ga on 25 Jun 2004 10:59 PDT

You could 
A. Keep it in the same database but have a different set of tables.
B. Create another database and copy the data there.

In either event a data warehouse involves copying the data from the
oltp tables into the data sarehouse tables.  The data warehouse tables
 have a different structure than OLTP tables.  This "copying process"
is called ETL (which stands for extract, transform, load) which is a
description of the process.

A data warehouse usually has fact tables and look up tables.  It makes
it much easier to write reports etc.  It is NOT designed for a
transaction system.

Subject: Re: Database Normalization
From: crythias-ga on 26 Jun 2004 06:55 PDT

I wonder if you might not add customer_id to Emails and Orders.
*ducking as thousands of database gurus scream in disgust*

Subject: Re: Database Normalization
From: shane43-ga on 26 Jun 2004 09:42 PDT

crythias - That's exactly my sort of question. Of course it will make
things speedy, which is nice - but the integrity of the database will
be compromised for sure. I wish there was an easy solution...

Subject: Re: Database Normalization
From: crythias-ga on 26 Jun 2004 20:08 PDT

Actually, what corruption of the database can there be by adding
fields? I mean, unless your database is reliant on only so many
fields, and one more field will cause turmoil, ...

The old queries are always valid. A new (one time) update query can be
made based upon the old queries to fill the field in existing orders
and emails, and the INSERT queries merely have one more field to
populate for new records.

Of course, backups are recommended, etc., before changing your schema,
but I don't really believe what I proposed is necessarily destructive,
nor does it invalidate anything you currently have in place.

The proposed query: Select count(*) from emails,orders,purchases,customers
 WHERE emails.order_id = orders.order_id
   and orders.purchase_id = purchases.purchase_id
   and purchases.customer_id = 1001

could be simply select count(*) from emails where customer_id=1001

Also, and maybe I am wrong here, but the customer-to-email and
customer-to-order are each a one-to-many relationship. Unless you are
in need of a grouping by purchase or order (which doesn't appear to be
the case in a count as above) ...

I have a question: 
Orders
---------
order_id
purchase_id
number_ordered
email_type
is_filled
price_paid

Emails
---------
email_id
order_id
email_type
email_address

Why does Order reference email_type and Emails reference email_type? I
know, I know, it's not your business model, but is it redundant or
loop-referencing?

Subject: Re: Database Normalization
From: shane43-ga on 27 Jun 2004 09:30 PDT

Thanks for the input, crythias - 

I consider adding another customer_id to the Emails table as risky
because if somehow the customer_id gets changed on Emails, but the
order_id doesn't get updated to reflect this, then the database is
corrupt. We will never know who that email was assigned to since the
data conflicts.

I know you would respond 'well just make sure that you update it in
both places!' and I would have agreed with you a year ago. A while ago
I used a similar method in the original design of the database. I had
a num_received column in the orders table. To my surprise, every once
in a while that number would be out of sync with the actual number of
rows it represented. I'm sure it was because I was not locking tables
when updating them, simply because there was so much traffic that to
lock them would make users have to wait. So I removed that column and
relied on the count() for that value from then on. Maybe I should have
stuck to the original plan and used transactions, but I think they
weren't supported in mysql at the time. Is that something you
recommend looking into now?

In regards to email_type: Sometimes when an email is submitted, it
isn't immediately assigned to an order, so we 'queue' it and wait
until there is an order that needs it. Then we use the email_type to
match it up with the type of the order when a new order comes along.
It is a little redundant, can you think of a better method?

Thanks!
Shane

Subject: Re: Database Normalization
From: crythias-ga on 27 Jun 2004 16:31 PDT

I'm not nearly close to being an expert, and what I'd suggest may be
in difference  to what data warehousing might offer, or even what an
expert might suggest.

here's yours:
Customers
---------
customer_id
Name

Purchases
---------
purchase_id
customer_id
date

Orders
---------
order_id
purchase_id
number_ordered
email_type
is_filled
price_paid

Emails
---------
email_id
order_id
email_type
email_address

OK, now I have a question. Would "Emails" be an inventory of widgets
that could be sold multiple times, or one widget sold one time?

What I'm trying to say is that you introduced the possibility of the
widget not being assigned to an order ... in that case, you should
have an Email_Inventory table that would be looked-up when an order
was placed. This would be separate from what was actually sold. If the
Email_Inventory item was sold, the quantity needs to either be
decremented (for multiples of the widget) or marked as sold (for
unique of the widget, or do nothing if this widget is reusable, like a
download).

Email_inventory
---------------
email_id
email_type
email_address
email_quantity (potentially optional).

Maybe "Emails" is the inventory list, but it ordered emails really
should be separate from what is in inventory.

Does a purchase have multiple orders? if not, reduce the load by
getting rid of the purchase table.

An order can request multiple email types, that's fine. or if the
purchase says each order is a different email type, that's fine, too.

the Tree (if you're using a relational database) should point one
customerID to the number of purchases or orders that customerID makes.
Frankly, I'd consider orders and purchases to be a one-to-one, but
your methods may vary. Each purchase contains x orders of email_types
each. (is order-to-email_type one to one or one-to-many?... based upon
the schema posted, order-to-email_type *should* be one-to-one) The
"Emails" table should actually be:
Emails
------
index_key (optional)
email_id
order_id

Where email_id is a subset of Email_Inventory.email_id. Actually,
order_id doesn't (shouldn't?) even need to have email_type. email_type
can be determined from email_id. (you don't need to ask order_id what
email_type was ordered. it's redundant).

If I'm way off, please let me know. If I'm not helping you, I'm really
sorry, and I'll bug out.

Subject: Re: Database Normalization
From: shane43-ga on 28 Jun 2004 10:23 PDT

crythias - how dare you think yourself a bother. You're giving me free
advice! I should feel in debt to you :)

Each 'email' entry is a unique item, and it can be assigned to more
than 1 order (i left that part out), so the inventory quantity doesn't
really apply to our business model.  The email_type also is needed
since we won't be able to determine the type of the email if it hasn't
been assigned to an order yet.

Thanks for all your suggestions! I think I am going to try out some
composite indices and see if that speeds things up. If not, then I'll
be forces to change the schema :(

Thanks again!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy