Google Answers: Optimum Database Architecture

View Question

Q: Optimum Database Architecture ( No Answer, 4 Comments )

Question

Subject: Optimum Database Architecture
Category: Computers > Programming
Asked by: subrataray-ga
List Price: $50.00

Posted: 17 Mar 2005 06:47 PST
Expires: 16 Apr 2005 07:47 PDT
Question ID: 496077

Hi,

We are developing an application that needs to use a massive back-end
database. The database will contain around 75 million rows with around
80 columns per row. We would prefer to use MySQL as the database
platform as it is free. The MySQL database would be hosted on a
dedicated server that we will purchase from a web hosting company.

This database would be used both by our customers and by our own employees.

The first column will contain some text which will be unique in each
row. 90% of the remaining columns will containing numbers and the
other columns will contain text.

The second column will contain numbers and it needs to be updated on a
monthly basis. But, we also need to store historical data regarding
the value of the second column for each row for the last 24 months, on
a rolling basis. This can either be done by adding more columns to the
same table, or by putting this historical data in a separate table,
depending on your recommendations.

Users will make 2 types of queries on this database:

i) The first type of query is what can be called a mission-critical
query - these queries will be made by our customers and the results of
these queries must be returned within 30 seconds at the most;
otherwise, customers are not going to want to use the application.
This query would basically involve asking the customer for a search
string, searching the FIRST column (and ONLY the first column) of the
entire database to find out each row that contains that search string
(either in whole or in part) and then returning all such rows to the
user sorted in descending order of the SECOND column. Only the
information in the first 2 columns will be returned to the customers -
the information in the other 78 columns will not be returned to the
customers. Customers will also have the option of specifying negative
matches - i.e. if the first column of a particular row contains any
one of a list of banned words or phrases, then that row will not be
returned even if it contained the primary search string.

ii) The second type of queries are non-mission-critical; these would
be run by our employees and it is ok if these queries take as much as
10 minutes to return results. However, the queries that our employees
will run are also much more complex - they will specify multiple
search criteria - for instance, "return all rows for which the 60th
column has a value > 2000 and the minimum value for the columns 40,
41, ... 50 for that row is 20 and the 35th column of that row is < 5"
etc.

It is quite possible that as many as 20 - 30 users will be querying
the database at the same time. Furthermore, there will be 5 - 6
different PHP scripts that are going to constantly update the
different columns and rows of the database with the values.

Here are my questions:

i) Is MySQL a realistic option for this kind of database?

ii) What should be the hardware configuration (processor type, number
of processors, RAM etc.) for the dedicated server that will host this
database. We are interested in the most cost-effective option; i.e.
the least powerful hardware configuration that is suitable for running
this system.

iii) What is the best way of designing the database architecture in
order to ensure that we are able to meet our targets regarding the
query times?

I might need to ask some follow-up questions based on your initial response.

Please be extremely detailed and specific in answering the questions,
especially the third question.

For the third question, please suggest the table structure including
whether we should keep all the data in a single table or whether we
should use multiple tables.

If we use multiple tables, how many such tables should we use, and
which columns should be present in each table? Also, we need to know
what should be the primary key, the unique key etc. for each table and
how the indexes should be defined.

Alternatively, if you recommend that we break up the database into
multiple tables not in terms of columns but in terms of rows (i.e.
keeping a fixed number of rows per table), then what is the ideal
number of rows that should be present in each table?

And, depending on your recommendation regarding the database
structure, please advise us regarding how we should build the queries
for the 2 types of queries that we need to support.

I would prefer that someone who has actual experience designing a
MySQL database with similar or even more demanding characteristics and
requirements answer my query.

I have set an initial price of $50 as I am not sure how many follow-up
questions we might need to ask. If we need to ask a considerable
number of follow-up questions which requires more time on your part,
and assuming the solution that you provide for us is effective, I'd be
more than happy to give you a substantial bonus.

If you need any clarifications from us before you can answer the
question, feel free to ask.

Answer

There is no answer at this time.

Comments

Subject: Re: Optimum Database Architecture
From: willcodeforfood-ga on 17 Mar 2005 08:40 PST

What is the width of the first column?  I'm hoping for your sake that
it's fairly small.

Subject: Re: Optimum Database Architecture
From: martyogelvie-ga on 17 Mar 2005 12:57 PST

considering there will be 75 million unique records in the 1st column
(pk), it would have to be semi wide...

I'm not any good at math but if the column were strictly numbers then
you would have 75,000,000 (8 characters).  Adding in some text would
help eliminate some.. I'm guessing somewhere between 5 and 6
characters wide.. maybe more.

Subject: Re: Optimum Database Architecture
From: subrataray-ga on 17 Mar 2005 20:22 PST

The first column would be around 100 characters long. Does that help?

Subject: Re: Optimum Database Architecture
From: willcodeforfood-ga on 18 Mar 2005 12:04 PST

Here is some feedback for part 3 of your question.

Let me start by saying that MySQL is not one of the tools I work with.
 I've worked with plenty of very large Oracle and SQL Server
databases, but not MySQL.  I think, however, that you will probably
need input from more than just one person to get a good feel for how
to approach this problem, so consider this my two cents worth.  Keep
in mind that I do not have a full knowledge of your system, so upon
further investigation and clarification, I might well revise parts of
this design and/or analysis.

=====================================================

I'll start with your "mission-critical" needs to suggest a database architecture.

In order to examine the first field to determine whether it contains a
particular substring will require that you decompose the field so that
you can utilize indexes and avoid full table scans.  Full table scans
on 75 million rows would require some very fast (and expensive)
hardware and possibly a distributed design to accommodate multiple
users obtaining results within 30 seconds.  It is not entirely clear
whether you are searching for substrings or words.  I've outlined some
tables below that demonstrate the design I'm leaning towards.

Here is your table:

Table: Main
Field1_Text (100 wide)     Field2_Num   F3 F4 F5 ...
-------------------------- ------------ -- -- --
Hello World                12345678     12  A  0 ...
Goodbye World              23456789      2  G  1 ...

=====================================================

This needs to be broken into the following tables:

Table: MainText
ID   Field1_Text (100 wide)      Field2_Num
---- --------------------------- ------------
1    Hello World                 12345678
2    Goodbye World               23456789

Table: MainNumHist
ID    Yr Mn Field2_Num
----- -- -- ------------
1     05 01 12345678
2     05 01 23456789

Table: MainDat
ID   Field1_Text (100 wide)     Field2_Num   F3 F4 F5 ...
---- -------------------------- ------------ -- -- --
1    Hello World                12345678     12  A  0 ...
2    Goodbye World              23456789      2  G  1 ...

=====================================================

Now to decompose Field1 so indexes can be used, add one of the following:

Table: TextParts                  -OR-  Table: TextWords
ID    TextPart                          ID    TextWord
----- --------------------------        ----- --------------------------
1     Hello World                       1     Hello
1     ello World                        1     World
1     llo World                         2     Goodbye
1     lo World                          2     World
1     o World                           ...
1      World
1     World
1     orld
1     rld
1     ld
1     d
2     Goodbye World
2     oodbye World
...

=====================================================

Here's some reasoning for the design:

MainText -- While this table duplicates data from MainDat, you
specified that you will have several applications running that are
going to update fields in your main table frequently.  You do not want
to be doing "mission critical" querying from tables that could be
locked for update by other applications.  That is a recipe for big
problems.  Primary key on the ID field.  No foreign keys.

MainNumHist -- Holds your historical values for Field 2.  You'll
either want to index (ID,Year,Month) or just (Year,Month) depending on
your needs.  No foreign keys.

MainDat -- Holds the data for your internal querying and updates.  ID
is auto-increment field and the primary key.  Apply other indexes
minimally as needed by your own internal processes.  No foreign keys.

TextParts -- Assumes you have to search Field1 for substrings,
otherwise use TextWords.  This table allows you to search for
substring without doing full tablescans on MainText with a costly
operator or function.  If your main table's Field1 is updated
frequently, maintaining the TextParts table could become problematic,
but from your question it sounded like that was not the case.  Primary
key on the ID field and index on the TextPart field.  No foreign keys.

So here are some sample queries to fetch results for your users:

+foo +bar

    select   Field1, Field2 from MainText
    where    ID in ( select ID from TextParts
                     where TextPart >= 'foo' and TextPart < 'foo~' )
      and    ID in ( select ID from TextParts
                     where TextPart >= 'bar' and TextPart < 'bar~' )
    order by Field2 descending

+foo -bar

    select   Field1, Field2 from MainText
    where    ID in ( select ID from TextParts
                     where TextPart >= 'foo' and TextPart < 'foo~' )
      and    ID not in ( select ID from TextParts
                         where TextPart >= 'bar' and TextPart < 'bar~' )
    order by Field2 descending

=====================================================

Now to look at storage requirements:

You say you'll have 75 million records.  At 300 Bytes/record your
primary table is 22.5 GB.

The biggest table by far will be the TextParts table.  Each record in
MainDat results in a maximum of 100 records in TextParts that consume
a total of no more than 5450 bytes (100(100+1)/2 + 4*100), for a grand
total maximum size of 7.5 billion rows taking up 408 GB.

So this design optimizes speed, but it grows your data significantly.

=====================================================

Once you have a final database design that can achieve your needs, you
can start to look at whether MySQL is a realistic solution.  As I've
indicated I'm probably not the best person to give guidance in this
area.  MySQL is clearly capable of handling a system of this size. 
I've definitely seen it used successfully on this scale for read-only
applications.  I believe, however, that the answer has more to do with
your needs regarding recovery should you suffer a hardware or software
failure.  That is one of the primary differences between commercial
and open source DBMS packages.  Here's an independent review of MySQL
and other DBMS:

[ http://www-css.fnal.gov/dsg/external/freeware/mysql-vs-pgsql.html ]

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy