I like to have an efficient Oracle SQL*Loader program to load a
datafile into an relational table in Oracle 10g (10.1.0.3) database.
I know SQL loader pretty well. Here there is a small twist. The data
file is in a slightly different format.
Please see the example. This is just a sample file that I have created
to explain the problem.
Data file price.txt
===================
Product ID,Product Name,Black List Price,Black Sale Price,White List
Price,White Sale Price,Green List Price,Green Sale Price,
10011,Shoe,89.95,49.99,79.95,39.99,69.95,49.99
10032,Pant,99.99,69.99,89.99,59.99,99.99,69.99
Relational table price
=========================
------------------------------------------------------------------
PRODUCT_ID PRODUCT_NAME COLOR LIST_PRICE SALE_PRICE
------------------------------------------------------------------
10011 Shoe Black 89.95 49.99
10032 Pant Black 99.99 69.99
10011 Shoe White 79.95 39.99
10032 Pant White 89.99 59.99
10011 Shoe Green 69.95 49.99
10032 Pant Green 99.99 69.99
I am looking for the price.ctl file containing the SQL logic to
convert the data to have a "color" column.
The above example is just a sample.
In the real life problem, I have more than 500,000 records to load in
the databse on daily basis.
I also have files with different headers. Instead of "color" it may be
"size" that varies.
The solution must be scalable to high number of records.
I want to avoid running a sql script after loading the data in the
original format. But I will do if it can't be avoided.
Also, I like the solution should be flexible so that I can just change
few things in the ctl file to use the same for other data loads that
requires the same solution.
Thanks
Westonian |
Clarification of Question by
westonian-ga
on
15 Nov 2004 10:35 PST
The header of the test file is
===================================================
Product ID,Product Name,Black List Price,Black Sale Price,White List
Price,White Sale Price,Green List Price,Green Sale Price
===================================================
It got wrapped when send it.
Thanks
|
Request for Question Clarification by
mathtalk-ga
on
15 Nov 2004 16:56 PST
Hi, westonian-ga:
So the basic issue is how to get SQL*Loader to insert multiple rows
per single line of text in the input file, right?
Are there potential issues with "missing colors"? For example, shoes
and pants come in all three colors in your example, but perhaps gloves
are only available in black and white (color green prices are
missing).
regards, mathtalk-ga
|
Clarification of Question by
westonian-ga
on
15 Nov 2004 17:50 PST
Dear Mathtalk,
Thanks for picking this up.
In my data set, there will be always zeros if there is a missing
color. So technically, it is not a problem for our data loading
program. But it is the shortcoming of the system that generated the
original data.
It would be great to optionally NOT to insert a row in the table if
both LIST_PRICE and SALE_PRICE are zero.
We can have two modes.
1. Insert all rows.
2. Insert only the rows that at least have one non-zero value.
You can tell what part of the code I should change when I switch between modes?
As I said, the above example is just a sample. In my real data, there
are more than two (some times up to 15) repeating attributes like
LIST_PRICE and SALE_PRICE.
Thanks
Westonian
|
Clarification of Question by
westonian-ga
on
17 Nov 2004 18:16 PST
Veiledlady,
Thanks for the help. You are right about the difficulty in doing it in
SQL*Loader. I am not a SQL expert. But I have done some simple data
loading with SQL*Loader. If it is difficult to do it with SQL*Loader
or it affects the scalability, then I am ready to use a SQL script to
change the data after it has been loaded.
BTW, somebody told me that the SQL*Loader internally uses SQL commands
to load the data. If that so, will there any performance difference
between SQL*Loader and SQL scripts? In fact I found SQL*Loader is much
faster than SQL insert statements. I could be wrong.
I use Oracle 10g. I think external tables will work in 10g also. But I
don't how to use external tables to solve this problem. In other
words, I need the SQL script to flip the data.
Westonian
Thanks
Anto.
|
Request for Question Clarification by
mathtalk-ga
on
19 Nov 2004 21:05 PST
Hi, Anto.:
I suspect we can trick SQL*Loader into loading multiple records per
line of input, along the lines you want, by varying the concept of
loading multiple tables in one pass described here:
[Can one load data into multiple tables at once? -- orafaq.com]
http://www.orafaq.com/faqloadr.htm#MULTITABLE
It'll take me a while to polish an illustration of this (assuming it
works), and if you are already proficient with control files, you'll
be able to proceed independently in the meantime.
Oracle's own FAQ (the above is a third party) describes the difference
in load methods:
[sql_loader_overview]
http://www.oracle.com/technology/products/database/utilities/htdocs/sql_loader_overview.html
A "convention path" load does indeed use underlying INSERT statements
to import the data, on a record by record basis. A "direct path" load
is an optimization for speed in which memory blocks are built and then
saved "directly into the extents allocated for the table being
loaded."
regards, mathtalk-ga
|
Clarification of Question by
westonian-ga
on
23 Nov 2004 06:04 PST
Dear Mathtalk,
Thanks for the quick reply and sorry for the delay from my side.
The core part of my question is "how to pivot the table". It can be
done at load-time or after the load (using SQL). I assume I can avoid
intermediate steps and tables if I can some how do it at the load
time.
There is another option. Use multiple tables to load as you described
in your email. Then create a "view" joining the two tables and finally
producing my output format. If we go by this path, I still need to
have the option of ignoring or including the records with zero values.
May be this can be done by defining the view different ways (with and
without zero values).
In either case, I like to get the SQL to pivot (or merge in the case
of ?view?) the data. If I write the SQL my own, I doubt it may not be
the most efficient one. I am looking for the solution that works
efficiently in Oracle environment.
Thanks
Westonian
|
Clarification of Question by
westonian-ga
on
24 Nov 2004 11:07 PST
mathtalk-ga,
Please let me know whether you are working on this question.
BTW, I see that this may involve more of your time and I have
increased the List Price to $50.
Thanks
Westonian.
|
Request for Question Clarification by
mathtalk-ga
on
24 Nov 2004 12:47 PST
Thanks, westonian-ga.
I will be happy to share some SQL for pivoting the table, as well as a
stab at trying to make SQL*Loader do the trick. I don't think there
is any special SQL optimization involved here, though. You will be
taking all the rows from one "scratch" table and inserting them into
another table (probably after truncating the records from the latter
"target" table so that you start afresh).
Apart from dropping all or most indexes that you might like to have in
production for this target table, and rebuilding them after the load,
I dont' think that there is any particular trick to doing this
quickly. But I'll be happy to kick around with you the experimental
results & see if some optimizations occur to us.
regards, mathtalk-ga
|
Clarification of Question by
westonian-ga
on
24 Nov 2004 13:40 PST
mathtalk-ga,
Thanks for committing to this. I am already too late in getting this work done.
I hope you will find one of the following options.
Option 1. SQL*Loader does the magic with an option to ignor zero value rows.
Option 2. Load the data into multiple tables (normalized - with an ID
to join the tables). Then create two materialized views for mode 1 and
mode 2. (Mode 1 = ignore records with all zero prices. Mode 2= keep
all rows). The view definition has the pivoting SQL.
Option 1 Advantage: It is simple. It is a single step process. This
is what I asked for in my original question.
Option 2 Advantage: This option has an additional step (view
creation). But this option stores for much lesser amount of data due
to normalization. I guess the performance will be better in this
option because we select on small number of records. For example, if
we have 10 colors, Option 1 may have about 10 times more number of
records than the Option 2. Materialized views (instead of ordinary
views) cache the data and make it really work like a table. The other
advantage is that we don't have to store the zero-value records for
both modes. The view can populate the zero values based on a SQL
statement. This will further reduce storage.
I want to know what you think about these two options. Let me know
which way you are proceeding.
Thanks
Westonian
|
Clarification of Question by
westonian-ga
on
02 Dec 2004 22:10 PST
mathtalk-ga,
Are you there?
I need at least "one" solution by tommorrow (EOB 12/3/04). Even if it
is not the most efficient, it is OK. I am Ok to use SQL after loading
it.
Please help ASAP.
Thanks
westonian
|
Request for Question Clarification by
mathtalk-ga
on
03 Dec 2004 06:02 PST
Hi, westonian-ga:
Let's try this. Naturally I'm guessing at the datatypes based on your example.
CREATE table DataFile (
PRODUCT_ID Number(9),
PRODUCT_NAME Varchar2(30),
BLACK_LIST_P Number(9.2),
BLACK_SALE_P Number(9.2),
WHITE_LIST_P Number(9.2),
WHITE_SALE_P Number(9.2),
GREEN_LIST_P Number(9.2),
GREEN_SALE_P Number(9.2)
);
Then load the comma-delimited data file price.txt straight into table
DATA_FILE using SQL*Loader, skipping if it's present the header row
(or just allowing it to kick out into the error log since "Product ID"
won't convert to a number).
Now assuming that you have the column names/datatypes fixed up in the
above to match your actual target table, you can do this:
INSERT into Product_Price (
PRODUCT_ID,
PRODUCT_NAME,
COLOR,
LIST_PRICE,
SALE_PRICE
)
SELECT PRODUCT_ID,
PRODUCT_NAME,
'Black',
BLACK_LIST_P,
BLACK_SALE_P
from DataFile
where BLACK_LIST_P > 0
and BLACK_SALE_P > 0;
COMMIT;
INSERT into Product_Price (
PRODUCT_ID,
PRODUCT_NAME,
COLOR,
LIST_PRICE,
SALE_PRICE
)
SELECT PRODUCT_ID,
PRODUCT_NAME,
'White',
WHITE_LIST_P,
WHITE_SALE_P
from DataFile
where WHITE_LIST_P > 0
and WHITE_SALE_P > 0;
COMMIT;
INSERT into Product_Price (
PRODUCT_ID,
PRODUCT_NAME,
COLOR,
LIST_PRICE,
SALE_PRICE
)
SELECT PRODUCT_ID,
PRODUCT_NAME,
'Green',
GREEN_LIST_P,
GREEN_SALE_P
from DataFile
where GREEN_LIST_P > 0
and GREEN_SALE_P > 0;
COMMIT;
The speed at which this works for large datasets may be enhanced by
first dropping all the indexes on the target table Product_Price and
rebuilding them at the end. If you have a "development" database to
try this on, I'd certainly give it a try that way. Rebuilding the
indexes can take a long time, but it's usually more efficient to
rebuild them all at once than to incur the penalty of updating them
incrementally as INSERT's take place.
regards, mathtalk-ga
|