I have tab delimited text files that I need to put into an Oracle
database. I think making the tables and putting the data into them is
supposed to be very easy, the sql definitions are here, but they may
need to be changed for Oracle. There are three types of table that
need to be used. They are called kgXref, knownGene and chrX_chainSelf.
There are 25 versions of the third, one for each chromosome. (X, Y, M,
1-22).These are the definitions for the tables:
CREATE TABLE kgXref (
kgID varchar(40) NOT NULL default '',
mRNA varchar(40) default NULL,
spID varchar(40) default NULL,
spDisplayID varchar(40) default NULL,
geneSymbol varchar(40) default NULL,
refseq varchar(40) default NULL,
protAcc varchar(40) default NULL,
description varchar(255) default NULL,
PRIMARY KEY (kgID),
KEY mRNA (mRNA),
KEY spID (spID),
KEY spDisplayID (spDisplayID),
KEY geneSymbol (geneSymbol),
KEY refseq (refseq),
KEY protAcc (protAcc)
) TYPE=MyISAM;
_so the text file looks liek this, without the row labels. (values
have been truncated)
#kgID mRNA spID spDisplayID geneSymbol refseq protAcc description
AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058 Down syndrome
critical region gene 4
AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050 osteomodulin
AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068 mRNA expressed in osteoblast
______________________________________________________________________________________
CREATE TABLE knownGene (
name varchar(255) NOT NULL default '',
chrom varchar(255) NOT NULL default '',
strand char(1) NOT NULL default '',
txStart int(10) unsigned NOT NULL default '0',
txEnd int(10) unsigned NOT NULL default '0',
cdsStart int(10) unsigned NOT NULL default '0',
cdsEnd int(10) unsigned NOT NULL default '0',
exonCount int(10) unsigned NOT NULL default '0',
exonStarts longblob NOT NULL,
exonEnds longblob NOT NULL,
proteinID varchar(40) NOT NULL default '',
alignID varchar(8) NOT NULL default '',
KEY name (name(10)),
KEY chrom (chrom(12),txStart),
KEY chrom_2 (chrom(12),txEnd),
KEY protein (proteinID(12)),
KEY align (alignID)
) TYPE=MyISAM;
__so the text file looks liek this:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount
exonStarts exonEnds proteinID alignID
BC0636 chr1 - 4224 7502 4558 7173 2 4224,4832, 4692,4901, AAH63682 2425
AL1377 chr1 - 4266 19346 7413 19346 2 4266,4832, 4692,6628, Q9NSV7 41739
BC0154 chr1 - 4268 5808 4558 5808 1 4268, 4692, Q96BN3 32430
______________________________________________________________________________________
__ will need one for each chromosome, replacing X with Y, M, 1-22.
CREATE TABLE chrX_chainSelf (
bin smallint(5) unsigned NOT NULL default '0',
score double NOT NULL default '0',
tName varchar(255) NOT NULL default '',
tSize int(10) unsigned NOT NULL default '0',
tStart int(10) unsigned NOT NULL default '0',
tEnd int(10) unsigned NOT NULL default '0',
qName varchar(255) NOT NULL default '',
qSize int(10) unsigned NOT NULL default '0',
qStrand char(1) NOT NULL default '',
qStart int(10) unsigned NOT NULL default '0',
qEnd int(10) unsigned NOT NULL default '0',
id int(10) unsigned NOT NULL default '0',
KEY bin (bin),
KEY tStart (tStart),
KEY tEnd (tEnd),
KEY id (id)
) TYPE=MyISAM;
__so the text file looks like this
#bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id
585 63520 chr10 135037 14000 16578 chr7 158545 - 96339 9634 204969
585 63670 chr10 135037 14000 16578 chr7 158545 + 62286 6228 204161
585 81668 chr10 135037 14000 17239 chr7 158545 + 62827 6283 133728
______________________________________________________________________________________
__here are the text file sizes, if that matters.
95560 -rwxr-xr-x 1 sbradley valle 97748346 2003-12-22 12:52
chr10_chainSelf.txt
96460 -rwxr-xr-x 1 sbradley valle 98672181 2003-12-22 12:53
chr11_chainSelf.txt
58964 -rwxr-xr-x 1 sbradley valle 60311109 2003-12-22 12:55
chr12_chainSelf.txt
36596 -rwxr-xr-x 1 sbradley valle 37431021 2003-12-22 12:56
chr13_chainSelf.txt
29804 -rwxr-xr-x 1 sbradley valle 30480088 2003-12-22 12:56
chr14_chainSelf.txt
36828 -rwxr-xr-x 1 sbradley valle 37668773 2003-12-22 12:57
chr15_chainSelf.txt
92684 -rwxr-xr-x 1 sbradley valle 94809905 2003-12-22 12:58
chr16_chainSelf.txt
67224 -rwxr-xr-x 1 sbradley valle 68760641 2003-12-22 12:59
chr17_chainSelf.txt
42204 -rwxr-xr-x 1 sbradley valle 43166380 2003-12-22 13:00
chr18_chainSelf.txt
534956 -rwxr-xr-x 1 sbradley valle 547254235 2003-12-22 13:03
chr19_chainSelf.txt
168984 -rwxr-xr-x 1 sbradley valle 172860089 2003-12-22 13:09
chr1_chainSelf.txt
41448 -rwxr-xr-x 1 sbradley valle 42393466 2003-12-22 13:11
chr20_chainSelf.txt
19048 -rwxr-xr-x 1 sbradley valle 19478514 2003-12-22 13:12
chr21_chainSelf.txt
63368 -rwxr-xr-x 1 sbradley valle 64817190 2003-12-22 13:13
chr22_chainSelf.txt
126324 -rwxr-xr-x 1 sbradley valle 129222607 2003-12-22 13:14
chr2_chainSelf.txt
114164 -rwxr-xr-x 1 sbradley valle 116784863 2003-12-22 13:15
chr3_chainSelf.txt
96836 -rwxr-xr-x 1 sbradley valle 99056270 2003-12-22 13:17
chr4_chainSelf.txt
105668 -rwxr-xr-x 1 sbradley valle 108090354 2003-12-22 13:19
chr5_chainSelf.txt
75448 -rwxr-xr-x 1 sbradley valle 77174930 2003-12-22 13:20
chr6_chainSelf.txt
480440 -rwxr-xr-x 1 sbradley valle 491479445 2003-12-22 13:23
chr7_chainSelf.txt
80096 -rwxr-xr-x 1 sbradley valle 81928240 2003-12-22 13:27
chr8_chainSelf.txt
161140 -rwxr-xr-x 1 sbradley valle 164835742 2003-12-22 13:29
chr9_chainSelf.txt
60 -rwxr-xr-x 1 sbradley valle 55970 2003-12-22 13:31
chrM_chainSelf.txt
86224 -rwxr-xr-x 1 sbradley valle 88198936 2003-12-22 13:31
chrX_chainSelf.txt
118128 -rwxr-xr-x 1 sbradley valle 120836544 2003-12-22 13:33
chrY_chainSelf.txt
9896 -rwxr-xr-x 1 sbradley valle 10114459 2004-04-10 09:39 knownGene.txt
3856 -rwxr-xr-x 1 sbradley valle 3944028 2004-04-10 09:39 kgXref.txt
_______________________________________________________________________________________________
I was told to make the tables and then "wrap insert statements" around
each line in each text file, but I think there are maybe hundred
thousand lines and i find it hard to believe that the program cant
just take in the file somehow.
I want the commands that I use from the sql prompt. Command line thing. SQL> |
Request for Question Clarification by
mathtalk-ga
on
12 Sep 2004 10:02 PDT
Hi, iterative-ga:
You may be trying to use the wrong Oracle utility.
The Oracle tool SQL*Plus is for executing SQL commands, like insert statements.
The Oracle tool SQL*Loader is for bulk loading records into a table.
Since you have the tab delimited text files & the rough table "DDL"
(data definition language), I would use SQL*Plus to create the tables
and use SQL*Loader to populate them.
Here's a link to a third-party FAQ for Oracle's SQL*Loader utility:
http://www.orafaq.com/faqloadr.htm
and here's Oracle's own "overview":
http://www.oracle.com/technology/products/database/utilities/htdocs/sql_loader_overview.html
Let me know if you'd like more information about using SQL*Loader for
your tables and data posted as an Answer.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
12 Sep 2004 15:44 PDT
I appreciate your information but they are hints I dont understand. If
your more information amounts to the text that I will type at the ssh
or sql prompt to make the tables and input the data, then I would like
you to be the answerer. Im not trying to find out what tools or
command types to use but the actual commands with my variables
included; the total text I will type. Thanks.
|
Request for Question Clarification by
mathtalk-ga
on
12 Sep 2004 19:04 PDT
Hi, iterative-ga:
If you plan to use only SQL*Plus (the program that gives you the SQL
prompt), you will have to transform your tab delimited text files into
SQL statements, namely some text files containing INSERT commands, one
per row of each table that you need to populate. The Oracle syntax of
the INSERT statement in SQL for specific values works like this,
assuming the table kgXref belongs to your own "schema":
INSERT into kgXref (kgID, mRNA, ... , protAcc, description)
VALUES ('AB0000','AB0000', ... , 'NP_0058', 'Down syndrome');
As you can see, some sort of programming would be necessary to turn
your tab delimited files, containing many thousands of records, into
SQL commands that can be processed by SQL*Plus.
The alternative that I'm suggesting would be to use a different
program that Oracle provides, called SQL*Loader. This is intended for
tasks of bulk loading tables from a variety of text-based data files.
Perhaps you can find someone to give you hands on help in 1) creating
the tables using SQL commands in SQL*Plus, and 2) then loading those
tables from your data files using SQL*Loader. Once you've seen it
done for two or three tables, you'll have a recipe to do the last
couple of dozen tables yourself. This is especially true since most
of the "chromosome" tables have the same layout.
From the looks of things, you may be porting data from a MySQL
database to Oracle. So the table definitions need to be changed to
use Oracle syntax, as you observed already. Then you need to make
sure that, in addition to having the SQL*Plus program available (at
the secure shell prompt in Unix?), you can also execute the SQL*Loader
program. If you checkout the links I showed you before, you'll see
that the command to execute SQL*Loader can depend on the platform (and
on the Oracle version, too), so given whether it is DOS, Unix,
Windows, or Linux, the syntax might vary. In general it may be
something like:
ssh> sqlldr <username>/<password>@<dblink> CONTROL=<control_file_name>
The control file has its own extensive syntax rules which allow a lot
of flexibility in importing data. [It will also prompt you for the
password, if you omit that, which is the recommended approach on
Unix.] There are about a dozen other "optional" command line
arguments besides CONTROL=control_file_name, but you certainly can
avoid learning everything at once. The default values for most
parameters will work fine for you. The control file is the critical
piece. It allows some of the fields in a record to be imported and
not others, which might be the desire with the "row labels" in your
data.
Here's another brief introduction with some sample control file
examples. Your Oracle user's guide documentation will have more
details.
[Introduction to SQL*Loader -- O'Reilly]
http://www.oreilly.com/catalog/orsqlloader/chapter/ch01.html
I recommend executing "sqlldr" at your ssh command prompt, just to see
if the program is available to you. With no arguments it spits back
the list of possible command line arguments (most of which you will
subsume into the control file when actually using SQL*Loader).
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
14 Sep 2004 07:53 PDT
Thank you for the additional hints. yes I have the loader
command/program. I am upping the price as it is not readily obvious to
me how to translate the sql scheme into oracle.
|
Request for Question Clarification by
mathtalk-ga
on
14 Sep 2004 13:21 PDT
Let's start with the table definitions. I'll post them as a Comment
since this is not the complete answer to your question, but will be
required to make further progress.
The syntax shown indicates the creation of a number of single column
keys in addition to defining a (unique) primary key on the first
table. The "other" keys are essentially just indexes, and can just as
well be defined after the tables are loaded. Is that an acceptable
approach?
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
14 Sep 2004 20:11 PDT
_I am worried_ about this question not working out because it appears
not as cut and dry as I thought. Only if you think the original
question can be answered should this proceed (what is the text typed
or pasted to load this data into an oracle database). I dont know what
a key is but I think you mean leaving out a column in any text file,
and I dont think that is going to work. This is because I assume all
those columns are functionally relevant. This is because I dont know
what RDMBS or programming or whatever you call this. I am trying to
learn unix and vi and sql in a very uncoordinated manner and so yes I
want your help but only if you think we can answer this question
within a week. Tentatively. THanks.
|
Request for Question Clarification by
mathtalk-ga
on
14 Sep 2004 21:02 PDT
If I were suggesting to leave out a column, I'd say so. Actually I
work all day with a database containing geneSymbol's, refseq's,
protAcc's, etc. So I can well appreciate that leaving the data fields
shown in the table would not be helpful.
A "key" is instead an index on one or more columns. In addition to
building the table that contains rows of data, it is often useful to
creating indexes that help queries to run faster (by speedier location
of the desired rows).
My point was that constructing the indexes can be (and probably should
be) done after the bulk loading of your tables. The "primary key" on
the first table I would make an exception for, because that forces the
records in that table to be uniquely identifiable by the kgID. Note
that this field is not allowed to be Null (missing), where all the
other fields (columns) of that table (kgXref) allow Null values.
I'm sure we can get there, but I can well believe that you thought it
would be much simpler in the beginning. (After we're done, it will
again seem simpler!)
In any case by posting the SQL "data definition language" for the
tables below as Comments, I'm not obligating you to pay for an Answer.
If we are able to completely solve your problem, then I'll post the
Answer and accept the list price payment you've offered. Fair enough?
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
15 Sep 2004 05:39 PDT
Yes. Incidentally it apears that the links I envision between these
tables are broken at one point. Lets move forward though, what you
describe sounds fine. I dont think query speed will be an issue.
|
Request for Question Clarification by
mathtalk-ga
on
15 Sep 2004 06:34 PDT
There are a few changes between syntax on some major versions of
Oracle, so just to expedite testing, what version of Oracle will you
be using? Also, will you be creating the tables in your own schema or
in another "user" schema? Your login to the database will need to
have certain privileges to create tables in either case.
Finally, are there two versions of the database at hand, a "test" and
a "production" platform? Or only one?
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
15 Sep 2004 07:41 PDT
Ocracle 9i, there is only one platform/version of the database, I have
create priviledges, I am creating tables in my own schema. Whatever
that is.
|
Request for Question Clarification by
mathtalk-ga
on
15 Sep 2004 18:43 PDT
Hi, iterative-ga:
I just wanted to make sure you've seen the "data definition" SQL
posted below as a Comment for the first table.
regards,
mathtalk-ga
|
Clarification of Question by
iterative-ga
on
15 Sep 2004 19:56 PDT
teh table is created. I make txt file called control_xgRef containing
your sketch and then invoke call the loader? I can do this for hours
tomorrow, but I dont now know how to call the loader.
|
Request for Question Clarification by
mathtalk-ga
on
15 Sep 2004 20:28 PDT
Hi, iterative-ga:
Yes, the general idea is to call SQL*Loader as I mentioned earlier, in
a similiar way to how you run SQL*Plus.
At the Unix prompt, with the current directory being where you have
the data file and the control file, you would type:
> sqlldr <username>@<dblink> control=control_xgRef
if that's the name you gave the control file. The name of the table
and the data file are going to be read from the control file. Unix
and the Oracle database itself are picky about capitalization, but the
SQL*Loader utility itself is pretty much case insensitive.
However I still want to know whether the kgXref.txt file contains any
extra fields (like the row labels you mentioned) that are _not_ to be
loaded into the table. Please clarify that point before we proceed,
so we can make any adjustments necessary to the layout of the control
file.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
16 Sep 2004 05:42 PDT
Sorry about that. keeping control file and data separate. No there are
no rows that need to be skipped, those headers do not exist.
|
Request for Question Clarification by
mathtalk-ga
on
16 Sep 2004 06:23 PDT
Okay, I hadn't thought about column headers, but that's good to know.
What I was worried about was that maybe each row in the file had a
"counter" field at the beginning, like a row number, that you don't
want to include in the loaded data.
Anyway, give it a whirl. The SQL*Loader will produce several output
files, and I think it may "append" to the end of existing files when
it runs, so you may need to stay on top of checking the output files
and either deleting them or renaming them to something else each time
you run the program.
One output file is a log of what the run did, e.g. what control file
it opens and how many data records it found to load, etc.
Another output file is a "rejects" pile, the rows it could not load
for some reason. This is actually a very useful feature in many
cases. Someone may give you a datafile with some messed up records,
e.g. a row split across two lines, etc. In your case the way the
table is defined is so flexible (everything is character data) that
you could only have a conflict on the primary key constraint (not
because a messed up line in the file has a string where the table
expects an integer or date, for example).
So be sure each time run to take a look at the output files and let me
know what you get. (With a bit of editing in "vi" the "rejects" file
can be recycled back into an input file.)
Here are some SQL commands that may be useful to you in trying to
practice the load. First of all, instead of DELETE'ing each row in
the table, clearing things out is done more efficiently by:
SQL> TRUNCATE TABLE kgXref;
Also, here's the ever handy way to see how many rows the database
thinks are in the table:
SQL> SELECT COUNT(*) FROM kgXref;
Good luck with the experiment! I suggest keeping all the files in a
single directory from which you run SQL*Loader. If it complains about
not being able to find the control file or the data file, you might
need to qualify the paths. I can't think of anything else that might
go wrong, but data has a surprising way of finding the Achilles heel
when supplied in sufficient quantities.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
16 Sep 2004 10:39 PDT
This is murder. Things do not work easily. That said, it seams loaded
SQL*Loader: Release 9.2.0.4.0 - Production on Thu Sep 16 13:29:31 2004
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Control File: control_kgXref.ctl
Data File: kgXref.txt
Bad File: kgXref.bad
Discard File: none specified
(Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Bind array: 64 rows, maximum of 256000 bytes
Continuation: none specified
Path used: Conventional
Table KGXREF, loaded from every logical record.
Insert option in effect for this table: INSERT
TRAILING NULLCOLS option in effect
Column Name Position Len Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
KGID FIRST * WHT CHARACTER
MRNA NEXT * WHT CHARACTER
SPID NEXT * WHT CHARACTER
SPDISPLAYID NEXT * WHT CHARACTER
GENESYMBOL NEXT * WHT CHARACTER
REFSEQ NEXT * WHT CHARACTER
PROTACC NEXT * WHT CHARACTER
DESCRIPTION NEXT * WHT CHARACTER
Table KGXREF:
42026 Rows successfully loaded.
0 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Space allocated for bind array: 132096 bytes(64 rows)
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 42026
Total logical records rejected: 0
Total logical records discarded: 0
Run began on Thu Sep 16 13:29:31 2004
Run ended on Thu Sep 16 13:30:01 2004
Elapsed time was: 00:00:30.08
CPU time was: 00:00:00.36
why I cant paste anytime in any window is beyond me, but I guess this is progress.
|
Request for Question Clarification by
mathtalk-ga
on
16 Sep 2004 13:43 PDT
Excellent! I suggest you also look in the *.bad file, or just confirm
it's empty, and do a select count(*) on the table to confirm the
number of rows that the database thinks are there matches what
SQL*Loader thinks it inserted.
I can post the SQL for building the extra "single column" indexes on
the kgXref table, but it looks like maybe you've changed the
capitalization of the table name and field names. All uppercase is
certainly easier to remember, but if you confirm this the design, then
I'll modify my own SQL statements going forward.
Next stop: creating and loading knownGene, which will introduce some
non-string datatypes.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Sep 2004 05:55 PDT
I have to figure out what this means. I am doing this to query the database.
In response to requests from Genome Browser users, we have set up a MySQL
database for public access at genome-mysql.cse.ucsc.edu. This new server
allows MySQL access to the same set of data currently available on our
public Genome Browser site. The data are synchronized daily with the main
databases on http://genome.ucsc.edu.
To connect to the database, you must use a computer on which the MySQL
client libraries have been installed. The UCSC server is running MySQL
4.0.20; for best results, we recommend you use clients of the same revision.
The latest MySQL clients may be downloaded from
http://dev.mysql.com/downloads/mysql/4.0.html. Connect to the MySql server
using the command "mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A".
The -A flag is optional but is recommended for speed.
Once connected to the database, you may use a wide range of MySQL commands
to query the database. As a courtesy to others, please observe the following
guidelines when using the database:
1. Avoid excessive or heavy queries that may impact the server performance.
Inappropriate query use will result in a restriction of access. If you plan
to execute a query that you think may be excessive, contact UCSC first to
avoid the possibility of having your access blocked.
2. Bot access and excessive program-driven use are not permitted.
3. Attachments by local mirror sites are prohibited.
If you prefer a more-structured graphical access to the UCSC database
tables, use the Table Browser at http://genome.ucsc.edu/cgi-bin/hgText.
System problems should be reported to genome-www@soe.ucsc.edu. Send
questions regarding the database contents or queries to genome@soe.ucsc.edu.
-Donna
-----------------------------------
Donna Karolchik
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
At this point we procede, but does it look to you as though I should
just move on to describing the query I had in mind, which I would love
you to help me with? I dont know the value of the "keys" you are
talking about adding. Also, I had no intention of changing any names
of anything. I think everything should be as it is at UCSC, but I dont
see what I changed.
|
Request for Question Clarification by
mathtalk-ga
on
17 Sep 2004 09:07 PDT
Yes, go ahead and describe the query you had in mind. That would help
to find out what keys/indexes would be useful to make the query run
quickly.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Sep 2004 09:56 PDT
Right. It convoluted and may be more clear at the browser. I want to
make a list of all the groups of genes formed by chains. A chain is a
"link" to another locus based on homology. They are repeats in a sense
but they have been filtered. Im sure they are of biological
significance, these groups. I want to prove that after making the
groups. The theory is that the presence of one rna species regulates
the expression of another.
For each gene, make list of all the chains it contains, exluding those
that are EXONIC. I want to include 3 and 5' UTR +2kb beyond the listed
boundary. For each of these chains, filter out those that do not point
to another gene (same boundary, no exonic points). The group is the
collection of kgXref rows, mainly name and description. This is the
master that I would then hope to able to filter by chain score,
strand, size, and persistance. persistance is some measure of a chain
representing how many chains share its query location. Im sure this is
gobldgook, or just an insane query, but it will make more sense at the
browser.
http://www.genome.ucsc.edu/cgi-bin/hgGateway?org=Human&db=hg16&hgsid=35845084
pick: human july 2003 chr1:56,374,002-56,374,561
then on that page set the last option on the page "self chain" to
full. The bars are the chains, colored by the chromosome they point
to.
Thanks for all your thoughts on this one Mathtalk.
iterative
|