Google Answers: SQL HELP to mathtalk-ga

View Question

Q: SQL HELP to mathtalk-ga ( No Answer, 1 Comment )

Question

Subject: SQL HELP to mathtalk-ga
Category: Computers > Programming
Asked by: iterative-ga
List Price: $78.00

Posted: 21 Sep 2004 16:46 PDT
Expires: 27 Sep 2004 13:09 PDT
Question ID: 404413

I cant find another way to update you, and trim the answer. The idea is that for every row in knownGene, take the coordinates of the gene (chrom, cdsStart, cdsEnd). Go to table chrN_chainSelf where N=chrom_knownGene, return all chains query coordinates (qName, qStart, qEnd) for chains (rows) that overlap the gene (using tStart, tEnd to see overlap with cdsStart, cdsEnd from knownGene). For each, take its query coordinates (qName, qStart, qEnd) and find what gene it falls in on knownGene by checking all teh query coordinates. Return #name for each knowngene that is found and get matching english name and description from kgXref (#name_knownGene is equivilent to #kgID_kgXref). MySQL table definitions follow. Tables must be loaded into Oracle9i, then keys must be created. &&& CREATE TABLE kgXref ( kgID varchar(40) NOT NULL default '', mRNA varchar(40) default NULL, spID varchar(40) default NULL, spDisplayID varchar(40) default NULL, geneSymbol varchar(40) default NULL, refseq varchar(40) default NULL, protAcc varchar(40) default NULL, description varchar(255) default NULL, PRIMARY KEY (kgID), KEY mRNA (mRNA), KEY spID (spID), KEY spDisplayID (spDisplayID), KEY geneSymbol (geneSymbol), KEY refseq (refseq), KEY protAcc (protAcc) ) TYPE=MyISAM; Genome Browser Database Description for kgXref Fields of kgXref: name SQL type text value histogram kgID varchar(40) mRNA varchar(40) spID varchar(40) spDisplayID varchar(40) geneSymbol varchar(40) refseq varchar(40) protAcc varchar(40) description varchar(255) Table kgXref has 42026 rows total. Example rows of table kgXref (not necessarily from current position!): #kgID mRNA spID spDisplayID geneSymbol refseq protAcc description AB000099 AB000099 P56555 DSR4_HUMAN DSCR4 NM_005867 NP_005858 Down syndrome critical region gene 4 AB000114 AB000114 Q99983 OMD_HUMAN OMD NM_005014 NP_005005 osteomodulin AB000115 AB000115 Q99984 Q99984 C1orf29 NM_006820 NP_006811 mRNA expressed in osteoblast. &&& CREATE TABLE knownGene ( name varchar(255) NOT NULL default '', chrom varchar(255) NOT NULL default '', strand char(1) NOT NULL default '', txStart int(10) unsigned NOT NULL default '0', txEnd int(10) unsigned NOT NULL default '0', cdsStart int(10) unsigned NOT NULL default '0', cdsEnd int(10) unsigned NOT NULL default '0', exonCount int(10) unsigned NOT NULL default '0', exonStarts longblob NOT NULL, exonEnds longblob NOT NULL, proteinID varchar(40) NOT NULL default '', alignID varchar(8) NOT NULL default '', KEY name (name(10)), KEY chrom (chrom(12),txStart), KEY chrom_2 (chrom(12),txEnd), KEY protein (proteinID(12)), KEY align (alignID) ) TYPE=MyISAM; Genome Browser Database Description for knownGene Fields of knownGene: name SQL type text value histogram name varchar(255) chrom varchar(255) strand char(1) txStart int(10) unsigned txEnd int(10) unsigned cdsStart int(10) unsigned cdsEnd int(10) unsigned exonCount int(10) unsigned exonStarts longblob exonEnds longblob proteinID varchar(40) alignID varchar(8) Table knownGene has 43232 rows total. Example rows of table knownGene (not necessarily from current position!): #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID BC063682 chr1 - 4224 7502 4558 7173 7 4224,4832,5658,6469,6719,7095,7468, 4692,4901,5810,6631,6918,7231,7502, AAH63682 2425 AL137714 chr1 - 4266 19346 7413 19346 8 4266,4832,6720,7095,7777,8130,14600,19183, 4692,6628,6918,7605,7924,8229,14754,19346, Q9NSV7 41739 BC015400 chr1 - 4268 5808 4558 5808 3 4268,4832,5658, 4692,4901,5808, Q96BN3 32430 &&& CREATE TABLE chrX_chainSelf ( bin smallint(5) unsigned NOT NULL default '0', score double NOT NULL default '0', tName varchar(255) NOT NULL default '', tSize int(10) unsigned NOT NULL default '0', tStart int(10) unsigned NOT NULL default '0', tEnd int(10) unsigned NOT NULL default '0', qName varchar(255) NOT NULL default '', qSize int(10) unsigned NOT NULL default '0', qStrand char(1) NOT NULL default '', qStart int(10) unsigned NOT NULL default '0', qEnd int(10) unsigned NOT NULL default '0', id int(10) unsigned NOT NULL default '0', KEY bin (bin), KEY tStart (tStart), KEY tEnd (tEnd), KEY id (id) ) TYPE=MyISAM; Genome Browser Database Description for chrN_chainSelf Fields of chrN_chainSelf: name SQL type bin smallint(5) unsigned score double tName varchar(255) tSize int(10) unsigned tStart int(10) unsigned tEnd int(10) unsigned qName varchar(255) qSize int(10) unsigned qStrand char(1) qStart int(10) unsigned qEnd int(10) unsigned id int(10) unsigned Table chrN_chainSelf has 34253375 rows total. Example rows of table chrN_chainSelf (not necessarily from current position!): #bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id 585 63520 chr10 135037215 14000 16578 chr7 158545518 - 96339947 96343245 204969 585 63670 chr10 135037215 14000 16578 chr7 158545518 + 62286685 62289986 204161 585 81668 chr10 135037215 14000 17239 chr7 158545518 + 62827826 62831922 133728 &&& _keys will be created after bulk loading of the data _First create the table in sql: CREATE TABLE kgXref ( kgID VARCHAR2(40) not Null, mRNA VARCHAR2(40) Null, spID VARCHAR2(40) Null, spDisplayID VARCHAR2(40) Null, geneSymbol VARCHAR2(40) Null, refseq VARCHAR2(40) Null, protAcc VARCHAR2(40) Null, description VARCHAR2(255) Null, CONSTRAINT pk_kgXref PRIMARY KEY (kgID) ) ; _then call sqlldr: > sqlldr <username>@<dblink> control=control_xgRef when control_xgRef.ctl contains the following: load data infile 'kgXref.txt' into table kgXref insert fields terminated by x'09' trailing nullcols (kgID CHAR, mRNA CHAR, spID CHAR, spDisplayID CHAR, geneSymbol CHAR, refseq CHAR, protAcc CHAR, description CHAR )
Request for Question Clarification by mathtalk-ga on 21 Sep 2004 18:49 PDT Hi, iterative-ga: You may have had some difficulty accessing the site today. As a researcher, I know that certain functions here were not working off and on for me. To proceed, here's another draft table definition: CREATE TABLE knownGene ( name VARCHAR2(255) not Null, chrom VARCHAR2(255) not Null, strand CHAR(1) not Null, txStart NUMBER(10) not Null, txEnd NUMBER(10) not Null, cdsStart NUMBER(10) not Null, cdsEnd NUMBER(10) not Null, exonCount NUMBER(10) not Null, exonStarts BLOB not Null, exonEnds BLOB not Null, proteinID VARCHAR2(40) not Null, alignID VARCHAR2(8) not Null ); I'm not sure what you've got to put into those "blob" columns. Maybe an explanation of the contents would be in order before we continue. regards, mathtalk-ga
Request for Question Clarification by mathtalk-ga on 22 Sep 2004 06:54 PDT Just to further the "blob" discussion: It looks like you have strings of numbers separated by commas to put in those columns. If you never want to do anything with these values, either BLOB or CLOB (character large "object") will have the effect of putting anything more than 4000 characters in a location physically removed from the corresponding "rows" of the table. However if you have less than 4000 characters in every case, you may want to simply use VARCHAR(4000). If you need to actually do something with these values, the bridge will have to be crossed at some point of getting them into a usable form. These would be consider "repeating groups" in a discussion of normalized (or non-normalized) database design. regards, mathtalk-ga
Clarification of Question by iterative-ga on 22 Sep 2004 11:25 PDT I think these numbers, used to define intron and exon boundaries, will be important almost immediately to mask (ignore) the repeats that fall in exons. I will try to find out what datatype they should be, leaning to VARCHAR. Max number of exons is <100 x <10 digits per coordinate so VARCHAR(4000) seems fine.
Request for Question Clarification by mathtalk-ga on 23 Sep 2004 21:36 PDT Okay, so let's use VARCHAR2(4000) in place of the BLOB's above for the table definition of knownGene. Here's the corresponding table definition template for each of the "chromosome" tables (replacing X by the appropriate chromosome identifier): CREATE TABLE chrX_chainSelf ( bin NUMBER(5) not Null, score FLOAT not Null, tName VARCHAR2(255) not Null, tSize NUMBER(10) not Null, tStart NUMBER(10) not Null, tEnd NUMBER(10) not Null, qName VARCHAR2(255) not Null, qSize NUMBER(10) not Null, qStrand CHAR(1) not Null, qStart NUMBER(10) not Null, qEnd NUMBER(10) not Null, id NUMBER(10) not Null ); Here FLOAT is actually a high-precision (38 decimal digit) floating-point format, but I see no need to reduce precision to some preset level. Also I've ignored the "unsigned" property of some numeric fields. If this is a problem for reasons unanticipated at the present, we can add a CHECK constraint to columns in order to prevent negative values. regards, mathtalk-ga
Request for Question Clarification by mathtalk-ga on 24 Sep 2004 06:04 PDT To modify the control file used to load the first table, most of the changes will be evident. You will need to revise the name of the text file to load, the table to be loaded, and the column names. Also, we now have a wider variety of datatypes to work with, as the first table contained only fields of character data. The VARCHAR2 and CHAR columns for the new tables will again be described to SQL*Loader using CHAR as the datatype description. The numeric columns will require one of these descriptions (depending on whether the column is integer or floating point): INTEGER EXTERNAL FLOAT EXTERNAL Here EXTERNAL has the sense of meaning a human-readable character format for numbers, as opposed to an "internal" binary format. Your example records illustrate that the text files contain only human-readable data. regards, mathtalk-ga
Clarification of Question by iterative-ga on 24 Sep 2004 07:38 PDT "Your example records illustrate that the text files contain only human-readable data" I was very much under the impression that all of this data would be readable by programs parsers etc. Is that wrong? I have failed to load the data continuously but have learned some stuff in the process. SQL> describe knowngene Name Null? Type ----------------------------------------- -------- ---------------------------- NAME NOT NULL VARCHAR2(255) CHROM NOT NULL VARCHAR2(255) STRAND NOT NULL CHAR(1) TXSTART NOT NULL NUMBER(10) TXEND NOT NULL NUMBER(10) CDSSTART NOT NULL NUMBER(10) CDSEND NOT NULL NUMBER(10) EXONCOUNT NOT NULL NUMBER(10) EXONSTARTS NOT NULL VARCHAR2(4000) EXONENDS NOT NULL VARCHAR2(4000) PROTEINID NOT NULL VARCHAR2(40) ALIGNID NOT NULL VARCHAR2(8) with BLOB or VARCHAR2(4000) I get 50 of these and then the loader quits, although there are not 4000 characters there. Record 186: Rejected - Error on table KNOWNGENE, column EXONSTARTS. Field in data file exceeds maximum length. This is the loader: load data infile 'knownGene.txt' into table knownGene insert fields terminated by x'09' trailing nullcols (name CHAR, chrom CHAR, strand CHAR, txStart CHAR, txEnd CHAR, cdsStart CHAR, cdsEnd CHAR, exonCount CHAR, exonStarts CHAR, exonEnds CHAR, proteinID CHAR, alignID CHAR)
Request for Question Clarification by mathtalk-ga on 24 Sep 2004 09:00 PDT Please change the control file for knownGene to reflect that the columns: txStart NUMBER(10) txEnd NUMBER(10) cdsStart NUMBER(10) cdsEnd NUMBER(10) exonCount NUMBER(10) are of INTEGER EXTERNAL type as far as SQL*Loader is concerned. If the numbers were stored in the input file as native binary format, you would not be able to recognize the values. "Human readable" means that you can see the numbers given as ASCII text rather than inscrutable machine language versions. regards, mathtalk-ga
Clarification of Question by iterative-ga on 24 Sep 2004 12:25 PDT that change didnt look like it would effect the problem I am having, and it did not. The logfile looks the same. the VARCHAR2(4000) nor BLOB are big enough to hold something like: 5871398,5875839,5876000,5876221,5879400,5879999,5881380,5881745,5882514,5890710,5891099,5893553,5894122,5894705,5895129,5895371,5896177,5897651,5898104,5898443,5899809,5901236,5903733,5904323,5904835,5906122,5906333,5911733,5912019,5913428,5913629,5915817,5916270,5918459,5918851,5920637,5922017,5924265,5925204,5928941,5937755,5949550, from the knownGene.bad file. The chainSelf worked. I am having trouble loading mulitple tables with multipe files trying load data infile 'chrY_chainSelf.txt' infile 'chrM_chainSelf.txt' infile 'chr1_chainSelf.txt' infile 'chr2_chainSelf.txt' infile 'chr3_chainSelf.txt' infile 'chr4_chainSelf.txt' infile 'chr5_chainSelf.txt' infile 'chr6_chainSelf.txt' infile 'chr7_chainSelf.txt' infile 'chr8_chainSelf.txt' infile 'chr9_chainSelf.txt' infile 'chr10_chainSelf.txt' infile 'chr11_chainSelf.txt' infile 'chr12_chainSelf.txt' infile 'chr13_chainSelf.txt' infile 'chr14_chainSelf.txt' infile 'chr15_chainSelf.txt' infile 'chr16_chainSelf.txt' infile 'chr17_chainSelf.txt' infile 'chr18_chainSelf.txt' infile 'chr19_chainSelf.txt' infile 'chr20_chainSelf.txt' infile 'chr21_chainSelf.txt' infile 'chr22_chainSelf.txt' into table chrY_chainSelf WHEN tName = 'chrY' insert fields terminated by x'09' trailing nullcols #####WHERE DO I PUT THIS LINE? (bin CHAR, score FLOAT EXTERNAL, tName CHAR, tSize CHAR, tStart CHAR, tEnd CHAR, qName CHAR, qSize CHAR, qStrand CHAR, qStart CHAR, qEnd CHAR, id CHAR) I hope your absence of and answer on my last question means 'dont worry about it' Thanks
Request for Question Clarification by mathtalk-ga on 24 Sep 2004 13:10 PDT I'm not sure what "the last question" refers to. I did try to address the phrase "human readable" and clear up any confusion over that wording. In particular it doesn't mean a computer program is unable to parse your data; rather it means that it is not in a form (binary) which _only_ a computer program could love! I will try to reproduce your problem with loading the knownGene records if you can post them on the Internet, e.g. provide a URL from which I can download the whole file or a sample thereof. If in fact you obtained these by extracting from a public database, you might be able instead to explain how to get the file for myself. regards, mathtalk-ga
Clarification of Question by iterative-ga on 24 Sep 2004 13:27 PDT im going to pare this down and repost like i did if you do not object. I can get the file from here. port 21, userID is anonymous, password is iterative@jhmi.edu HGdownload.cse.ucsc.edu/goldenpath/HG16/database/knowngene.gz

Answer

There is no answer at this time.

Comments

Subject: Re: SQL HELP to mathtalk-ga
From: mathtalk-ga on 24 Sep 2004 06:08 PDT

Here's a set of example control files for SQL*Loader:

[Oracle SQL*Loader version 9.2]
http://www.psoug.org/reference/sqlloader.html

One of the examples illustrates the use of INTEGER EXTERNAL.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy