Google Answers Logo
View Question
 
Q: SQL HELP to mathtalk-ga ( No Answer,   1 Comment )
Question  
Subject: SQL HELP to mathtalk-ga
Category: Computers > Programming
Asked by: iterative-ga
List Price: $78.00
Posted: 21 Sep 2004 16:46 PDT
Expires: 27 Sep 2004 13:09 PDT
Question ID: 404413
I cant find another way to update you, and trim the answer.

The idea is that for every row in knownGene, take the coordinates of
the gene (chrom, cdsStart, cdsEnd). Go to table chrN_chainSelf where
N=chrom_knownGene, return all chains query coordinates (qName, qStart,
qEnd) for chains (rows) that overlap the gene (using tStart, tEnd to
see overlap with cdsStart, cdsEnd from knownGene). For each, take its
query coordinates (qName, qStart, qEnd) and find what gene it falls in
on knownGene by checking all teh query coordinates. Return #name for
each knowngene that is found and get matching english name and
description from kgXref (#name_knownGene is equivilent to
#kgID_kgXref). MySQL table definitions follow. Tables must be loaded
into Oracle9i, then keys must be created.

&&&

CREATE TABLE kgXref (
  kgID varchar(40) NOT NULL default '',
  mRNA varchar(40) default NULL,
  spID varchar(40) default NULL,
  spDisplayID varchar(40) default NULL,
  geneSymbol varchar(40) default NULL,
  refseq varchar(40) default NULL,
  protAcc varchar(40) default NULL,
  description varchar(255) default NULL,
  PRIMARY KEY  (kgID),
  KEY mRNA (mRNA),
  KEY spID (spID),
  KEY spDisplayID (spDisplayID),
  KEY geneSymbol (geneSymbol),
  KEY refseq (refseq),
  KEY protAcc (protAcc)
) TYPE=MyISAM;


Genome Browser Database Description for kgXref

Fields of kgXref: 
name SQL type text value histogram 
kgID varchar(40)  
mRNA varchar(40)  
spID varchar(40)  
spDisplayID varchar(40)  
geneSymbol varchar(40)  
refseq varchar(40)  
protAcc varchar(40)  
description varchar(255)  
 

Table kgXref has 42026 rows total.
Example rows of table kgXref (not necessarily from current position!):


#kgID	mRNA	spID	spDisplayID	geneSymbol	refseq	protAcc	description	
AB000099	AB000099	P56555	DSR4_HUMAN	DSCR4	NM_005867	NP_005858	Down
syndrome critical region gene 4
AB000114	AB000114	Q99983	OMD_HUMAN	OMD	NM_005014	NP_005005	osteomodulin	
AB000115	AB000115	Q99984	Q99984	C1orf29	NM_006820	NP_006811	mRNA
expressed in osteoblast.


&&&

CREATE TABLE knownGene (
  name varchar(255) NOT NULL default '',
  chrom varchar(255) NOT NULL default '',
  strand char(1) NOT NULL default '',
  txStart int(10) unsigned NOT NULL default '0',
  txEnd int(10) unsigned NOT NULL default '0',
  cdsStart int(10) unsigned NOT NULL default '0',
  cdsEnd int(10) unsigned NOT NULL default '0',
  exonCount int(10) unsigned NOT NULL default '0',
  exonStarts longblob NOT NULL,
  exonEnds longblob NOT NULL,
  proteinID varchar(40) NOT NULL default '',
  alignID varchar(8) NOT NULL default '',
  KEY name (name(10)),
  KEY chrom (chrom(12),txStart),
  KEY chrom_2 (chrom(12),txEnd),
  KEY protein (proteinID(12)),
  KEY align (alignID)
) TYPE=MyISAM;


Genome Browser Database Description for knownGene

Fields of knownGene: 
name SQL type text value histogram 
name varchar(255)  
chrom varchar(255)  
strand char(1)  
txStart int(10) unsigned   
txEnd int(10) unsigned   
cdsStart int(10) unsigned   
cdsEnd int(10) unsigned   
exonCount int(10) unsigned   
exonStarts longblob   
exonEnds longblob   
proteinID varchar(40)  
alignID varchar(8)  
 

Table knownGene has 43232 rows total.
Example rows of table knownGene (not necessarily from current position!):


#name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	proteinID	alignID
BC063682	chr1	-	4224	7502	4558	7173	7	4224,4832,5658,6469,6719,7095,7468,	4692,4901,5810,6631,6918,7231,7502,	AAH63682	2425
AL137714	chr1	-	4266	19346	7413	19346	8	4266,4832,6720,7095,7777,8130,14600,19183,	4692,6628,6918,7605,7924,8229,14754,19346,	Q9NSV7	41739
BC015400	chr1	-	4268	5808	4558	5808	3	4268,4832,5658,	4692,4901,5808,	Q96BN3	32430	


&&&

CREATE TABLE chrX_chainSelf (
  bin smallint(5) unsigned NOT NULL default '0',
  score double NOT NULL default '0',
  tName varchar(255) NOT NULL default '',
  tSize int(10) unsigned NOT NULL default '0',
  tStart int(10) unsigned NOT NULL default '0',
  tEnd int(10) unsigned NOT NULL default '0',
  qName varchar(255) NOT NULL default '',
  qSize int(10) unsigned NOT NULL default '0',
  qStrand char(1) NOT NULL default '',
  qStart int(10) unsigned NOT NULL default '0',
  qEnd int(10) unsigned NOT NULL default '0',
  id int(10) unsigned NOT NULL default '0',
  KEY bin (bin),
  KEY tStart (tStart),
  KEY tEnd (tEnd),
  KEY id (id)
) TYPE=MyISAM;


Genome Browser Database Description for chrN_chainSelf

Fields of chrN_chainSelf: 
name SQL type 
bin smallint(5) unsigned 
score double 
tName varchar(255) 
tSize int(10) unsigned 
tStart int(10) unsigned 
tEnd int(10) unsigned 
qName varchar(255) 
qSize int(10) unsigned 
qStrand char(1) 
qStart int(10) unsigned 
qEnd int(10) unsigned 
id int(10) unsigned 
 

Table chrN_chainSelf has 34253375 rows total.
Example rows of table chrN_chainSelf (not necessarily from current position!):


#bin	score	tName	tSize	tStart	tEnd	qName	qSize	qStrand	qStart	qEnd	id	
585	63520	chr10	135037215	14000	16578	chr7	158545518	-	96339947	96343245	204969	
585	63670	chr10	135037215	14000	16578	chr7	158545518	+	62286685	62289986	204161	
585	81668	chr10	135037215	14000	17239	chr7	158545518	+	62827826	62831922	133728	

&&&

_keys will be created after bulk loading of the data
_First create the table in sql:

CREATE TABLE kgXref
(
  kgID         VARCHAR2(40)  not Null,
  mRNA         VARCHAR2(40)      Null,
  spID         VARCHAR2(40)      Null,
  spDisplayID  VARCHAR2(40)      Null,
  geneSymbol   VARCHAR2(40)      Null,
  refseq       VARCHAR2(40)      Null,
  protAcc      VARCHAR2(40)      Null,
  description  VARCHAR2(255)     Null,
  CONSTRAINT pk_kgXref PRIMARY KEY  (kgID)
) ;

_then call sqlldr:

> sqlldr <username>@<dblink> control=control_xgRef

when control_xgRef.ctl contains the following:

load data
infile 'kgXref.txt'
into table kgXref
insert fields terminated by x'09' trailing nullcols
(kgID         CHAR,
 mRNA         CHAR,
 spID         CHAR,
 spDisplayID  CHAR,
 geneSymbol   CHAR,
 refseq       CHAR,
 protAcc      CHAR,
 description  CHAR )

Request for Question Clarification by mathtalk-ga on 21 Sep 2004 18:49 PDT
Hi, iterative-ga:

You may have had some difficulty accessing the site today.  As a
researcher, I know that certain functions here were not working off
and on for me.

To proceed, here's another draft table definition:

CREATE TABLE knownGene (
  name        VARCHAR2(255) not Null,
  chrom       VARCHAR2(255) not Null,
  strand      CHAR(1)       not Null,
  txStart     NUMBER(10)    not Null,
  txEnd       NUMBER(10)    not Null,
  cdsStart    NUMBER(10)    not Null,
  cdsEnd      NUMBER(10)    not Null,
  exonCount   NUMBER(10)    not Null,
  exonStarts  BLOB          not Null,
  exonEnds    BLOB          not Null,
  proteinID   VARCHAR2(40)  not Null,
  alignID     VARCHAR2(8)   not Null
);

I'm not sure what you've got to put into those "blob" columns.  Maybe
an explanation of the contents would be in order before we continue.

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 22 Sep 2004 06:54 PDT
Just to further the "blob" discussion:

It looks like you have strings of numbers separated by commas to put
in those columns.  If you never want to do anything with these values,
either BLOB or CLOB (character large "object") will have the effect of
putting anything more than 4000 characters in a location physically
removed from the corresponding "rows" of the table.  However if you
have less than 4000 characters in every case, you may want to simply
use VARCHAR(4000).

If you need to actually do something with these values, the bridge
will have to be crossed at some point of getting them into a usable
form.  These would be consider "repeating groups" in a discussion of
normalized (or non-normalized) database design.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 22 Sep 2004 11:25 PDT
I think these numbers, used to define intron and exon boundaries, will
be important almost immediately to mask (ignore) the repeats that fall
in exons. I will try to find out what datatype they should be, leaning
to VARCHAR. Max number of exons is <100 x <10 digits per coordinate so
VARCHAR(4000) seems fine.

Request for Question Clarification by mathtalk-ga on 23 Sep 2004 21:36 PDT
Okay, so let's use VARCHAR2(4000) in place of the BLOB's above for the
table definition of knownGene.  Here's the corresponding table
definition template for each of the "chromosome" tables (replacing X
by the appropriate chromosome identifier):

CREATE TABLE chrX_chainSelf (
  bin            NUMBER(5)     not Null,
  score          FLOAT         not Null,
  tName          VARCHAR2(255) not Null,
  tSize          NUMBER(10)    not Null,
  tStart         NUMBER(10)    not Null,
  tEnd           NUMBER(10)    not Null,
  qName          VARCHAR2(255) not Null,
  qSize          NUMBER(10)    not Null,
  qStrand        CHAR(1)       not Null,
  qStart         NUMBER(10)    not Null,
  qEnd           NUMBER(10)    not Null,
  id             NUMBER(10)    not Null
);

Here FLOAT is actually a high-precision (38 decimal digit)
floating-point format, but I see no need to reduce precision to some
preset level.

Also I've ignored the "unsigned" property of some numeric fields.  If
this is a problem for reasons unanticipated at the present, we can add
a CHECK constraint to columns in order to prevent negative values.

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 24 Sep 2004 06:04 PDT
To modify the control file used to load the first table, most of the
changes will be evident.  You will need to revise the name of the text
file to load, the table to be loaded, and the column names.

Also, we now have a wider variety of datatypes to work with, as the
first table contained only fields of character data.

The VARCHAR2 and CHAR columns for the new tables will again be
described to SQL*Loader using CHAR as the datatype description.  The
numeric columns will require one of these descriptions (depending on
whether the column is integer or floating point):

INTEGER EXTERNAL
FLOAT EXTERNAL

Here EXTERNAL has the sense of meaning a human-readable character
format for numbers, as opposed to an "internal" binary format.

Your example records illustrate that the text files contain only
human-readable data.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 24 Sep 2004 07:38 PDT
"Your example records illustrate that the text files contain only
human-readable data" I was very much under the impression that all of
this data would be readable by programs parsers etc. Is that wrong?
I have failed to load the data continuously but have learned some
stuff in the process.
SQL> describe knowngene
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 NAME                                      NOT NULL VARCHAR2(255)
 CHROM                                     NOT NULL VARCHAR2(255)
 STRAND                                    NOT NULL CHAR(1)
 TXSTART                                   NOT NULL NUMBER(10)
 TXEND                                     NOT NULL NUMBER(10)
 CDSSTART                                  NOT NULL NUMBER(10)
 CDSEND                                    NOT NULL NUMBER(10)
 EXONCOUNT                                 NOT NULL NUMBER(10)
 EXONSTARTS                                NOT NULL VARCHAR2(4000)
 EXONENDS                                  NOT NULL VARCHAR2(4000)
 PROTEINID                                 NOT NULL VARCHAR2(40)
 ALIGNID                                   NOT NULL VARCHAR2(8)

with BLOB or VARCHAR2(4000) I get 50 of these and then the loader
quits, although there are not 4000 characters there.
Record 186: Rejected - Error on table KNOWNGENE, column EXONSTARTS.
Field in data file exceeds maximum length. This is the loader:
load data
infile 'knownGene.txt'
into table knownGene
insert fields terminated by x'09' trailing nullcols
(name         CHAR,
 chrom        CHAR,
 strand       CHAR,
 txStart      CHAR,
 txEnd        CHAR,
 cdsStart     CHAR,
 cdsEnd       CHAR,
 exonCount    CHAR,
 exonStarts   CHAR, 
 exonEnds     CHAR, 
 proteinID    CHAR, 
 alignID      CHAR)

Request for Question Clarification by mathtalk-ga on 24 Sep 2004 09:00 PDT
Please change the control file for knownGene to reflect that the columns:

  txStart     NUMBER(10)
  txEnd       NUMBER(10)
  cdsStart    NUMBER(10)
  cdsEnd      NUMBER(10)
  exonCount   NUMBER(10)

are of INTEGER EXTERNAL type as far as SQL*Loader is concerned.  If
the numbers were stored in the input file as native binary format, you
would not be able to recognize the values.  "Human readable" means
that you can see the numbers given as ASCII text rather than
inscrutable machine language versions.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 24 Sep 2004 12:25 PDT
that change didnt look like it would effect the problem I am having,
and it did not. The logfile looks the same. the VARCHAR2(4000) nor
BLOB are big enough to hold something like:
5871398,5875839,5876000,5876221,5879400,5879999,5881380,5881745,5882514,5890710,5891099,5893553,5894122,5894705,5895129,5895371,5896177,5897651,5898104,5898443,5899809,5901236,5903733,5904323,5904835,5906122,5906333,5911733,5912019,5913428,5913629,5915817,5916270,5918459,5918851,5920637,5922017,5924265,5925204,5928941,5937755,5949550,
from the knownGene.bad file. 

The chainSelf worked. I am having trouble loading mulitple tables with
multipe files trying

load data
infile 'chrY_chainSelf.txt'
infile 'chrM_chainSelf.txt'
infile 'chr1_chainSelf.txt'
infile 'chr2_chainSelf.txt'
infile 'chr3_chainSelf.txt'
infile 'chr4_chainSelf.txt'
infile 'chr5_chainSelf.txt'
infile 'chr6_chainSelf.txt'
infile 'chr7_chainSelf.txt'
infile 'chr8_chainSelf.txt'
infile 'chr9_chainSelf.txt'
infile 'chr10_chainSelf.txt'
infile 'chr11_chainSelf.txt'
infile 'chr12_chainSelf.txt'
infile 'chr13_chainSelf.txt'
infile 'chr14_chainSelf.txt'
infile 'chr15_chainSelf.txt'
infile 'chr16_chainSelf.txt'
infile 'chr17_chainSelf.txt'
infile 'chr18_chainSelf.txt'
infile 'chr19_chainSelf.txt'
infile 'chr20_chainSelf.txt'
infile 'chr21_chainSelf.txt'
infile 'chr22_chainSelf.txt'

into table chrY_chainSelf
WHEN tName = 'chrY'
insert fields terminated by x'09' trailing nullcols  #####WHERE DO I PUT THIS LINE?
(bin         CHAR,
 score       FLOAT EXTERNAL,
 tName       CHAR,
 tSize       CHAR,
 tStart      CHAR,
 tEnd        CHAR,
 qName       CHAR,
 qSize       CHAR,
 qStrand     CHAR,
 qStart      CHAR, 
 qEnd        CHAR, 
 id          CHAR)

I hope your absence of and answer on my last question means 'dont worry about it'

Thanks

Request for Question Clarification by mathtalk-ga on 24 Sep 2004 13:10 PDT
I'm not sure what "the last question" refers to.  I did try to address
the phrase "human readable" and clear up any confusion over that
wording.  In particular it doesn't mean a computer program is unable
to parse your data; rather it means that it is not in a form (binary)
which _only_ a computer program could love!

I will try to reproduce your problem with loading the knownGene
records if you can post them on the Internet, e.g. provide a URL from
which I can download the whole file or a sample thereof.  If in fact
you obtained these by extracting from a public database, you might be
able instead to explain how to get the file for myself.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 24 Sep 2004 13:27 PDT
im going to pare this down and repost like i did if you do not object. 

I can get the file from here. port 21, userID is anonymous, password
is iterative@jhmi.edu

HGdownload.cse.ucsc.edu/goldenpath/HG16/database/knowngene.gz
Answer  
There is no answer at this time.

Comments  
Subject: Re: SQL HELP to mathtalk-ga
From: mathtalk-ga on 24 Sep 2004 06:08 PDT
 
Here's a set of example control files for SQL*Loader:

[Oracle SQL*Loader version 9.2]
http://www.psoug.org/reference/sqlloader.html

One of the examples illustrates the use of INTEGER EXTERNAL.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy