I cant find another way to update you, and trim the answer.
The idea is that for every row in knownGene, take the coordinates of
the gene (chrom, cdsStart, cdsEnd). Go to table chrN_chainSelf where
N=chrom_knownGene, return all chains query coordinates (qName, qStart,
qEnd) for chains (rows) that overlap the gene (using tStart, tEnd to
see overlap with cdsStart, cdsEnd from knownGene). For each, take its
query coordinates (qName, qStart, qEnd) and find what gene it falls in
on knownGene by checking all teh query coordinates. Return #name for
each knowngene that is found and get matching english name and
description from kgXref (#name_knownGene is equivilent to
#kgID_kgXref). MySQL table definitions follow. Tables must be loaded
into Oracle9i, then keys must be created.
&&&
CREATE TABLE kgXref (
kgID varchar(40) NOT NULL default '',
mRNA varchar(40) default NULL,
spID varchar(40) default NULL,
spDisplayID varchar(40) default NULL,
geneSymbol varchar(40) default NULL,
refseq varchar(40) default NULL,
protAcc varchar(40) default NULL,
description varchar(255) default NULL,
PRIMARY KEY (kgID),
KEY mRNA (mRNA),
KEY spID (spID),
KEY spDisplayID (spDisplayID),
KEY geneSymbol (geneSymbol),
KEY refseq (refseq),
KEY protAcc (protAcc)
) TYPE=MyISAM;
Genome Browser Database Description for kgXref
Fields of kgXref:
name SQL type text value histogram
kgID varchar(40)
mRNA varchar(40)
spID varchar(40)
spDisplayID varchar(40)
geneSymbol varchar(40)
refseq varchar(40)
protAcc varchar(40)
description varchar(255)
Table kgXref has 42026 rows total.
Example rows of table kgXref (not necessarily from current position!):
#kgID mRNA spID spDisplayID geneSymbol refseq protAcc description
AB000099 AB000099 P56555 DSR4_HUMAN DSCR4 NM_005867 NP_005858 Down
syndrome critical region gene 4
AB000114 AB000114 Q99983 OMD_HUMAN OMD NM_005014 NP_005005 osteomodulin
AB000115 AB000115 Q99984 Q99984 C1orf29 NM_006820 NP_006811 mRNA
expressed in osteoblast.
&&&
CREATE TABLE knownGene (
name varchar(255) NOT NULL default '',
chrom varchar(255) NOT NULL default '',
strand char(1) NOT NULL default '',
txStart int(10) unsigned NOT NULL default '0',
txEnd int(10) unsigned NOT NULL default '0',
cdsStart int(10) unsigned NOT NULL default '0',
cdsEnd int(10) unsigned NOT NULL default '0',
exonCount int(10) unsigned NOT NULL default '0',
exonStarts longblob NOT NULL,
exonEnds longblob NOT NULL,
proteinID varchar(40) NOT NULL default '',
alignID varchar(8) NOT NULL default '',
KEY name (name(10)),
KEY chrom (chrom(12),txStart),
KEY chrom_2 (chrom(12),txEnd),
KEY protein (proteinID(12)),
KEY align (alignID)
) TYPE=MyISAM;
Genome Browser Database Description for knownGene
Fields of knownGene:
name SQL type text value histogram
name varchar(255)
chrom varchar(255)
strand char(1)
txStart int(10) unsigned
txEnd int(10) unsigned
cdsStart int(10) unsigned
cdsEnd int(10) unsigned
exonCount int(10) unsigned
exonStarts longblob
exonEnds longblob
proteinID varchar(40)
alignID varchar(8)
Table knownGene has 43232 rows total.
Example rows of table knownGene (not necessarily from current position!):
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
BC063682 chr1 - 4224 7502 4558 7173 7 4224,4832,5658,6469,6719,7095,7468, 4692,4901,5810,6631,6918,7231,7502, AAH63682 2425
AL137714 chr1 - 4266 19346 7413 19346 8 4266,4832,6720,7095,7777,8130,14600,19183, 4692,6628,6918,7605,7924,8229,14754,19346, Q9NSV7 41739
BC015400 chr1 - 4268 5808 4558 5808 3 4268,4832,5658, 4692,4901,5808, Q96BN3 32430
&&&
CREATE TABLE chrX_chainSelf (
bin smallint(5) unsigned NOT NULL default '0',
score double NOT NULL default '0',
tName varchar(255) NOT NULL default '',
tSize int(10) unsigned NOT NULL default '0',
tStart int(10) unsigned NOT NULL default '0',
tEnd int(10) unsigned NOT NULL default '0',
qName varchar(255) NOT NULL default '',
qSize int(10) unsigned NOT NULL default '0',
qStrand char(1) NOT NULL default '',
qStart int(10) unsigned NOT NULL default '0',
qEnd int(10) unsigned NOT NULL default '0',
id int(10) unsigned NOT NULL default '0',
KEY bin (bin),
KEY tStart (tStart),
KEY tEnd (tEnd),
KEY id (id)
) TYPE=MyISAM;
Genome Browser Database Description for chrN_chainSelf
Fields of chrN_chainSelf:
name SQL type
bin smallint(5) unsigned
score double
tName varchar(255)
tSize int(10) unsigned
tStart int(10) unsigned
tEnd int(10) unsigned
qName varchar(255)
qSize int(10) unsigned
qStrand char(1)
qStart int(10) unsigned
qEnd int(10) unsigned
id int(10) unsigned
Table chrN_chainSelf has 34253375 rows total.
Example rows of table chrN_chainSelf (not necessarily from current position!):
#bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id
585 63520 chr10 135037215 14000 16578 chr7 158545518 - 96339947 96343245 204969
585 63670 chr10 135037215 14000 16578 chr7 158545518 + 62286685 62289986 204161
585 81668 chr10 135037215 14000 17239 chr7 158545518 + 62827826 62831922 133728
&&&
_keys will be created after bulk loading of the data
_First create the table in sql:
CREATE TABLE kgXref
(
kgID VARCHAR2(40) not Null,
mRNA VARCHAR2(40) Null,
spID VARCHAR2(40) Null,
spDisplayID VARCHAR2(40) Null,
geneSymbol VARCHAR2(40) Null,
refseq VARCHAR2(40) Null,
protAcc VARCHAR2(40) Null,
description VARCHAR2(255) Null,
CONSTRAINT pk_kgXref PRIMARY KEY (kgID)
) ;
_then call sqlldr:
> sqlldr <username>@<dblink> control=control_xgRef
when control_xgRef.ctl contains the following:
load data
infile 'kgXref.txt'
into table kgXref
insert fields terminated by x'09' trailing nullcols
(kgID CHAR,
mRNA CHAR,
spID CHAR,
spDisplayID CHAR,
geneSymbol CHAR,
refseq CHAR,
protAcc CHAR,
description CHAR ) |
Clarification of Question by
iterative-ga
on
24 Sep 2004 12:25 PDT
that change didnt look like it would effect the problem I am having,
and it did not. The logfile looks the same. the VARCHAR2(4000) nor
BLOB are big enough to hold something like:
5871398,5875839,5876000,5876221,5879400,5879999,5881380,5881745,5882514,5890710,5891099,5893553,5894122,5894705,5895129,5895371,5896177,5897651,5898104,5898443,5899809,5901236,5903733,5904323,5904835,5906122,5906333,5911733,5912019,5913428,5913629,5915817,5916270,5918459,5918851,5920637,5922017,5924265,5925204,5928941,5937755,5949550,
from the knownGene.bad file.
The chainSelf worked. I am having trouble loading mulitple tables with
multipe files trying
load data
infile 'chrY_chainSelf.txt'
infile 'chrM_chainSelf.txt'
infile 'chr1_chainSelf.txt'
infile 'chr2_chainSelf.txt'
infile 'chr3_chainSelf.txt'
infile 'chr4_chainSelf.txt'
infile 'chr5_chainSelf.txt'
infile 'chr6_chainSelf.txt'
infile 'chr7_chainSelf.txt'
infile 'chr8_chainSelf.txt'
infile 'chr9_chainSelf.txt'
infile 'chr10_chainSelf.txt'
infile 'chr11_chainSelf.txt'
infile 'chr12_chainSelf.txt'
infile 'chr13_chainSelf.txt'
infile 'chr14_chainSelf.txt'
infile 'chr15_chainSelf.txt'
infile 'chr16_chainSelf.txt'
infile 'chr17_chainSelf.txt'
infile 'chr18_chainSelf.txt'
infile 'chr19_chainSelf.txt'
infile 'chr20_chainSelf.txt'
infile 'chr21_chainSelf.txt'
infile 'chr22_chainSelf.txt'
into table chrY_chainSelf
WHEN tName = 'chrY'
insert fields terminated by x'09' trailing nullcols #####WHERE DO I PUT THIS LINE?
(bin CHAR,
score FLOAT EXTERNAL,
tName CHAR,
tSize CHAR,
tStart CHAR,
tEnd CHAR,
qName CHAR,
qSize CHAR,
qStrand CHAR,
qStart CHAR,
qEnd CHAR,
id CHAR)
I hope your absence of and answer on my last question means 'dont worry about it'
Thanks
|