Problem: two columns in one table prevent loading.
The goal: for every row in knownGene, take the coordinates of the gene
(chrom, cdsStart, cdsEnd). Go to table chrN_chainSelf where
N=knownGene.chrom, return all chains query coordinates (qName, qStart,
qEnd) for chains (rows) that overlap the gene (using tStart, tEnd to
see overlap with cdsStart, cdsEnd from knownGene). For each, take its
query coordinates (qName, qStart, qEnd) and find what gene it falls in
on knownGene by checking all the query coordinates. Return #name for
each knowngene that is found and get matching english name and
description from kgXref (knownGene.#name is equivilent to
kgXref.#kgID). MySQL table definitions follow. Tables must be loaded
into Oracle9i, then keys must be created. Two of Three table types
created and loaded successfully:
CREATE TABLE kgXref (
kgID VARCHAR2(40) not Null,
mRNA VARCHAR2(40) Null,
spID VARCHAR2(40) Null,
spDisplayID VARCHAR2(40) Null,
geneSymbol VARCHAR2(40) Null,
refseq VARCHAR2(40) Null,
protAcc VARCHAR2(40) Null,
description VARCHAR2(255) Null,
CONSTRAINT pk_kgXref PRIMARY KEY (kgID)
) ;
CREATE TABLE chrX_chainSelf (
bin NUMBER(5) not Null,
score FLOAT not Null,
tName VARCHAR2(255) not Null,
tSize NUMBER(10) not Null,
tStart NUMBER(10) not Null,
tEnd NUMBER(10) not Null,
qName VARCHAR2(255) not Null,
qSize NUMBER(10) not Null,
qStrand CHAR(1) not Null,
qStart NUMBER(10) not Null,
qEnd NUMBER(10) not Null,
id NUMBER(10) not Null
);
&&& The third table is problematic. mysql definition:
CREATE TABLE knownGene (
name varchar(255) NOT NULL default '',
chrom varchar(255) NOT NULL default '',
strand char(1) NOT NULL default '',
txStart int(10) unsigned NOT NULL default '0',
txEnd int(10) unsigned NOT NULL default '0',
cdsStart int(10) unsigned NOT NULL default '0',
cdsEnd int(10) unsigned NOT NULL default '0',
exonCount int(10) unsigned NOT NULL default '0',
exonStarts longblob NOT NULL,
exonEnds longblob NOT NULL,
proteinID varchar(40) NOT NULL default '',
alignID varchar(8) NOT NULL default '',
KEY name (name(10)),
KEY chrom (chrom(12),txStart),
KEY chrom_2 (chrom(12),txEnd),
KEY protein (proteinID(12)),
KEY align (alignID)
) TYPE=MyISAM;
example row, tab delimited:
BC040520 chr1 - 5632418 5762079 5649110 5755895 33 5632418,5633495,5633943,5634707,5635978,5636635,5637345,5642857,5644076,5644479,5645559,5645
991,5646698,5648951,5649719,5656891,5660473,5674222,5674897,5675237,5676720,5678757,5697254,
5702752,5716709,5717675,5722305,5731399,5736904,5738692,5747875,5755760,5761849,
5633011,5633639,5634123,5634879,5636064,5636721,5637502,5642941,5644263,5644706,5645684,5646
129,5646904,5649238,5649845,5657072,5660634,5674410,5675089,5675386,5676828,5678819,5697393,
5702935,5716836,5717857,5722442,5731555,5736969,5738865,5748019,5755933,5762079,
Q8IWC0 23180
&&& table oracle9i attempt:
CREATE TABLE knownGene (
name VARCHAR2(255) not Null,
chrom VARCHAR2(255) not Null,
strand CHAR(1) not Null,
txStart NUMBER(10) not Null,
txEnd NUMBER(10) not Null,
cdsStart NUMBER(10) not Null,
cdsEnd NUMBER(10) not Null,
exonCount NUMBER(10) not Null,
exonStarts VARCHAR2(4000) not Null,
exonEnds VARCHAR2(4000) not Null,
proteinID VARCHAR2(40) not Null,
alignID VARCHAR2(8) not Null
);
loading this table results in a .bad file containing the above example row
using this load file. If the VARCHAR2(4000) both change to BLOB, the same happens.
load data
infile 'knownGene.txt'
into table knownGene
insert fields terminated by x'09' trailing nullcols
(name CHAR,
chrom CHAR,
strand CHAR,
txStart INTEGER EXTERNAL,
txEnd INTEGER EXTERNAL,
cdsStart INTEGER EXTERNAL,
cdsEnd INTEGER EXTERNAL,
exonCount INTEGER EXTERNAL,
exonStarts CHAR,
exonEnds CHAR,
proteinID CHAR,
alignID CHAR)
(all the fields are human readable, I dont know why we are inclined to
make only those 5 INTEGER EXTERNAL)
&&& The file knownGene.gz is at
HGdownload.cse.ucsc.edu/goldenpath/hg16/database/knownGene.gz USERID:
anonymous PWD: anemail@address
Ideas:
1) these columns wont load because of the commas, replace with
something else? These types, exonStarts and exonEnds, contain the
information I will need to filter out all "hits" or chains that fall
into introns, so someone said I should put them in another table and
that this table definition is "stupid". Thanks for all your help.
iterative |