Google Answers Logo
View Question
 
Q: SQLplus for mathtalk ( No Answer,   1 Comment )
Question  
Subject: SQLplus for mathtalk
Category: Computers > Programming
Asked by: iterative-ga
List Price: $79.00
Posted: 27 Sep 2004 13:40 PDT
Expires: 27 Oct 2004 13:40 PDT
Question ID: 407060
Problem: two columns in one table prevent loading. 
The goal: for every row in knownGene, take the coordinates of the gene
(chrom, cdsStart, cdsEnd). Go to table chrN_chainSelf where
N=knownGene.chrom, return all chains query coordinates (qName, qStart,
qEnd) for chains (rows) that overlap the gene (using tStart, tEnd to
see overlap with cdsStart, cdsEnd from knownGene). For each, take its
query coordinates (qName, qStart, qEnd) and find what gene it falls in
on knownGene by checking all the query coordinates. Return #name for
each knowngene that is found and get matching english name and
description from kgXref (knownGene.#name is equivilent to
kgXref.#kgID). MySQL table definitions follow. Tables must be loaded
into Oracle9i, then keys must be created. Two of Three table types
created and loaded successfully:

CREATE TABLE kgXref (
  kgID         VARCHAR2(40)  not Null,
  mRNA         VARCHAR2(40)      Null,
  spID         VARCHAR2(40)      Null,
  spDisplayID  VARCHAR2(40)      Null,
  geneSymbol   VARCHAR2(40)      Null,
  refseq       VARCHAR2(40)      Null,
  protAcc      VARCHAR2(40)      Null,
  description  VARCHAR2(255)     Null,
  CONSTRAINT pk_kgXref PRIMARY KEY  (kgID)
) ;

CREATE TABLE chrX_chainSelf (
  bin            NUMBER(5)     not Null,
  score          FLOAT         not Null,
  tName          VARCHAR2(255) not Null,
  tSize          NUMBER(10)    not Null,
  tStart         NUMBER(10)    not Null,
  tEnd           NUMBER(10)    not Null,
  qName          VARCHAR2(255) not Null,
  qSize          NUMBER(10)    not Null,
  qStrand        CHAR(1)       not Null,
  qStart         NUMBER(10)    not Null,
  qEnd           NUMBER(10)    not Null,
  id             NUMBER(10)    not Null
);

&&& The third table is problematic. mysql definition:

CREATE TABLE knownGene (
  name varchar(255) NOT NULL default '',
  chrom varchar(255) NOT NULL default '',
  strand char(1) NOT NULL default '',
  txStart int(10) unsigned NOT NULL default '0',
  txEnd int(10) unsigned NOT NULL default '0',
  cdsStart int(10) unsigned NOT NULL default '0',
  cdsEnd int(10) unsigned NOT NULL default '0',
  exonCount int(10) unsigned NOT NULL default '0',
  exonStarts longblob NOT NULL,
  exonEnds longblob NOT NULL,
  proteinID varchar(40) NOT NULL default '',
  alignID varchar(8) NOT NULL default '',
  KEY name (name(10)),
  KEY chrom (chrom(12),txStart),
  KEY chrom_2 (chrom(12),txEnd),
  KEY protein (proteinID(12)),
  KEY align (alignID)
) TYPE=MyISAM;

example row, tab delimited:
BC040520	chr1	-	5632418	5762079	5649110	5755895	33	5632418,5633495,5633943,5634707,5635978,5636635,5637345,5642857,5644076,5644479,5645559,5645
991,5646698,5648951,5649719,5656891,5660473,5674222,5674897,5675237,5676720,5678757,5697254,
5702752,5716709,5717675,5722305,5731399,5736904,5738692,5747875,5755760,5761849,	
5633011,5633639,5634123,5634879,5636064,5636721,5637502,5642941,5644263,5644706,5645684,5646
129,5646904,5649238,5649845,5657072,5660634,5674410,5675089,5675386,5676828,5678819,5697393,
5702935,5716836,5717857,5722442,5731555,5736969,5738865,5748019,5755933,5762079,	
Q8IWC0	23180

&&& table oracle9i attempt:

CREATE TABLE knownGene (
  name        VARCHAR2(255) not Null,
  chrom       VARCHAR2(255) not Null,
  strand      CHAR(1)       not Null,
  txStart     NUMBER(10)    not Null,
  txEnd       NUMBER(10)    not Null,
  cdsStart    NUMBER(10)    not Null,
  cdsEnd      NUMBER(10)    not Null,
  exonCount   NUMBER(10)    not Null,
  exonStarts  VARCHAR2(4000)          not Null,
  exonEnds    VARCHAR2(4000)          not Null,
  proteinID   VARCHAR2(40)  not Null,
  alignID     VARCHAR2(8)   not Null
);

loading this table results in a .bad file containing the above example row
using this load file. If the VARCHAR2(4000) both change to BLOB, the same happens. 

load data
infile 'knownGene.txt'
into table knownGene
insert fields terminated by x'09' trailing nullcols
(name         CHAR,
 chrom        CHAR,
 strand       CHAR,
 txStart      INTEGER EXTERNAL,
 txEnd        INTEGER EXTERNAL,
 cdsStart     INTEGER EXTERNAL,
 cdsEnd       INTEGER EXTERNAL,
 exonCount    INTEGER EXTERNAL,
 exonStarts   CHAR, 
 exonEnds     CHAR, 
 proteinID    CHAR, 
 alignID      CHAR)

(all the fields are human readable, I dont know why we are inclined to
make only those 5 INTEGER EXTERNAL)
&&& The file knownGene.gz is at
HGdownload.cse.ucsc.edu/goldenpath/hg16/database/knownGene.gz USERID:
anonymous PWD: anemail@address
Ideas: 
1) these columns wont load because of the commas, replace with
something else? These types, exonStarts and exonEnds, contain the
information I will need to filter out all "hits" or chains that fall
into introns, so someone said I should put them in another table and
that this table definition is "stupid". Thanks for all your help.

iterative

Request for Question Clarification by mathtalk-ga on 27 Sep 2004 19:15 PDT
Hi, iterative-ga:

I downloaded a file called knownGene.txt.gz from the directory path
you gave.  I did not see one called simply knownGene.gz.

I think I recall the discussion (on another thread) about the (poor)
database design involved in lumping all the exon starts and ends into
a comma-delimited field.

It's possible to process them serially in this form, but it is not the
"SQL" way of doing things.  But first let me figure out how to get
these records loaded.

My current suspicion is a problem involving the end of record
terminator, but I don't know what the problem bites us on this file
and the others.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 28 Sep 2004 09:28 PDT
Nobody around here agrees on the problem. I am going to try to find
out what ensemble or others do for storing this type of data. I will
need exon starts and ends to mask exonic chains for the query, so I am
worried about the state of this entry. It still appears to me that
only records with many exons are a problem to load
Answer  
There is no answer at this time.

Comments  
Subject: Re: SQLplus for mathtalk
From: angie_h-ga on 29 Sep 2004 11:46 PDT
 
You can use UCSC's table browser (http://genome.ucsc.edu/, click on Tables) 
to get coordinates of all exons in Known Genes, and then save those locally 
and load them up as a separate table.  

In the table browser, select the Known Genes track, set the position 
to "genome", and click "Advanced query".  Then set the output format 
to "BED", and click "Get results".  Choose exons as the regions to output.  
Then get the results and save the file locally.  Each line will describe 
one exon, and the original knownGene name can be parsed out of the exon name.  
[Note: within a week or so we will be releasing a new & improved  Table 
Browser, and there will be fewer clicks along the way but you'll be able to 
get the same coords via BED output.]  

Hope that helps,

Angie
UCSC Genome Bioinformatics Group

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy