Google Answers Logo
View Question
 
Q: SQLplus query needed ( Answered 5 out of 5 stars,   3 Comments )
Question  
Subject: SQLplus query needed
Category: Computers > Algorithms
Asked by: iterative-ga
List Price: $100.00
Posted: 08 Nov 2004 12:57 PST
Expires: 08 Dec 2004 12:57 PST
Question ID: 426258
First dibs on this question must go to Mathtalk, the author of the
comments that go me this far. I would understand if they are not
interested, but I do feel like I have to compensate him or her
somehow.

I need a query to build a map of gene collisions. 
>>> I am using Oracle9i and SQLplus

The "known" genes in the genome are listed in table knownGene. For each gene (row) 
there is a genomic interval or locus with coordinates (chrN, Start position, end 
position). For each chromosome (1-22, X, Y) there is a table of entities called 
chains. These each have two intervals, target and query. A chain is a repeat.

The plan is to make a list for each known gene. The list is named by its target 
gene, and populated by the names and descriptions of all the genes associated via 
the chains. A chain is a link, or edge, from one region to another in the genome. 

For each knowngene, all the chainSelf tables must be scanned for chains whose 
target coordinates indicate they would fall into the genes genomic interval. Each 
genes interval will have to be expanded by 1.5, but that can wait. With a gene's 
interval in mind, a list of chains that are enclosed by or that overlap it can be 
made. Each of these chains has a query locus associated with it. QNAME is the 
chromosome, QSTART, QEND define the interval. For each query interval, table 
knownGene is searched and every overlapping knowngene returned. These are the 
known genes that are linked to or "chain to" the initial target gene for which the 
list is named. They constitute the first of two columns in this is gene's list. 
The second column is populated by taking the name in knownGene and scanning table 
kgXref (were name is equivilent to kgID) to return all teh descriptions of the 
genes whose names we collected for this one target gene. I am not particularly 
clear on what the best final output is. Suffice to say I would want to sort the 
lists by size and see what genes chain to the greatest number of other genes, and 
then make a map of the network created by these associations using something like 
Cytoscape to show that the distribution of links follows a power law.  

>>>Here are the descriptions of the 3 table types:

SQL> describe knownGene
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 NAME                                      NOT NULL VARCHAR2(255)
 CHROM                                     NOT NULL VARCHAR2(255)
 STRAND                                    NOT NULL CHAR(1)
 TXSTART                                   NOT NULL NUMBER(10)
 TXEND                                     NOT NULL NUMBER(10)
 CDSSTART                                  NOT NULL NUMBER(10)
 CDSEND                                    NOT NULL NUMBER(10)
 EXONCOUNT                                 NOT NULL NUMBER(10)
 PROTEINID                                 NOT NULL VARCHAR2(40)
 ALIGNID                                   NOT NULL VARCHAR2(8)

>knownGene looks like this, but with no headers. Values abbreviated to
fit page here.
>Table knownGene has 43232 rows total.
#name  chrom strand txStart txEnd cdsStart cdsEnd exonCnt  proteinID alignID
BC0634  chr1  -      4224    7502  4558     7173   2       AAH63682  24255      
AL1373  chr1  -      4266    1936  7413     19346  2       Q9NSV7    41739      
BC0544  chr1  -      4268    5808  4558     5808   1       Q96BN3    32430      


SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y)
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 BIN                                       NOT NULL NUMBER(5)
 SCORE                                     NOT NULL FLOAT(126)
 TNAME                                     NOT NULL VARCHAR2(255)
 TSIZE                                     NOT NULL NUMBER(10)
 TSTART                                    NOT NULL NUMBER(10)
 TEND                                      NOT NULL NUMBER(10)
 QNAME                                     NOT NULL VARCHAR2(255)
 QSIZE                                     NOT NULL NUMBER(10)
 QSTRAND                                   NOT NULL CHAR(1)
 QSTART                                    NOT NULL NUMBER(10)
 QEND                                      NOT NULL NUMBER(10)
 ID                                        NOT NULL NUMBER(10)

>chr10_chainSelf looks like this with no headers.  
#bin  score  tName  tSize   tStart  tEnd    qName   qSize   qStrand
qStart  qEnd id
585   63520  chr10  135037  14000   16578   chr7    158545  -      
96339   9634 204969
585   63670  chr10  135037  14000   16578   chr7    158545  +      
62286   6228 204161
585   81668  chr10  135037  14000   17239   chr7    158545  +      
62827   6283 133728


SQL> describe kgXref
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 KGID                                      NOT NULL VARCHAR2(40)
 MRNA                                               VARCHAR2(40)
 SPID                                               VARCHAR2(40)
 SPDISPLAYID                                        VARCHAR2(40)
 GENESYMBOL                                         VARCHAR2(40)
 REFSEQ                                             VARCHAR2(40)
 PROTACC                                            VARCHAR2(40)
 DESCRIPTION                                        VARCHAR2(255)

>kgXref looks like this but with no headers. Values have been truncated. 
>Table kgXref has 42026 rows total.
#kgID   mRNA    spID    spDisplayID   geneSymbol   refseq   protAcc 
description
AB0000  AB0000  P56555  DSR4_HUMAN    DSCR4        NM_0057  NP_0058 
Down syndrome gene 4
AB0001  AB0001  Q99983  OMD_HUMAN     OMD          NM_0050  NP_0050 
osteomodulin
AB0001  AB0001  Q99984  Q99984        C1orf29      NM_0068  NP_0068 
mRNA expr in osteoblast
 


>>>this is the query that was given to me. It can only check through one of the 
chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results 
over and over. I dont think it ever stops. I have been sending the results to a 
spool file, and clearly the query doesnt do what I intended.  I
definately need it to check all chainSelf tables.

/* query1.sql*/

select a.name, a.exoncount, d.description, c.name
from    knowngene a,
        chr1_chainSelf b,
        knownGene c,
        kgXref d

where   a.txstart <= b.tStart
and     a.txEnd >= b.tEnd
and     b.qstart >= c.Txstart
and     b.qEnd <= c.txend
and     c.name = d.kgId
and     a.chrom = 'chr1'
;

Any direction would be greatly appreciated, but the answer is going to
have to work. I can help develop it but it will essentially be the
text in the query file, and thus syntax specific.

Request for Question Clarification by mathtalk-ga on 11 Nov 2004 05:27 PST
Hi, iterative-ga:

"For each knowngene, all the chainSelf tables must be scanned for
chains whose target coordinates indicate they would fall into the
genes genomic interval."

I think that your sample query illustrates that only the chainSelf
table that corresponds to the same chromosome to which a known gene is
assigned needs to be scanned (for that row in knownGene).

So for rows in knownGene with chrom = 'chr1', we need the
chr1_chainSelf table, but where a gene is located on a different
chromosome, we need the corresponding "chainSelf" table.

This basically means writing a different query for each chromosome (24
queries in all), although SQL has a keyword UNION that allows one to
combine queries with similar columns into one large result set.

Let's try and understand clearly what the existing query does, and
then you can perhaps comment on whether this is what you'd like to do
or not.  The query involves joining two copies of knownGene, and one
copy each of chr1_chainSelf and kgXref.  As we just discussed, the
fact that chr1_chainSelf is being used is determined by the chromosome
location of the "known gene".

The only use that is made in the query at hand for kgXref is to lookup
the description by matching:

knownGene.name = kgXref.kgID

It is the sort of "added information" that can be tacked on later
(because the knownGene.name values are also returned in the query), so
to simplify the discussion, I'm going to omit the kgXref table for the
moment.

Now I believe that what your query is meant to do is to start from
each row in knownGene which (for the sake of illustration) happens to
be on chromosome 1, and to build a list of "related" genes ALSO on the
same chromosome.  If this is so, then there's a condition missing in
the sample query.  It should restrict both "copies" of the knownGene
table to rows for chromosome 1 (or whichever chromosome we'd be
writing the query against).

Thus let's consider this:

/* rewrite of chr1 "collision" query */

select a.name, a.exoncount, c.name
from    knownGene a,
        chr1_chainSelf b,
        knownGene c

where   a.txStart <= b.tStart
and     a.txEnd >= b.tEnd
and     c.txStart <= b.qStart
and     c.txend >= b.qEnd
and     a.chrom = 'chr1'
and     c.chrom = 'chr1';

Now this query, which rewrites the original in some minor ways apart
from adding the requirement that both knownGene rows "live" on
chromosome 1, produces a relationship between pairs of known genes. 
As far as it goes, I think this query will probably do the right and
useful thing.

However I think that there is presumably then a second step you'll
want to take with these relations, and that is to organize them into
"chains" or lists that partition the chromosome.  In mathematics we
might call this "taking the transitive closure", but the idea is
simply that if knownGene A is related to knownGene B, and knownGene B
is related to knownGene C, then we want ultimately to put all three
into a common "linked list" or other data structure that lumps them
together.

This is the impression I have so far about what you are trying to
accomplish.  Please let me know if I'm getting warm.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 12 Nov 2004 09:54 PST
First, I dont know what the problem is but I have been trying to post
this for a while and the clarification button did not exist.

Second, in answer to you getting warmer, sort of. Every known gene is related to
a number, maybe zero, of other genes, each by a chain, which has a
score. Imagine a table with a row for every known gene. The gene that
is linked by the highest scoring chain is put in column two. You are
right that for each gene you only have to look in one chr_NchainSelf
table, but the chains found in that table on that chromosom link to
genes on all chromosomes, c.chrom = 'chr1' is not generally true. The
problem with my model here is that it might require ~500 columns, but
most of the rows in these low scoring columns will be empty.
I have been told this is a stupid or impossible table.

Request for Question Clarification by mathtalk-ga on 12 Nov 2004 20:55 PST
Let me ask this.  Suppose I pick two known genes, and I ask you what
the "score" is that relates them.  Is the score the same regardless of
the ordering of the pair?  Does finding the score between genes A and
B involve for the most part consideration of intermediate
relationships to other genes?

Still trying to grasp why one considers genes related in this fashion.
 Originally you gave me the impression we were talking about
coordinates (start and end positions) within a sequence, but if that
were the case, it scarcely makes sense to confuse coordinates on one
chromosome with numerically similar values on a different chromosome.

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 13 Nov 2004 05:39 PST
Here's an alternative to building the table quite as you described it
(with many columns, most of which are empty on most records).

One instead defines a table in which the key is a pair of known genes,
and a third column gives the "rank" of their scoring.

This is the motivation behind my requests for clarification I'm asking
above.  One part addresses whether the key is an ordered or unordered
pair of known genes; the others seek insight into how the score or
ranking would be found.

As far as the missing "Clarification" button goes, my guess would be
that the site didn't see you as logged in as yourself (only the user
who posts the Question can submit Clarifications).  Rarely I find that
I have to logout, log back in to get recognized by the site.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 14 Nov 2004 11:27 PST
yes, a two or three column table was described to me yesterday as more
sensible, like you describe. Links are birectional, the table only
indicates one degree of linking. Each link is a chain. Column one is
the target gene. The gene (row) where you say for every gene do this.
what you do is find all the chains that link you to other genes.
Column two is the gene you link to, column three is the score for that
link, given by the scoore for the chain that links you. Example:
TargetGene__QueryGene__Score
gene 1      gene 233   500
gene 1      gene 458   302
gene 1      gene 458   302
gene 1      gene 2      58
gene 2      gene 332   403
gene 2      gene 806   255
gene 2      gene 1     58  >> this is the reverse of row 4. There may
be several links between two genes, they all will be listed twice.
Directionality constraint can be induced later by strand information.

The problem was that gmail accounts are not allowed and I was logged in to mail.

Request for Question Clarification by mathtalk-ga on 14 Nov 2004 15:26 PST
Okay, my impression was that the query we looked at before gave "one
degree of linking".  You can still create and populate a table that
will hold _all_ degrees of linking (the transitive closure business I
mentioned earlier).

Basically we can collect all the "one degree" links first, and
populate them as the corresponding pair of known genes with "rank"
column set = 1.

Then we proceed to insert new pairs of genes into the "linkable genes"
tables based on a rule that says:

If known gene A is linked to known gene B with rank = K, and known
gene B is linked to known gene C with rank = 1, then insert the
pairing A,C into the table with rank = K+1 unless it is already there
(with a smaller rank).

We would "iterate" over this rule until new gene pairs stop being
added.  (One iteration with no new pairs means we're done.)

Efficiency requires some indexes be defined on the "linkable genes"
table, for example so that it's pretty fast to search it for gene
pairs and know whether they already exist or not.

I see from your last note that you contemplate storing the same gene
pair multiple times (if they can be linked together in more than one
way).  Personally I would avoid this, as it will likely generate a
high proportion of duplicate records.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 15 Nov 2004 06:31 PST
"However ... then a second step you'll want to take with these
relations, and that is to organize them into "chains" or lists that
partition the chromosome.  In mathematics we might call this "taking
the transitive closure" ... we want ultimately to put all three into a
common "linked list" or other data structure that lumps them
together."

I do not follow what this looks like nor am I familiar with the point,
nor is its really the goal of this question. I want to make a table
wiht one degree of linking for every gene. if I want to make second
degree links, I can get them from that table. If your
conceptualization yeilds a better query to do this, that will be fine.
The query we are refering to above does not do this. In addition it
still has the incorrect assumption that target gene and query gene are
on the same chrom. You were hinting at the creation of a new table for
a new degree of linking, or a new column, either way I dont get it.
Show me the query to run adn I might get it.

Request for Question Clarification by mathtalk-ga on 16 Nov 2004 11:20 PST
Okay, let's back up and discuss where you find shortcomings in the existing query.

One point is to make the query use all the chr#_chainSelf tables.  If
you are satisfied with inserting the results into a common table
(linked_genes), then we could simply modify the query1.sql to process
each of the other 23 "chainSelf" tables one at a time.  If it's
important to have a single query that does everything, then we could
write this as a UNION ALL query.

However I recall your earlier statement: "It returns the same results
over and over. I dont think it ever stops."

Is the problem that when you run the query (once), it never stops?  Or
are you simply saying that the query returns the same rows each time
it runs?

regards, mathtalk-ga

Clarification of Question by iterative-ga on 16 Nov 2004 12:17 PST
both problems. The first thing to understand is that this query does
not work. It was someones idea, it would be best to start from
scratch. One query, multiple queries, this is not really a concern. I
do want one table containing all the links, with score, between genes.
no entire row will be duplicated:
targetGene_queryGene_chainScore
There are 44000 genes, 20 genes on average linked to each, less than a million rows.

Request for Question Clarification by mathtalk-ga on 17 Nov 2004 05:07 PST
Hi, iterative-ga:

Regarding "both problems", I have two suggestions:

1)

I suggest we fix a row in the "a" table and do the query to see
whether the results that come back are the appropriate subset of the
ones you want, ie. all the linked genes from chromosome 1.  Comparing
your description of the "overlapping" genomic intervals/coordinates, I
see that there's some room here for the inequality logic to be wrong.

So, if there are some rows that don't belong, then we can address that
issue sooner than later:

/* query1.sql revised again */

select a.name, c.name, b.score
from    knowngene a,
        chr1_chainSelf b,
        knownGene c
where   a.txStart <= b.tStart
and     a.txEnd >= b.tEnd
and     b.qStart >= c.txStart
and     b.qEnd <= c.txEnd
and     c.name = d.kgId
and     a.chrom = 'chr1'
and     a.name = 'BC0634'
;

2) 

I think the reason the query runs very slowly is because it is doing
the joins on many tables without having a good set of indexes defined
to help.  If you can confirm that indexes have not yet been created,
I'll post some SQL to create the ones that jump out at me.  If you
have created some indexes, please post what they are, and I can review
the list for additions.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 17 Nov 2004 07:56 PST
/* query6.sql from mathtalk 
must change knowngene to knowngenex for exon coordinate absence 
must change a.name to AK000538, which has one chr1 chain to SHARP
must take out kgid reference for now
*/
select a.name, c.name, b.score
from    knowngenex a,
        chr1_chainSelf b,
        knownGenex c
where   a.txStart <= b.tStart
and     a.txEnd >= b.tEnd
and     b.qStart >= c.txStart
and     b.qEnd <= c.txEnd
and     a.chrom = 'chr1'
and     a.name = 'AK000538'
;


&&& This query is simply hanging now for 30 minutes, shoud return one line. 
&&& I thought indexes would show up in DESCRIBE knownGene, etc above,
so they do not exist I guess.
&&& I get the feeling also that the inequality logic is wrong, as
though its not comparing one chain to fufill the position requirements
but multiple. If this would be a good time to incorporate the desire
to expand a genes coordinates by 1.5x with the same center point,
please keep it in mind.
&&& I think we are getting warmer.

Request for Question Clarification by mathtalk-ga on 17 Nov 2004 08:58 PST
As a quick confirmation that we are not missing something fundamental:

select a.name
from    knowngenex a
and     a.chrom = 'chr1'
and     a.name = 'AK000538'
;

should quickly return one row.  Then, let's test the speed of this:

select a.name, b.score
from    knowngenex a,
        chr1_chainSelf b
where   a.txStart <= b.tStart
and     a.txEnd >= b.tEnd
and     a.chrom = 'chr1'
and     a.name = 'AK000538'
;

I think we want to created indexes on knowngenex.name and chr1_chainSelf.tStart.  

The inequality logic here asks that the "target" interval
(b.tStart,btEnd) is entirely contained within the "genomic" interval
(a.txStart,txEnd).  I agree that now would be a good time to think
about the additional 1.5 factor.  First let's confirm that invariably:

  txStart <= txEnd in the knowngenex table

  tStart <= tEnd in the chr1_chainSelf table

  qStart <= qEnd in the chr1_chainSelf table

This may seem "obvious" but as you may have found, sometimes the
endpoints are reversed for "opposite" strand orientation.  I'd just do
a few queries to be sure what we should expect:

select count(*) from knowngenex where txStart > txEnd;

select count(*) from chr1_chainSelf where tStart > tEnd;

select count(*) from chr1_chainSelf where qStart > qEnd;

regards, mathtalk-ga

Clarification of Question by iterative-ga on 17 Nov 2004 14:12 PST
yes quick one line return
yes each of those counts is equal to zero
chain ID someone refered to as the index because it was unique, but I dont know more

Request for Question Clarification by mathtalk-ga on 19 Nov 2004 05:54 PST
Hi, iterative-ga:

See my Comment at bottom about creating a couple of indexes on tables.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 19 Nov 2004 09:29 PST
&&&yes, you understand the concept
&&&yes, genetic intervals to "contain" the query interval are also
subject to 50% enlargement, this may look sticky later because one
chain's query interval may hit two genes. Fine, count them all with
that chain score.
&&&No, you are right, we cannot ignore chromosomal location when
comparing chain query interval to the knownGene table in the final
step. Coordinates include chromosome name and postion.
&&& knowngenex is the table used because we could not get the longblob
exon coordinates into an sql table. This index is created. The
chromosome 10 index is created, but upon repeating for other
chrX_chainSelf files I get
CREATE INDEX tStart_idx ON chr1_chainSelf ( tStart )
             *
ERROR at line 1:
ORA-00955: name is already used by an existing object

Request for Question Clarification by mathtalk-ga on 19 Nov 2004 10:32 PST
Ah, I see we need to make the index names distinct for each of the
chr#_chainSelf tables.  Something along these lines:

DROP INDEX tStart_idx;

CREATE INDEX chr10_tStart_idx ON chr10_chainSelf ( tStart );

Also, thanks for the clarification about the chromosome locations and
intervals.  Given that, I'll amend my recommendation for the index on
knownGenex to:

DROP INDEX txStart_idx;

CREATE INDEX chrom_txStart_idx ON knownGenex ( chrom, txStart );

*  *  *  *  *  *  *  *  *  *  *  *  *  *

When a genetic interval txStart/txEnd is to be increased by 1.5 times,
it would be my intuition that this means to lower txStart by 1/4th of
(txEnd - txStart), and to symmetrically increase txEnd by the same
amount.  Is that your sense as well?

regards, mathtalk-ga

Request for Question Clarification by mathtalk-ga on 23 Nov 2004 06:06 PST
My thinking right now is to create the links between known genes via
chain-self target and query intervals in three steps:

1. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf target interval

2. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf query interval

3. Join the results of 1. and 2. based on equality of chain IDs,
excluding if necessary any rows where the known genes are the same.

So as an experiment, I'll post the SQL to create a couple of
intermediate tables to hold the results of 1. and 2.  Then we can put
an index on these tables for chain ID, to facilitate joining them
(3.).

regards, mathtalk-ga

Clarification of Question by iterative-ga on 30 Nov 2004 08:51 PST
OK, I think that the indexes are created. Im not sure what to clrify
here. What you describe sounds good, but seems to include more steps
than needed. I dont know why a table must be created, I thought we
were already doing joins in the where statement. I just want the table
of one degree links though, so any way is fine.

Request for Question Clarification by mathtalk-ga on 30 Nov 2004 08:57 PST
I hope we will save time overall because of the computation of the
enlarged intervals -- doing it once rather than over and over.

regards, mathtalk-ga

Clarification of Question by iterative-ga on 30 Nov 2004 10:00 PST
bene. Then my clarification is sounds very good. I will keep reading
about joins and formatting outputs
Answer  
Subject: Re: SQLplus query needed
Answered By: mathtalk-ga on 06 Dec 2004 05:12 PST
Rated:5 out of 5 stars
 
Hi, iterative-ga:

Let's outline the steps to take, and then go into the details (see below).

1. Create a table to hold each known genes, its chromosome, and the
enlarged genomic intervals.  Populate this table with an INSERT
statement based on selecting records from table knownGenex.

2a. Create a table to hold pairs of known genes and chainSelf IDs,
such that the start/end coordinates of the chainSelf target interval
are within the enlarged genomic interval.  Populate this table with
several INSERT statements, one for each chromosome, joining the table
from 1. and the respect chromosome's chainSelf table.

2b. Create a table to hold pairs of known genes and chainSelf IDs,
such that the start/end coordinates of the chainSelf query interval
are within the enlarged genomic interval.  Populate this table with
several INSERT statements, one for each chromosome, joining the table
from 1. and the respect chromosome's chainSelf table.

3. Create a table to hold pairs of known genes and the chainSelf IDs
that link them.  Join the tables from 2a/2b by chromosome & chainSelf
IDs, to produce the required "links" among known genes.

It can be seen from this outline that the process can be done
chromosome by chromosome, although we will be using tables that pool
all the chromosomes' results together.  But to prove the ideas, let's
run through the process for chromosome 1.  It may not only find a bug
or two, but point out some indexing issues for performance.

Please run through these steps just on chromosome 1 and let me know
what the timings are like.

regards, mathtalk-ga


*  *  *  *  *  *  *  *  *  *  *  *

Step 1
======

CREATE table knowngene_enlarge (
  name          varchar2(255)          not null,
  chrom         varchar2(255)          not null,
  wxStart       number(10)             not null,
  wxEnd         number(10)             not null
);

INSERT into knowngene_enlarge
SELECT name,
       chrom,
       txStart - (txEnd - txStart)/4,
       txEnd   + (txEnd - txStart)/4
from  knowngenex;

/* Note: We will probably want to add an index here on
         chrom, for the sake of the next steps.
*/

Step 2a
=======

CREATE table knowngene_target (
  name          varchar2(255)          not null,
  chrom         varchar2(255)          not null,
  chain_id      number(10)             not null
);

/* for each chromosome, do like this: */

INSERT into knowngene_target
SELECT a.name,
       a.chrom,
       b.id
from  knowngene_enlarge a, chr1_chainSelf b
where a.chrom = 'chr1'
and   b.tStart between a.wxStart and a.wxEnd
and   b.tEnd <= a.wxEnd;

/* Note: The hope is that the condition b.tStart <= a.wxEnd,
         although redundant, will help speed up the query by
         using an index on chrXX_chainSelf.tStart to limit
         the join.
*/

Step 2b
=======

CREATE table knowngene_query (
  name          varchar2(255)          not null,
  chrom         varchar2(255)          not null,
  chain_id      number(10)             not null,
  score         float(126)             not null
);

/* for each chromosome, do like this: */

INSERT into knowngene_query
SELECT a.name,
       a.chrom,
       b.id,
       b.score
from  knowngene_enlarge a, chr1_chainSelf b
where a.chrom = 'chr1'
and   b.qStart between a.wxStart and a.wxEnd
and   b.qEnd <= a.wxEnd;

/* Note: Very similar to above except using query interval
         instead of target interval, and also picking up
         the score field for later convenience.
*/

Step 3
======

CREATE table knowngene_link (
  name_t        varchar2(255)          not null,
  name_q        varchar2(255)          not null,
  chrom         varchar2(255)          not null,
  chain_id      number(10)             not null,
  score         float(126)             not null
);

INSERT into knowngene_link
SELECT a.name,
       b.name,
       b.chrom,
       b.chain_id,
       b.score
from  knowngene_target a, knowngene_query b
where a.chrom    = b.chrom
and   a.chain_id = b.chain_id;

/* Note: To make this join efficient, we probably want to
         create a composite index on chrom & chain_id for
         knowngene_target and/or knowngene_query.
*/

Request for Answer Clarification by iterative-ga on 06 Dec 2004 12:04 PST
1 is fine, 2a appears to be ok but both statements in the second
comment are not clear to me. I do not intend to do anything explicitly
instructed because it could take serious time to figure out or debug.
Target table made and populated. 2b, query table, created but insert
hangs for 2 hours now, not for any reason I can see. Ideas?

Clarification of Answer by mathtalk-ga on 06 Dec 2004 12:13 PST
Hi, iterative-ga:

Both my comments in Step 2 were intended as explanatory (apparently
clear as mud!) and not as instructive.  I was just trying to make a
remark on why the queries were written the way they were.

If 2a.'s insert went reasonably quickly but 2b.'s is running very
slowly, then I suspect we need to add an index to chr1_chainSelf on
qStart.

Would you mind verifying if we have an index on tStart but not on
qStart for that table?

Thanks,
mathtalk-ga

Clarification of Answer by mathtalk-ga on 06 Dec 2004 12:18 PST
Now that I look back at our earlier exchanges of Clarifications, I see
we did indeed create indexes for the tStart columns in chrXX_chainSelf
tables, but none as far as I know on the qStart column.

Please imitate the SQL we used before, changing the index names, etc.
to reflect qStart rather than tStart as the indexed field.

regards, mathtalk-ga

Request for Answer Clarification by iterative-ga on 07 Dec 2004 08:09 PST
ok part two is fine, part 3 table created, insert hangs. Your comment
about the composite index seems to come into play. Are we supposed to
make another index to insert effectively into the link table?

Clarification of Answer by mathtalk-ga on 07 Dec 2004 10:08 PST
Yes, I think it would be sufficient to put the composite index on just
one of the tables from step 2, but for the sake of symmetry:

CREATE INDEX chrom_chain_t ON knowngene_target ( chrom, chain_id );

CREATE INDEX chrom_chain_q ON knowngene_query ( chrom, chain_id );

regards, mathtalk-ga

Request for Answer Clarification by iterative-ga on 11 Dec 2004 12:34 PST
the final insert statement excedes my tablespace. I talked to system
administrator and he said I was at 10gig and he gave me another 10gig.
I dont know if he has to reboot or something because it still excedes
and quits. I am looking fo commands to see how much of my quote is
free. He also said i should drop the last indexes and make them again
in tablespace INDEX01_TBSP, but I am struggling with the sybtax for
that as well. If you have guidance, it would be great but I will still
work to determine what is wrong. What I dont have though is an answer
that works yet. Thanks.

Clarification of Answer by mathtalk-ga on 11 Dec 2004 21:54 PST
Hi, iterative-ga:

I'm curious to find out a rough number for the records being inserted.
 Let's check just the first chromosome, as I suspect the numbers of
links on different chromosomes will not vary by an order of magnitude:

SELECT count(*)
from  knowngene_target a, knowngene_query b
where a.chrom    = b.chrom
and   a.chain_id = b.chain_id
and   a.chrom    = 'chr1';

At one point you estimated ~ one million rows would be generated as
links, so it's a good time to check that assumption.  This count will
potentially include some pairs of genes that are linked in more than
one way.

Since the rows we are creating for knowngene_link have only five
columns, it's not hard to see that we are overdoing it with how wide
we've made many of these columns.  I used column widths that were
compatible with how you'd defined the columns in the source tables,
but surely 255 characters is overkill for the gene names (name_t and
name_q) almost as much as it is for the chromosome identifier.

How about 30 characters each for the two gene names, and 10 for
chromosome, and making field score be number(10) rather than
float(126).  Judging by your sample data, all the score fields are
whole numbers anyway.  These changes to the knowngene_link table will
reduce the size of rows by about an order of magnitude.

Here's the syntax to drop and recreate the indexes on knowngene_target
and knowngene_query in the specified tablespace your administrator
suggested:

DROP INDEX chrom_chain_t;

CREATE INDEX chrom_chain_t
  ON knowngene_target ( chrom, chain_id )
  TABLESPACE INDEX01_TBSP;

DROP INDEX chrom_chain_q;

CREATE INDEX chrom_chain_q
  ON knowngene_query ( chrom, chain_id )
  TABLESPACE INDEX01_TBSP;

As far as figuring out how much space a table is using, there is a
"system" table that holds some relevant information called DBA_TABLES.
 Often the first step is to estimate it yourself by counting the
number of rows and adding up the size of one row, then multiplying. 
Let me write up some on this topic with a clear mind in the morning.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 12 Dec 2004 12:51 PST
Hi, iterative-ga:

Well, in the clear light of "morning" I'm doubtful how much space
could be saved in the final table by redefining its columns.  The
VARCHAR2 fields are of variable length, so it was incorrect for me to
say we'd save a lot of space by resizing them.  In any case let's get
a count of how many rows are being inserted before we do anything
else.

The database administrator (DBA) is in a much better position to check
a user's quota than the user is, as Oracle puts many of the
tables/views/etc. under the dba permissions.  However you should be
able to check how much free space you have remaining (per tablespace)
under your assigned quota by querying some views that are defined for
user permissions.

Here's a tip from "The Oracle (tm) Users' Co-Operative FAQ" written by
Norman Dunbar:

[How I determine how much free space a tablespace has left?]
http://www.jlcomp.demon.co.uk/faq/freespace.html

select ts.tablespace_name,
       to_char(sum(nvl(fs.bytes,0))/1048576, '99,999,990.99') as MB_FREE, 
       count(*) as FRAGMENTS,
       to_char(max(nvl(fs.bytes,0))/1048576, '99,999,990.99') as BIGGEST_BIT
from  user_free_space fs, user_tablespaces ts
where fs.tablespace_name(+) = ts.tablespace_name
group by ts.tablespace_name;

On the other hand, for a quick summary of how much space has been
allocated (but not necessarily used; a detailed explanation would
involve discussing what Oracle does with "segments" and "extents"):

select segment_type, sum(bytes) from user_segments group by segment_type

which should report allocation totals by combined tables and combined indexes.

regards, mathtalk-ga

Request for Answer Clarification by iterative-ga on 15 Dec 2004 07:36 PST
SQL> select count(*) from knowngene_target;   COUNT(*)-39535334
SQL> select count(*) from knowngene_query;   COUNT(*)-29017023
SQL> SELECT count(*) from  knowngene_target a, knowngene_query b where
a.chrom = b.chrom
  and a.chain_id = b.chain_id and a.chrom = 'chr1';  ***took ~5hrs
hours: COUNT(*)-4995703
SQL> select segment_type, sum(bytes) from user_segments group by segment_type;
SEGMENT_TYPE       SUM(BYTES)
------------------ ----------
INDEX              1383071744
TABLE              4865523712
>>other quota check from Norman Dunbar failed. DBA says I have 20
gigs. Turns out INDEX is INDX.
the last two nights someone has logged me off as the compound index
was being made. but I will continue with the plan, unless there seems
to be something off.

Clarification of Answer by mathtalk-ga on 15 Dec 2004 09:14 PST
Chromosome 1 is the largest (if I remember correctly) chromosome in
the human genome, but that may not translate into the largest number
of known genes (or links).  But at least the 5,000,000 links reported
gives us a good ballpark to plan for, per chromosome.

So I'd rebuild the composite indexes, then proceed to do the final
inserts one chromosome at a time, keeping an eye on the amount of
space used.  Let's check it after two or three chromosomes are
processed and try to extrapolate to the final size of the
knowngene_link table.  We may need to analyze the storage factors
being used for the table to make sure they are reasonable.  By default
the succeeeding allocations (extents) added for a table will be
increased by a certain percentage in size each time.  While this
ordinarily helps to keep the number of extents to a minimum without
requiring much planning, it can be a problem if space starts getting
tight.

regards, mathtalk-ga

Request for Answer Clarification by iterative-ga on 20 Dec 2004 11:36 PST
about face. 

I cant create any tables, like the last one I need, and the last index
creation disconnects me from oracle. We have one DBA for the building
and he is always "migrating" something or otherwise busy. Turns out I
am the only one using oracle, and I do not rate high on the pole, so
he cannot look into these issues.
Soooooooooooo, 
do I change to postgres or mysql. This DB will be on a php webpage once it works. 

Thanks for all your help, I guess this question is over for now.

Clarification of Answer by mathtalk-ga on 20 Dec 2004 12:03 PST
Ouch!  Well, the SQL & indexes should work on either postgres or
MySQL.  It seems like an odd situation, in that Oracle is a commercial
database management system and the situation is impelling you to
migrate to an open source DBMS.

best wishes, mathtalk-ga
iterative-ga rated this answer:5 out of 5 stars
Great effort by researcher, well recommended.

Comments  
Subject: Re: SQLplus query needed
From: mathtalk-ga on 08 Nov 2004 13:03 PST
 
Hi, iterative-ga:

I'm looking at it... will touch base later today.

regards, mathtalk-ga
Subject: Re: SQLplus query needed
From: mathtalk-ga on 08 Nov 2004 19:38 PST
 
I think we may want to build a temp table to hold the results of the
query, and then loop until we no longer find anything new to insert
into the table.

I have a bit of catching up to do with you on populating the tables,
from where we left off before.  But I can start to sketch out the
above idea in a series of comments.

regards, mathtalk-ga
Subject: Re: SQLplus query needed
From: mathtalk-ga on 19 Nov 2004 06:11 PST
 
Here is the SQL to create an index on the starting point of the
"target" interval for each of chr#_chainSelf tables, using chromosome
10 as our model:

CREATE INDEX tStart_idx ON chr10_chainSelf ( tStart );

I'm not clear on the distinction between knownGene and knownGenex
tables, but you referred recently to "exon coordinate absence". 
Anyway, both tables could presumably use an index of this kind to help
our query:

CREATE INDEX txStart_idx ON knownGenex ( txStart );

*  *  *  *  *  *  *  *  *  *  *  *  *

Here's how I understand the intent of the existing query.  One begins
with a "known gene" and its genetic interval on some chromosome, say
chromosome 10.  We then go to chr10_chainSelf, which has information
(about repeating groups???) on  chromosome 10 and find rows there
whose "target" interval is entirely contained in the genetic interval
of the given "known gene".  [This being subject to the 50%
enlargement, ie. making the genetic interval bigger so that more
things from the "chain self" profiles will fit into it.]

Finally we use the corresponding "query" intervals from these
chromosome specific rows to check for "known gene" rows (other than
the original one!) whose genetic intervals entirely contain the
"query" interval.

  -  This last condition is the hardest one to check, I think, from a
database point of view.

  -  I'm not clear if the genetic intervals to "contain" the query
interval are also subject to 50% enlargement.

  -  I'm still very doubtful about your desire to ignore any
chromosome restriction in comparing the genetic interval coordinates
to the query interval.  If the query interval refers to a sequence on
a specific chromosome, the only meaningful sense in which a genetic
interval can contain it is when the "known gene" is located on the
same chromosome.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy