First dibs on this question must go to Mathtalk, the author of the
comments that go me this far. I would understand if they are not
interested, but I do feel like I have to compensate him or her
somehow.
I need a query to build a map of gene collisions.
>>> I am using Oracle9i and SQLplus
The "known" genes in the genome are listed in table knownGene. For each gene (row)
there is a genomic interval or locus with coordinates (chrN, Start position, end
position). For each chromosome (1-22, X, Y) there is a table of entities called
chains. These each have two intervals, target and query. A chain is a repeat.
The plan is to make a list for each known gene. The list is named by its target
gene, and populated by the names and descriptions of all the genes associated via
the chains. A chain is a link, or edge, from one region to another in the genome.
For each knowngene, all the chainSelf tables must be scanned for chains whose
target coordinates indicate they would fall into the genes genomic interval. Each
genes interval will have to be expanded by 1.5, but that can wait. With a gene's
interval in mind, a list of chains that are enclosed by or that overlap it can be
made. Each of these chains has a query locus associated with it. QNAME is the
chromosome, QSTART, QEND define the interval. For each query interval, table
knownGene is searched and every overlapping knowngene returned. These are the
known genes that are linked to or "chain to" the initial target gene for which the
list is named. They constitute the first of two columns in this is gene's list.
The second column is populated by taking the name in knownGene and scanning table
kgXref (were name is equivilent to kgID) to return all teh descriptions of the
genes whose names we collected for this one target gene. I am not particularly
clear on what the best final output is. Suffice to say I would want to sort the
lists by size and see what genes chain to the greatest number of other genes, and
then make a map of the network created by these associations using something like
Cytoscape to show that the distribution of links follows a power law.
>>>Here are the descriptions of the 3 table types:
SQL> describe knownGene
Name Null? Type
----------------------------------------- -------- ----------------------------
NAME NOT NULL VARCHAR2(255)
CHROM NOT NULL VARCHAR2(255)
STRAND NOT NULL CHAR(1)
TXSTART NOT NULL NUMBER(10)
TXEND NOT NULL NUMBER(10)
CDSSTART NOT NULL NUMBER(10)
CDSEND NOT NULL NUMBER(10)
EXONCOUNT NOT NULL NUMBER(10)
PROTEINID NOT NULL VARCHAR2(40)
ALIGNID NOT NULL VARCHAR2(8)
>knownGene looks like this, but with no headers. Values abbreviated to
fit page here.
>Table knownGene has 43232 rows total.
#name chrom strand txStart txEnd cdsStart cdsEnd exonCnt proteinID alignID
BC0634 chr1 - 4224 7502 4558 7173 2 AAH63682 24255
AL1373 chr1 - 4266 1936 7413 19346 2 Q9NSV7 41739
BC0544 chr1 - 4268 5808 4558 5808 1 Q96BN3 32430
SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y)
Name Null? Type
----------------------------------------- -------- ----------------------------
BIN NOT NULL NUMBER(5)
SCORE NOT NULL FLOAT(126)
TNAME NOT NULL VARCHAR2(255)
TSIZE NOT NULL NUMBER(10)
TSTART NOT NULL NUMBER(10)
TEND NOT NULL NUMBER(10)
QNAME NOT NULL VARCHAR2(255)
QSIZE NOT NULL NUMBER(10)
QSTRAND NOT NULL CHAR(1)
QSTART NOT NULL NUMBER(10)
QEND NOT NULL NUMBER(10)
ID NOT NULL NUMBER(10)
>chr10_chainSelf looks like this with no headers.
#bin score tName tSize tStart tEnd qName qSize qStrand
qStart qEnd id
585 63520 chr10 135037 14000 16578 chr7 158545 -
96339 9634 204969
585 63670 chr10 135037 14000 16578 chr7 158545 +
62286 6228 204161
585 81668 chr10 135037 14000 17239 chr7 158545 +
62827 6283 133728
SQL> describe kgXref
Name Null? Type
----------------------------------------- -------- ----------------------------
KGID NOT NULL VARCHAR2(40)
MRNA VARCHAR2(40)
SPID VARCHAR2(40)
SPDISPLAYID VARCHAR2(40)
GENESYMBOL VARCHAR2(40)
REFSEQ VARCHAR2(40)
PROTACC VARCHAR2(40)
DESCRIPTION VARCHAR2(255)
>kgXref looks like this but with no headers. Values have been truncated.
>Table kgXref has 42026 rows total.
#kgID mRNA spID spDisplayID geneSymbol refseq protAcc
description
AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058
Down syndrome gene 4
AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050
osteomodulin
AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068
mRNA expr in osteoblast
>>>this is the query that was given to me. It can only check through one of the
chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results
over and over. I dont think it ever stops. I have been sending the results to a
spool file, and clearly the query doesnt do what I intended. I
definately need it to check all chainSelf tables.
/* query1.sql*/
select a.name, a.exoncount, d.description, c.name
from knowngene a,
chr1_chainSelf b,
knownGene c,
kgXref d
where a.txstart <= b.tStart
and a.txEnd >= b.tEnd
and b.qstart >= c.Txstart
and b.qEnd <= c.txend
and c.name = d.kgId
and a.chrom = 'chr1'
;
Any direction would be greatly appreciated, but the answer is going to
have to work. I can help develop it but it will essentially be the
text in the query file, and thus syntax specific. |
Request for Question Clarification by
mathtalk-ga
on
11 Nov 2004 05:27 PST
Hi, iterative-ga:
"For each knowngene, all the chainSelf tables must be scanned for
chains whose target coordinates indicate they would fall into the
genes genomic interval."
I think that your sample query illustrates that only the chainSelf
table that corresponds to the same chromosome to which a known gene is
assigned needs to be scanned (for that row in knownGene).
So for rows in knownGene with chrom = 'chr1', we need the
chr1_chainSelf table, but where a gene is located on a different
chromosome, we need the corresponding "chainSelf" table.
This basically means writing a different query for each chromosome (24
queries in all), although SQL has a keyword UNION that allows one to
combine queries with similar columns into one large result set.
Let's try and understand clearly what the existing query does, and
then you can perhaps comment on whether this is what you'd like to do
or not. The query involves joining two copies of knownGene, and one
copy each of chr1_chainSelf and kgXref. As we just discussed, the
fact that chr1_chainSelf is being used is determined by the chromosome
location of the "known gene".
The only use that is made in the query at hand for kgXref is to lookup
the description by matching:
knownGene.name = kgXref.kgID
It is the sort of "added information" that can be tacked on later
(because the knownGene.name values are also returned in the query), so
to simplify the discussion, I'm going to omit the kgXref table for the
moment.
Now I believe that what your query is meant to do is to start from
each row in knownGene which (for the sake of illustration) happens to
be on chromosome 1, and to build a list of "related" genes ALSO on the
same chromosome. If this is so, then there's a condition missing in
the sample query. It should restrict both "copies" of the knownGene
table to rows for chromosome 1 (or whichever chromosome we'd be
writing the query against).
Thus let's consider this:
/* rewrite of chr1 "collision" query */
select a.name, a.exoncount, c.name
from knownGene a,
chr1_chainSelf b,
knownGene c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and c.txStart <= b.qStart
and c.txend >= b.qEnd
and a.chrom = 'chr1'
and c.chrom = 'chr1';
Now this query, which rewrites the original in some minor ways apart
from adding the requirement that both knownGene rows "live" on
chromosome 1, produces a relationship between pairs of known genes.
As far as it goes, I think this query will probably do the right and
useful thing.
However I think that there is presumably then a second step you'll
want to take with these relations, and that is to organize them into
"chains" or lists that partition the chromosome. In mathematics we
might call this "taking the transitive closure", but the idea is
simply that if knownGene A is related to knownGene B, and knownGene B
is related to knownGene C, then we want ultimately to put all three
into a common "linked list" or other data structure that lumps them
together.
This is the impression I have so far about what you are trying to
accomplish. Please let me know if I'm getting warm.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
12 Nov 2004 09:54 PST
First, I dont know what the problem is but I have been trying to post
this for a while and the clarification button did not exist.
Second, in answer to you getting warmer, sort of. Every known gene is related to
a number, maybe zero, of other genes, each by a chain, which has a
score. Imagine a table with a row for every known gene. The gene that
is linked by the highest scoring chain is put in column two. You are
right that for each gene you only have to look in one chr_NchainSelf
table, but the chains found in that table on that chromosom link to
genes on all chromosomes, c.chrom = 'chr1' is not generally true. The
problem with my model here is that it might require ~500 columns, but
most of the rows in these low scoring columns will be empty.
I have been told this is a stupid or impossible table.
|
Request for Question Clarification by
mathtalk-ga
on
12 Nov 2004 20:55 PST
Let me ask this. Suppose I pick two known genes, and I ask you what
the "score" is that relates them. Is the score the same regardless of
the ordering of the pair? Does finding the score between genes A and
B involve for the most part consideration of intermediate
relationships to other genes?
Still trying to grasp why one considers genes related in this fashion.
Originally you gave me the impression we were talking about
coordinates (start and end positions) within a sequence, but if that
were the case, it scarcely makes sense to confuse coordinates on one
chromosome with numerically similar values on a different chromosome.
regards, mathtalk-ga
|
Request for Question Clarification by
mathtalk-ga
on
13 Nov 2004 05:39 PST
Here's an alternative to building the table quite as you described it
(with many columns, most of which are empty on most records).
One instead defines a table in which the key is a pair of known genes,
and a third column gives the "rank" of their scoring.
This is the motivation behind my requests for clarification I'm asking
above. One part addresses whether the key is an ordered or unordered
pair of known genes; the others seek insight into how the score or
ranking would be found.
As far as the missing "Clarification" button goes, my guess would be
that the site didn't see you as logged in as yourself (only the user
who posts the Question can submit Clarifications). Rarely I find that
I have to logout, log back in to get recognized by the site.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
14 Nov 2004 11:27 PST
yes, a two or three column table was described to me yesterday as more
sensible, like you describe. Links are birectional, the table only
indicates one degree of linking. Each link is a chain. Column one is
the target gene. The gene (row) where you say for every gene do this.
what you do is find all the chains that link you to other genes.
Column two is the gene you link to, column three is the score for that
link, given by the scoore for the chain that links you. Example:
TargetGene__QueryGene__Score
gene 1 gene 233 500
gene 1 gene 458 302
gene 1 gene 458 302
gene 1 gene 2 58
gene 2 gene 332 403
gene 2 gene 806 255
gene 2 gene 1 58 >> this is the reverse of row 4. There may
be several links between two genes, they all will be listed twice.
Directionality constraint can be induced later by strand information.
The problem was that gmail accounts are not allowed and I was logged in to mail.
|
Request for Question Clarification by
mathtalk-ga
on
14 Nov 2004 15:26 PST
Okay, my impression was that the query we looked at before gave "one
degree of linking". You can still create and populate a table that
will hold _all_ degrees of linking (the transitive closure business I
mentioned earlier).
Basically we can collect all the "one degree" links first, and
populate them as the corresponding pair of known genes with "rank"
column set = 1.
Then we proceed to insert new pairs of genes into the "linkable genes"
tables based on a rule that says:
If known gene A is linked to known gene B with rank = K, and known
gene B is linked to known gene C with rank = 1, then insert the
pairing A,C into the table with rank = K+1 unless it is already there
(with a smaller rank).
We would "iterate" over this rule until new gene pairs stop being
added. (One iteration with no new pairs means we're done.)
Efficiency requires some indexes be defined on the "linkable genes"
table, for example so that it's pretty fast to search it for gene
pairs and know whether they already exist or not.
I see from your last note that you contemplate storing the same gene
pair multiple times (if they can be linked together in more than one
way). Personally I would avoid this, as it will likely generate a
high proportion of duplicate records.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
15 Nov 2004 06:31 PST
"However ... then a second step you'll want to take with these
relations, and that is to organize them into "chains" or lists that
partition the chromosome. In mathematics we might call this "taking
the transitive closure" ... we want ultimately to put all three into a
common "linked list" or other data structure that lumps them
together."
I do not follow what this looks like nor am I familiar with the point,
nor is its really the goal of this question. I want to make a table
wiht one degree of linking for every gene. if I want to make second
degree links, I can get them from that table. If your
conceptualization yeilds a better query to do this, that will be fine.
The query we are refering to above does not do this. In addition it
still has the incorrect assumption that target gene and query gene are
on the same chrom. You were hinting at the creation of a new table for
a new degree of linking, or a new column, either way I dont get it.
Show me the query to run adn I might get it.
|
Request for Question Clarification by
mathtalk-ga
on
16 Nov 2004 11:20 PST
Okay, let's back up and discuss where you find shortcomings in the existing query.
One point is to make the query use all the chr#_chainSelf tables. If
you are satisfied with inserting the results into a common table
(linked_genes), then we could simply modify the query1.sql to process
each of the other 23 "chainSelf" tables one at a time. If it's
important to have a single query that does everything, then we could
write this as a UNION ALL query.
However I recall your earlier statement: "It returns the same results
over and over. I dont think it ever stops."
Is the problem that when you run the query (once), it never stops? Or
are you simply saying that the query returns the same rows each time
it runs?
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
16 Nov 2004 12:17 PST
both problems. The first thing to understand is that this query does
not work. It was someones idea, it would be best to start from
scratch. One query, multiple queries, this is not really a concern. I
do want one table containing all the links, with score, between genes.
no entire row will be duplicated:
targetGene_queryGene_chainScore
There are 44000 genes, 20 genes on average linked to each, less than a million rows.
|
Request for Question Clarification by
mathtalk-ga
on
17 Nov 2004 05:07 PST
Hi, iterative-ga:
Regarding "both problems", I have two suggestions:
1)
I suggest we fix a row in the "a" table and do the query to see
whether the results that come back are the appropriate subset of the
ones you want, ie. all the linked genes from chromosome 1. Comparing
your description of the "overlapping" genomic intervals/coordinates, I
see that there's some room here for the inequality logic to be wrong.
So, if there are some rows that don't belong, then we can address that
issue sooner than later:
/* query1.sql revised again */
select a.name, c.name, b.score
from knowngene a,
chr1_chainSelf b,
knownGene c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and b.qStart >= c.txStart
and b.qEnd <= c.txEnd
and c.name = d.kgId
and a.chrom = 'chr1'
and a.name = 'BC0634'
;
2)
I think the reason the query runs very slowly is because it is doing
the joins on many tables without having a good set of indexes defined
to help. If you can confirm that indexes have not yet been created,
I'll post some SQL to create the ones that jump out at me. If you
have created some indexes, please post what they are, and I can review
the list for additions.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Nov 2004 07:56 PST
/* query6.sql from mathtalk
must change knowngene to knowngenex for exon coordinate absence
must change a.name to AK000538, which has one chr1 chain to SHARP
must take out kgid reference for now
*/
select a.name, c.name, b.score
from knowngenex a,
chr1_chainSelf b,
knownGenex c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and b.qStart >= c.txStart
and b.qEnd <= c.txEnd
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
&&& This query is simply hanging now for 30 minutes, shoud return one line.
&&& I thought indexes would show up in DESCRIBE knownGene, etc above,
so they do not exist I guess.
&&& I get the feeling also that the inequality logic is wrong, as
though its not comparing one chain to fufill the position requirements
but multiple. If this would be a good time to incorporate the desire
to expand a genes coordinates by 1.5x with the same center point,
please keep it in mind.
&&& I think we are getting warmer.
|
Request for Question Clarification by
mathtalk-ga
on
17 Nov 2004 08:58 PST
As a quick confirmation that we are not missing something fundamental:
select a.name
from knowngenex a
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
should quickly return one row. Then, let's test the speed of this:
select a.name, b.score
from knowngenex a,
chr1_chainSelf b
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
I think we want to created indexes on knowngenex.name and chr1_chainSelf.tStart.
The inequality logic here asks that the "target" interval
(b.tStart,btEnd) is entirely contained within the "genomic" interval
(a.txStart,txEnd). I agree that now would be a good time to think
about the additional 1.5 factor. First let's confirm that invariably:
txStart <= txEnd in the knowngenex table
tStart <= tEnd in the chr1_chainSelf table
qStart <= qEnd in the chr1_chainSelf table
This may seem "obvious" but as you may have found, sometimes the
endpoints are reversed for "opposite" strand orientation. I'd just do
a few queries to be sure what we should expect:
select count(*) from knowngenex where txStart > txEnd;
select count(*) from chr1_chainSelf where tStart > tEnd;
select count(*) from chr1_chainSelf where qStart > qEnd;
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Nov 2004 14:12 PST
yes quick one line return
yes each of those counts is equal to zero
chain ID someone refered to as the index because it was unique, but I dont know more
|
Request for Question Clarification by
mathtalk-ga
on
19 Nov 2004 05:54 PST
Hi, iterative-ga:
See my Comment at bottom about creating a couple of indexes on tables.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
19 Nov 2004 09:29 PST
&&&yes, you understand the concept
&&&yes, genetic intervals to "contain" the query interval are also
subject to 50% enlargement, this may look sticky later because one
chain's query interval may hit two genes. Fine, count them all with
that chain score.
&&&No, you are right, we cannot ignore chromosomal location when
comparing chain query interval to the knownGene table in the final
step. Coordinates include chromosome name and postion.
&&& knowngenex is the table used because we could not get the longblob
exon coordinates into an sql table. This index is created. The
chromosome 10 index is created, but upon repeating for other
chrX_chainSelf files I get
CREATE INDEX tStart_idx ON chr1_chainSelf ( tStart )
*
ERROR at line 1:
ORA-00955: name is already used by an existing object
|
Request for Question Clarification by
mathtalk-ga
on
19 Nov 2004 10:32 PST
Ah, I see we need to make the index names distinct for each of the
chr#_chainSelf tables. Something along these lines:
DROP INDEX tStart_idx;
CREATE INDEX chr10_tStart_idx ON chr10_chainSelf ( tStart );
Also, thanks for the clarification about the chromosome locations and
intervals. Given that, I'll amend my recommendation for the index on
knownGenex to:
DROP INDEX txStart_idx;
CREATE INDEX chrom_txStart_idx ON knownGenex ( chrom, txStart );
* * * * * * * * * * * * * *
When a genetic interval txStart/txEnd is to be increased by 1.5 times,
it would be my intuition that this means to lower txStart by 1/4th of
(txEnd - txStart), and to symmetrically increase txEnd by the same
amount. Is that your sense as well?
regards, mathtalk-ga
|
Request for Question Clarification by
mathtalk-ga
on
23 Nov 2004 06:06 PST
My thinking right now is to create the links between known genes via
chain-self target and query intervals in three steps:
1. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf target interval
2. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf query interval
3. Join the results of 1. and 2. based on equality of chain IDs,
excluding if necessary any rows where the known genes are the same.
So as an experiment, I'll post the SQL to create a couple of
intermediate tables to hold the results of 1. and 2. Then we can put
an index on these tables for chain ID, to facilitate joining them
(3.).
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
30 Nov 2004 08:51 PST
OK, I think that the indexes are created. Im not sure what to clrify
here. What you describe sounds good, but seems to include more steps
than needed. I dont know why a table must be created, I thought we
were already doing joins in the where statement. I just want the table
of one degree links though, so any way is fine.
|
Request for Question Clarification by
mathtalk-ga
on
30 Nov 2004 08:57 PST
I hope we will save time overall because of the computation of the
enlarged intervals -- doing it once rather than over and over.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
30 Nov 2004 10:00 PST
bene. Then my clarification is sounds very good. I will keep reading
about joins and formatting outputs
|