First dibs on this question must go to Mathtalk, the author of the
comments that go me this far. I would understand if they are not
interested, but I do feel like I have to compensate him or her
somehow.
I need a query to build a map of gene collisions.
>>> I am using Oracle9i and SQLplus
The "known" genes in the genome are listed in table knownGene. For each gene (row)
there is a genomic interval or locus with coordinates (chrN, Start position, end
position). For each chromosome (1-22, X, Y) there is a table of entities called
chains. These each have two intervals, target and query. A chain is a repeat.
The plan is to make a list for each known gene. The list is named by its target
gene, and populated by the names and descriptions of all the genes associated via
the chains. A chain is a link, or edge, from one region to another in the genome.
For each knowngene, all the chainSelf tables must be scanned for chains whose
target coordinates indicate they would fall into the genes genomic interval. Each
genes interval will have to be expanded by 1.5, but that can wait. With a gene's
interval in mind, a list of chains that are enclosed by or that overlap it can be
made. Each of these chains has a query locus associated with it. QNAME is the
chromosome, QSTART, QEND define the interval. For each query interval, table
knownGene is searched and every overlapping knowngene returned. These are the
known genes that are linked to or "chain to" the initial target gene for which the
list is named. They constitute the first of two columns in this is gene's list.
The second column is populated by taking the name in knownGene and scanning table
kgXref (were name is equivilent to kgID) to return all teh descriptions of the
genes whose names we collected for this one target gene. I am not particularly
clear on what the best final output is. Suffice to say I would want to sort the
lists by size and see what genes chain to the greatest number of other genes, and
then make a map of the network created by these associations using something like
Cytoscape to show that the distribution of links follows a power law.
>>>Here are the descriptions of the 3 table types:
SQL> describe knownGene
Name Null? Type
----------------------------------------- -------- ----------------------------
NAME NOT NULL VARCHAR2(255)
CHROM NOT NULL VARCHAR2(255)
STRAND NOT NULL CHAR(1)
TXSTART NOT NULL NUMBER(10)
TXEND NOT NULL NUMBER(10)
CDSSTART NOT NULL NUMBER(10)
CDSEND NOT NULL NUMBER(10)
EXONCOUNT NOT NULL NUMBER(10)
PROTEINID NOT NULL VARCHAR2(40)
ALIGNID NOT NULL VARCHAR2(8)
>knownGene looks like this, but with no headers. Values abbreviated to
fit page here.
>Table knownGene has 43232 rows total.
#name chrom strand txStart txEnd cdsStart cdsEnd exonCnt proteinID alignID
BC0634 chr1 - 4224 7502 4558 7173 2 AAH63682 24255
AL1373 chr1 - 4266 1936 7413 19346 2 Q9NSV7 41739
BC0544 chr1 - 4268 5808 4558 5808 1 Q96BN3 32430
SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y)
Name Null? Type
----------------------------------------- -------- ----------------------------
BIN NOT NULL NUMBER(5)
SCORE NOT NULL FLOAT(126)
TNAME NOT NULL VARCHAR2(255)
TSIZE NOT NULL NUMBER(10)
TSTART NOT NULL NUMBER(10)
TEND NOT NULL NUMBER(10)
QNAME NOT NULL VARCHAR2(255)
QSIZE NOT NULL NUMBER(10)
QSTRAND NOT NULL CHAR(1)
QSTART NOT NULL NUMBER(10)
QEND NOT NULL NUMBER(10)
ID NOT NULL NUMBER(10)
>chr10_chainSelf looks like this with no headers.
#bin score tName tSize tStart tEnd qName qSize qStrand
qStart qEnd id
585 63520 chr10 135037 14000 16578 chr7 158545 -
96339 9634 204969
585 63670 chr10 135037 14000 16578 chr7 158545 +
62286 6228 204161
585 81668 chr10 135037 14000 17239 chr7 158545 +
62827 6283 133728
SQL> describe kgXref
Name Null? Type
----------------------------------------- -------- ----------------------------
KGID NOT NULL VARCHAR2(40)
MRNA VARCHAR2(40)
SPID VARCHAR2(40)
SPDISPLAYID VARCHAR2(40)
GENESYMBOL VARCHAR2(40)
REFSEQ VARCHAR2(40)
PROTACC VARCHAR2(40)
DESCRIPTION VARCHAR2(255)
>kgXref looks like this but with no headers. Values have been truncated.
>Table kgXref has 42026 rows total.
#kgID mRNA spID spDisplayID geneSymbol refseq protAcc
description
AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058
Down syndrome gene 4
AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050
osteomodulin
AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068
mRNA expr in osteoblast
>>>this is the query that was given to me. It can only check through one of the
chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results
over and over. I dont think it ever stops. I have been sending the results to a
spool file, and clearly the query doesnt do what I intended. I
definately need it to check all chainSelf tables.
/* query1.sql*/
select a.name, a.exoncount, d.description, c.name
from knowngene a,
chr1_chainSelf b,
knownGene c,
kgXref d
where a.txstart <= b.tStart
and a.txEnd >= b.tEnd
and b.qstart >= c.Txstart
and b.qEnd <= c.txend
and c.name = d.kgId
and a.chrom = 'chr1'
;
Any direction would be greatly appreciated, but the answer is going to
have to work. I can help develop it but it will essentially be the
text in the query file, and thus syntax specific. |
Request for Question Clarification by
mathtalk-ga
on
11 Nov 2004 05:27 PST
Hi, iterative-ga:
"For each knowngene, all the chainSelf tables must be scanned for
chains whose target coordinates indicate they would fall into the
genes genomic interval."
I think that your sample query illustrates that only the chainSelf
table that corresponds to the same chromosome to which a known gene is
assigned needs to be scanned (for that row in knownGene).
So for rows in knownGene with chrom = 'chr1', we need the
chr1_chainSelf table, but where a gene is located on a different
chromosome, we need the corresponding "chainSelf" table.
This basically means writing a different query for each chromosome (24
queries in all), although SQL has a keyword UNION that allows one to
combine queries with similar columns into one large result set.
Let's try and understand clearly what the existing query does, and
then you can perhaps comment on whether this is what you'd like to do
or not. The query involves joining two copies of knownGene, and one
copy each of chr1_chainSelf and kgXref. As we just discussed, the
fact that chr1_chainSelf is being used is determined by the chromosome
location of the "known gene".
The only use that is made in the query at hand for kgXref is to lookup
the description by matching:
knownGene.name = kgXref.kgID
It is the sort of "added information" that can be tacked on later
(because the knownGene.name values are also returned in the query), so
to simplify the discussion, I'm going to omit the kgXref table for the
moment.
Now I believe that what your query is meant to do is to start from
each row in knownGene which (for the sake of illustration) happens to
be on chromosome 1, and to build a list of "related" genes ALSO on the
same chromosome. If this is so, then there's a condition missing in
the sample query. It should restrict both "copies" of the knownGene
table to rows for chromosome 1 (or whichever chromosome we'd be
writing the query against).
Thus let's consider this:
/* rewrite of chr1 "collision" query */
select a.name, a.exoncount, c.name
from knownGene a,
chr1_chainSelf b,
knownGene c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and c.txStart <= b.qStart
and c.txend >= b.qEnd
and a.chrom = 'chr1'
and c.chrom = 'chr1';
Now this query, which rewrites the original in some minor ways apart
from adding the requirement that both knownGene rows "live" on
chromosome 1, produces a relationship between pairs of known genes.
As far as it goes, I think this query will probably do the right and
useful thing.
However I think that there is presumably then a second step you'll
want to take with these relations, and that is to organize them into
"chains" or lists that partition the chromosome. In mathematics we
might call this "taking the transitive closure", but the idea is
simply that if knownGene A is related to knownGene B, and knownGene B
is related to knownGene C, then we want ultimately to put all three
into a common "linked list" or other data structure that lumps them
together.
This is the impression I have so far about what you are trying to
accomplish. Please let me know if I'm getting warm.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
12 Nov 2004 09:54 PST
First, I dont know what the problem is but I have been trying to post
this for a while and the clarification button did not exist.
Second, in answer to you getting warmer, sort of. Every known gene is related to
a number, maybe zero, of other genes, each by a chain, which has a
score. Imagine a table with a row for every known gene. The gene that
is linked by the highest scoring chain is put in column two. You are
right that for each gene you only have to look in one chr_NchainSelf
table, but the chains found in that table on that chromosom link to
genes on all chromosomes, c.chrom = 'chr1' is not generally true. The
problem with my model here is that it might require ~500 columns, but
most of the rows in these low scoring columns will be empty.
I have been told this is a stupid or impossible table.
|
Request for Question Clarification by
mathtalk-ga
on
12 Nov 2004 20:55 PST
Let me ask this. Suppose I pick two known genes, and I ask you what
the "score" is that relates them. Is the score the same regardless of
the ordering of the pair? Does finding the score between genes A and
B involve for the most part consideration of intermediate
relationships to other genes?
Still trying to grasp why one considers genes related in this fashion.
Originally you gave me the impression we were talking about
coordinates (start and end positions) within a sequence, but if that
were the case, it scarcely makes sense to confuse coordinates on one
chromosome with numerically similar values on a different chromosome.
regards, mathtalk-ga
|
Request for Question Clarification by
mathtalk-ga
on
13 Nov 2004 05:39 PST
Here's an alternative to building the table quite as you described it
(with many columns, most of which are empty on most records).
One instead defines a table in which the key is a pair of known genes,
and a third column gives the "rank" of their scoring.
This is the motivation behind my requests for clarification I'm asking
above. One part addresses whether the key is an ordered or unordered
pair of known genes; the others seek insight into how the score or
ranking would be found.
As far as the missing "Clarification" button goes, my guess would be
that the site didn't see you as logged in as yourself (only the user
who posts the Question can submit Clarifications). Rarely I find that
I have to logout, log back in to get recognized by the site.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
14 Nov 2004 11:27 PST
yes, a two or three column table was described to me yesterday as more
sensible, like you describe. Links are birectional, the table only
indicates one degree of linking. Each link is a chain. Column one is
the target gene. The gene (row) where you say for every gene do this.
what you do is find all the chains that link you to other genes.
Column two is the gene you link to, column three is the score for that
link, given by the scoore for the chain that links you. Example:
TargetGene__QueryGene__Score
gene 1 gene 233 500
gene 1 gene 458 302
gene 1 gene 458 302
gene 1 gene 2 58
gene 2 gene 332 403
gene 2 gene 806 255
gene 2 gene 1 58 >> this is the reverse of row 4. There may
be several links between two genes, they all will be listed twice.
Directionality constraint can be induced later by strand information.
The problem was that gmail accounts are not allowed and I was logged in to mail.
|
Request for Question Clarification by
mathtalk-ga
on
14 Nov 2004 15:26 PST
Okay, my impression was that the query we looked at before gave "one
degree of linking". You can still create and populate a table that
will hold _all_ degrees of linking (the transitive closure business I
mentioned earlier).
Basically we can collect all the "one degree" links first, and
populate them as the corresponding pair of known genes with "rank"
column set = 1.
Then we proceed to insert new pairs of genes into the "linkable genes"
tables based on a rule that says:
If known gene A is linked to known gene B with rank = K, and known
gene B is linked to known gene C with rank = 1, then insert the
pairing A,C into the table with rank = K+1 unless it is already there
(with a smaller rank).
We would "iterate" over this rule until new gene pairs stop being
added. (One iteration with no new pairs means we're done.)
Efficiency requires some indexes be defined on the "linkable genes"
table, for example so that it's pretty fast to search it for gene
pairs and know whether they already exist or not.
I see from your last note that you contemplate storing the same gene
pair multiple times (if they can be linked together in more than one
way). Personally I would avoid this, as it will likely generate a
high proportion of duplicate records.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
15 Nov 2004 06:31 PST
"However ... then a second step you'll want to take with these
relations, and that is to organize them into "chains" or lists that
partition the chromosome. In mathematics we might call this "taking
the transitive closure" ... we want ultimately to put all three into a
common "linked list" or other data structure that lumps them
together."
I do not follow what this looks like nor am I familiar with the point,
nor is its really the goal of this question. I want to make a table
wiht one degree of linking for every gene. if I want to make second
degree links, I can get them from that table. If your
conceptualization yeilds a better query to do this, that will be fine.
The query we are refering to above does not do this. In addition it
still has the incorrect assumption that target gene and query gene are
on the same chrom. You were hinting at the creation of a new table for
a new degree of linking, or a new column, either way I dont get it.
Show me the query to run adn I might get it.
|
Request for Question Clarification by
mathtalk-ga
on
16 Nov 2004 11:20 PST
Okay, let's back up and discuss where you find shortcomings in the existing query.
One point is to make the query use all the chr#_chainSelf tables. If
you are satisfied with inserting the results into a common table
(linked_genes), then we could simply modify the query1.sql to process
each of the other 23 "chainSelf" tables one at a time. If it's
important to have a single query that does everything, then we could
write this as a UNION ALL query.
However I recall your earlier statement: "It returns the same results
over and over. I dont think it ever stops."
Is the problem that when you run the query (once), it never stops? Or
are you simply saying that the query returns the same rows each time
it runs?
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
16 Nov 2004 12:17 PST
both problems. The first thing to understand is that this query does
not work. It was someones idea, it would be best to start from
scratch. One query, multiple queries, this is not really a concern. I
do want one table containing all the links, with score, between genes.
no entire row will be duplicated:
targetGene_queryGene_chainScore
There are 44000 genes, 20 genes on average linked to each, less than a million rows.
|
Request for Question Clarification by
mathtalk-ga
on
17 Nov 2004 05:07 PST
Hi, iterative-ga:
Regarding "both problems", I have two suggestions:
1)
I suggest we fix a row in the "a" table and do the query to see
whether the results that come back are the appropriate subset of the
ones you want, ie. all the linked genes from chromosome 1. Comparing
your description of the "overlapping" genomic intervals/coordinates, I
see that there's some room here for the inequality logic to be wrong.
So, if there are some rows that don't belong, then we can address that
issue sooner than later:
/* query1.sql revised again */
select a.name, c.name, b.score
from knowngene a,
chr1_chainSelf b,
knownGene c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and b.qStart >= c.txStart
and b.qEnd <= c.txEnd
and c.name = d.kgId
and a.chrom = 'chr1'
and a.name = 'BC0634'
;
2)
I think the reason the query runs very slowly is because it is doing
the joins on many tables without having a good set of indexes defined
to help. If you can confirm that indexes have not yet been created,
I'll post some SQL to create the ones that jump out at me. If you
have created some indexes, please post what they are, and I can review
the list for additions.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Nov 2004 07:56 PST
/* query6.sql from mathtalk
must change knowngene to knowngenex for exon coordinate absence
must change a.name to AK000538, which has one chr1 chain to SHARP
must take out kgid reference for now
*/
select a.name, c.name, b.score
from knowngenex a,
chr1_chainSelf b,
knownGenex c
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and b.qStart >= c.txStart
and b.qEnd <= c.txEnd
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
&&& This query is simply hanging now for 30 minutes, shoud return one line.
&&& I thought indexes would show up in DESCRIBE knownGene, etc above,
so they do not exist I guess.
&&& I get the feeling also that the inequality logic is wrong, as
though its not comparing one chain to fufill the position requirements
but multiple. If this would be a good time to incorporate the desire
to expand a genes coordinates by 1.5x with the same center point,
please keep it in mind.
&&& I think we are getting warmer.
|
Request for Question Clarification by
mathtalk-ga
on
17 Nov 2004 08:58 PST
As a quick confirmation that we are not missing something fundamental:
select a.name
from knowngenex a
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
should quickly return one row. Then, let's test the speed of this:
select a.name, b.score
from knowngenex a,
chr1_chainSelf b
where a.txStart <= b.tStart
and a.txEnd >= b.tEnd
and a.chrom = 'chr1'
and a.name = 'AK000538'
;
I think we want to created indexes on knowngenex.name and chr1_chainSelf.tStart.
The inequality logic here asks that the "target" interval
(b.tStart,btEnd) is entirely contained within the "genomic" interval
(a.txStart,txEnd). I agree that now would be a good time to think
about the additional 1.5 factor. First let's confirm that invariably:
txStart <= txEnd in the knowngenex table
tStart <= tEnd in the chr1_chainSelf table
qStart <= qEnd in the chr1_chainSelf table
This may seem "obvious" but as you may have found, sometimes the
endpoints are reversed for "opposite" strand orientation. I'd just do
a few queries to be sure what we should expect:
select count(*) from knowngenex where txStart > txEnd;
select count(*) from chr1_chainSelf where tStart > tEnd;
select count(*) from chr1_chainSelf where qStart > qEnd;
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
17 Nov 2004 14:12 PST
yes quick one line return
yes each of those counts is equal to zero
chain ID someone refered to as the index because it was unique, but I dont know more
|
Request for Question Clarification by
mathtalk-ga
on
19 Nov 2004 05:54 PST
Hi, iterative-ga:
See my Comment at bottom about creating a couple of indexes on tables.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
19 Nov 2004 09:29 PST
&&&yes, you understand the concept
&&&yes, genetic intervals to "contain" the query interval are also
subject to 50% enlargement, this may look sticky later because one
chain's query interval may hit two genes. Fine, count them all with
that chain score.
&&&No, you are right, we cannot ignore chromosomal location when
comparing chain query interval to the knownGene table in the final
step. Coordinates include chromosome name and postion.
&&& knowngenex is the table used because we could not get the longblob
exon coordinates into an sql table. This index is created. The
chromosome 10 index is created, but upon repeating for other
chrX_chainSelf files I get
CREATE INDEX tStart_idx ON chr1_chainSelf ( tStart )
*
ERROR at line 1:
ORA-00955: name is already used by an existing object
|
Request for Question Clarification by
mathtalk-ga
on
19 Nov 2004 10:32 PST
Ah, I see we need to make the index names distinct for each of the
chr#_chainSelf tables. Something along these lines:
DROP INDEX tStart_idx;
CREATE INDEX chr10_tStart_idx ON chr10_chainSelf ( tStart );
Also, thanks for the clarification about the chromosome locations and
intervals. Given that, I'll amend my recommendation for the index on
knownGenex to:
DROP INDEX txStart_idx;
CREATE INDEX chrom_txStart_idx ON knownGenex ( chrom, txStart );
* * * * * * * * * * * * * *
When a genetic interval txStart/txEnd is to be increased by 1.5 times,
it would be my intuition that this means to lower txStart by 1/4th of
(txEnd - txStart), and to symmetrically increase txEnd by the same
amount. Is that your sense as well?
regards, mathtalk-ga
|
Request for Question Clarification by
mathtalk-ga
on
23 Nov 2004 06:06 PST
My thinking right now is to create the links between known genes via
chain-self target and query intervals in three steps:
1. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf target interval
2. Select known genes & "chain IDs" where the known gene's (expanded)
genomic interval contains the chainSelf query interval
3. Join the results of 1. and 2. based on equality of chain IDs,
excluding if necessary any rows where the known genes are the same.
So as an experiment, I'll post the SQL to create a couple of
intermediate tables to hold the results of 1. and 2. Then we can put
an index on these tables for chain ID, to facilitate joining them
(3.).
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
30 Nov 2004 08:51 PST
OK, I think that the indexes are created. Im not sure what to clrify
here. What you describe sounds good, but seems to include more steps
than needed. I dont know why a table must be created, I thought we
were already doing joins in the where statement. I just want the table
of one degree links though, so any way is fine.
|
Request for Question Clarification by
mathtalk-ga
on
30 Nov 2004 08:57 PST
I hope we will save time overall because of the computation of the
enlarged intervals -- doing it once rather than over and over.
regards, mathtalk-ga
|
Clarification of Question by
iterative-ga
on
30 Nov 2004 10:00 PST
bene. Then my clarification is sounds very good. I will keep reading
about joins and formatting outputs
|
Hi, iterative-ga:
Let's outline the steps to take, and then go into the details (see below).
1. Create a table to hold each known genes, its chromosome, and the
enlarged genomic intervals. Populate this table with an INSERT
statement based on selecting records from table knownGenex.
2a. Create a table to hold pairs of known genes and chainSelf IDs,
such that the start/end coordinates of the chainSelf target interval
are within the enlarged genomic interval. Populate this table with
several INSERT statements, one for each chromosome, joining the table
from 1. and the respect chromosome's chainSelf table.
2b. Create a table to hold pairs of known genes and chainSelf IDs,
such that the start/end coordinates of the chainSelf query interval
are within the enlarged genomic interval. Populate this table with
several INSERT statements, one for each chromosome, joining the table
from 1. and the respect chromosome's chainSelf table.
3. Create a table to hold pairs of known genes and the chainSelf IDs
that link them. Join the tables from 2a/2b by chromosome & chainSelf
IDs, to produce the required "links" among known genes.
It can be seen from this outline that the process can be done
chromosome by chromosome, although we will be using tables that pool
all the chromosomes' results together. But to prove the ideas, let's
run through the process for chromosome 1. It may not only find a bug
or two, but point out some indexing issues for performance.
Please run through these steps just on chromosome 1 and let me know
what the timings are like.
regards, mathtalk-ga
* * * * * * * * * * * *
Step 1
======
CREATE table knowngene_enlarge (
name varchar2(255) not null,
chrom varchar2(255) not null,
wxStart number(10) not null,
wxEnd number(10) not null
);
INSERT into knowngene_enlarge
SELECT name,
chrom,
txStart - (txEnd - txStart)/4,
txEnd + (txEnd - txStart)/4
from knowngenex;
/* Note: We will probably want to add an index here on
chrom, for the sake of the next steps.
*/
Step 2a
=======
CREATE table knowngene_target (
name varchar2(255) not null,
chrom varchar2(255) not null,
chain_id number(10) not null
);
/* for each chromosome, do like this: */
INSERT into knowngene_target
SELECT a.name,
a.chrom,
b.id
from knowngene_enlarge a, chr1_chainSelf b
where a.chrom = 'chr1'
and b.tStart between a.wxStart and a.wxEnd
and b.tEnd <= a.wxEnd;
/* Note: The hope is that the condition b.tStart <= a.wxEnd,
although redundant, will help speed up the query by
using an index on chrXX_chainSelf.tStart to limit
the join.
*/
Step 2b
=======
CREATE table knowngene_query (
name varchar2(255) not null,
chrom varchar2(255) not null,
chain_id number(10) not null,
score float(126) not null
);
/* for each chromosome, do like this: */
INSERT into knowngene_query
SELECT a.name,
a.chrom,
b.id,
b.score
from knowngene_enlarge a, chr1_chainSelf b
where a.chrom = 'chr1'
and b.qStart between a.wxStart and a.wxEnd
and b.qEnd <= a.wxEnd;
/* Note: Very similar to above except using query interval
instead of target interval, and also picking up
the score field for later convenience.
*/
Step 3
======
CREATE table knowngene_link (
name_t varchar2(255) not null,
name_q varchar2(255) not null,
chrom varchar2(255) not null,
chain_id number(10) not null,
score float(126) not null
);
INSERT into knowngene_link
SELECT a.name,
b.name,
b.chrom,
b.chain_id,
b.score
from knowngene_target a, knowngene_query b
where a.chrom = b.chrom
and a.chain_id = b.chain_id;
/* Note: To make this join efficient, we probably want to
create a composite index on chrom & chain_id for
knowngene_target and/or knowngene_query.
*/ |
Request for Answer Clarification by
iterative-ga
on
06 Dec 2004 12:04 PST
1 is fine, 2a appears to be ok but both statements in the second
comment are not clear to me. I do not intend to do anything explicitly
instructed because it could take serious time to figure out or debug.
Target table made and populated. 2b, query table, created but insert
hangs for 2 hours now, not for any reason I can see. Ideas?
|
Clarification of Answer by
mathtalk-ga
on
06 Dec 2004 12:13 PST
Hi, iterative-ga:
Both my comments in Step 2 were intended as explanatory (apparently
clear as mud!) and not as instructive. I was just trying to make a
remark on why the queries were written the way they were.
If 2a.'s insert went reasonably quickly but 2b.'s is running very
slowly, then I suspect we need to add an index to chr1_chainSelf on
qStart.
Would you mind verifying if we have an index on tStart but not on
qStart for that table?
Thanks,
mathtalk-ga
|
Clarification of Answer by
mathtalk-ga
on
06 Dec 2004 12:18 PST
Now that I look back at our earlier exchanges of Clarifications, I see
we did indeed create indexes for the tStart columns in chrXX_chainSelf
tables, but none as far as I know on the qStart column.
Please imitate the SQL we used before, changing the index names, etc.
to reflect qStart rather than tStart as the indexed field.
regards, mathtalk-ga
|
Request for Answer Clarification by
iterative-ga
on
07 Dec 2004 08:09 PST
ok part two is fine, part 3 table created, insert hangs. Your comment
about the composite index seems to come into play. Are we supposed to
make another index to insert effectively into the link table?
|
Clarification of Answer by
mathtalk-ga
on
07 Dec 2004 10:08 PST
Yes, I think it would be sufficient to put the composite index on just
one of the tables from step 2, but for the sake of symmetry:
CREATE INDEX chrom_chain_t ON knowngene_target ( chrom, chain_id );
CREATE INDEX chrom_chain_q ON knowngene_query ( chrom, chain_id );
regards, mathtalk-ga
|
Request for Answer Clarification by
iterative-ga
on
11 Dec 2004 12:34 PST
the final insert statement excedes my tablespace. I talked to system
administrator and he said I was at 10gig and he gave me another 10gig.
I dont know if he has to reboot or something because it still excedes
and quits. I am looking fo commands to see how much of my quote is
free. He also said i should drop the last indexes and make them again
in tablespace INDEX01_TBSP, but I am struggling with the sybtax for
that as well. If you have guidance, it would be great but I will still
work to determine what is wrong. What I dont have though is an answer
that works yet. Thanks.
|
Clarification of Answer by
mathtalk-ga
on
11 Dec 2004 21:54 PST
Hi, iterative-ga:
I'm curious to find out a rough number for the records being inserted.
Let's check just the first chromosome, as I suspect the numbers of
links on different chromosomes will not vary by an order of magnitude:
SELECT count(*)
from knowngene_target a, knowngene_query b
where a.chrom = b.chrom
and a.chain_id = b.chain_id
and a.chrom = 'chr1';
At one point you estimated ~ one million rows would be generated as
links, so it's a good time to check that assumption. This count will
potentially include some pairs of genes that are linked in more than
one way.
Since the rows we are creating for knowngene_link have only five
columns, it's not hard to see that we are overdoing it with how wide
we've made many of these columns. I used column widths that were
compatible with how you'd defined the columns in the source tables,
but surely 255 characters is overkill for the gene names (name_t and
name_q) almost as much as it is for the chromosome identifier.
How about 30 characters each for the two gene names, and 10 for
chromosome, and making field score be number(10) rather than
float(126). Judging by your sample data, all the score fields are
whole numbers anyway. These changes to the knowngene_link table will
reduce the size of rows by about an order of magnitude.
Here's the syntax to drop and recreate the indexes on knowngene_target
and knowngene_query in the specified tablespace your administrator
suggested:
DROP INDEX chrom_chain_t;
CREATE INDEX chrom_chain_t
ON knowngene_target ( chrom, chain_id )
TABLESPACE INDEX01_TBSP;
DROP INDEX chrom_chain_q;
CREATE INDEX chrom_chain_q
ON knowngene_query ( chrom, chain_id )
TABLESPACE INDEX01_TBSP;
As far as figuring out how much space a table is using, there is a
"system" table that holds some relevant information called DBA_TABLES.
Often the first step is to estimate it yourself by counting the
number of rows and adding up the size of one row, then multiplying.
Let me write up some on this topic with a clear mind in the morning.
regards, mathtalk-ga
|
Clarification of Answer by
mathtalk-ga
on
12 Dec 2004 12:51 PST
Hi, iterative-ga:
Well, in the clear light of "morning" I'm doubtful how much space
could be saved in the final table by redefining its columns. The
VARCHAR2 fields are of variable length, so it was incorrect for me to
say we'd save a lot of space by resizing them. In any case let's get
a count of how many rows are being inserted before we do anything
else.
The database administrator (DBA) is in a much better position to check
a user's quota than the user is, as Oracle puts many of the
tables/views/etc. under the dba permissions. However you should be
able to check how much free space you have remaining (per tablespace)
under your assigned quota by querying some views that are defined for
user permissions.
Here's a tip from "The Oracle (tm) Users' Co-Operative FAQ" written by
Norman Dunbar:
[How I determine how much free space a tablespace has left?]
http://www.jlcomp.demon.co.uk/faq/freespace.html
select ts.tablespace_name,
to_char(sum(nvl(fs.bytes,0))/1048576, '99,999,990.99') as MB_FREE,
count(*) as FRAGMENTS,
to_char(max(nvl(fs.bytes,0))/1048576, '99,999,990.99') as BIGGEST_BIT
from user_free_space fs, user_tablespaces ts
where fs.tablespace_name(+) = ts.tablespace_name
group by ts.tablespace_name;
On the other hand, for a quick summary of how much space has been
allocated (but not necessarily used; a detailed explanation would
involve discussing what Oracle does with "segments" and "extents"):
select segment_type, sum(bytes) from user_segments group by segment_type
which should report allocation totals by combined tables and combined indexes.
regards, mathtalk-ga
|
Request for Answer Clarification by
iterative-ga
on
15 Dec 2004 07:36 PST
SQL> select count(*) from knowngene_target; COUNT(*)-39535334
SQL> select count(*) from knowngene_query; COUNT(*)-29017023
SQL> SELECT count(*) from knowngene_target a, knowngene_query b where
a.chrom = b.chrom
and a.chain_id = b.chain_id and a.chrom = 'chr1'; ***took ~5hrs
hours: COUNT(*)-4995703
SQL> select segment_type, sum(bytes) from user_segments group by segment_type;
SEGMENT_TYPE SUM(BYTES)
------------------ ----------
INDEX 1383071744
TABLE 4865523712
>>other quota check from Norman Dunbar failed. DBA says I have 20
gigs. Turns out INDEX is INDX.
the last two nights someone has logged me off as the compound index
was being made. but I will continue with the plan, unless there seems
to be something off.
|
Clarification of Answer by
mathtalk-ga
on
15 Dec 2004 09:14 PST
Chromosome 1 is the largest (if I remember correctly) chromosome in
the human genome, but that may not translate into the largest number
of known genes (or links). But at least the 5,000,000 links reported
gives us a good ballpark to plan for, per chromosome.
So I'd rebuild the composite indexes, then proceed to do the final
inserts one chromosome at a time, keeping an eye on the amount of
space used. Let's check it after two or three chromosomes are
processed and try to extrapolate to the final size of the
knowngene_link table. We may need to analyze the storage factors
being used for the table to make sure they are reasonable. By default
the succeeeding allocations (extents) added for a table will be
increased by a certain percentage in size each time. While this
ordinarily helps to keep the number of extents to a minimum without
requiring much planning, it can be a problem if space starts getting
tight.
regards, mathtalk-ga
|
Request for Answer Clarification by
iterative-ga
on
20 Dec 2004 11:36 PST
about face.
I cant create any tables, like the last one I need, and the last index
creation disconnects me from oracle. We have one DBA for the building
and he is always "migrating" something or otherwise busy. Turns out I
am the only one using oracle, and I do not rate high on the pole, so
he cannot look into these issues.
Soooooooooooo,
do I change to postgres or mysql. This DB will be on a php webpage once it works.
Thanks for all your help, I guess this question is over for now.
|
Clarification of Answer by
mathtalk-ga
on
20 Dec 2004 12:03 PST
Ouch! Well, the SQL & indexes should work on either postgres or
MySQL. It seems like an odd situation, in that Oracle is a commercial
database management system and the situation is impelling you to
migrate to an open source DBMS.
best wishes, mathtalk-ga
|