Google Answers: SQLplus query needed

View Question

Q: SQLplus query needed ( Answered 5 out of 5 stars

Question

Subject: SQLplus query needed
Category: Computers > Algorithms
Asked by: iterative-ga
List Price: $100.00

Posted: 08 Nov 2004 12:57 PST
Expires: 08 Dec 2004 12:57 PST
Question ID: 426258

First dibs on this question must go to Mathtalk, the author of the comments that go me this far. I would understand if they are not interested, but I do feel like I have to compensate him or her somehow. I need a query to build a map of gene collisions. >>> I am using Oracle9i and SQLplus The "known" genes in the genome are listed in table knownGene. For each gene (row) there is a genomic interval or locus with coordinates (chrN, Start position, end position). For each chromosome (1-22, X, Y) there is a table of entities called chains. These each have two intervals, target and query. A chain is a repeat. The plan is to make a list for each known gene. The list is named by its target gene, and populated by the names and descriptions of all the genes associated via the chains. A chain is a link, or edge, from one region to another in the genome. For each knowngene, all the chainSelf tables must be scanned for chains whose target coordinates indicate they would fall into the genes genomic interval. Each genes interval will have to be expanded by 1.5, but that can wait. With a gene's interval in mind, a list of chains that are enclosed by or that overlap it can be made. Each of these chains has a query locus associated with it. QNAME is the chromosome, QSTART, QEND define the interval. For each query interval, table knownGene is searched and every overlapping knowngene returned. These are the known genes that are linked to or "chain to" the initial target gene for which the list is named. They constitute the first of two columns in this is gene's list. The second column is populated by taking the name in knownGene and scanning table kgXref (were name is equivilent to kgID) to return all teh descriptions of the genes whose names we collected for this one target gene. I am not particularly clear on what the best final output is. Suffice to say I would want to sort the lists by size and see what genes chain to the greatest number of other genes, and then make a map of the network created by these associations using something like Cytoscape to show that the distribution of links follows a power law. >>>Here are the descriptions of the 3 table types: SQL> describe knownGene Name Null? Type ----------------------------------------- -------- ---------------------------- NAME NOT NULL VARCHAR2(255) CHROM NOT NULL VARCHAR2(255) STRAND NOT NULL CHAR(1) TXSTART NOT NULL NUMBER(10) TXEND NOT NULL NUMBER(10) CDSSTART NOT NULL NUMBER(10) CDSEND NOT NULL NUMBER(10) EXONCOUNT NOT NULL NUMBER(10) PROTEINID NOT NULL VARCHAR2(40) ALIGNID NOT NULL VARCHAR2(8) >knownGene looks like this, but with no headers. Values abbreviated to fit page here. >Table knownGene has 43232 rows total. #name chrom strand txStart txEnd cdsStart cdsEnd exonCnt proteinID alignID BC0634 chr1 - 4224 7502 4558 7173 2 AAH63682 24255 AL1373 chr1 - 4266 1936 7413 19346 2 Q9NSV7 41739 BC0544 chr1 - 4268 5808 4558 5808 1 Q96BN3 32430 SQL> describe chr10_chainSelf (there is one table for each chromosome 1-22, X, Y) Name Null? Type ----------------------------------------- -------- ---------------------------- BIN NOT NULL NUMBER(5) SCORE NOT NULL FLOAT(126) TNAME NOT NULL VARCHAR2(255) TSIZE NOT NULL NUMBER(10) TSTART NOT NULL NUMBER(10) TEND NOT NULL NUMBER(10) QNAME NOT NULL VARCHAR2(255) QSIZE NOT NULL NUMBER(10) QSTRAND NOT NULL CHAR(1) QSTART NOT NULL NUMBER(10) QEND NOT NULL NUMBER(10) ID NOT NULL NUMBER(10) >chr10_chainSelf looks like this with no headers. #bin score tName tSize tStart tEnd qName qSize qStrand qStart qEnd id 585 63520 chr10 135037 14000 16578 chr7 158545 - 96339 9634 204969 585 63670 chr10 135037 14000 16578 chr7 158545 + 62286 6228 204161 585 81668 chr10 135037 14000 17239 chr7 158545 + 62827 6283 133728 SQL> describe kgXref Name Null? Type ----------------------------------------- -------- ---------------------------- KGID NOT NULL VARCHAR2(40) MRNA VARCHAR2(40) SPID VARCHAR2(40) SPDISPLAYID VARCHAR2(40) GENESYMBOL VARCHAR2(40) REFSEQ VARCHAR2(40) PROTACC VARCHAR2(40) DESCRIPTION VARCHAR2(255) >kgXref looks like this but with no headers. Values have been truncated. >Table kgXref has 42026 rows total. #kgID mRNA spID spDisplayID geneSymbol refseq protAcc description AB0000 AB0000 P56555 DSR4_HUMAN DSCR4 NM_0057 NP_0058 Down syndrome gene 4 AB0001 AB0001 Q99983 OMD_HUMAN OMD NM_0050 NP_0050 osteomodulin AB0001 AB0001 Q99984 Q99984 C1orf29 NM_0068 NP_0068 mRNA expr in osteoblast >>>this is the query that was given to me. It can only check through one of the chainself tables instead of all 24 (chr1-22, X, Y). It returns the same results over and over. I dont think it ever stops. I have been sending the results to a spool file, and clearly the query doesnt do what I intended. I definately need it to check all chainSelf tables. /* query1.sql*/ select a.name, a.exoncount, d.description, c.name from knowngene a, chr1_chainSelf b, knownGene c, kgXref d where a.txstart <= b.tStart and a.txEnd >= b.tEnd and b.qstart >= c.Txstart and b.qEnd <= c.txend and c.name = d.kgId and a.chrom = 'chr1' ; Any direction would be greatly appreciated, but the answer is going to have to work. I can help develop it but it will essentially be the text in the query file, and thus syntax specific.
Request for Question Clarification by mathtalk-ga on 11 Nov 2004 05:27 PST Hi, iterative-ga: "For each knowngene, all the chainSelf tables must be scanned for chains whose target coordinates indicate they would fall into the genes genomic interval." I think that your sample query illustrates that only the chainSelf table that corresponds to the same chromosome to which a known gene is assigned needs to be scanned (for that row in knownGene). So for rows in knownGene with chrom = 'chr1', we need the chr1_chainSelf table, but where a gene is located on a different chromosome, we need the corresponding "chainSelf" table. This basically means writing a different query for each chromosome (24 queries in all), although SQL has a keyword UNION that allows one to combine queries with similar columns into one large result set. Let's try and understand clearly what the existing query does, and then you can perhaps comment on whether this is what you'd like to do or not. The query involves joining two copies of knownGene, and one copy each of chr1_chainSelf and kgXref. As we just discussed, the fact that chr1_chainSelf is being used is determined by the chromosome location of the "known gene". The only use that is made in the query at hand for kgXref is to lookup the description by matching: knownGene.name = kgXref.kgID It is the sort of "added information" that can be tacked on later (because the knownGene.name values are also returned in the query), so to simplify the discussion, I'm going to omit the kgXref table for the moment. Now I believe that what your query is meant to do is to start from each row in knownGene which (for the sake of illustration) happens to be on chromosome 1, and to build a list of "related" genes ALSO on the same chromosome. If this is so, then there's a condition missing in the sample query. It should restrict both "copies" of the knownGene table to rows for chromosome 1 (or whichever chromosome we'd be writing the query against). Thus let's consider this: /* rewrite of chr1 "collision" query */ select a.name, a.exoncount, c.name from knownGene a, chr1_chainSelf b, knownGene c where a.txStart <= b.tStart and a.txEnd >= b.tEnd and c.txStart <= b.qStart and c.txend >= b.qEnd and a.chrom = 'chr1' and c.chrom = 'chr1'; Now this query, which rewrites the original in some minor ways apart from adding the requirement that both knownGene rows "live" on chromosome 1, produces a relationship between pairs of known genes. As far as it goes, I think this query will probably do the right and useful thing. However I think that there is presumably then a second step you'll want to take with these relations, and that is to organize them into "chains" or lists that partition the chromosome. In mathematics we might call this "taking the transitive closure", but the idea is simply that if knownGene A is related to knownGene B, and knownGene B is related to knownGene C, then we want ultimately to put all three into a common "linked list" or other data structure that lumps them together. This is the impression I have so far about what you are trying to accomplish. Please let me know if I'm getting warm. regards, mathtalk-ga
Clarification of Question by iterative-ga on 12 Nov 2004 09:54 PST First, I dont know what the problem is but I have been trying to post this for a while and the clarification button did not exist. Second, in answer to you getting warmer, sort of. Every known gene is related to a number, maybe zero, of other genes, each by a chain, which has a score. Imagine a table with a row for every known gene. The gene that is linked by the highest scoring chain is put in column two. You are right that for each gene you only have to look in one chr_NchainSelf table, but the chains found in that table on that chromosom link to genes on all chromosomes, c.chrom = 'chr1' is not generally true. The problem with my model here is that it might require ~500 columns, but most of the rows in these low scoring columns will be empty. I have been told this is a stupid or impossible table.
Request for Question Clarification by mathtalk-ga on 12 Nov 2004 20:55 PST Let me ask this. Suppose I pick two known genes, and I ask you what the "score" is that relates them. Is the score the same regardless of the ordering of the pair? Does finding the score between genes A and B involve for the most part consideration of intermediate relationships to other genes? Still trying to grasp why one considers genes related in this fashion. Originally you gave me the impression we were talking about coordinates (start and end positions) within a sequence, but if that were the case, it scarcely makes sense to confuse coordinates on one chromosome with numerically similar values on a different chromosome. regards, mathtalk-ga
Request for Question Clarification by mathtalk-ga on 13 Nov 2004 05:39 PST Here's an alternative to building the table quite as you described it (with many columns, most of which are empty on most records). One instead defines a table in which the key is a pair of known genes, and a third column gives the "rank" of their scoring. This is the motivation behind my requests for clarification I'm asking above. One part addresses whether the key is an ordered or unordered pair of known genes; the others seek insight into how the score or ranking would be found. As far as the missing "Clarification" button goes, my guess would be that the site didn't see you as logged in as yourself (only the user who posts the Question can submit Clarifications). Rarely I find that I have to logout, log back in to get recognized by the site. regards, mathtalk-ga
Clarification of Question by iterative-ga on 14 Nov 2004 11:27 PST yes, a two or three column table was described to me yesterday as more sensible, like you describe. Links are birectional, the table only indicates one degree of linking. Each link is a chain. Column one is the target gene. The gene (row) where you say for every gene do this. what you do is find all the chains that link you to other genes. Column two is the gene you link to, column three is the score for that link, given by the scoore for the chain that links you. Example: TargetGene__QueryGene__Score gene 1 gene 233 500 gene 1 gene 458 302 gene 1 gene 458 302 gene 1 gene 2 58 gene 2 gene 332 403 gene 2 gene 806 255 gene 2 gene 1 58 >> this is the reverse of row 4. There may be several links between two genes, they all will be listed twice. Directionality constraint can be induced later by strand information. The problem was that gmail accounts are not allowed and I was logged in to mail.
Request for Question Clarification by mathtalk-ga on 14 Nov 2004 15:26 PST Okay, my impression was that the query we looked at before gave "one degree of linking". You can still create and populate a table that will hold _all_ degrees of linking (the transitive closure business I mentioned earlier). Basically we can collect all the "one degree" links first, and populate them as the corresponding pair of known genes with "rank" column set = 1. Then we proceed to insert new pairs of genes into the "linkable genes" tables based on a rule that says: If known gene A is linked to known gene B with rank = K, and known gene B is linked to known gene C with rank = 1, then insert the pairing A,C into the table with rank = K+1 unless it is already there (with a smaller rank). We would "iterate" over this rule until new gene pairs stop being added. (One iteration with no new pairs means we're done.) Efficiency requires some indexes be defined on the "linkable genes" table, for example so that it's pretty fast to search it for gene pairs and know whether they already exist or not. I see from your last note that you contemplate storing the same gene pair multiple times (if they can be linked together in more than one way). Personally I would avoid this, as it will likely generate a high proportion of duplicate records. regards, mathtalk-ga
Clarification of Question by iterative-ga on 15 Nov 2004 06:31 PST "However ... then a second step you'll want to take with these relations, and that is to organize them into "chains" or lists that partition the chromosome. In mathematics we might call this "taking the transitive closure" ... we want ultimately to put all three into a common "linked list" or other data structure that lumps them together." I do not follow what this looks like nor am I familiar with the point, nor is its really the goal of this question. I want to make a table wiht one degree of linking for every gene. if I want to make second degree links, I can get them from that table. If your conceptualization yeilds a better query to do this, that will be fine. The query we are refering to above does not do this. In addition it still has the incorrect assumption that target gene and query gene are on the same chrom. You were hinting at the creation of a new table for a new degree of linking, or a new column, either way I dont get it. Show me the query to run adn I might get it.
Request for Question Clarification by mathtalk-ga on 16 Nov 2004 11:20 PST Okay, let's back up and discuss where you find shortcomings in the existing query. One point is to make the query use all the chr#_chainSelf tables. If you are satisfied with inserting the results into a common table (linked_genes), then we could simply modify the query1.sql to process each of the other 23 "chainSelf" tables one at a time. If it's important to have a single query that does everything, then we could write this as a UNION ALL query. However I recall your earlier statement: "It returns the same results over and over. I dont think it ever stops." Is the problem that when you run the query (once), it never stops? Or are you simply saying that the query returns the same rows each time it runs? regards, mathtalk-ga
Clarification of Question by iterative-ga on 16 Nov 2004 12:17 PST both problems. The first thing to understand is that this query does not work. It was someones idea, it would be best to start from scratch. One query, multiple queries, this is not really a concern. I do want one table containing all the links, with score, between genes. no entire row will be duplicated: targetGene_queryGene_chainScore There are 44000 genes, 20 genes on average linked to each, less than a million rows.
Request for Question Clarification by mathtalk-ga on 17 Nov 2004 05:07 PST Hi, iterative-ga: Regarding "both problems", I have two suggestions: 1) I suggest we fix a row in the "a" table and do the query to see whether the results that come back are the appropriate subset of the ones you want, ie. all the linked genes from chromosome 1. Comparing your description of the "overlapping" genomic intervals/coordinates, I see that there's some room here for the inequality logic to be wrong. So, if there are some rows that don't belong, then we can address that issue sooner than later: /* query1.sql revised again */ select a.name, c.name, b.score from knowngene a, chr1_chainSelf b, knownGene c where a.txStart <= b.tStart and a.txEnd >= b.tEnd and b.qStart >= c.txStart and b.qEnd <= c.txEnd and c.name = d.kgId and a.chrom = 'chr1' and a.name = 'BC0634' ; 2) I think the reason the query runs very slowly is because it is doing the joins on many tables without having a good set of indexes defined to help. If you can confirm that indexes have not yet been created, I'll post some SQL to create the ones that jump out at me. If you have created some indexes, please post what they are, and I can review the list for additions. regards, mathtalk-ga
Clarification of Question by iterative-ga on 17 Nov 2004 07:56 PST /* query6.sql from mathtalk must change knowngene to knowngenex for exon coordinate absence must change a.name to AK000538, which has one chr1 chain to SHARP must take out kgid reference for now */ select a.name, c.name, b.score from knowngenex a, chr1_chainSelf b, knownGenex c where a.txStart <= b.tStart and a.txEnd >= b.tEnd and b.qStart >= c.txStart and b.qEnd <= c.txEnd and a.chrom = 'chr1' and a.name = 'AK000538' ; &&& This query is simply hanging now for 30 minutes, shoud return one line. &&& I thought indexes would show up in DESCRIBE knownGene, etc above, so they do not exist I guess. &&& I get the feeling also that the inequality logic is wrong, as though its not comparing one chain to fufill the position requirements but multiple. If this would be a good time to incorporate the desire to expand a genes coordinates by 1.5x with the same center point, please keep it in mind. &&& I think we are getting warmer.
Request for Question Clarification by mathtalk-ga on 17 Nov 2004 08:58 PST As a quick confirmation that we are not missing something fundamental: select a.name from knowngenex a and a.chrom = 'chr1' and a.name = 'AK000538' ; should quickly return one row. Then, let's test the speed of this: select a.name, b.score from knowngenex a, chr1_chainSelf b where a.txStart <= b.tStart and a.txEnd >= b.tEnd and a.chrom = 'chr1' and a.name = 'AK000538' ; I think we want to created indexes on knowngenex.name and chr1_chainSelf.tStart. The inequality logic here asks that the "target" interval (b.tStart,btEnd) is entirely contained within the "genomic" interval (a.txStart,txEnd). I agree that now would be a good time to think about the additional 1.5 factor. First let's confirm that invariably: txStart <= txEnd in the knowngenex table tStart <= tEnd in the chr1_chainSelf table qStart <= qEnd in the chr1_chainSelf table This may seem "obvious" but as you may have found, sometimes the endpoints are reversed for "opposite" strand orientation. I'd just do a few queries to be sure what we should expect: select count() from knowngenex where txStart > txEnd; select count() from chr1_chainSelf where tStart > tEnd; select count(*) from chr1_chainSelf where qStart > qEnd; regards, mathtalk-ga
Clarification of Question by iterative-ga on 17 Nov 2004 14:12 PST yes quick one line return yes each of those counts is equal to zero chain ID someone refered to as the index because it was unique, but I dont know more
Request for Question Clarification by mathtalk-ga on 19 Nov 2004 05:54 PST Hi, iterative-ga: See my Comment at bottom about creating a couple of indexes on tables. regards, mathtalk-ga
Clarification of Question by iterative-ga on 19 Nov 2004 09:29 PST &&&yes, you understand the concept &&&yes, genetic intervals to "contain" the query interval are also subject to 50% enlargement, this may look sticky later because one chain's query interval may hit two genes. Fine, count them all with that chain score. &&&No, you are right, we cannot ignore chromosomal location when comparing chain query interval to the knownGene table in the final step. Coordinates include chromosome name and postion. &&& knowngenex is the table used because we could not get the longblob exon coordinates into an sql table. This index is created. The chromosome 10 index is created, but upon repeating for other chrX_chainSelf files I get CREATE INDEX tStart_idx ON chr1_chainSelf ( tStart ) * ERROR at line 1: ORA-00955: name is already used by an existing object
Request for Question Clarification by mathtalk-ga on 19 Nov 2004 10:32 PST Ah, I see we need to make the index names distinct for each of the chr#_chainSelf tables. Something along these lines: DROP INDEX tStart_idx; CREATE INDEX chr10_tStart_idx ON chr10_chainSelf ( tStart ); Also, thanks for the clarification about the chromosome locations and intervals. Given that, I'll amend my recommendation for the index on knownGenex to: DROP INDEX txStart_idx; CREATE INDEX chrom_txStart_idx ON knownGenex ( chrom, txStart ); * * * * * * * * * * * * * * When a genetic interval txStart/txEnd is to be increased by 1.5 times, it would be my intuition that this means to lower txStart by 1/4th of (txEnd - txStart), and to symmetrically increase txEnd by the same amount. Is that your sense as well? regards, mathtalk-ga
Request for Question Clarification by mathtalk-ga on 23 Nov 2004 06:06 PST My thinking right now is to create the links between known genes via chain-self target and query intervals in three steps: 1. Select known genes & "chain IDs" where the known gene's (expanded) genomic interval contains the chainSelf target interval 2. Select known genes & "chain IDs" where the known gene's (expanded) genomic interval contains the chainSelf query interval 3. Join the results of 1. and 2. based on equality of chain IDs, excluding if necessary any rows where the known genes are the same. So as an experiment, I'll post the SQL to create a couple of intermediate tables to hold the results of 1. and 2. Then we can put an index on these tables for chain ID, to facilitate joining them (3.). regards, mathtalk-ga
Clarification of Question by iterative-ga on 30 Nov 2004 08:51 PST OK, I think that the indexes are created. Im not sure what to clrify here. What you describe sounds good, but seems to include more steps than needed. I dont know why a table must be created, I thought we were already doing joins in the where statement. I just want the table of one degree links though, so any way is fine.
Request for Question Clarification by mathtalk-ga on 30 Nov 2004 08:57 PST I hope we will save time overall because of the computation of the enlarged intervals -- doing it once rather than over and over. regards, mathtalk-ga
Clarification of Question by iterative-ga on 30 Nov 2004 10:00 PST bene. Then my clarification is sounds very good. I will keep reading about joins and formatting outputs

Answer

Subject: Re: SQLplus query needed
Answered By: mathtalk-ga on 06 Dec 2004 05:12 PST
Rated: 5 out of 5 stars

Hi, iterative-ga: Let's outline the steps to take, and then go into the details (see below). 1. Create a table to hold each known genes, its chromosome, and the enlarged genomic intervals. Populate this table with an INSERT statement based on selecting records from table knownGenex. 2a. Create a table to hold pairs of known genes and chainSelf IDs, such that the start/end coordinates of the chainSelf target interval are within the enlarged genomic interval. Populate this table with several INSERT statements, one for each chromosome, joining the table from 1. and the respect chromosome's chainSelf table. 2b. Create a table to hold pairs of known genes and chainSelf IDs, such that the start/end coordinates of the chainSelf query interval are within the enlarged genomic interval. Populate this table with several INSERT statements, one for each chromosome, joining the table from 1. and the respect chromosome's chainSelf table. 3. Create a table to hold pairs of known genes and the chainSelf IDs that link them. Join the tables from 2a/2b by chromosome & chainSelf IDs, to produce the required "links" among known genes. It can be seen from this outline that the process can be done chromosome by chromosome, although we will be using tables that pool all the chromosomes' results together. But to prove the ideas, let's run through the process for chromosome 1. It may not only find a bug or two, but point out some indexing issues for performance. Please run through these steps just on chromosome 1 and let me know what the timings are like. regards, mathtalk-ga * * * * * * * * * * * * Step 1 ====== CREATE table knowngene_enlarge ( name varchar2(255) not null, chrom varchar2(255) not null, wxStart number(10) not null, wxEnd number(10) not null ); INSERT into knowngene_enlarge SELECT name, chrom, txStart - (txEnd - txStart)/4, txEnd + (txEnd - txStart)/4 from knowngenex; /* Note: We will probably want to add an index here on chrom, for the sake of the next steps. / Step 2a ======= CREATE table knowngene_target ( name varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null ); / for each chromosome, do like this: / INSERT into knowngene_target SELECT a.name, a.chrom, b.id from knowngene_enlarge a, chr1_chainSelf b where a.chrom = 'chr1' and b.tStart between a.wxStart and a.wxEnd and b.tEnd <= a.wxEnd; / Note: The hope is that the condition b.tStart <= a.wxEnd, although redundant, will help speed up the query by using an index on chrXX_chainSelf.tStart to limit the join. / Step 2b ======= CREATE table knowngene_query ( name varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null, score float(126) not null ); / for each chromosome, do like this: / INSERT into knowngene_query SELECT a.name, a.chrom, b.id, b.score from knowngene_enlarge a, chr1_chainSelf b where a.chrom = 'chr1' and b.qStart between a.wxStart and a.wxEnd and b.qEnd <= a.wxEnd; / Note: Very similar to above except using query interval instead of target interval, and also picking up the score field for later convenience. / Step 3 ====== CREATE table knowngene_link ( name_t varchar2(255) not null, name_q varchar2(255) not null, chrom varchar2(255) not null, chain_id number(10) not null, score float(126) not null ); INSERT into knowngene_link SELECT a.name, b.name, b.chrom, b.chain_id, b.score from knowngene_target a, knowngene_query b where a.chrom = b.chrom and a.chain_id = b.chain_id; / Note: To make this join efficient, we probably want to create a composite index on chrom & chain_id for knowngene_target and/or knowngene_query. */
Request for Answer Clarification by iterative-ga on 06 Dec 2004 12:04 PST 1 is fine, 2a appears to be ok but both statements in the second comment are not clear to me. I do not intend to do anything explicitly instructed because it could take serious time to figure out or debug. Target table made and populated. 2b, query table, created but insert hangs for 2 hours now, not for any reason I can see. Ideas?
Clarification of Answer by mathtalk-ga on 06 Dec 2004 12:13 PST Hi, iterative-ga: Both my comments in Step 2 were intended as explanatory (apparently clear as mud!) and not as instructive. I was just trying to make a remark on why the queries were written the way they were. If 2a.'s insert went reasonably quickly but 2b.'s is running very slowly, then I suspect we need to add an index to chr1_chainSelf on qStart. Would you mind verifying if we have an index on tStart but not on qStart for that table? Thanks, mathtalk-ga
Clarification of Answer by mathtalk-ga on 06 Dec 2004 12:18 PST Now that I look back at our earlier exchanges of Clarifications, I see we did indeed create indexes for the tStart columns in chrXX_chainSelf tables, but none as far as I know on the qStart column. Please imitate the SQL we used before, changing the index names, etc. to reflect qStart rather than tStart as the indexed field. regards, mathtalk-ga
Request for Answer Clarification by iterative-ga on 07 Dec 2004 08:09 PST ok part two is fine, part 3 table created, insert hangs. Your comment about the composite index seems to come into play. Are we supposed to make another index to insert effectively into the link table?
Clarification of Answer by mathtalk-ga on 07 Dec 2004 10:08 PST Yes, I think it would be sufficient to put the composite index on just one of the tables from step 2, but for the sake of symmetry: CREATE INDEX chrom_chain_t ON knowngene_target ( chrom, chain_id ); CREATE INDEX chrom_chain_q ON knowngene_query ( chrom, chain_id ); regards, mathtalk-ga
Request for Answer Clarification by iterative-ga on 11 Dec 2004 12:34 PST the final insert statement excedes my tablespace. I talked to system administrator and he said I was at 10gig and he gave me another 10gig. I dont know if he has to reboot or something because it still excedes and quits. I am looking fo commands to see how much of my quote is free. He also said i should drop the last indexes and make them again in tablespace INDEX01_TBSP, but I am struggling with the sybtax for that as well. If you have guidance, it would be great but I will still work to determine what is wrong. What I dont have though is an answer that works yet. Thanks.
Clarification of Answer by mathtalk-ga on 11 Dec 2004 21:54 PST Hi, iterative-ga: I'm curious to find out a rough number for the records being inserted. Let's check just the first chromosome, as I suspect the numbers of links on different chromosomes will not vary by an order of magnitude: SELECT count(*) from knowngene_target a, knowngene_query b where a.chrom = b.chrom and a.chain_id = b.chain_id and a.chrom = 'chr1'; At one point you estimated ~ one million rows would be generated as links, so it's a good time to check that assumption. This count will potentially include some pairs of genes that are linked in more than one way. Since the rows we are creating for knowngene_link have only five columns, it's not hard to see that we are overdoing it with how wide we've made many of these columns. I used column widths that were compatible with how you'd defined the columns in the source tables, but surely 255 characters is overkill for the gene names (name_t and name_q) almost as much as it is for the chromosome identifier. How about 30 characters each for the two gene names, and 10 for chromosome, and making field score be number(10) rather than float(126). Judging by your sample data, all the score fields are whole numbers anyway. These changes to the knowngene_link table will reduce the size of rows by about an order of magnitude. Here's the syntax to drop and recreate the indexes on knowngene_target and knowngene_query in the specified tablespace your administrator suggested: DROP INDEX chrom_chain_t; CREATE INDEX chrom_chain_t ON knowngene_target ( chrom, chain_id ) TABLESPACE INDEX01_TBSP; DROP INDEX chrom_chain_q; CREATE INDEX chrom_chain_q ON knowngene_query ( chrom, chain_id ) TABLESPACE INDEX01_TBSP; As far as figuring out how much space a table is using, there is a "system" table that holds some relevant information called DBA_TABLES. Often the first step is to estimate it yourself by counting the number of rows and adding up the size of one row, then multiplying. Let me write up some on this topic with a clear mind in the morning. regards, mathtalk-ga
Clarification of Answer by mathtalk-ga on 12 Dec 2004 12:51 PST Hi, iterative-ga: Well, in the clear light of "morning" I'm doubtful how much space could be saved in the final table by redefining its columns. The VARCHAR2 fields are of variable length, so it was incorrect for me to say we'd save a lot of space by resizing them. In any case let's get a count of how many rows are being inserted before we do anything else. The database administrator (DBA) is in a much better position to check a user's quota than the user is, as Oracle puts many of the tables/views/etc. under the dba permissions. However you should be able to check how much free space you have remaining (per tablespace) under your assigned quota by querying some views that are defined for user permissions. Here's a tip from "The Oracle (tm) Users' Co-Operative FAQ" written by Norman Dunbar: [How I determine how much free space a tablespace has left?] http://www.jlcomp.demon.co.uk/faq/freespace.html select ts.tablespace_name, to_char(sum(nvl(fs.bytes,0))/1048576, '99,999,990.99') as MB_FREE, count(*) as FRAGMENTS, to_char(max(nvl(fs.bytes,0))/1048576, '99,999,990.99') as BIGGEST_BIT from user_free_space fs, user_tablespaces ts where fs.tablespace_name(+) = ts.tablespace_name group by ts.tablespace_name; On the other hand, for a quick summary of how much space has been allocated (but not necessarily used; a detailed explanation would involve discussing what Oracle does with "segments" and "extents"): select segment_type, sum(bytes) from user_segments group by segment_type which should report allocation totals by combined tables and combined indexes. regards, mathtalk-ga
Request for Answer Clarification by iterative-ga on 15 Dec 2004 07:36 PST SQL> select count() from knowngene_target; COUNT()-39535334 SQL> select count() from knowngene_query; COUNT()-29017023 SQL> SELECT count() from knowngene_target a, knowngene_query b where a.chrom = b.chrom and a.chain_id = b.chain_id and a.chrom = 'chr1'; *took ~5hrs hours: COUNT()-4995703 SQL> select segment_type, sum(bytes) from user_segments group by segment_type; SEGMENT_TYPE SUM(BYTES) ------------------ ---------- INDEX 1383071744 TABLE 4865523712 >>other quota check from Norman Dunbar failed. DBA says I have 20 gigs. Turns out INDEX is INDX. the last two nights someone has logged me off as the compound index was being made. but I will continue with the plan, unless there seems to be something off.
Clarification of Answer by mathtalk-ga on 15 Dec 2004 09:14 PST Chromosome 1 is the largest (if I remember correctly) chromosome in the human genome, but that may not translate into the largest number of known genes (or links). But at least the 5,000,000 links reported gives us a good ballpark to plan for, per chromosome. So I'd rebuild the composite indexes, then proceed to do the final inserts one chromosome at a time, keeping an eye on the amount of space used. Let's check it after two or three chromosomes are processed and try to extrapolate to the final size of the knowngene_link table. We may need to analyze the storage factors being used for the table to make sure they are reasonable. By default the succeeeding allocations (extents) added for a table will be increased by a certain percentage in size each time. While this ordinarily helps to keep the number of extents to a minimum without requiring much planning, it can be a problem if space starts getting tight. regards, mathtalk-ga
Request for Answer Clarification by iterative-ga on 20 Dec 2004 11:36 PST about face. I cant create any tables, like the last one I need, and the last index creation disconnects me from oracle. We have one DBA for the building and he is always "migrating" something or otherwise busy. Turns out I am the only one using oracle, and I do not rate high on the pole, so he cannot look into these issues. Soooooooooooo, do I change to postgres or mysql. This DB will be on a php webpage once it works. Thanks for all your help, I guess this question is over for now.
Clarification of Answer by mathtalk-ga on 20 Dec 2004 12:03 PST Ouch! Well, the SQL & indexes should work on either postgres or MySQL. It seems like an odd situation, in that Oracle is a commercial database management system and the situation is impelling you to migrate to an open source DBMS. best wishes, mathtalk-ga

iterative-ga rated this answer: 5 out of 5 stars

Great effort by researcher, well recommended.

Comments

Subject: Re: SQLplus query needed
From: mathtalk-ga on 08 Nov 2004 13:03 PST

Hi, iterative-ga:

I'm looking at it... will touch base later today.

regards, mathtalk-ga

Subject: Re: SQLplus query needed
From: mathtalk-ga on 08 Nov 2004 19:38 PST

I think we may want to build a temp table to hold the results of the
query, and then loop until we no longer find anything new to insert
into the table.

I have a bit of catching up to do with you on populating the tables,
from where we left off before.  But I can start to sketch out the
above idea in a series of comments.

regards, mathtalk-ga

Subject: Re: SQLplus query needed
From: mathtalk-ga on 19 Nov 2004 06:11 PST

Here is the SQL to create an index on the starting point of the
"target" interval for each of chr#_chainSelf tables, using chromosome
10 as our model:

CREATE INDEX tStart_idx ON chr10_chainSelf ( tStart );

I'm not clear on the distinction between knownGene and knownGenex
tables, but you referred recently to "exon coordinate absence". 
Anyway, both tables could presumably use an index of this kind to help
our query:

CREATE INDEX txStart_idx ON knownGenex ( txStart );

*  *  *  *  *  *  *  *  *  *  *  *  *

Here's how I understand the intent of the existing query.  One begins
with a "known gene" and its genetic interval on some chromosome, say
chromosome 10.  We then go to chr10_chainSelf, which has information
(about repeating groups???) on  chromosome 10 and find rows there
whose "target" interval is entirely contained in the genetic interval
of the given "known gene".  [This being subject to the 50%
enlargement, ie. making the genetic interval bigger so that more
things from the "chain self" profiles will fit into it.]

Finally we use the corresponding "query" intervals from these
chromosome specific rows to check for "known gene" rows (other than
the original one!) whose genetic intervals entirely contain the
"query" interval.

  -  This last condition is the hardest one to check, I think, from a
database point of view.

  -  I'm not clear if the genetic intervals to "contain" the query
interval are also subject to 50% enlargement.

  -  I'm still very doubtful about your desire to ignore any
chromosome restriction in comparing the genetic interval coordinates
to the query interval.  If the query interval refers to a sequence on
a specific chromosome, the only meaningful sense in which a genetic
interval can contain it is when the "known gene" is located on the
same chromosome.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy