Hello,
I'm working on an affiliate marketing project, written in PHP 4 and
driven by a PostgreSQL database. In the past I have worked only with
MS Access (for a really short while) and a few years with MySQL.
Because I wanted to take full advantage of PostgreSQL's features, I
decided to try it's OOP feature for the tracking table, which will
contain all impressions, clicks, leads and sales. The structure of the
transaction tables looks like this:
- Transaction (14202042 rows)
- - CampaignTransaction (14202042 rows)
- - - CampaignTransactionImpression (13395944 rows)
- - - CampaignTransactionClick (798597 rows)
- - - CampaignTransactionLead (5204 rows)
- - - CampaignTransactionSale (2297 rows)
The problem occurs when I'm trying to use aggregate functions to count
the number of rows in one of these tables. I wrote a small benchmark
tool to see the average time of the following query:
SELECT COUNT("ID") FROM "CampaignTransactionClick"
The output from the benchmark script (with an interval of 25 tries) is
a horrible 9.612 seconds! I tried tweaking the postgresql
configuration file, creating/removing indexes (although that only
helps with a where clause) and just can't figure out how to solve this
problem.
The server I'm running the query on is a Pentium 4 2,66ghz machine
with just 256 MB of memory and PostgreSQL 8. I also tried running it
on a Dual Xeon 2Ghz with 4 GB of memory, and PostgreSQL 7.4. The
output on that machine was 1.095 seconds, which is still a lot. I'm
afraid to even try it on the "CampaignTransactionImpression" table
:-(.
Another simple query with a few (completely) indexed conditions:
SELECT COUNT("ID") FROM "CampaignTransactionClick" WHERE
"transactionStatusID" = 2 AND "campaignID" = 62
Takes an average time of 12.723 seconds (1.272 on the Dual Xeon
server). Output of the query is 66325 btw.
These execution times are just terrible, and horrible compared to what
I was used to with MySQL. I have already tried increasing
"shared_buffers", "work_mem", "max_fsm_pages", "max_fsm_relations" etc
in the config. The only way I can think of to solve this is by running
a cronjob every hour or so to update the counts, but that's a terrible
solution too, since much data has to be calculated while all data has
to be (near) real-time, and these numbers have to be generated per
affiliate too. With 10.000 affiliates sites, this will be extremely
many (heavy) queries. :(
Is there another solution to solve this and keeping things real-time
at the same moment?
Thank you so much!
Tim
PS: here is the execution plan for the second query, if it can help:
Aggregate (cost=31364.85..31364.85 rows=1 width=4) (actual
time=1247.717..1247.718 rows=1 loops=1)
-> Seq Scan on "CampaignTransactionClick" (cost=0.00..31190.72
rows=69652 width=4) (actual time=0.085..1177.993 rows=68907 loops=1)
Filter: (("transactionStatusID" = 2) AND ("campaignID" = 62))
Total runtime: 1248.098 ms
And every ID/date column is indexed, and specially for this query, I
tried this index:
CREATE INDEX "idx_CTC_transactionStatusID_campaignID" ON
"CampaignTransactionClick" USING btree ("transactionStatusID",
"campaignID")
I also tried to disable sequence scans to force an index scan. This is
the new execution plan:
Aggregate (cost=114882.80..114882.80 rows=1 width=4)
-> Index Scan using "idx_CTC_transactionStatusID_campaignID" on
"CampaignTransactionClick" (cost=0.00..114737.00 rows=58319 width=4)
Index Cond: (("transactionStatusID" = 2) AND ("campaignID" = 62))
The query now takes 7.985 seconds on the Pentium 4 server instead of
12 seconds. A slight improvement, but not nearly enough :(. |