|
|
Subject:
Duplicate-checking a MySQL text field with Python
Category: Computers > Programming Asked by: abgillette-ga List Price: $50.00 |
Posted:
09 May 2006 13:39 PDT
Expires: 08 Jun 2006 13:39 PDT Question ID: 727053 |
I'm looking for a Python script/module/sample code that can intelligently determine duplicate records based on the contents of a text field in a MySQL database. The task is complicated by the fact that the dupes won't be exact. The table contains the full text of articles from several different sources, but many of them contain mostly the same content, loaded into different templates (think of an AP article that ends up listed on several different web sites). So, there will be quite a bit of fuzzy logic and guesswork involved in determining what is a duplicate and what is not. The table contains three fields: id, date and article. Ideally, the script would allow me to run a full-text search on the article field, and produce a list of the ids of the non-dupe results with the ids of the duplicates listed as a separate array. Or something like that. |
|
There is no answer at this time. |
|
Subject:
Re: Duplicate-checking a MySQL text field with Python
From: andyt-ga on 09 May 2006 19:18 PDT |
great answer to a related question - http://answers.google.com/answers/threadview?id=337832 |
Subject:
Re: Duplicate-checking a MySQL text field with Python
From: mfripp-ga on 20 May 2006 21:09 PDT |
Can you use MySQL's full-text indexing system? (see http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html) If you are only trying to match a single article, which you've stored in a single row in a table called "new", against a number of articles in a table called "archive", you could try something like this: DROP TEMPORARY TABLE IF EXISTS matches; CREATE TEMPORARY TABLE matches SELECT a.* FROM new n, archive a WHERE MATCH (a.article) AGAINST (n.article) > 0.8; # Retrieve the matches SELECT * FROM matches; # Retreive the non-matches SELECT * FROM archive WHERE id NOT IN (SELECT id FROM matches); Alternatively, you could pick 5 or 10 phrases at random out of your new article and then search for each of them rigorously (i.e., using some system that doesn't have fuzzy matching). Then if say 80% of your random phrases are found in the same existing article, you could call that a duplicate. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |