Google Answers: Duplicate-checking a MySQL text field with Python

View Question

Q: Duplicate-checking a MySQL text field with Python ( No Answer, 2 Comments )

Question

Subject: Duplicate-checking a MySQL text field with Python
Category: Computers > Programming
Asked by: abgillette-ga
List Price: $50.00

Posted: 09 May 2006 13:39 PDT
Expires: 08 Jun 2006 13:39 PDT
Question ID: 727053

I'm looking for a Python script/module/sample code that can
intelligently determine duplicate records based on the contents of a
text field in a MySQL database.

The task is complicated by the fact that the dupes won't be exact. The
table contains the full text of articles from several different
sources, but many of them contain mostly the same content, loaded into
different templates (think of an AP article that ends up listed on
several different web sites). So, there will be quite a bit of fuzzy
logic and guesswork involved in determining what is a duplicate and
what is not.

The table contains three fields: id, date and article. Ideally, the
script would allow me to run a full-text search on the article field,
and produce a list of the ids of the non-dupe results with the ids of
the duplicates listed as a separate array. Or something like that.

Answer

There is no answer at this time.

Comments

Subject: Re: Duplicate-checking a MySQL text field with Python
From: andyt-ga on 09 May 2006 19:18 PDT

great answer to a related question -
http://answers.google.com/answers/threadview?id=337832

Subject: Re: Duplicate-checking a MySQL text field with Python
From: mfripp-ga on 20 May 2006 21:09 PDT

Can you use MySQL's full-text indexing system? (see
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html)

If you are only trying to match a single article, which you've stored
in a single row in a table called "new", against a number of articles
in a table called "archive", you could try something like this:

DROP TEMPORARY TABLE IF EXISTS matches;
CREATE TEMPORARY TABLE matches
  SELECT a.* FROM new n, archive a 
    WHERE MATCH (a.article) AGAINST (n.article) > 0.8;
# Retrieve the matches
SELECT * FROM matches;
# Retreive the non-matches
SELECT * FROM archive WHERE id NOT IN (SELECT id FROM matches);

Alternatively, you could pick 5 or 10 phrases at random out of your
new article and then search for each of them rigorously (i.e., using
some system that doesn't have fuzzy matching). Then if say 80% of your
random phrases are found in the same existing article, you could call
that a duplicate.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy