Google Answers Logo
View Question
Q: Text/HTML Similarity Algorithms ( Answered 5 out of 5 stars,   1 Comment )
Subject: Text/HTML Similarity Algorithms
Category: Computers > Algorithms
Asked by: nathanntg-ga
List Price: $35.00
Posted: 28 Apr 2004 13:54 PDT
Expires: 28 May 2004 13:54 PDT
Question ID: 337832
I am looking for an algorithm that will allow for two HTML or text
pages to be compared and in return a numerical value will be provided
that somehow corelates to the level of similarity. In paticular, I am
looking for the name of a public algorithm or a link to a page that
describes strategies for writing such algorithms.

Clarification of Question by nathanntg-ga on 28 Apr 2004 15:24 PDT
I am not as much looking for the diff tool, but for a way of comparing
two documents based on keywords and looking for exact quotes or common
Subject: Re: Text/HTML Similarity Algorithms
Answered By: eiffel-ga on 29 Apr 2004 05:48 PDT
Rated:5 out of 5 stars
Hi nathanntg,

Using computers for sophisticated similarly analysis of text documents
is an area of active research interest, but straightforward similarity
measurement is a "solved problem".

Jack Lynch, of the University of Pennsylvania, has written a web page
that provides an overview of algorithms for text comparison, working
from the simplest to the most sophisticated. The methods include:

   -  Measuring shared letters (discarded as being not useful)
   -  Measuring shared runs of letters
   -  Measuring shared words
   -  Measuring shared runs of words
   -  Measuring shared runs of word stems
   -  Measuring shared vocabularies
   -  Measuring shared meanings
   -  Measuring shared syntax, even when vocabulary differs

Lynch discusses the advantages and disadvantages for each algorithm,
and provides open source code in C (for measuring shared runs of
letters) and in Perl (for measuring shared vocabularies):

Text Analysis with Compare

Lynch dismisses as impractical the measuring of shared meanings,
because it "would involve not only the most powerful processor, but an
almost unimaginably large database -- the sort of thing that would
require a large team working full-time for years".

But he was writing in 1995, and since then processors have become more
powerful, databases have become larger, and a large team has been
working for years...

A paper published by Michael Lee, Brandon Pincombe and Matthew Welsh
of the University of Adelaide discusses text similarity comparison
using word-based, n-gram and Latent Semantic Analysis methods. They
correlate the results of these comparison methods with human
comparison and find that the machine comparisons are useful, but
inferior compared to human comparison:

A Comparison of Machine Measures of Document Similarity with Human Judgements

Further practical work on document comparison using Latent Semantic
Analysis has been done since then. Thomas Landauer describes LSA in
detail in the following article, which doesn't include a complete
algorithm but does contain a complete textual description of the
method from which an algorithm can be derived:

Chapter 13. Landauer - The computational basis of learning

Web applications have been written to demonstrate that it is possible
to implement the concepts of LSA described in the above paper. I found
the "One to many comparison" application to be most interesting. I
entered the text of your question as the main text. For the texts to
compare against, I entered the text of your clarification and an
unrelated text. It picked your two pieces as being semantically
closely related, compared to the third piece:

Latent Semantic Analysis at Colorado University Boulder

If you want something more down-to-earth and less cutting-edge, you
could consider Dick Grune's software and text similarity tester. This
has been in use at Vrije University Amsterdam for a number of years to
compare submitted assignments - to check for copying. It compares
texts in programming languages or in natural language. It is available
for free as C source code and as DOS binaries:

The software and text similarity tester SIM

The SIM algorithms are available as pseudocode:

Concise Report on the Algorithms in SIM

The above applications are designed to work with plain text files,
whereas you also wish to work with HTML files. Two approaches are

1. Treat the HTML file as if it were a text file, in which case
   your comparisons will be based on similarity in the HTML markup
   as well as similarity in the text, or

2. Convert each HTML file to a text file (on the fly if necessary).

An easy way to convert HTML to text is to use a text-mode web browser
that has a command-line "dump to file" option. The lynx browser can do
this using the "-dump" option:

Lynx Information

You can try out the HTML-to-text functionality of the lynx browser
online. Given a URL it will display the text content of that web page:

Lynx Viewer

I trust that this answer provides the information that you are
seeking. Please request clarification if this answer does not yet meet
your needs.

Google Search Strategy:

"compare two text files" "index of similarity"

text similarity comparing OR compare OR comparison

"text mode browser"

nathanntg-ga rated this answer:5 out of 5 stars and gave an additional tip of: $10.00
Thank you. Your answer was extremely helpful and detailed. I
appreciate your support.

Subject: Re: Text/HTML Similarity Algorithms
From: eiffel-ga on 29 Apr 2004 23:46 PDT
Thanks, nathanntg, for the kind comments and tip.


Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  

Google Home - Answers FAQ - Terms of Service - Privacy Policy