Google Answers: Clustering Resemble Documents

View Question

Q: Clustering Resemble Documents ( No Answer, 2 Comments )

Question

Subject: Clustering Resemble Documents
Category: Computers > Algorithms
Asked by: mluggy2-ga
List Price: $2.00

Posted: 01 Jan 2003 15:07 PST
Expires: 31 Jan 2003 15:07 PST
Question ID: 136116

I'm trying to find a free tool, idea, theory or a way I can use to
find and "cluster" similar documents. I have a textual document (in
hebrew) from source1 which I want to compare to hunderds of thousands
of documents from other sources. Currentlly i'm squizzing cpu power to
find the top 30 repeated words from each document and comparing those
to other 30 words from different documents.. Obviously, this simple
task does not scale too good.

Any idea on where I should look?

Request for Question Clarification by mathtalk-ga on 08 Jan 2003 19:28 PST

Hi, mluggy2:

I was wondering if you had any thoughts in response to the comment I
posted a few days ago.  There are some open source tools and papers I
can point you to on the Web, but I'd like to gauge your level of
interest before making the effort.

thanks, mathtalk-ga

Clarification of Question by mluggy2-ga on 08 Jan 2003 23:59 PST

Thank you for your comments,

By scaling I actually meant to use this alogrithem for a much bigger
database and hunderds of thousands of documents.

I would love to see some open source tools and documents on ways to
accomplish a document resemblence.

Thanks,

Michael.

Answer

There is no answer at this time.

Comments

Subject: Re: Clustering Resemble Documents
From: mathtalk-ga on 02 Jan 2003 12:41 PST

I think you are looking at the wrong end of the frequency
distribution.  It seems to me that documents are not so well
characterized by there most common words as they are by their least
frequently used words (or words that occur in one but few other
documents).

I'm unclear about your assertion, "Obviously, this simple task does
not scale too [well]."  Good scaling generally means that one can
accomplish twice as much by throwing twice as much hardware (or twice
as much time) at the job.  Is that not the case here?  In rare
circumstances one is able to do a twice as big job with less than
twice the resources, but these circumstances are so few that the
algorithms for them are generally labelled "fast" in recognition of
this.  In combinatorial algorithms one often does not get even linear
scaling.

regards, mathtalk-ga

Subject: Re: Clustering Resemble Documents
From: mathtalk-ga on 09 Jan 2003 06:59 PST

A library specifically for text analysis and document classification
is Bow, written in C (see especially the Rainbow and Crossbow
components):

http://www-2.cs.cmu.edu/~mccallum/bow/

You may also be interested in Weka, an open source Java-based
collection of tools aimed at data mining/machine learning:

http://www.cs.waikato.ac.nz/~ml/

This paper is useful, not only for its algorithmic ideas, but for its
survey of the existing literature:

[Centroid-based Document Classification]
http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/centroid.pdf

This next paper, although slanted toward text retrieval algorithms
rather than document classification, nonetheless has a useful
discussion of terms:

[On Domain Modelling for Technical Documentation Retrival]
http://www.cc.jyu.fi/~pttyrvai/ISBN%20951-666-406-7.pdf

This paper references some ideas about using word-frequencies to
produce a two-dimensional mapping of "document space", borrowing some
approaches from neural networks:

[Multilingual Application of the SOMlib Digital Library System]
http://www.ifs.tuwien.ac.at/ifs/research/pub_pdf/rau_rcdl01.pdf

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy