|
|
Subject:
Clustering Resemble Documents
Category: Computers > Algorithms Asked by: mluggy2-ga List Price: $2.00 |
Posted:
01 Jan 2003 15:07 PST
Expires: 31 Jan 2003 15:07 PST Question ID: 136116 |
I'm trying to find a free tool, idea, theory or a way I can use to find and "cluster" similar documents. I have a textual document (in hebrew) from source1 which I want to compare to hunderds of thousands of documents from other sources. Currentlly i'm squizzing cpu power to find the top 30 repeated words from each document and comparing those to other 30 words from different documents.. Obviously, this simple task does not scale too good. Any idea on where I should look? | |
| |
|
|
There is no answer at this time. |
|
Subject:
Re: Clustering Resemble Documents
From: mathtalk-ga on 02 Jan 2003 12:41 PST |
I think you are looking at the wrong end of the frequency distribution. It seems to me that documents are not so well characterized by there most common words as they are by their least frequently used words (or words that occur in one but few other documents). I'm unclear about your assertion, "Obviously, this simple task does not scale too [well]." Good scaling generally means that one can accomplish twice as much by throwing twice as much hardware (or twice as much time) at the job. Is that not the case here? In rare circumstances one is able to do a twice as big job with less than twice the resources, but these circumstances are so few that the algorithms for them are generally labelled "fast" in recognition of this. In combinatorial algorithms one often does not get even linear scaling. regards, mathtalk-ga |
Subject:
Re: Clustering Resemble Documents
From: mathtalk-ga on 09 Jan 2003 06:59 PST |
A library specifically for text analysis and document classification is Bow, written in C (see especially the Rainbow and Crossbow components): http://www-2.cs.cmu.edu/~mccallum/bow/ You may also be interested in Weka, an open source Java-based collection of tools aimed at data mining/machine learning: http://www.cs.waikato.ac.nz/~ml/ This paper is useful, not only for its algorithmic ideas, but for its survey of the existing literature: [Centroid-based Document Classification] http://www-users.cs.umn.edu/~karypis/publications/Papers/PDF/centroid.pdf This next paper, although slanted toward text retrieval algorithms rather than document classification, nonetheless has a useful discussion of terms: [On Domain Modelling for Technical Documentation Retrival] http://www.cc.jyu.fi/~pttyrvai/ISBN%20951-666-406-7.pdf This paper references some ideas about using word-frequencies to produce a two-dimensional mapping of "document space", borrowing some approaches from neural networks: [Multilingual Application of the SOMlib Digital Library System] http://www.ifs.tuwien.ac.at/ifs/research/pub_pdf/rau_rcdl01.pdf regards, mathtalk-ga |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |