|
|
Subject:
Automatic content categorization
Category: Computers Asked by: david7777-ga List Price: $75.00 |
Posted:
09 Nov 2006 08:29 PST
Expires: 29 Nov 2006 00:41 PST Question ID: 781356 |
I'm in the process of putting a print dictionary online. To make it more valuable to users, I'd like to indicate a category for each term (or to be precise, a sub-category, since the entire dictionary is specific to one broad subject.) Since there are over 10,000 terms, to do this manually would be very time-consuming. Is there a way that I can totally or mostly automate the categorization of the terms based on the text of their definitions? (Typical definitions are 1-4 sentences long.) For example, seeing what terms mention other terms in their definition, or seeing what terms talk about similar subjects. (Alternatively, if there's some source that has already classified all (or nearly all) words by category, that would presumably include all the words in our dictionary and we could theoretically use that as a starting point.) I have a good idea of what the sub-categories should be (probably about 50-60). I did some Google searches looking for software that would enable me to do this but didn't find anything very good. |
|
There is no answer at this time. |
The following answer was rejected by the asker (they received a refund for the question). | |
Subject:
Re: Automatic content categorization
Answered By: easterangel-ga on 09 Nov 2006 18:46 PST |
Hi! Thanks for the question. Automated text categorization or taxonomy software could be of help to your project. There are only quite a handful of software that I see which can be helpful to your requirements. The following automated text categorization software could be of assistance to you in making such a dictionary. Wordstat http://www.provalisresearch.com/wordstat/WordstatFeatures.html Inxight Software http://www.inxight.com/products/ Autonomy http://www.autonomy.com/content/Products/Taxonomy/index.en.html Data Harmony http://www.dataharmony.com/products/tm.htm Entrieva http://www.entrieva.com/entrieva/semiotagger.htm ------------------------------ You may also be interested in taxonomy software for thesaurus products. a.ka. Classification Software http://a-k-a.com.au/aka_classification/ Multisystems http://www.multites.com/ Term Tree http://www.termtree.com.au/ Search terms used: text taxonomy thesaurus dictionary automated text taxonomy categorizer I hope this would help you in your research. Before rating this answer, please ask for a clarification if you have a question or if you would need further information. Regards, Easterangel-ga Google Answers Researcher | |
| |
| |
| |
| |
| |
|
|
Subject:
Re: Automatic content categorization
From: singbat-ga on 23 Nov 2006 05:48 PST |
you are looking for text clustering software. try the open source Apache Lucene Project as a base indexing engine (and a very good one with fairly low entry cost in terms of skills required) in combination with the open source Carrot2 clustering framework. Carrot2 sits on top of Lucene and other indexes and automatically groups results of queries. that is, at least, the standard usage. however, you can download and configure Carrot2 to read your corpus and cluster your definitions (and thereby your words) with some modest programming effort. the developers of Carrot2 do offer technical development support for a fee, though the software itself is free. my company is currently building a framework using these components to cluster the text of medical records automatically. early results are promising. there should be substantial similarities to your situation, though the business domain is clearly different. there are other clustering tools available, many are commercial and may require less technical help to set up. the underlying algorithms are well-known and available, typically in computer science textbooks. good luck! |
Subject:
Re: Automatic content categorization
From: funtick-ga on 23 Nov 2006 11:07 PST |
Apache Lucene is great, and it has a SOLR subproject which can simplify the task. You have a Term, and a Definition which may include references to another Terms. By calculating scores of referenced Terms you may automatically define Category (referenced Terms with highest scores). SOLR project has a feature called 'Faceted Browsing': Term may be contained in a few different Categories, and such design might be attractive in some cases (when we don't need strict tree of categories and subcategories). Apache Lucene is great. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |