Google Answers Logo
View Question
 
Q: NLP software for keyword extraction ( No Answer,   1 Comment )
Question  
Subject: NLP software for keyword extraction
Category: Computers
Asked by: newbie99-ga
List Price: $2.00
Posted: 17 Aug 2004 04:08 PDT
Expires: 16 Sep 2004 04:08 PDT
Question ID: 388902
I have a smallish corpus of documents (personal, not in any special
field) -- about 700 documents, total 10MB.  I want to identify common
topics between documents--if, say, between 2 and 30 of them mention a
common topic, I want to have a keyword or description of that topic
listed with pointers to the relevant documents.

The following might make more sense of my request: Say you wanted to
use NLP to analyze messages posted to a listserv--it would extract
topics or keywords and provide pointers to relevant messages for each
keyword.

I would expect to have to do a fair amount of manual fiddling with
whatever list of topics the software came up with.  But once I have a
decent keyword list, I should be able to build, say, an HTML topic
listing with links from each topic to its set of documents.

My preferred tools are Unix--I have access to Linux and to Mac OS X
(preferred but possibly harder to find something for), and Perl
(although I could manage ok with C/C++ or Java).

I will consider my question answered once I get to the point that I
can actually generate a keyword list (I'm happy to do whatever work it
takes, but the answerer has to give me enough information to create
something that really works.)

I'm going to set the price to $40.  If you think you can give me real
help with this but want the price to be higher, let me know and I'll
consider it.

Thank you

Clarification of Question by newbie99-ga on 18 Aug 2004 09:48 PDT
The Etymon software is probably not what I want because it's text
retrieval software.  (I do like that it's Unix command-line based.) 
Maybe I could use the index files it builds for my keyword
generation...but it would take some work before I could see if that
would be helpful.

The Dublin Core tools look possibly closer to what I want.  I haven't
had enough time yet to check them out.

So far I've spent the most time with GATE (http://gate.ac.uk/).  It
probably has enough stuff in it to accomplish my goals, but I'm having
a hard time figuring it out.

Ellogon also looks good (http://www.ellogon.org/), but I haven't tried
it because I don't have Tcl installed in the places where I plan to
use it.

The problem I'm running into is that there's tons of software
(http://registry.dfki.de/ and other places), and much of it might do
what I want, but I don't know enough about the field to be able to
efficiently recognize and use a piece of software that would work for
me and to choose one that fits my technical skills and requirements.

In question http://answers.google.com/answers/threadview?id=248829
someone called Yosarian (who might be able to answer my present
question), recommends a Perl package:
http://www.d.umn.edu/~tpederse/nsp.html that I tried to use (although
I think that n-grams might be a little less than the degree of
analysis I need), but it depends on a WordNet installation that I
couldn't get running on my Mac OS X.  (I would have gone to the
further trouble of installing on my Linux host if I had some assurance
it would have done what I wanted, but I didn't.)

Anyway, maybe all this gives enough background for someone to help me.

Thanks.

Clarification of Question by newbie99-ga on 20 Aug 2004 09:57 PDT
I just raised the price to $70.  

I might be satisfied with a decent concordance generator I can use
from the Unix command line.  Of course, something smarter than that
would be nice.  Particularly, I'd like to be able to index names.  And
it's essential that the output be something I can process using Perl
(i.e., it probably shouldn't be a binary file.)

Request for Question Clarification by andyt-ga on 20 Aug 2004 11:20 PDT
Hello newbie99,
Would a solution using the python scripting language work for you?

-Andyt-ga

Clarification of Question by newbie99-ga on 20 Aug 2004 11:44 PDT
> Would a solution using the python scripting language work for you?

I've got Python 2.3 on my MacOSX and Python 2.2.1 on my Linux host. 
I'd like it to work on both.  And I don't know Python (I played with
it for a few hours once), so the solution shouldn't require a lot of
fiddling with the code.

The long and short of it is, if I can get it working and it does what
I'm asking, it'll be fine.

Clarification of Question by newbie99-ga on 20 Aug 2004 15:56 PDT
If the Python thing you're talking about is an existing tool, you
should go ahead and recommend it.  If I can get it working, I'll
consider my question answered.  If it's something that you're thinking
of putting together yourself, you should communicate with me about it
before putting work into it.  I don't need a whole long explanation or
research into NLP techniques and software.  I just want to get this
thing done.  It's a small part of a bigger project.

Request for Question Clarification by andyt-ga on 20 Aug 2004 19:24 PDT
The python program I was talking about was the python natural language
toolkit - http://nltk.sf.net.  However, I wasn't able to determine if
it would be able to help you with your problem, because I'm on dialup
for the next week or so, and the data files were pretty big.

I found a unix concordance generator called hum which lists frequency
of words in a document like:
    1  analysis
    4  and
    1  are
    1  at

Hum is available here: ftp://clr.nmsu.edu/CLR/tools/concordances/ 
It's very old, but it compiled without any problems on my mac. 
However, I was unable to think of a good algorithm for the next step:
ie picking out which of those words are unique to that document in
order to identify common topics between groups.  If I think of
something or find another tool that could help you I'll let you know.

Clarification of Question by newbie99-ga on 21 Aug 2004 03:09 PDT
I glanced at the documentation for NLTK.  It looks good, but at first
glance it didn't seem like it would have the functionality of
identifying proper names (people, places), which GATE has.  I think
Hum might be too rudimentary.  The names thing is seeming increasingly
important to me.

I think the best way to proceed might be for me to use GATE.  Would
you be interested in helping me figure out how to get GATE from a unix
command-line to produce name indexes and a couple other indexes
(probably indexes that I can hand edit the keywords for based on
n-grams or something)?

Clarification of Question by newbie99-ga on 22 Aug 2004 09:26 PDT
I'm posting a new question, to simplify what's here.  I'm leaving this
(with $2 price) for reference.

Clarification of Question by newbie99-ga on 22 Aug 2004 09:36 PDT
The new question is at: http://answers.google.com/answers/threadview?id=391093
Answer  
There is no answer at this time.

Comments  
Subject: Re: NLP software for keyword extraction
From: nixit-ga on 17 Aug 2004 16:31 PDT
 
http://www.etymon.com/tr.html describes opensource software that might
be of interest to you.

also see http://dublincore.org/tools/ for tools for metadata
generation and extraction.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy