Google Answers: help with NLP task using GATE

View Question

Q: help with NLP task using GATE ( No Answer, 9 Comments )

Question

Subject: help with NLP task using GATE
Category: Computers
Asked by: newbie99-ga
List Price: $100.00

Posted: 22 Aug 2004 09:35 PDT
Expires: 21 Sep 2004 09:35 PDT
Question ID: 391093

[This question takes the place of
http://answers.google.com/answers/threadview?id=388902 which got too
long and confusing.]

I would like help using the GATE (General Architecture for Text
Engineering, http://gate.ac.uk/) for a small natural language
processing task.

I want to index a small corpus of documents--extracting index terms
such as proper names and significant key words/phrases shared between
documents.  I need the index to be in a non-binary form that I can
parse in Perl to generate HTML indexes that link back to the original
documents.  I would also like to be able to manually improve the
keyword list after it has been automatically generated and to use the
manually improved list for index generation.

I don't know the field of NLP, I don't know the GATE framework, and I
don't know Java (which it's written in) very well.  That being said,
I'm a decent programmer and will do the work of assembling the thing
that I want--but I need some guidance and help from someone who knows
this stuff better than I do.

An 'answer' to this question will consist of: 1) assurance that what I
want to do is possible, and, if not, discussion of other approaches;
2) description of the steps needed to accomplish my goal; 3) help if I
get stuck.

Answer

There is no answer at this time.

Comments

Subject: Re: help with NLP task using GATE
From: andyt-ga on 23 Aug 2004 15:41 PDT

I have a solution that doesn't use gate.  It uses ConceptNet
(http://web.media.mit.edu/~hugo/)conceptnet/, which extracts keywords
from documents using a commonsense database which is described as
topic-jisting.  The keywords and percent match from the text of your
question above are below.  This is from a command line tool, written
in python that could process all documents in a directory, or all
tables in a database.  If this is an acceptable answer, let me know
and I'll post it.

('write', 0.68882988287653213)
('gate', 0.48439127793177872)
('do', 0.48969719844767445)
('stick', 0.428354972498245)
('document', 0.37425316663587266)
('list', 0.35665178907153694)
('use', 0.34857019081656077)
('help', 0.35102943408088871)
('phrase', 0.31505912954976761)
('stuff', 0.29834468122950747)
('answer', 0.30827976908653165)
('field', 0.28358292078786629)
('question', 0.2581860479796626)
('rider', 0.24992628766121872)
('id', 0.23594067440995611)
('pen', 0.22406136257520062)
('discussion', 0.20208053268321652)
('perl', 0.20208053268321652)
('step', 0.20208053268321652)
('com', 0.21165904043316788)
('work', 0.21036842875426234)
('it', 0.19649227620969983)
('man', 0.17360376667804922)
('saddle', 0.17688068828682652)
('key', 0.17324510537102999)
('description', 0.16763713511304498)
('link', 0.15386522247401335)
('http', 0.15386522247401335)
('java', 0.15386522247401335)
("person 's goal", 0.15386522247401335)
('place', 0.1637051924071676)
('passport', 0.14591808978237122)
('desk', 0.15548756473305567)
('helmet', 0.14335352401100654)
('paper', 0.14167240373955273)
('typewriter', 0.13035173990309293)
('which', 0.12956439557354296)
('take place', 0.12951759566589174)
('generate', 0.12951759566589174)
('google', 0.13871378043049046)
('uk', 0.13871378043049046)
('uniform', 0.14620284421347129)
('pencil', 0.13619612345723794)
('do work', 0.12951759566589174)
('something', 0.11718800287048317)
('think', 0.12271809818344746)
('consist', 0.12951759566589174)
('take place of http', 0.12951759566589174)
('get long', 0.12951759566589174)
('use gate', 0.12951759566589174)
('index small corpus', 0.12951759566589174)
('index small corpus of documents--extracting index term', 0.12951759566589174)
('share between document', 0.12951759566589174)
('need index', 0.12951759566589174)
('be in non-binary form', 0.12951759566589174)
('parse in perl', 0.12951759566589174)
('generate html', 0.12951759566589174)
('index link', 0.12951759566589174)
('be able', 0.12951759566589174)
('improve keyword list', 0.12951759566589174)
('improve keyword list after it', 0.12951759566589174)
('improve list', 0.12951759566589174)
('improve list for index generation', 0.12951759566589174)
('know field', 0.12951759566589174)
('know field of nlp', 0.12951759566589174)
('know gate framework', 0.12951759566589174)
('know java', 0.12951759566589174)
('assemble thing', 0.12951759566589174)
('need guidance', 0.12951759566589174)
('help from person', 0.12951759566589174)
('know stuff', 0.12951759566589174)
('be possible', 0.12951759566589174)
("accomplish person 's goal", 0.12951759566589174)
('help if person', 0.12951759566589174)
('not', 0.11425807404517607)
('guidance', 0.12201949563929163)
('get pay', 0.10458377205627771)
('long', 0.11425807404517607)
('machine', 0.10374805110153379)
('communicate', 0.10560132379302133)
('skate', 0.08314228749939831)
('of step', 0.090662316966124215)
('natural', 0.090089146988051608)
('student', 0.099896869766910507)
('writer', 0.097314876207260509)
('very', 0.10104026634160826)
('cloud', 0.097322830515959802)
('threadview', 0.1036140765327134)
('388902', 0.1036140765327134)
('general architecture', 0.1036140765327134)
('text engineering http', 0.1036140765327134)
('ac', 0.1036140765327134)
('small natural language processing task', 0.1036140765327134)
('small corpus', 0.1036140765327134)
('documents--extracting index term', 0.1036140765327134)
('proper name and significant key word', 0.1036140765327134)
('index', 0.1036140765327134)
('non-binary form', 0.1036140765327134)
('html', 0.1036140765327134)
('original document', 0.1036140765327134)
('keyword list', 0.1036140765327134)
('index generation', 0.1036140765327134)
('gate framework', 0.1036140765327134)
("be say i'm", 0.1036140765327134)
('decent programmer', 0.1036140765327134)
('thing', 0.1036140765327134)
('who', 0.1036140765327134)
("'answer'", 0.1036140765327134)
('1', 0.1036140765327134)
('assurance', 0.1036140765327134)
('what', 0.1036140765327134)
('other approach', 0.1036140765327134)
('2', 0.1036140765327134)
('3', 0.1036140765327134)
('nlp', 0.1036140765327134)

Subject: Re: help with NLP task using GATE
From: andyt-ga on 23 Aug 2004 15:43 PDT

Sorry, that url is http://web.media.mit.edu/~hugo/conceptnet/.

Subject: Re: help with NLP task using GATE
From: newbie99-ga on 24 Aug 2004 04:54 PDT

Here are my main reservations about what you've posted:

1) The output you provided includes a bunch of words and concepts not
mentioned in the input text (write, stick, rider, pen, saddle,
passport, helmet, cloud).
2) Of the relevant words that are included, the order doesn't come
close to their order of actual relevance in the input document.
3) Indexing of proper names is particularly important to me.  The
software should have at least recognized that GATE and NLP, being in
all caps, had some special significance.
4) The software should know not to index excessively common words like
do, use, it, very, thing, who.
5) One part of the output has to include an index.  So if you had
processed multiple documents, the second line of your output might
say, "('gate', 0.484..., doc1:1 occurence:offset 137; doc8: 3
occurences: offsets 20,300,502)"

I don't know for sure, but I suspect that the GATE software could
address these issues.  That being said, I certainly would prefer
something written in Python over something written in Java, so if
concept net can address these issues, it would be a good solution.

Subject: Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 10:55 PDT

I'm really not having much luck getting this question answered.  Does
anyone have any general advice about how I should go about getting
help (or directly about keyword extraction)?

Subject: Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 13:10 PDT

Hello.  I apoligize for the long time between my last response.  I
have another interpretation of keywords (unranked this time) for the
original post which addresses most of the issues.  You can find the
keywords here: http://www.tunebounce.com/ga/keywords.txt  There is
less of the words that seemed to come out of nowhere, I don't honestly
know why some of them got in there.  For the #5 request about indexes,
I think it is possible to write a script in any scripting language
that takes as input the keyword list and original text, and finds what
line numbers each match is on.  This seems like the easiest way,
compared to editing conceptnet source code.  It recognized both gate
and nlp as for #3.

If someone else wants to jump in and answer this, that would be great.
 I'm definitely not an expert, this is just an interesting problem I
think.  In fact if someone here has past experience in this, I would
welcome their response simply for the selfish reason to learn more
about this.   Otherwise let me know what needs to be done for this to
be considered answered.

Regards,
Andyt-ga

Subject: Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 13:34 PDT

Andy,  that definitely looks like a more usable list.  On the indexing
issue (#5) I think you're right.  It's easy enough to index back to
the document.  Two remaining concerns:

1) Proper names.  In the little time I played with GATE, I saw that it
was actually able to identify names as being the names of people. 
This would be extremely valuable to me.

2) I'm afraid of being overwhelmed by the number of keywords generated
(which is why I didn't just want a concordance or n-grams).  So some
method of saying that certain words/phrases have special significance
in the CORPUS (not just in an individual document) would be very
helpful.  A keyword/phrase is more interesting if it appear across a
few documents, but more likely to be just some common English word if
it appears in most or all documents.  I'm not very clear on how to
deal with this concern--hand editing, yes, but if I'm going to have to
hand edit tens of thousands of terms, the program won't have helped me
much.

Thanks for sticking with this.

Subject: Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 16:09 PDT

1) I tried another sample with some proper names.  I think it did OK. 
Check http://www.tunebounce.com/ga/samplewithnames.txt
2)I think it would be possible to write a script to find the
uniqueness of each keyword.  How are you storing the texts, in files
or a database?

General algorithm for keyword list post processing:
-Input each keyword list from each text into a datastructure
-Strip out list of common words, maybe the 100 most common english words
-For each word in each keyword list check if the word occurs anywhere
else by looping through every other keyword list.  Each time there is
a match, increment a counter for that keyword.
-When you have all the keywords through the previous process, you can
do some analyzing:
   -how many keywords appear once in a document, twice, n times?
   -average number of documents a keyword appears in
  -for a given keyword, find a uniqueness rating from 0 to 1 with 0
being very unique and 1 being very common.  Take the max number of
times a keyword appears in a document, and the min number of times and
use: uniqueness= samplekeyword/(max-min)

I could work on coming up with the above scripts, but it may take some
time.  Alternatively I could write a quick how-to guide on generating
keyword lists like the samples I have posted with conceptnet, and then
describe the algorithms so that you could implement them.

Regards,
Andyt-ga

Subject: Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 18:07 PDT

Andy, the new sample doesn't handle names the way I was talking about.
 GATE identifies names AS names.  It actually pulls out words and
classifies them in different categories: personal names, addresses,
dates, other things.

I feel bad.  It seems like you've done a fair amount of work, but the
work is not helping me much.

I've ordered a copy of 
         Foundations of Statistical Natural Language Processing
         by Manning and Schutze,
         MIT Press. Cambridge, MA: May 1999. 
         book site: http://www-nlp.Stanford.EDU/fsnlp/
which will hopefully help me better understand and describe and maybe
accomplish what I'm trying to do.

I appreciate that you're working on this as an interesting problem. 
But if you are hoping to get paid, this is what I could offer:  If you
give me the script(s) you used to produce those two lists, I could pay
$20.  It doesn't really do what I want, but presumably it would be
easy to use and could stand in until I'm able to replace it. 
Alternatively, you could help me figure out how to use GATE and I'll
pay the full $100.

Thanks,
Sigfried

Subject: Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 21:57 PDT

Ok, No problem. I leave this open to someone else.  Good luck!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy