![]() |
|
![]() | ||
|
Subject:
help with NLP task using GATE
Category: Computers Asked by: newbie99-ga List Price: $100.00 |
Posted:
22 Aug 2004 09:35 PDT
Expires: 21 Sep 2004 09:35 PDT Question ID: 391093 |
[This question takes the place of http://answers.google.com/answers/threadview?id=388902 which got too long and confusing.] I would like help using the GATE (General Architecture for Text Engineering, http://gate.ac.uk/) for a small natural language processing task. I want to index a small corpus of documents--extracting index terms such as proper names and significant key words/phrases shared between documents. I need the index to be in a non-binary form that I can parse in Perl to generate HTML indexes that link back to the original documents. I would also like to be able to manually improve the keyword list after it has been automatically generated and to use the manually improved list for index generation. I don't know the field of NLP, I don't know the GATE framework, and I don't know Java (which it's written in) very well. That being said, I'm a decent programmer and will do the work of assembling the thing that I want--but I need some guidance and help from someone who knows this stuff better than I do. An 'answer' to this question will consist of: 1) assurance that what I want to do is possible, and, if not, discussion of other approaches; 2) description of the steps needed to accomplish my goal; 3) help if I get stuck. |
![]() | ||
|
There is no answer at this time. |
![]() | ||
|
Subject:
Re: help with NLP task using GATE
From: andyt-ga on 23 Aug 2004 15:41 PDT |
I have a solution that doesn't use gate. It uses ConceptNet (http://web.media.mit.edu/~hugo/)conceptnet/, which extracts keywords from documents using a commonsense database which is described as topic-jisting. The keywords and percent match from the text of your question above are below. This is from a command line tool, written in python that could process all documents in a directory, or all tables in a database. If this is an acceptable answer, let me know and I'll post it. ('write', 0.68882988287653213) ('gate', 0.48439127793177872) ('do', 0.48969719844767445) ('stick', 0.428354972498245) ('document', 0.37425316663587266) ('list', 0.35665178907153694) ('use', 0.34857019081656077) ('help', 0.35102943408088871) ('phrase', 0.31505912954976761) ('stuff', 0.29834468122950747) ('answer', 0.30827976908653165) ('field', 0.28358292078786629) ('question', 0.2581860479796626) ('rider', 0.24992628766121872) ('id', 0.23594067440995611) ('pen', 0.22406136257520062) ('discussion', 0.20208053268321652) ('perl', 0.20208053268321652) ('step', 0.20208053268321652) ('com', 0.21165904043316788) ('work', 0.21036842875426234) ('it', 0.19649227620969983) ('man', 0.17360376667804922) ('saddle', 0.17688068828682652) ('key', 0.17324510537102999) ('description', 0.16763713511304498) ('link', 0.15386522247401335) ('http', 0.15386522247401335) ('java', 0.15386522247401335) ("person 's goal", 0.15386522247401335) ('place', 0.1637051924071676) ('passport', 0.14591808978237122) ('desk', 0.15548756473305567) ('helmet', 0.14335352401100654) ('paper', 0.14167240373955273) ('typewriter', 0.13035173990309293) ('which', 0.12956439557354296) ('take place', 0.12951759566589174) ('generate', 0.12951759566589174) ('google', 0.13871378043049046) ('uk', 0.13871378043049046) ('uniform', 0.14620284421347129) ('pencil', 0.13619612345723794) ('do work', 0.12951759566589174) ('something', 0.11718800287048317) ('think', 0.12271809818344746) ('consist', 0.12951759566589174) ('take place of http', 0.12951759566589174) ('get long', 0.12951759566589174) ('use gate', 0.12951759566589174) ('index small corpus', 0.12951759566589174) ('index small corpus of documents--extracting index term', 0.12951759566589174) ('share between document', 0.12951759566589174) ('need index', 0.12951759566589174) ('be in non-binary form', 0.12951759566589174) ('parse in perl', 0.12951759566589174) ('generate html', 0.12951759566589174) ('index link', 0.12951759566589174) ('be able', 0.12951759566589174) ('improve keyword list', 0.12951759566589174) ('improve keyword list after it', 0.12951759566589174) ('improve list', 0.12951759566589174) ('improve list for index generation', 0.12951759566589174) ('know field', 0.12951759566589174) ('know field of nlp', 0.12951759566589174) ('know gate framework', 0.12951759566589174) ('know java', 0.12951759566589174) ('assemble thing', 0.12951759566589174) ('need guidance', 0.12951759566589174) ('help from person', 0.12951759566589174) ('know stuff', 0.12951759566589174) ('be possible', 0.12951759566589174) ("accomplish person 's goal", 0.12951759566589174) ('help if person', 0.12951759566589174) ('not', 0.11425807404517607) ('guidance', 0.12201949563929163) ('get pay', 0.10458377205627771) ('long', 0.11425807404517607) ('machine', 0.10374805110153379) ('communicate', 0.10560132379302133) ('skate', 0.08314228749939831) ('of step', 0.090662316966124215) ('natural', 0.090089146988051608) ('student', 0.099896869766910507) ('writer', 0.097314876207260509) ('very', 0.10104026634160826) ('cloud', 0.097322830515959802) ('threadview', 0.1036140765327134) ('388902', 0.1036140765327134) ('general architecture', 0.1036140765327134) ('text engineering http', 0.1036140765327134) ('ac', 0.1036140765327134) ('small natural language processing task', 0.1036140765327134) ('small corpus', 0.1036140765327134) ('documents--extracting index term', 0.1036140765327134) ('proper name and significant key word', 0.1036140765327134) ('index', 0.1036140765327134) ('non-binary form', 0.1036140765327134) ('html', 0.1036140765327134) ('original document', 0.1036140765327134) ('keyword list', 0.1036140765327134) ('index generation', 0.1036140765327134) ('gate framework', 0.1036140765327134) ("be say i'm", 0.1036140765327134) ('decent programmer', 0.1036140765327134) ('thing', 0.1036140765327134) ('who', 0.1036140765327134) ("'answer'", 0.1036140765327134) ('1', 0.1036140765327134) ('assurance', 0.1036140765327134) ('what', 0.1036140765327134) ('other approach', 0.1036140765327134) ('2', 0.1036140765327134) ('3', 0.1036140765327134) ('nlp', 0.1036140765327134) |
Subject:
Re: help with NLP task using GATE
From: andyt-ga on 23 Aug 2004 15:43 PDT |
Sorry, that url is http://web.media.mit.edu/~hugo/conceptnet/. |
Subject:
Re: help with NLP task using GATE
From: newbie99-ga on 24 Aug 2004 04:54 PDT |
Here are my main reservations about what you've posted: 1) The output you provided includes a bunch of words and concepts not mentioned in the input text (write, stick, rider, pen, saddle, passport, helmet, cloud). 2) Of the relevant words that are included, the order doesn't come close to their order of actual relevance in the input document. 3) Indexing of proper names is particularly important to me. The software should have at least recognized that GATE and NLP, being in all caps, had some special significance. 4) The software should know not to index excessively common words like do, use, it, very, thing, who. 5) One part of the output has to include an index. So if you had processed multiple documents, the second line of your output might say, "('gate', 0.484..., doc1:1 occurence:offset 137; doc8: 3 occurences: offsets 20,300,502)" I don't know for sure, but I suspect that the GATE software could address these issues. That being said, I certainly would prefer something written in Python over something written in Java, so if concept net can address these issues, it would be a good solution. |
Subject:
Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 10:55 PDT |
I'm really not having much luck getting this question answered. Does anyone have any general advice about how I should go about getting help (or directly about keyword extraction)? |
Subject:
Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 13:10 PDT |
Hello. I apoligize for the long time between my last response. I have another interpretation of keywords (unranked this time) for the original post which addresses most of the issues. You can find the keywords here: http://www.tunebounce.com/ga/keywords.txt There is less of the words that seemed to come out of nowhere, I don't honestly know why some of them got in there. For the #5 request about indexes, I think it is possible to write a script in any scripting language that takes as input the keyword list and original text, and finds what line numbers each match is on. This seems like the easiest way, compared to editing conceptnet source code. It recognized both gate and nlp as for #3. If someone else wants to jump in and answer this, that would be great. I'm definitely not an expert, this is just an interesting problem I think. In fact if someone here has past experience in this, I would welcome their response simply for the selfish reason to learn more about this. Otherwise let me know what needs to be done for this to be considered answered. Regards, Andyt-ga |
Subject:
Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 13:34 PDT |
Andy, that definitely looks like a more usable list. On the indexing issue (#5) I think you're right. It's easy enough to index back to the document. Two remaining concerns: 1) Proper names. In the little time I played with GATE, I saw that it was actually able to identify names as being the names of people. This would be extremely valuable to me. 2) I'm afraid of being overwhelmed by the number of keywords generated (which is why I didn't just want a concordance or n-grams). So some method of saying that certain words/phrases have special significance in the CORPUS (not just in an individual document) would be very helpful. A keyword/phrase is more interesting if it appear across a few documents, but more likely to be just some common English word if it appears in most or all documents. I'm not very clear on how to deal with this concern--hand editing, yes, but if I'm going to have to hand edit tens of thousands of terms, the program won't have helped me much. Thanks for sticking with this. |
Subject:
Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 16:09 PDT |
1) I tried another sample with some proper names. I think it did OK. Check http://www.tunebounce.com/ga/samplewithnames.txt 2)I think it would be possible to write a script to find the uniqueness of each keyword. How are you storing the texts, in files or a database? General algorithm for keyword list post processing: -Input each keyword list from each text into a datastructure -Strip out list of common words, maybe the 100 most common english words -For each word in each keyword list check if the word occurs anywhere else by looping through every other keyword list. Each time there is a match, increment a counter for that keyword. -When you have all the keywords through the previous process, you can do some analyzing: -how many keywords appear once in a document, twice, n times? -average number of documents a keyword appears in -for a given keyword, find a uniqueness rating from 0 to 1 with 0 being very unique and 1 being very common. Take the max number of times a keyword appears in a document, and the min number of times and use: uniqueness= samplekeyword/(max-min) I could work on coming up with the above scripts, but it may take some time. Alternatively I could write a quick how-to guide on generating keyword lists like the samples I have posted with conceptnet, and then describe the algorithms so that you could implement them. Regards, Andyt-ga |
Subject:
Re: help with NLP task using GATE
From: newbie99-ga on 26 Aug 2004 18:07 PDT |
Andy, the new sample doesn't handle names the way I was talking about. GATE identifies names AS names. It actually pulls out words and classifies them in different categories: personal names, addresses, dates, other things. I feel bad. It seems like you've done a fair amount of work, but the work is not helping me much. I've ordered a copy of Foundations of Statistical Natural Language Processing by Manning and Schutze, MIT Press. Cambridge, MA: May 1999. book site: http://www-nlp.Stanford.EDU/fsnlp/ which will hopefully help me better understand and describe and maybe accomplish what I'm trying to do. I appreciate that you're working on this as an interesting problem. But if you are hoping to get paid, this is what I could offer: If you give me the script(s) you used to produce those two lists, I could pay $20. It doesn't really do what I want, but presumably it would be easy to use and could stand in until I'm able to replace it. Alternatively, you could help me figure out how to use GATE and I'll pay the full $100. Thanks, Sigfried |
Subject:
Re: help with NLP task using GATE
From: andyt-ga on 26 Aug 2004 21:57 PDT |
Ok, No problem. I leave this open to someone else. Good luck! |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |