Google Answers Logo
View Question
 
Q: Word Analysis ( Answered,   3 Comments )
Question  
Subject: Word Analysis
Category: Science > Technology
Asked by: broker-ga
List Price: $30.00
Posted: 26 Aug 2003 02:43 PDT
Expires: 25 Sep 2003 02:43 PDT
Question ID: 248829
I need to find a service, ideally a free one, that can analyze a
database or document and summarize by percentages the top 1-word,
2-word and 3-word phrases found.  To put this into perspective, I am
running a survey that will have lots of data and want to mine that
data for common language patterns.
Answer  
Subject: Re: Word Analysis
Answered By: pafalafa-ga on 26 Aug 2003 07:37 PDT
 
Hello broker-ga, 


I had to smile when I saw your question, because I've had a
long-standing interest in the very same topic -- analyzing phrase
frequencies in text -- and have been surprised over the years at how
frustratingly difficult it can be to find the right tools.

To cut to the chase, there is software out there available for free
that will do the job for you (and the software will have to do...I
don't think there's a service in existence that will do the analysis
for you at no charge).

However, be warned...the software is pretty cumbersome to get used to;
it's DOS based, doesn't come with much documentation, and tends to use
the very arcane lingo of linguistics, which is almost incomprehensible
to normal human beings.

However, even though its pretty dated at this point, I haven't found
anything better among the freeware offerings.

That said, some time ago I tinkered with the software for a while, and
then ran a phrase analysis on a downloadable text of the Bible.  Here
are some of my results of the most frequently-occuring five-word (or
more) phrases:

132  occurences of the phrase:  of the children of israel 
 95  and the lord said to 
 91  the lord spoke to moses 
 83  and the lord spoke to 
 81  and the lord spoke to moses 
 72  the children of israel and 
 70  to the children of israel 
 66  the lord spoke to moses saying 
 63  the tabernacle of the testimony 
 63  and the lord spoke to moses saying 
 59  and the children of israel 
 59  is the family of the 
 55  as the lord had commanded 
 54  out of the land of 
 52  the lord said to moses 

The same could be done for for any text document, and for any group of
n-word phrases, where n is 2,3,4 or any number of your choosing.

The software itself is called TACT -- Text Analysis Computing Tools --
and can be found here:

http://www.chass.utoronto.ca/cch/tact.html

Click on the links "Disk 1" and "Disk 2" to download the software
(shows you how long ago it was created -- they were still using
floppies!).  It might help to right-click and then choose "Save Target
As" on the pulldown menu.

There's a link on the page for "ordering Information", but this is not
for the software (which is free) but for documentation about the
program -- your call as to whether you want to spring for it or not.

Like I said, the software itself is very useful, but there *is* a
learning curve.  There's no way to fully talk you through the initial
stages -- best thing I can suggest is to simply ask me any questions
here.  Just post a "Request for Clarification" to let me know what
questions come up as you play with the software.  I'll do my best to
respond promptly.

One more thing.  Here's a link to other text analysis tools.  Again,
the lingo is hard to wade through, but the tools themselves are very
interesting...some of them may be of use to you:

http://www.sil.org/linguistics/computing.html 

Good luck in your ventures.  

pafalafa-ga





search strategy:  None -- used bookmarked sites and personal
knowledge.

Request for Answer Clarification by broker-ga on 28 Aug 2003 05:32 PDT
Thanks for the detailed response.  Since I like having my cake and
eating it too :-), I should have clarified that I want it to be
extremely easy to use, which means Windows-based for me.  Can you find
something out there that's window's based that's easy to use?

Thanks!

Clarification of Answer by pafalafa-ga on 28 Aug 2003 07:21 PDT
Broker-ga,

The text analysis you want to do is fairly sophisticated and fairly
arcane -- a combination that doesn't lend itself to software solutions
that are both easy and free.  As the comment from yosarian-ga notes,
you are really trying to do n-gram linguistic analysis, and this is a
tough field to find easy-to-use tools.

That said, I urge you to try out the TACT software I found for you. 
Although it is DOS-based, it will run on your Windows system, so that
shouldn't be an obstacle (Windows itself, until recently, was
DOS-based as well).

However, if you want Windows alternatives, they are out there, but: 
(1) there's no reason to think they're any easier to use and (2)
they're not free.

I've no direct familiarity with these other programs, but you might
want to explore them if you're looking for options:

Wordstat text analysis software at:

http://www.simstat.com/home.html


Hyperresearch at:

http://www.researchware.com/

These *probably* will do the trick, but I have no direct familiarity
with them, so I'm afraid I can't make any promises.

I really do think that the TACT software is your absolute best bet. 
Try it out, and if you run into any difficulties, just post a note
here to let me know, and I'll be glad to try to walk through the
set-up and use of the software.

Good luck.

pafalafa-ga
Comments  
Subject: Re: Word Analysis
From: yosarian-ga on 28 Aug 2003 00:30 PDT
 
Hi broker.
In case you have access to a unix/linux machine,
I suggest you use the CMU-CAM Language Modelling Toolkit:
http://mi.eng.cam.ac.uk/~prc14/toolkit.html
Its basic operation is to build n-grams (sorry for the lingo,
but this is how your request is called by 
statistical natural language processing people).

Ted Pedersen's n-gram statistics package is another option.
It is distributed according to the GPL, and is written in PERL,
so should work in every operating system which
supports PERL: 
http://www.d.umn.edu/~tpederse/nsp.html

A third, do-it-yourself version of building n-gram frequencies
can be found inside "Unix for poets" by Kenneth Church (unix again, sorry):
http://www.stanford.edu/class/cs224n/new_handouts/kwc-unix-for-poets.pdf

For some theoretical background you can look at chapters 5 and 6 of:
Foundations of Statistical Natural Language Processing
by Manning and Schutze,
MIT Press. Cambridge, MA: May 1999. 
book site: http://www-nlp.Stanford.EDU/fsnlp/

HTH
yosarian-ga
Subject: Re: Word Analysis
From: newbie99-ga on 17 Aug 2004 11:25 PDT
 
Yosarian,

I posted a question to computers regarding NLP this morning that I
imagine you could help with.  Love to hear from you.

Thanks
Subject: Re: Word Analysis
From: taradfong-ga on 19 Nov 2004 13:48 PST
 
FWIW, I have a very simple script for doing word counting in Perl here...

http://www.mattwalsh.com/twiki/bin/view/Main/PerlWordCounter

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy