Google Answers Logo
View Question
 
Q: MS Word Document Data Mining ( No Answer,   7 Comments )
Question  
Subject: MS Word Document Data Mining
Category: Computers > Software
Asked by: steveb-ga
List Price: $10.00
Posted: 19 Jun 2002 10:19 PDT
Expires: 11 Jul 2002 09:48 PDT
Question ID: 29245
I have an interesting problem; I am hoping to automate extraction of
name, contact, and address information from hundreds of word documents
into a database or spreadsheet.  All the information is within the
first 15-20 lines of each document, below letterhead, and letter
header information (from, date, etc).  All the documents vary slightly
and do not use a template or standard layout.

Is there existing software or scripts, or some other means to pull
this data?  Any comments are appreciated!

Thanks!
 Steve
Answer  
There is no answer at this time.

Comments  
Subject: Re: MS Word Document Data Mining
From: chilledout-ga on 19 Jun 2002 11:54 PDT
 
Mike,
  Both Word and Excel support a scripting language called VBA, which
is a form of Visual Basic.  You could automate this using VBA, which
is not too difficult for someone with programming or scripting
experience.  The main problem you will encounter is inconsistancies
between different documents.  For example, how would the script know
the difference between a name and an address.  Without seeing your
exact documents I can't really determine how difficult it would be. 
Hope that helps!

Joshua
Subject: Re: MS Word Document Data Mining
From: steveb-ga on 19 Jun 2002 12:28 PDT
 
Thanks Joshua

That is my backup plan, I am hoping for existing software as
programming such a script will take many hours.

Thanks! 
  Steve
Subject: Re: MS Word Document Data Mining
From: anand_suhana-ga on 20 Jun 2002 05:26 PDT
 
Mail me two document sampls and mention the fields, have done thjis
stuff before, will see if i can work a very quick fix.

ANAND
artyfact_in@hotmail.com
Subject: Re: MS Word Document Data Mining
From: anand_suhana-ga on 20 Jun 2002 05:32 PDT
 
Also try a programme called 80:20...I do not knwo whether it will work
though...but is worth a try.

ANAND
Subject: Re: MS Word Document Data Mining
From: ddent-ga on 20 Jun 2002 13:54 PDT
 
One method would be to use a utility such as catdoc
(http://www.ice.ru/~vitus/catdoc/) (for the Linux operating system) to
extract the text from the files, and then to pipe the output of that
through 'head' (a common utility included with most linux
distributions which will give you the first n lines of a file).  From
that point, you can either manually add the information to the
database, or you can use some kind of pattern matching tool such as
AAC (http://www.patrice.ch/en/computer/programs/aac/aac.html) to
identify phone numbers and addresses to some extent automatically, or
if you want to spend some money for a commercial product, (where they
may be able to help get you going),
http://www.vedit.com/office-tasks.htm may be of use.

Hope this helps!
Subject: Re: MS Word Document Data Mining
From: shivakumar-ga on 23 Jun 2002 00:49 PDT
 
hi  Steve,

There is a software which does exactly what you require 

It identifies all the contact infromation which you requre
its name is Address grabber to know further about the product
check the website
http://www.egrabber.com/addressgrabberdeluxe/addressgrabber_s.htm

They have also provided a 15 day free trial 
http://www.egrabber.com/addressgrabberdeluxe/trial.htm

Rgds,
sivakumar.
Subject: Re: MS Word Document Data Mining
From: heinrich-ga on 26 Jun 2002 21:18 PDT
 
Email me a sample document, perhaps I can do it for you.

kochhw@intekom.co.za

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy