Google Answers Logo
View Question
 
Q: Text import/cleanup tool needed ( No Answer,   4 Comments )
Question  
Subject: Text import/cleanup tool needed
Category: Computers > Software
Asked by: sherpaj-ga
List Price: $22.00
Posted: 21 Apr 2003 16:09 PDT
Expires: 21 May 2003 16:09 PDT
Question ID: 193543
We are experimenting with software that analysis written text.  The
program only accepts text files, or you can copy/paste from word and
it will only recognize the text.  You need to give it clean text
files.  No junk characters, extra line breaks, etc.

We have a folder full of various articles in the following formats:
PDF, text, and HTML.  We want to run those articles through the
program but have run into an unexpected snag.

The time it takes to copy the text into word, remove the  extra line
breaks, clean it up a bit (not often needed), and save it back out as
a text file, is turning out take a long time.  PDFs are turning out to
be a nightmare.  You have to copy individual blocks of text.

Is there some inexpensive software that will automate most of this? If
it could do a “export” on the PDFs, that would be even better.

We are WinXP, acrobat 5, and office xp.

thanks in advance,

Request for Question Clarification by pafalafa-ga on 21 Apr 2003 16:16 PDT
Hello,

It's amazing, isn't it, that the simple transfer of text is still a
substantial road block in computing systems.

You said that, for PDF's, "You have to copy individual blocks of
text".  Can you clarify this.  It should be possible to select the
text of the entire document, and copy it to a text-only program.  What
is the particular obstacle with PDF that you are having?

Request for Question Clarification by chris2002micrometer-ga on 21 Apr 2003 20:53 PDT
I wrote a BASIC program to do just that. It strips HTML from text to
load bus schedules from the local transit authority into my "Bus
Wizard" that plans trips. I'd be delighted to send the EXE and the
source (after I locate it) for the posted price. I believe it would be
useful to you but I really don't want to get kicked around about my
response/answer for 22 bucks. It does not handle pdf's (yet).

Request for Question Clarification by webadept-ga on 24 Apr 2003 00:03 PDT
pafalafa-ga, what he is talking about with the paragraphs is that even
if the paragraphs are ended with a double return inside the PDF file,
they are not when you paste them outside in a text program such as
notepade or Vi. I'm very aware of that huge drawback in PDF files. I
made the mistake myself in putting reams of text data in PDF format
that I couldn't do anything with for a long time.

sherpaj-ga : Can you install Perl on your system? or do you know
enough to work a DOS prompt program if I sent you one?

webadpet-ga

Clarification of Question by sherpaj-ga on 25 Apr 2003 01:19 PDT
Here are my clarifications:


Request for Question Clarification by pafalafa-ga 
PDFs often have callouts (in a frame), pictures, columns, etc.  These
things always cause me grief when I try to do a copy/paste into a text
editor.  The callout gets inserted in weird ways with the text.  I
just want the raw text to export, no pictures and such.

I agree, it is amazing that the simple transfer of text is still a
substantial road block in computing systems.  Well said.

 
Request for Question Clarification by chris2002micrometer-ga 
Please let me know when it does PDF.  Sounds like a great program.
Thanks.



 
Request for Question Clarification by chris2002micrometer-ga 
Thanks for the offer, but Perl and DOS stuff is more intense then I
want to go. I need something that is easy.

 
 From: smudgy-ga 
Thanks, I’ll check it out and give you a shout if it is simple to use
and meets my needs.

	
From: anti_virus-ga-  What exactly is your budget? 
Under $50 for the software

Request for Question Clarification by pafalafa-ga on 25 Apr 2003 06:36 PDT
Sherpaj-ga,

Thanks for clarifying things (and thanks to my colleagues for their
observations as well).

One more question about PDF.  I understand that if you select an
entire document, and copy and paste it, you end up with a bunch of
garbage from images, embedded codes, etc.

But...if you paste it in text-only format, then the non-Ascii junk
should all disappear.  One the one hand, fonts, formatting, etc are
all stripped out, but on the other hand, all the garbage is stripped
out as well.  All that remains is the bare-bones text.

There still remains the problem of PDF columns, which don't "map"
properly when they're copied into simple text documents.  But the rest
of the problems should be pretty much taken care of.  Am I missing
something here?
Answer  
There is no answer at this time.

Comments  
Subject: Re: Text import/cleanup tool needed
From: smudgy-ga on 21 Apr 2003 17:17 PDT
 
There's a simple text editing program called Kedit which runs for PC
and has lots of powerful semi-automatic tools. For instance, you could
set it up to strip the first five or seven or ten leading characters
from each line of a text file, or delete every string bounded by
pointy brackets.

I believe it can also set up these commands in macro format so you
could set up all the tidying you want to do in a single macro and zap
the text files one by one.

I don't know whether this is exactly what you need, nor do I know too
many of the details of the program, but you may want to look into it.

Their web page is
www.kedit.com

Good luck,
smudgy

Google search: "kedit"
Subject: Re: Text import/cleanup tool needed
From: anti_virus-ga on 22 Apr 2003 06:37 PDT
 
What exactly is your budget?
Subject: Re: Text import/cleanup tool needed
From: sherpaj-ga on 25 Apr 2003 01:21 PDT
 
KDEdit costs over $150.  Looks complicated by the written desc.  It
won't work for me.
Subject: Re: Text import/cleanup tool needed
From: srik123-ga on 19 May 2003 03:42 PDT
 
The best tool is probably Textpipe pro
(http://www.crystalsoftware.com.au/textpipe/features.html). I have
used it extensively and can vouch for it. Its easy to use and is also
stable.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy