Google Answers: Text import/cleanup tool needed

View Question

Q: Text import/cleanup tool needed ( No Answer, 4 Comments )

Question

Subject: Text import/cleanup tool needed
Category: Computers > Software
Asked by: sherpaj-ga
List Price: $22.00

Posted: 21 Apr 2003 16:09 PDT
Expires: 21 May 2003 16:09 PDT
Question ID: 193543

We are experimenting with software that analysis written text. The program only accepts text files, or you can copy/paste from word and it will only recognize the text. You need to give it clean text files. No junk characters, extra line breaks, etc. We have a folder full of various articles in the following formats: PDF, text, and HTML. We want to run those articles through the program but have run into an unexpected snag. The time it takes to copy the text into word, remove the extra line breaks, clean it up a bit (not often needed), and save it back out as a text file, is turning out take a long time. PDFs are turning out to be a nightmare. You have to copy individual blocks of text. Is there some inexpensive software that will automate most of this? If it could do a “export” on the PDFs, that would be even better. We are WinXP, acrobat 5, and office xp. thanks in advance,
Request for Question Clarification by pafalafa-ga on 21 Apr 2003 16:16 PDT Hello, It's amazing, isn't it, that the simple transfer of text is still a substantial road block in computing systems. You said that, for PDF's, "You have to copy individual blocks of text". Can you clarify this. It should be possible to select the text of the entire document, and copy it to a text-only program. What is the particular obstacle with PDF that you are having?
Request for Question Clarification by chris2002micrometer-ga on 21 Apr 2003 20:53 PDT I wrote a BASIC program to do just that. It strips HTML from text to load bus schedules from the local transit authority into my "Bus Wizard" that plans trips. I'd be delighted to send the EXE and the source (after I locate it) for the posted price. I believe it would be useful to you but I really don't want to get kicked around about my response/answer for 22 bucks. It does not handle pdf's (yet).
Request for Question Clarification by webadept-ga on 24 Apr 2003 00:03 PDT pafalafa-ga, what he is talking about with the paragraphs is that even if the paragraphs are ended with a double return inside the PDF file, they are not when you paste them outside in a text program such as notepade or Vi. I'm very aware of that huge drawback in PDF files. I made the mistake myself in putting reams of text data in PDF format that I couldn't do anything with for a long time. sherpaj-ga : Can you install Perl on your system? or do you know enough to work a DOS prompt program if I sent you one? webadpet-ga
Clarification of Question by sherpaj-ga on 25 Apr 2003 01:19 PDT Here are my clarifications: Request for Question Clarification by pafalafa-ga PDFs often have callouts (in a frame), pictures, columns, etc. These things always cause me grief when I try to do a copy/paste into a text editor. The callout gets inserted in weird ways with the text. I just want the raw text to export, no pictures and such. I agree, it is amazing that the simple transfer of text is still a substantial road block in computing systems. Well said. Request for Question Clarification by chris2002micrometer-ga Please let me know when it does PDF. Sounds like a great program. Thanks. Request for Question Clarification by chris2002micrometer-ga Thanks for the offer, but Perl and DOS stuff is more intense then I want to go. I need something that is easy. From: smudgy-ga Thanks, I’ll check it out and give you a shout if it is simple to use and meets my needs. From: anti_virus-ga- What exactly is your budget? Under $50 for the software
Request for Question Clarification by pafalafa-ga on 25 Apr 2003 06:36 PDT Sherpaj-ga, Thanks for clarifying things (and thanks to my colleagues for their observations as well). One more question about PDF. I understand that if you select an entire document, and copy and paste it, you end up with a bunch of garbage from images, embedded codes, etc. But...if you paste it in text-only format, then the non-Ascii junk should all disappear. One the one hand, fonts, formatting, etc are all stripped out, but on the other hand, all the garbage is stripped out as well. All that remains is the bare-bones text. There still remains the problem of PDF columns, which don't "map" properly when they're copied into simple text documents. But the rest of the problems should be pretty much taken care of. Am I missing something here?

Answer

There is no answer at this time.

Comments

Subject: Re: Text import/cleanup tool needed
From: smudgy-ga on 21 Apr 2003 17:17 PDT

There's a simple text editing program called Kedit which runs for PC
and has lots of powerful semi-automatic tools. For instance, you could
set it up to strip the first five or seven or ten leading characters
from each line of a text file, or delete every string bounded by
pointy brackets.

I believe it can also set up these commands in macro format so you
could set up all the tidying you want to do in a single macro and zap
the text files one by one.

I don't know whether this is exactly what you need, nor do I know too
many of the details of the program, but you may want to look into it.

Their web page is
www.kedit.com

Good luck,
smudgy

Google search: "kedit"

Subject: Re: Text import/cleanup tool needed
From: anti_virus-ga on 22 Apr 2003 06:37 PDT

What exactly is your budget?

Subject: Re: Text import/cleanup tool needed
From: sherpaj-ga on 25 Apr 2003 01:21 PDT

KDEdit costs over $150.  Looks complicated by the written desc.  It
won't work for me.

Subject: Re: Text import/cleanup tool needed
From: srik123-ga on 19 May 2003 03:42 PDT

The best tool is probably Textpipe pro
(http://www.crystalsoftware.com.au/textpipe/features.html). I have
used it extensively and can vouch for it. Its easy to use and is also
stable.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy