We are experimenting with software that analysis written text. The
program only accepts text files, or you can copy/paste from word and
it will only recognize the text. You need to give it clean text
files. No junk characters, extra line breaks, etc.
We have a folder full of various articles in the following formats:
PDF, text, and HTML. We want to run those articles through the
program but have run into an unexpected snag.
The time it takes to copy the text into word, remove the extra line
breaks, clean it up a bit (not often needed), and save it back out as
a text file, is turning out take a long time. PDFs are turning out to
be a nightmare. You have to copy individual blocks of text.
Is there some inexpensive software that will automate most of this? If
it could do a export on the PDFs, that would be even better.
We are WinXP, acrobat 5, and office xp.
thanks in advance, |
Request for Question Clarification by
pafalafa-ga
on
21 Apr 2003 16:16 PDT
Hello,
It's amazing, isn't it, that the simple transfer of text is still a
substantial road block in computing systems.
You said that, for PDF's, "You have to copy individual blocks of
text". Can you clarify this. It should be possible to select the
text of the entire document, and copy it to a text-only program. What
is the particular obstacle with PDF that you are having?
|
Request for Question Clarification by
chris2002micrometer-ga
on
21 Apr 2003 20:53 PDT
I wrote a BASIC program to do just that. It strips HTML from text to
load bus schedules from the local transit authority into my "Bus
Wizard" that plans trips. I'd be delighted to send the EXE and the
source (after I locate it) for the posted price. I believe it would be
useful to you but I really don't want to get kicked around about my
response/answer for 22 bucks. It does not handle pdf's (yet).
|
Request for Question Clarification by
webadept-ga
on
24 Apr 2003 00:03 PDT
pafalafa-ga, what he is talking about with the paragraphs is that even
if the paragraphs are ended with a double return inside the PDF file,
they are not when you paste them outside in a text program such as
notepade or Vi. I'm very aware of that huge drawback in PDF files. I
made the mistake myself in putting reams of text data in PDF format
that I couldn't do anything with for a long time.
sherpaj-ga : Can you install Perl on your system? or do you know
enough to work a DOS prompt program if I sent you one?
webadpet-ga
|
Clarification of Question by
sherpaj-ga
on
25 Apr 2003 01:19 PDT
Here are my clarifications:
Request for Question Clarification by pafalafa-ga
PDFs often have callouts (in a frame), pictures, columns, etc. These
things always cause me grief when I try to do a copy/paste into a text
editor. The callout gets inserted in weird ways with the text. I
just want the raw text to export, no pictures and such.
I agree, it is amazing that the simple transfer of text is still a
substantial road block in computing systems. Well said.
Request for Question Clarification by chris2002micrometer-ga
Please let me know when it does PDF. Sounds like a great program.
Thanks.
Request for Question Clarification by chris2002micrometer-ga
Thanks for the offer, but Perl and DOS stuff is more intense then I
want to go. I need something that is easy.
From: smudgy-ga
Thanks, Ill check it out and give you a shout if it is simple to use
and meets my needs.
From: anti_virus-ga- What exactly is your budget?
Under $50 for the software
|
Request for Question Clarification by
pafalafa-ga
on
25 Apr 2003 06:36 PDT
Sherpaj-ga,
Thanks for clarifying things (and thanks to my colleagues for their
observations as well).
One more question about PDF. I understand that if you select an
entire document, and copy and paste it, you end up with a bunch of
garbage from images, embedded codes, etc.
But...if you paste it in text-only format, then the non-Ascii junk
should all disappear. One the one hand, fonts, formatting, etc are
all stripped out, but on the other hand, all the garbage is stripped
out as well. All that remains is the bare-bones text.
There still remains the problem of PDF columns, which don't "map"
properly when they're copied into simple text documents. But the rest
of the problems should be pretty much taken care of. Am I missing
something here?
|