Google Answers: PERL: convert html to text-only while preserving text links

View Question

Q: PERL: convert html to text-only while preserving text links ( No Answer, 0 Comments )

Question

Subject: PERL: convert html to text-only while preserving text links
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $20.00

Posted: 10 Feb 2004 04:47 PST
Expires: 10 Feb 2004 13:01 PST
Question ID: 305327

Pls see the script below. Its goal is to convert html to text only,
however, preserving text links plus turning all relative URLs into
absolute urls. By "text links" I mean text that is linked, versus
images that are linked. I want to keep text links (like Headlines that
link to an article) but get rid of all other stuff that's not text.

After 

$content = $tf->filter($content); 

$content still contains javascript portions.

After 

$content = $tf->filter($content);

JS portions are gone as desired, however, some of the text links that
should have stayed are gone as well (the linked headlines).
Specifically, Scrubber seems to erroneously remove links that contain
a "=" or "?" in the src url.

As an answer to this question, please post a complete, changed version
of the script so that no text links get erroneously removed any more.
Also, please change it so that all relative URLs in $content are
changed to absolute URLs.

Thank you.

Marc.


#!/usr/bin/perl

use LWP::Simple;
use HTML::TagFilter;
use HTML::Scrubber::StripScripts;
require HTTP::Request;

$url = "http://www.orlandosentinel.com/news/opinion/";

$content = get ($url);

my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});

$content = $tf->filter($content);

# now $content still contains javascript stuff; am using the code
below to get rid of this

my $hss = HTML::Scrubber::StripScripts->new(
      Allow_src      => 1,
      Allow_href     => 1,
      Allow_a_mailto => 1,
      Whole_document => 1,
      Block_tags     => ['hr'],
   );

$content = $hss->scrub($content);

# now, JS stuff is gone, BUT some links for the headlines are gone as
well (reason seems to be that Scrubber doesn't like
# links containing "=" and "?" chars

# need to add code for converting relative to absolute URLs here, too

print $content;

Answer

There is no answer at this time.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy