Pls see the script below. Its goal is to convert html to text only,
however, preserving text links plus turning all relative URLs into
absolute urls. By "text links" I mean text that is linked, versus
images that are linked. I want to keep text links (like Headlines that
link to an article) but get rid of all other stuff that's not text.
After
$content = $tf->filter($content);
$content still contains javascript portions.
After
$content = $tf->filter($content);
JS portions are gone as desired, however, some of the text links that
should have stayed are gone as well (the linked headlines).
Specifically, Scrubber seems to erroneously remove links that contain
a "=" or "?" in the src url.
As an answer to this question, please post a complete, changed version
of the script so that no text links get erroneously removed any more.
Also, please change it so that all relative URLs in $content are
changed to absolute URLs.
Thank you.
Marc.
#!/usr/bin/perl
use LWP::Simple;
use HTML::TagFilter;
use HTML::Scrubber::StripScripts;
require HTTP::Request;
$url = "http://www.orlandosentinel.com/news/opinion/";
$content = get ($url);
my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});
$content = $tf->filter($content);
# now $content still contains javascript stuff; am using the code
below to get rid of this
my $hss = HTML::Scrubber::StripScripts->new(
Allow_src => 1,
Allow_href => 1,
Allow_a_mailto => 1,
Whole_document => 1,
Block_tags => ['hr'],
);
$content = $hss->scrub($content);
# now, JS stuff is gone, BUT some links for the headlines are gone as
well (reason seems to be that Scrubber doesn't like
# links containing "=" and "?" chars
# need to add code for converting relative to absolute URLs here, too
print $content; |