Hello, marcfest:
I have written a simple script using the HTML::TagFilter class that
removes all tags from an HTML input except the links, because of that,
only text links will be visible, this is start, but if you need
further customization, please, don't hesitate to request for a
clarification.
The TagFilter class is located here, there is a lot of nice info about it:
HTML::TagFilter
http://search.cpan.org/~wross/HTML-TagFilter-0.075/TagFilter.pm
The script to dump out the HTML code with just text links is this:
--
use HTML::TagFilter;
my $tf = HTML::TagFilter->new(allow=>{a=>{'any'}});
my $string="<b>tags</b> to delete<a
href=\"http://www.url.com\">hello, world</A>";
#my $file='index.html';
#open F,"$file" or die "can't open the file: $!\n";
#my $count = 0;
#my $string = "";
#my $size = -s $file;
#$count += sysread(F,$string,$size-$count,$count) until $count == $size;
my $clean_html = $tf->filter($string);
print $clean_html;
--
I have added as a goodie a piece of code to read a file into a
variable, so you can test it easily. This script creates a new
TagFilter class that only allows <a> tags with any attribute inside,
so the resulting code will contain "only" such tags. For example, this
input:
<b>tags</b> to delete<a href="http://www.url.com">hello, world</A>
Produce this output:
tags to delete<a href="http://www.url.com">hello, world</a>
I hope this is what you were looking for, if you need any additional
help or any modification to this source, please, don't hesitate to
request for any clarification, I'm here to help you until you get what
you need.
Regards. |
Clarification of Answer by
joseleon-ga
on
12 Jan 2004 08:49 PST
Hello, marcfest:
Then try this script instead:
--
#!/usr/bin/perl
use LWP::Simple;
use HTML::TagFilter;
use HTML::Scrubber::StripScripts;
my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},script=>{'any'},style=>{'any'}});
$url = "http://www.fuckedcompany.com/";
$content = get $url;
my $content = $tf->filter($content);
my $hss = HTML::Scrubber::StripScripts->new(
Allow_src => 1,
Allow_href => 1,
Allow_a_mailto => 1,
Whole_document => 1,
Block_tags => ['hr'],
);
my $clean_html = $hss->scrub($content);
print "Content-type: text/html", "\n"; print "Pragma: no-cache", "\n\n";
print $clean_html;
--
You will need these modules:
http://search.cpan.org/~podmaster/HTML-Scrubber-0.06/Scrubber.pm
http://search.cpan.org/~ncleaton/HTML-Scrubber-StripScripts-0.01/StripScripts.pm
Just tell me if this is what you need, and keep requestion for
clarifications if not.
Regards.
|
Clarification of Answer by
joseleon-ga
on
13 Jan 2004 01:09 PST
Hello, marcfest:
Thanks for the rating and for the tip, unfortunately we are not able
to contact outside Google Answers, is against Researcher Guidelines,
in any case, if you want to contact with me here again, you can post a
question that in the subject includes "For joseleon only", that way,
that question will be reserved to me. I hope to see you again around
here!
Regards.
|