Google Answers Logo
View Question
 
Q: PERL function for removing html tags except for text links ( Answered 5 out of 5 stars,   4 Comments )
Question  
Subject: PERL function for removing html tags except for text links
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $25.00
Posted: 10 Jan 2004 17:33 PST
Expires: 09 Feb 2004 17:33 PST
Question ID: 295173
I need a perl function ("striphtml") that strips out all html tags
from html contained in a variable except for those that create text
links.

I would like to be able to use it like this:

$content_without_tags_except_links = &striphtml($content_with_tags)

Thank you.

Clarification of Question by marcfest-ga on 11 Jan 2004 06:54 PST
Yes, links to images should also be removed, and the image itself as
well. The function html_to_text that I posted above actually converts
images to the string "[image]" in the stripped version, which is fine.
Answer  
Subject: Re: PERL function for removing html tags except for text links
Answered By: joseleon-ga on 11 Jan 2004 07:51 PST
Rated:5 out of 5 stars
 
Hello, marcfest:
  I have written a simple script using the HTML::TagFilter class that
removes all tags from an HTML input except the links, because of that,
only text links will be visible, this is start, but if you need
further customization, please, don't hesitate to request for a
clarification.
  
The TagFilter class is located here, there is a lot of nice info about it:

HTML::TagFilter  
http://search.cpan.org/~wross/HTML-TagFilter-0.075/TagFilter.pm  

The script to dump out the HTML code with just text links is this:

--
use HTML::TagFilter;

    my $tf = HTML::TagFilter->new(allow=>{a=>{'any'}});
    my $string="<b>tags</b> to delete<a
href=\"http://www.url.com\">hello, world</A>";
    
    #my $file='index.html';
    #open F,"$file" or die "can't open the file: $!\n";
    #my $count = 0;
    #my $string = "";
    #my $size = -s $file;
    #$count += sysread(F,$string,$size-$count,$count) until $count == $size;
   
    
    my $clean_html = $tf->filter($string);
    
    print $clean_html;
--

I have added as a goodie a piece of code to read a file into a
variable, so you can test it easily. This script creates a new
TagFilter class that only allows <a> tags with any attribute inside,
so the resulting code will contain "only" such tags. For example, this
input:

<b>tags</b> to delete<a href="http://www.url.com">hello, world</A>

Produce this output:

tags to delete<a href="http://www.url.com">hello, world</a>

I hope this is what you were looking for, if you need any additional
help or any modification to this source, please, don't hesitate to
request for any clarification, I'm here to help you until you get what
you need.

Regards.

Request for Answer Clarification by marcfest-ga on 11 Jan 2004 08:51 PST
Hi. Please go to http://www.marcfest.com/changedetect/x.cgi . See
below for the code of x.cgi. Please note how a lot of javascript is
still visible on top of the page. I'd like that removed as well. Also,
can you tweak it so it re-inserts line returns <br> where needed for
better html display?

Thank you!

Code:
#!/usr/bin/perl

use LWP::Simple;
use HTML::TagFilter;
my $tf = HTML::TagFilter->new(allow=>{a=>{'any'}});

$url = "http://www.fuckedcompany.com/";

$content = get $url;
my $content = $tf->filter($content);

print "Content-type: text/html", "\n"; print "Pragma: no-cache", "\n\n";
print $content;

Clarification of Answer by joseleon-ga on 12 Jan 2004 08:49 PST
Hello, marcfest:

Then try this script instead:

--
#!/usr/bin/perl

use LWP::Simple;
use HTML::TagFilter;
use HTML::Scrubber::StripScripts;

my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},script=>{'any'},style=>{'any'}});

$url = "http://www.fuckedcompany.com/";

$content = get $url;
my $content = $tf->filter($content);


my $hss = HTML::Scrubber::StripScripts->new(
      Allow_src      => 1,
      Allow_href     => 1,
      Allow_a_mailto => 1,
      Whole_document => 1,
      Block_tags     => ['hr'],
   );

my $clean_html = $hss->scrub($content);

print "Content-type: text/html", "\n"; print "Pragma: no-cache", "\n\n";
print $clean_html;
--

You will need these modules:

http://search.cpan.org/~podmaster/HTML-Scrubber-0.06/Scrubber.pm
http://search.cpan.org/~ncleaton/HTML-Scrubber-StripScripts-0.01/StripScripts.pm

Just tell me if this is what you need, and keep requestion for
clarifications if not.

Regards.

Request for Answer Clarification by marcfest-ga on 12 Jan 2004 19:31 PST
This works great. Can you please contact me at
http://www.marcfest.com/email.html ? I'd like to collaborate in the
future.

Regards,

Marc.

Clarification of Answer by joseleon-ga on 13 Jan 2004 01:09 PST
Hello, marcfest:
  Thanks for the rating and for the tip, unfortunately we are not able
to contact outside Google Answers, is against Researcher Guidelines,
in any case, if you want to contact with me here again, you can post a
question that in the subject includes "For joseleon only", that way,
that question will be reserved to me. I hope to see you again around
here!

Regards.
marcfest-ga rated this answer:5 out of 5 stars and gave an additional tip of: $5.00
Excellent answers, clarifications.

Comments  
Subject: Re: PERL function for removing html tags except for text links
From: iwb-ga on 11 Jan 2004 03:25 PST
 
Hm, 

I'm not sure what you want. What's a text link? Do you have an
example? Perhaps you want just extract links:

http://search.cpan.org/~podmaster/HTML-LinkExtractor-0.09/LinkExtractor.pm

IWB
Subject: Re: PERL function for removing html tags except for text links
From: marcfest-ga on 11 Jan 2004 06:17 PST
 
A text is something like <a href = "url here">text here</a>. 

I do not want to "extract" text links. I want the input to be html
source code and the output to be a text-only version, however, with
the "text links" being preserved, and all other html tags, javascript
etc being removed.
Subject: Re: PERL function for removing html tags except for text links
From: joseleon-ga on 11 Jan 2004 06:47 PST
 
Hello, marcfest:
  Ok, in that case, links to images must be also removed, for example:

<A href="url here"><img src=""></A>

All these tags must removed from the output, isn't it?

Regards.
Subject: Re: PERL function for removing html tags except for text links
From: marcfest-ga on 11 Jan 2004 06:52 PST
 
PS: below is a function I've been using successfully to strip *all*
html tags from a page. But as I said, I want to modify this so that
text links are preserved.

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy