Google Answers: PERL function for removing html tags except for text links

View Question

Q: PERL function for removing html tags except for text links ( Answered 5 out of 5 stars

Question

Subject: PERL function for removing html tags except for text links
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $25.00

Posted: 10 Jan 2004 17:33 PST
Expires: 09 Feb 2004 17:33 PST
Question ID: 295173

I need a perl function ("striphtml") that strips out all html tags
from html contained in a variable except for those that create text
links.

I would like to be able to use it like this:

$content_without_tags_except_links = &striphtml($content_with_tags)

Thank you.

Clarification of Question by marcfest-ga on 11 Jan 2004 06:54 PST

Yes, links to images should also be removed, and the image itself as
well. The function html_to_text that I posted above actually converts
images to the string "[image]" in the stripped version, which is fine.

Answer

Subject: Re: PERL function for removing html tags except for text links
Answered By: joseleon-ga on 11 Jan 2004 07:51 PST
Rated: 5 out of 5 stars

Hello, marcfest: I have written a simple script using the HTML::TagFilter class that removes all tags from an HTML input except the links, because of that, only text links will be visible, this is start, but if you need further customization, please, don't hesitate to request for a clarification. The TagFilter class is located here, there is a lot of nice info about it: HTML::TagFilter http://search.cpan.org/~wross/HTML-TagFilter-0.075/TagFilter.pm The script to dump out the HTML code with just text links is this: -- use HTML::TagFilter; my $tf = HTML::TagFilter->new(allow=>{a=>{'any'}}); my $string="<b>tags</b> to delete<a href=\"http://www.url.com\">hello, world</A>"; #my $file='index.html'; #open F,"$file" or die "can't open the file: $!\n"; #my $count = 0; #my $string = ""; #my $size = -s $file; #$count += sysread(F,$string,$size-$count,$count) until $count == $size; my $clean_html = $tf->filter($string); print $clean_html; -- I have added as a goodie a piece of code to read a file into a variable, so you can test it easily. This script creates a new TagFilter class that only allows <a> tags with any attribute inside, so the resulting code will contain "only" such tags. For example, this input: <b>tags</b> to delete<a href="http://www.url.com">hello, world</A> Produce this output: tags to delete<a href="http://www.url.com">hello, world</a> I hope this is what you were looking for, if you need any additional help or any modification to this source, please, don't hesitate to request for any clarification, I'm here to help you until you get what you need. Regards.
Request for Answer Clarification by marcfest-ga on 11 Jan 2004 08:51 PST Hi. Please go to http://www.marcfest.com/changedetect/x.cgi . See below for the code of x.cgi. Please note how a lot of javascript is still visible on top of the page. I'd like that removed as well. Also, can you tweak it so it re-inserts line returns <br> where needed for better html display? Thank you! Code: #!/usr/bin/perl use LWP::Simple; use HTML::TagFilter; my $tf = HTML::TagFilter->new(allow=>{a=>{'any'}}); $url = "http://www.fuckedcompany.com/"; $content = get $url; my $content = $tf->filter($content); print "Content-type: text/html", "\n"; print "Pragma: no-cache", "\n\n"; print $content;
Clarification of Answer by joseleon-ga on 12 Jan 2004 08:49 PST Hello, marcfest: Then try this script instead: -- #!/usr/bin/perl use LWP::Simple; use HTML::TagFilter; use HTML::Scrubber::StripScripts; my $tf = HTML::TagFilter->new(strip_comments => 1,allow=>{a=>{'any'},br=>{'any'},script=>{'any'},style=>{'any'}}); $url = "http://www.fuckedcompany.com/"; $content = get $url; my $content = $tf->filter($content); my $hss = HTML::Scrubber::StripScripts->new( Allow_src => 1, Allow_href => 1, Allow_a_mailto => 1, Whole_document => 1, Block_tags => ['hr'], ); my $clean_html = $hss->scrub($content); print "Content-type: text/html", "\n"; print "Pragma: no-cache", "\n\n"; print $clean_html; -- You will need these modules: http://search.cpan.org/~podmaster/HTML-Scrubber-0.06/Scrubber.pm http://search.cpan.org/~ncleaton/HTML-Scrubber-StripScripts-0.01/StripScripts.pm Just tell me if this is what you need, and keep requestion for clarifications if not. Regards.
Request for Answer Clarification by marcfest-ga on 12 Jan 2004 19:31 PST This works great. Can you please contact me at http://www.marcfest.com/email.html ? I'd like to collaborate in the future. Regards, Marc.
Clarification of Answer by joseleon-ga on 13 Jan 2004 01:09 PST Hello, marcfest: Thanks for the rating and for the tip, unfortunately we are not able to contact outside Google Answers, is against Researcher Guidelines, in any case, if you want to contact with me here again, you can post a question that in the subject includes "For joseleon only", that way, that question will be reserved to me. I hope to see you again around here! Regards.

marcfest-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $5.00

Excellent answers, clarifications.

Comments

Subject: Re: PERL function for removing html tags except for text links
From: iwb-ga on 11 Jan 2004 03:25 PST

Hm, 

I'm not sure what you want. What's a text link? Do you have an
example? Perhaps you want just extract links:

http://search.cpan.org/~podmaster/HTML-LinkExtractor-0.09/LinkExtractor.pm

IWB

Subject: Re: PERL function for removing html tags except for text links
From: marcfest-ga on 11 Jan 2004 06:17 PST

A text is something like <a href = "url here">text here</a>. 

I do not want to "extract" text links. I want the input to be html
source code and the output to be a text-only version, however, with
the "text links" being preserved, and all other html tags, javascript
etc being removed.

Subject: Re: PERL function for removing html tags except for text links
From: joseleon-ga on 11 Jan 2004 06:47 PST

Hello, marcfest:
  Ok, in that case, links to images must be also removed, for example:

<A href="url here"><img src=""></A>

All these tags must removed from the output, isn't it?

Regards.

Subject: Re: PERL function for removing html tags except for text links
From: marcfest-ga on 11 Jan 2004 06:52 PST

PS: below is a function I've been using successfully to strip *all*
html tags from a page. But as I said, I want to modify this so that
text links are preserved.

sub html_to_ascii {
use HTML::TreeBuilder;
use HTML::FormatText;
$document = $_[0];
$html = HTML::TreeBuilder->new();
$html->parse($document);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 0);
$return = $formatter->format($html);
return $return;
}

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy