Google Answers Logo
View Question
 
Q: PERL: HTML to TEXT (lynx-style) ( No Answer,   2 Comments )
Question  
Subject: PERL: HTML to TEXT (lynx-style)
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $20.00
Posted: 21 Feb 2004 06:27 PST
Expires: 21 Feb 2004 13:00 PST
Question ID: 309148
Hi - the script below is supposed to convert the content from any Web
page into lynx-style format, i.e. text-only with text links intact.
The script works in most cases, however with http://www.news.com it
bunches together lots of stuff into a single line instead of putting
it onto separate lines like lynx would do. Basically, I missing line
breaks.

See www.marcfest.com/qxi5/news.cgi to see what I'm talking about.

Doing $content = `lynx ...` is not an option.

As an answer to this question, please rewrite the filter part in the
script below so that it will display all pages, including
www.news.com, lynx-style, i.e. converted to text with text lynx
intact, plus appropriate line break.

Thank you.


#!/usr/bin/perl

use LWP::Simple;

use HTML::TagFilter;

$content = get ("http://www.news.com");
my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});


$content = $tf->filter($content);

print $content; exit;
Answer  
There is no answer at this time.

Comments  
Subject: Re: PERL: HTML to TEXT (lynx-style)
From: dewolfe001-ga on 21 Feb 2004 10:15 PST
 
One of your problems is the presence of XML friendly HTML tags in the
news.com article: (e.g. <br />)

In going through the documentation
(http://search.cpan.org/~wross/HTML-TagFilter-0.075/TagFilter.pm), I
think this is how your filter rules should work:

my $tf = HTML::TagFilter->new(strip_comments =>
1,allow=>{a=>{'any'},br=>{'none'},p=>{'any'},script=>{'any'},style=>{'any'}});

This will allow empty <br> tags but not those with the slash.

If this helps, feel free to throw me a tip.
Subject: Re: PERL: HTML to TEXT (lynx-style)
From: marcfest-ga on 21 Feb 2004 13:00 PST
 
I've replaced the filter at http://www.marcfest.com/qxi5/news.cgi but
it does not make a difference. Thanks for looking into this though.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy