|
|
Subject:
PERL: HTML to TEXT (lynx-style)
Category: Computers > Programming Asked by: marcfest-ga List Price: $20.00 |
Posted:
21 Feb 2004 06:27 PST
Expires: 21 Feb 2004 13:00 PST Question ID: 309148 |
Hi - the script below is supposed to convert the content from any Web page into lynx-style format, i.e. text-only with text links intact. The script works in most cases, however with http://www.news.com it bunches together lots of stuff into a single line instead of putting it onto separate lines like lynx would do. Basically, I missing line breaks. See www.marcfest.com/qxi5/news.cgi to see what I'm talking about. Doing $content = `lynx ...` is not an option. As an answer to this question, please rewrite the filter part in the script below so that it will display all pages, including www.news.com, lynx-style, i.e. converted to text with text lynx intact, plus appropriate line break. Thank you. #!/usr/bin/perl use LWP::Simple; use HTML::TagFilter; $content = get ("http://www.news.com"); my $tf = HTML::TagFilter->new(strip_comments => 1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}}); $content = $tf->filter($content); print $content; exit; |
|
There is no answer at this time. |
|
Subject:
Re: PERL: HTML to TEXT (lynx-style)
From: dewolfe001-ga on 21 Feb 2004 10:15 PST |
One of your problems is the presence of XML friendly HTML tags in the news.com article: (e.g. <br />) In going through the documentation (http://search.cpan.org/~wross/HTML-TagFilter-0.075/TagFilter.pm), I think this is how your filter rules should work: my $tf = HTML::TagFilter->new(strip_comments => 1,allow=>{a=>{'any'},br=>{'none'},p=>{'any'},script=>{'any'},style=>{'any'}}); This will allow empty <br> tags but not those with the slash. If this helps, feel free to throw me a tip. |
Subject:
Re: PERL: HTML to TEXT (lynx-style)
From: marcfest-ga on 21 Feb 2004 13:00 PST |
I've replaced the filter at http://www.marcfest.com/qxi5/news.cgi but it does not make a difference. Thanks for looking into this though. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |