Google Answers Logo
View Question
 
Q: perl function improvement ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: perl function improvement
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $20.00
Posted: 18 Jan 2004 10:41 PST
Expires: 17 Feb 2004 10:41 PST
Question ID: 297701
Please look at the perl script below. It uses a function called "grab"
to fetch URLs of the Web.

I need the function's timeout feature fixed. Right now, when including
a non existent server like "http://216.239.39.111/" (which is a bogus
url to simulate a server that's down; it should trigger the timeout)
in @urls, the script will hang for up to 3 minutes instead of timing
out after 3 seconds. This may be due to a bug in the PUA Perl module.

Please make it so the "grab" function times out appropriately, i.e. in
this case after 3 seconds. You may have to stop using PUA to
accomplish this. Make it so that when a request times out, the value
"timeout" is assigned to the variable that would otherwise hold the
fetched html. Please try to keep grab working as fast as possible.

Thank you.

Marc.



#!/usr/bin/perl

#Uncomment to get full debug info
#use LWP::Debug qw(+ -conns);
use LWP::Simple;
require LWP::Parallel::UserAgent;
require HTTP::Request;


@urls = (
"http://www.yahoo.com/",
"http://216.239.39.111/",
);

$timeout = 3; # each request times out after 3 seconds)
@content = grab(@urls);

#This prints the contents of http://www.cnn.com
print $content[0];

exit;

sub grab
{
   @results;

   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->redirect (0); # prevents automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (6); # sets maximum number of parallel requests per host

  foreach $url (@_)
  {

       $ua->register(HTTP::Request->new(GET => $url), \&callback);
  }

  $ua->wait ( $timeout );

  return @results;

}

sub callback
{
        my($data, $response, $protocol) = @_;

        #Comment this line to prevent show the url
        print $response->base."\n";
        for ($i=0; $i<@urls; $i++)
        {
                if ( index( $response->base, $urls[$i]) != -1 )
                {
                        $results[$i].=$data;
                        last;
                }
        }
}

Request for Question Clarification by haversian-ga on 20 Jan 2004 16:48 PST
Hello marcfest-ga

I know how to fix your grab function, but in neither the original nor
the fixed version does the callback seem to do much.  In particular,
its print function doesn't print anything.

Do you want just a fix for grab, or do you want me to see what I can
do with callback?  If you'd like me to work on callback, what is it
supposed to do?  What parts of the response do you want stored in
@results?

-Haversian

Clarification of Question by marcfest-ga on 20 Jan 2004 19:48 PST
Hi Haversian - 

pls change the script as you deem fit. What matters is that it will
work fast and that the timeout will work. Pls test it with
http://216.239.39.111/ which should cause the time out to be triggered
since it's a bogus URL.

Thx.

Request for Question Clarification by haversian-ga on 20 Jan 2004 20:51 PST
Thanks for the quick reply.

I wouldn't feel comfortable taking your $20 if I'm not getting you what you want.

Are you happy with the callback function?  That is, is the script
giving you the output you want, just taking too long to do it?  If so,
I'll post my changes to the grab function and get you on your way.

-Haversian

Clarification of Question by marcfest-ga on 21 Jan 2004 03:36 PST
The only thing that I'm unhappy about is that the script's timeout
function does not work. This problem surfaces when using a bogus url
like "http://216.239.39.111/" in order to simulate a server that's
unavailable. Instead of timing out after 3 seconds, that current
script will try to get this URL for up to 3 minutes. So what I need
fixed is this faulty timeout behavior.
Answer  
Subject: Re: perl function improvement
Answered By: haversian-ga on 21 Jan 2004 05:41 PST
Rated:5 out of 5 stars
 
Good morning marcfest,

You've made a small error in using $ua->wait in your script.  Since
the call to wait occurs after your foreach loop, it does not impact
the execution of any code within that loop.  Instead, you have to use
$ua->timeout() to set the timeout value while you're setting up the
other constants governing the behavior of LWP::Parallel in your
script:

sub grab
{
   @results;

   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->timeout ($timeout); #  <--- ADD THIS
   $ua->redirect (0); # prevents automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (6); # sets maximum number of parallel requests per host

  foreach $url (@_)
  {
       $res = $ua->register(HTTP::Request->new('GET', $url), \$callback);
  }

  $ua->wait ();  #  <--- simply wait until all registered URLs are dealt with
  return @results;
}



LINK:

http://search.cpan.org/~marclang/ParallelUserAgent-2.56/lib/LWP/Parallel.pm
  The CPAN page on LWP::Parallel was invaluable in answering this
question for you.  It has several examples that you may find useful in
continuing to work with this script.


-Haversian

Request for Answer Clarification by marcfest-ga on 21 Jan 2004 07:31 PST
Runnning the script below which contains you suggested changes only
produces "xx" as output. Something's not working. Please advise.

#!/usr/bin/perl

#Uncomment to get full debug info
#use LWP::Debug qw(+ -conns);
use LWP::Simple;
require LWP::Parallel::UserAgent;
require HTTP::Request;


@urls = (
"http://www.yahoo.com/",
"http://216.239.39.111/",
);

$timeout = 10; # each request times out after 3 seconds)
@content = grab(@urls);

print "xx $content[0]";

exit;


sub grab
{
   @results;

   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->timeout ($timeout); #  <--- ADD THIS
   $ua->redirect (0); # prevents automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (6); # sets maximum number of parallel requests per host

  foreach $url (@_)
  {
       $res = $ua->register(HTTP::Request->new('GET', $url), \$callback);
  }

  $ua->wait ();  #  <--- simply wait until all registered URLs are dealt with
  return @results;
}


sub callback
{
        my($data, $response, $protocol) = @_;

        #Comment this line to prevent show the url
        print $response->base."\n";
        for ($i=0; $i<@urls; $i++)
        {
                if ( index( $response->base, $urls[$i]) != -1 )
                {
                        $results[$i].=$data;
                        last;
                }
        }
}

Clarification of Answer by haversian-ga on 21 Jan 2004 09:40 PST
'Afternoon,

That's what I was referring to when I kept asking about your callback
function.  It doesn't seem to be properly loading values into results.
 Does it work (albeit slowly) when using your original code?  It
didn't for me.

I have some code I placed in the grab function for testing purposes
while answering your question.  It's at home, but I could post it here
if you're interested.  It prints the response header (200 OK, 440 not
found, etc) for each URL, and could probably be extended to record
more of the response and place it into the results variable, but as I
understood it you wanted the callback function to handle that instead.

Let me know what the status is; I'll get back to you this evening.

-Haversian

Request for Answer Clarification by marcfest-ga on 21 Jan 2004 10:44 PST
The callback part wasn't written by me. I don't know how it works. 

What I would like as an "answer" to this question is a posting of the
complete, reworked script functioning according to the specs: i.e. it
grabs the sites quickly and times out appropriately. I'd also
appreciate if you could test it before posting it; make sure to use
"http://216.239.39.111/" to simulate a time out and make sure that it
grabs the other sites OK.

Thanks a bunch.

Clarification of Answer by haversian-ga on 21 Jan 2004 13:48 PST
> The callback part wasn't written by me. I don't know how it works.

Yes, but *does* it work?  On your system, with your original script,
do you get entries in the content variable?

Since I wasn't sure what callback's purpose was, I simply ignored it
and included my own test code right in the grab subroutine.  Which
works fine.

Simply replace:

  $ua->wait ();

in the code I posted as an answer with:

  $entries = $ua->wait ();
  foreach (keys %$entries) {
    my $res = $entries->{$_}->response;
    print "RESULTS: " . $res->message . "\n";
  }

Also remove the reference to callback like so:

       $res = $ua->register(HTTP::Request->new('GET', $url));#, \$callback);

You'll see the HTTP response headers for the URLs tested (I tested
your original two and added several both good and bad URLs as well).


If you tell me what you mean by "grabs the other sites OK.", I can try
to rewrite callback or incorporate the code into the grab subroutine
for you.

-Haversian

Request for Answer Clarification by marcfest-ga on 21 Jan 2004 14:55 PST
Haversian - 

Please post a complete script for me, so that all I have to do is cut
and paste it and run it that way, without having to replace stuff and
change lines. Try to make it as safe and convenient for me as you can.
I'll appreicate it. By "grabbing" a URL I mean what the script is
supposed to do, i.e. fetch a URL off the Web. Thx.

Request for Answer Clarification by marcfest-ga on 21 Jan 2004 15:02 PST
Also, please assume that I am a total idiot when it comes to perl
code. The original script was done by someone else. I cannot give you
the answer to any of your code-related questions. The original script
ran OK as long as it did not encounter any URLs whose pages were
unavailable (those URLs would cause the timeout mechanism to not work
as explained in my original specs which is why I posted this
question).

Again, what I hope to receive from an expert here is a complete
script, (not bits and pieces), that is tested and will solve the
timeout issue. If you want to get rid of this case and like me to ask
for a refund, that's no problem. Pls let me know. Otherwise, please
post the complete code of a script that works according to my specs.
Thank you. Sorry about me being so difficult.

Clarification of Answer by haversian-ga on 22 Jan 2004 22:12 PST
Marcfest,

The difficulty of communication isn't a problem - some questions just
take more back-and-forth than others.  I had assumed you wrote the
script, which turns out to be a bad assumption - that clears some
things up.

As to the specifications, the script as you presented it to me does
not work on my system.  Since it worked on yours, I will assume this
is a misconfiguration issue on my part, and will work to fix it
tomorrow.  In either case, the script only prints out $content[0],
that is, the first URL given to it.  Would you like it changed to
print out the content of all the webpages, one by one?  Would you like
some sort of divider between each URL?

I'll sleep on it, and tackle the problem anew tomorrow.  I'm confident
the hard part is behind us and at long last I should be able to get
you the response you've been looking for.  My apologies for the delay.

-Haversian

Request for Answer Clarification by marcfest-ga on 23 Jan 2004 02:08 PST
printing out just $content[0] is OK. This line is for testing purposes only.

Clarification of Answer by haversian-ga on 24 Jan 2004 09:09 PST
Ok, I've got things working ( I had problems with my HTML::Parser module ).

When I changed things so they worked (for my test code), I ended up
breaking them so they didn't work (for your code).  I've backed out of
some of those changes, and have only a few changes left in the code.

As you requested, here is the completed, tested, script:

#!/usr/bin/perl

#Uncomment to get full debug info
#use LWP::Debug qw(+ -conns);
use LWP::Simple;
require LWP::Parallel::UserAgent;
require HTTP::Request;


@urls = (
"http://www.yahoo.com/",
"http://216.239.39.111/",
);

$timeout = 3; # each request times out after 3 seconds)
@content = grab(@urls);

#This prints the contents of one of the URLs, probably the first
print $content[0];

exit;

sub grab
{
   @results;

   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->timeout ($timeout);
   $ua->redirect (0); # prevents automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (6); # sets maximum number of parallel requests per host

  foreach $url (@_)
  {
       $ua->register(HTTP::Request->new(GET => $url), \&callback);
  }

  $ua->wait ();
  return @results;

}


sub callback
{
        my($data, $response, $protocol) = @_;

        #Comment this line to prevent show the url
        print $response->base."\n";
        for ($i=0; $i<@urls; $i++)
        {
                if ( index( $response->base, $urls[$i]) != -1 )
                {
                        $results[$i].=$data;
                        last;
                }
        }
}
marcfest-ga rated this answer:5 out of 5 stars
Put in a lot of effort to get it right and got it right. Walked the extra mile!

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy