Google Answers: perl: fetching Web pages / redirect issue

View Question

Q: perl: fetching Web pages / redirect issue ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: perl: fetching Web pages / redirect issue
Category: Computers > Programming
Asked by: marcfest-ga
List Price: $30.00

Posted: 24 Feb 2004 11:22 PST
Expires: 25 Mar 2004 11:22 PST
Question ID: 310341

Hi - The script below fetches Web page content from specified URLs.

I'd like to have code added to the grab function so that in case of a
redirect I'll have a way of knowing what URL is being redirected to,
by looking at the contents of @endurl. In the example below, for
instance, "http://www.news.com" redirects to http://news.com.com",
therefore $endurl[0] should be "http://news.com.com". $endurl[1]
should "http://www.yahoo.com" since there is no redirect for that
address.

As an answer to this question, please add the necessary code to the
grab function and post the entire modified, tested script. Thank you.

Marc.



#!/usr/bin/perl

use LWP::Simple;
require LWP::Parallel::UserAgent;

@urls = (
"http://www.news.com",
"http://www.yahoo.com"
);

@content = &grab(@urls);

# add code to the grab function below so that @endurl is filled with
actual URLs that content is being fetched from.
# in cases where there is no redirect, those will be the same as the $url[n]. 
# In this example, $endurl[0] should be "http://news.com.com" ( since this is 
# what www.news.com redirects to) and $endurl[1] should be "http://www.news.com"


print "$endurl[0] \n";

print $content[0];


exit;


sub grab
{
   @results= ();

   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->timeout ($timeout);
   $ua->redirect (1); # prevents automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (2); # sets maximum number of parallel requests per host

  foreach $url2 (@_)
  {
       $ua->register(HTTP::Request->new(GET => $url2), \&callback);
  }

  $ua->wait ();
  return @results;

}
sub callback
{
        my($data, $response, $protocol) = @_;

        #Comment this line to prevent show the url
        print $response->base."\n";
        #majortom-ga's change
        my $oresponse = $response;
        while (defined($oresponse->previous)) {
                $oresponse = $oresponse->previous;
        }
        for ($i=0; $i<@urls; $i++)
        {
                # check $oresponse->base, not $response->base
                if ( index( $oresponse->base, $urls[$i]) != -1 )
                {
                        $results[$i].=$data;
                        last;
                }
        }
}

############

Answer

Subject: Re: perl: fetching Web pages / redirect issue
Answered By: majortom-ga on 27 Feb 2004 08:47 PST
Rated: 5 out of 5 stars

Hello, marcfest-ga,

It's a pleasure to hear from you again.

I have made the change you asked for, and also made an improvement to
the callback: instead of searching for $urls[$i] in $oresponse->base,
which just happens to work if there is no <base> tag or the <base> tag
happens to point to a deeper URL in the same web site, I now compare
$urls[$i] directly to $oresponse->request->uri, which is the document
actually asked for on this particular request, although it may not be
the *original* document asked for due to redirects and we still have
to deal with that as we always have by walking the list of responses.

Here is the code with the new @endurl feature. Let me know if you have
any questions! Thanks for the opportunity to work on this interesting
program.

Sources of information: "perldoc HTTP::Request", "perldoc HTTP::Response"

* * * CUT HERE * * *

#!/usr/bin/perl

use LWP::Simple;
require LWP::Parallel::UserAgent;

@urls = (
"http://www.news.com",
"http://www.yahoo.com"
);

@content = &grab(@urls);

# add code to the grab function below so that @endurl is filled with
# actual URLs that content is being fetched from.
# in cases where there is no redirect, those will be the same as the $url[n]. 
# In this example, $endurl[0] should be "http://news.com.com" ( since this is 
# what www.news.com redirects to) and $endurl[1] should be "http://www.news.com"

print "ENDURLS ARE:\n";
for $u (@endurl) {
	print $u, "\n";
}
print "END OF ENDURLS\n";

print $content[0];


exit;


sub grab
{
   @results = ();
   @endurl = ();
   $ua = LWP::Parallel::UserAgent->new();
   $ua->agent("MS Internet Explorer");
   $ua->timeout ($timeout);
   $ua->redirect (1); # ALLOWS automatic following of redirects
   $ua->max_hosts(6); # sets maximum number of locations accessed in parallel
   $ua->max_req  (2); # sets maximum number of parallel requests per host

  foreach $url2 (@_)
  {
       $ua->register(HTTP::Request->new(GET => $url2), \&callback);
  }

  $ua->wait ();
  return @results;

}
sub callback
{
        my($data, $response, $protocol) = @_;

        #Comment this line to prevent show the url
        print "URL: ", $response->base, "\n";
        #majortom-ga's change
        my $oresponse = $response;
        while (defined($oresponse->previous)) {
                $oresponse = $oresponse->previous;
        }
        for ($i=0; $i<@urls; $i++)
        {
		# majortom-ga: instead of trying to find $urls[$i] in
		# $oresponse->base, which won't necessarily contain it
		# if a Content-Base: header or <base> tag is present,
		# we match it exactly in $oresponse->request->uri	
                if ($oresponse->request->uri eq $urls[$i]) 
                {
                        #majortom-ga: record the URL of the last page
                        #we were redirected to
			$endurl[$i] = $response->request->uri;
                        $results[$i] .= $data;
                        last;
                }
        }
}

marcfest-ga rated this answer: 5 out of 5 stars

excellent solution. thank you.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy