|
|
Subject:
perl: fetching Web pages / redirect issue
Category: Computers > Programming Asked by: marcfest-ga List Price: $30.00 |
Posted:
24 Feb 2004 11:22 PST
Expires: 25 Mar 2004 11:22 PST Question ID: 310341 |
Hi - The script below fetches Web page content from specified URLs. I'd like to have code added to the grab function so that in case of a redirect I'll have a way of knowing what URL is being redirected to, by looking at the contents of @endurl. In the example below, for instance, "http://www.news.com" redirects to http://news.com.com", therefore $endurl[0] should be "http://news.com.com". $endurl[1] should "http://www.yahoo.com" since there is no redirect for that address. As an answer to this question, please add the necessary code to the grab function and post the entire modified, tested script. Thank you. Marc. #!/usr/bin/perl use LWP::Simple; require LWP::Parallel::UserAgent; @urls = ( "http://www.news.com", "http://www.yahoo.com" ); @content = &grab(@urls); # add code to the grab function below so that @endurl is filled with actual URLs that content is being fetched from. # in cases where there is no redirect, those will be the same as the $url[n]. # In this example, $endurl[0] should be "http://news.com.com" ( since this is # what www.news.com redirects to) and $endurl[1] should be "http://www.news.com" print "$endurl[0] \n"; print $content[0]; exit; sub grab { @results= (); $ua = LWP::Parallel::UserAgent->new(); $ua->agent("MS Internet Explorer"); $ua->timeout ($timeout); $ua->redirect (1); # prevents automatic following of redirects $ua->max_hosts(6); # sets maximum number of locations accessed in parallel $ua->max_req (2); # sets maximum number of parallel requests per host foreach $url2 (@_) { $ua->register(HTTP::Request->new(GET => $url2), \&callback); } $ua->wait (); return @results; } sub callback { my($data, $response, $protocol) = @_; #Comment this line to prevent show the url print $response->base."\n"; #majortom-ga's change my $oresponse = $response; while (defined($oresponse->previous)) { $oresponse = $oresponse->previous; } for ($i=0; $i<@urls; $i++) { # check $oresponse->base, not $response->base if ( index( $oresponse->base, $urls[$i]) != -1 ) { $results[$i].=$data; last; } } } ############ |
|
Subject:
Re: perl: fetching Web pages / redirect issue
Answered By: majortom-ga on 27 Feb 2004 08:47 PST Rated: |
Hello, marcfest-ga, It's a pleasure to hear from you again. I have made the change you asked for, and also made an improvement to the callback: instead of searching for $urls[$i] in $oresponse->base, which just happens to work if there is no <base> tag or the <base> tag happens to point to a deeper URL in the same web site, I now compare $urls[$i] directly to $oresponse->request->uri, which is the document actually asked for on this particular request, although it may not be the *original* document asked for due to redirects and we still have to deal with that as we always have by walking the list of responses. Here is the code with the new @endurl feature. Let me know if you have any questions! Thanks for the opportunity to work on this interesting program. Sources of information: "perldoc HTTP::Request", "perldoc HTTP::Response" * * * CUT HERE * * * #!/usr/bin/perl use LWP::Simple; require LWP::Parallel::UserAgent; @urls = ( "http://www.news.com", "http://www.yahoo.com" ); @content = &grab(@urls); # add code to the grab function below so that @endurl is filled with # actual URLs that content is being fetched from. # in cases where there is no redirect, those will be the same as the $url[n]. # In this example, $endurl[0] should be "http://news.com.com" ( since this is # what www.news.com redirects to) and $endurl[1] should be "http://www.news.com" print "ENDURLS ARE:\n"; for $u (@endurl) { print $u, "\n"; } print "END OF ENDURLS\n"; print $content[0]; exit; sub grab { @results = (); @endurl = (); $ua = LWP::Parallel::UserAgent->new(); $ua->agent("MS Internet Explorer"); $ua->timeout ($timeout); $ua->redirect (1); # ALLOWS automatic following of redirects $ua->max_hosts(6); # sets maximum number of locations accessed in parallel $ua->max_req (2); # sets maximum number of parallel requests per host foreach $url2 (@_) { $ua->register(HTTP::Request->new(GET => $url2), \&callback); } $ua->wait (); return @results; } sub callback { my($data, $response, $protocol) = @_; #Comment this line to prevent show the url print "URL: ", $response->base, "\n"; #majortom-ga's change my $oresponse = $response; while (defined($oresponse->previous)) { $oresponse = $oresponse->previous; } for ($i=0; $i<@urls; $i++) { # majortom-ga: instead of trying to find $urls[$i] in # $oresponse->base, which won't necessarily contain it # if a Content-Base: header or <base> tag is present, # we match it exactly in $oresponse->request->uri if ($oresponse->request->uri eq $urls[$i]) { #majortom-ga: record the URL of the last page #we were redirected to $endurl[$i] = $response->request->uri; $results[$i] .= $data; last; } } } |
marcfest-ga
rated this answer:
excellent solution. thank you. |
|
There are no comments at this time. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |