Google Answers Logo
View Question
 
Q: Scraping ascii file with redirected URL using Perl ( Answered 4 out of 5 stars,   0 Comments )
Question  
Subject: Scraping ascii file with redirected URL using Perl
Category: Computers > Programming
Asked by: grabby-ga
List Price: $40.00
Posted: 27 Jul 2004 07:33 PDT
Expires: 26 Aug 2004 07:33 PDT
Question ID: 379640
I am trying to scrape a page that is generated on the fly by a
webserver. Consistent data is submitted (by me) to the CGI on the
target machine, then a report is generated on the fly and a redirect
is issued. I need the script to obtain and follow this redirect, match
a "ascii download" link on the target page and "click" this link to
download. I was thinking of using WWW:Mechanize to achieve this. A
full code example would be required.
Answer  
Subject: Re: Scraping ascii file with redirected URL using Perl
Answered By: palitoy-ga on 27 Jul 2004 09:06 PDT
Rated:4 out of 5 stars
 
Hello grabby-ga

There are few exact details in your question so I will give a generic
perl script that should hopefully allow you to modify it and generate
the solution you require.

This script will:

1) Go to the page and retrieve the redirect
2) Download the redirect page
3) Match the text link
4) Download the text link into a file

#BEGIN
#!/usr/bin/perl

# modules to use
use LWP::UserAgent;

# this is the url of the page that gets redirected
$url = "http://redirectedurl.com";

# get this redirected page and find out where it is redirected to
$ua = new LWP::UserAgent;
$request = new HTTP::Request HEAD => $url;
$response = $ua->request($request);
$url = $response->request->url;

# download the redirected page
$browser = LWP::UserAgent->new();
$response = $browser->get($url);
$page_content = $response->content;

# use a regular expression to match the text file
# this will need to be nailed down more to ensure the link is correct
# but this is difficult to do without seeing a copy of the page it is on
$page_content =~ m/http\:\/\/(.*)\.txt/ ;
$text_link = "http://" . $1 . ".txt";

# download the text link page
$browser = LWP::UserAgent->new();
$response = $browser->get($text_link);

# save the output
open(OUT,">config.txt") || die $!;
print OUT $response->content;
close(OUT);

# end the program
exit(0);
#END

If you have any questions or need some more help in adapting this to
your situation please ask for clarification and give as much further
information as you can for your exact requirements.

Request for Answer Clarification by grabby-ga on 27 Jul 2004 09:38 PDT
This is a great start, the actual file that I am trying to download is
in the following format (it is in a frame BTW):-

ENVOY1_SG1.NEWARK-RH-VNNN_2004_07_27_17_30_29_556.ascii

The name and date changes daily but this should illustrate it somewhat.

Clarification of Answer by palitoy-ga on 27 Jul 2004 09:44 PDT
How much of the name changes every day?  Do you have a link of the
page that you are trying to scrape?  If the page you are trying to
scrape is in a frame you should be able to discover the name of the
page by looking at the frameset of the pages...

Request for Answer Clarification by grabby-ga on 27 Jul 2004 09:55 PDT
the format of the name stays the same, just the date changes and the
unique numeric identifier directly before .ascii
Can you update the regex to scrape this?

Clarification of Answer by palitoy-ga on 27 Jul 2004 09:59 PDT
Another way to match the ASCII file to download would be to parse the
HTML in the downloaded file.  This can be done something like this:

use HTML::TokeParser; 
$stream = HTML::TokeParser->new( $response->content_ref );
# process the <a> tags
while ( my $tag = $stream->get_tag('a') ) {
  # get the href from the <a> tag
  $text_link = $tag->[1]{'href'};
  # if the href contains .ascii...  
  if ( $text_link =~ m/\.ascii/ ) {
    #it is the link we are looking for so break from the while loop
    break;
  };
}

Clarification of Answer by palitoy-ga on 27 Jul 2004 10:04 PDT
I was writing my last comment as you posted yours :-)

The updated regex part would require you to change these lines:

$page_content =~ m/http\:\/\/(.*)\.txt/ ;
$text_link = "http://" . $1 . ".txt";

to:

$page_content =~ m/http\:\/\/(.*)ENVOY1\_SG1\.NEWARK\-RH\-VNNN\_(.*)\.ascii/ ;
$text_link = "http://" . $1 . "ENVOY1_SG1.NEWARK-RH-VNNN_" . $2 . ".ascii";

This could be simplified further still by removing the first (.*) in
the regex if the domain name/directory information also does not
change.

Request for Answer Clarification by grabby-ga on 27 Jul 2004 10:21 PDT
Eek! 501 Protocol scheme 'javascript' is not supported

Anything I can do here?

I tip well ;-)

Clarification of Answer by palitoy-ga on 27 Jul 2004 10:36 PDT
I don't think I can help much with this error without being able to
see exactly what you are doing.  I guess the error is testing your
script to see if javascript is supported when fetching the page with
LWP.

I don't know if this will work but it could be worth a try...

After this line:
$ua = new LWP::UserAgent;

Add this:
@pretend_to_be_netscape = (
    'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
    'Accept-Language' => 'en-US',
    'Accept-Charset' => 'iso-8859-1,*,utf-8',
    'Accept-Encoding' => 'gzip',
    'Accept' => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
image/png, */*",
    'Referer' => "://www.google.com"
);

Then change this line:
$response = $browser->get($url);

to:
$response = $browser->get($url, @pretend_to_be_netscape);

And this line:
$response = $browser->get($text_link);

to:
$response = $browser->get($text_link, @pretend_to_be_netscape);

This should make the website think it is being queried by a Netscape
browser... if this doesn't help I am afraid I am not sure how to solve
the problem.

Clarification of Answer by palitoy-ga on 27 Jul 2004 11:15 PDT
Thanks for the tip and rating!  I'm sorry I couldn't work out the last
error part for you.

Request for Answer Clarification by grabby-ga on 27 Jul 2004 12:34 PDT
I think I have worked out why this is barfing, when I submit the
precomposed URL to the servers' CGI the page that is returned is a
"Processing" holding page, then 30-40 seconds later a redirect is
issued, so is this why the "$request = new HTTP::Request HEAD =>
$url;" doesnt work? Is there a way of doing this with WWW::Mechanize ?
Any other ideas?

Clarification of Answer by palitoy-ga on 27 Jul 2004 12:51 PDT
If the redirect is not happening for a period of time, try looking at
the code of the "processing" page (you can access print this out by
using print $response->content;).  The processing page may have an
HTTP redirect command in it which will include the URL you need.

Request for Answer Clarification by grabby-ga on 27 Jul 2004 13:45 PDT
excellent, I'm almost there... This URL when pasted in gives me the
result I am looking for!!! Is there a way that I can pull the
following out of the HTML and assign a variable with anything after
the "/cgi/" up to the apostrophe/closing paranthasis. I'm thinking
regex trickery ;-)

 function reloadPage()
{
        location.replace
('/cgi/nhWeb?func=viewProgress&rptHtml=users/ENVOY1_SG1/mark/stats/ENVOY1_SG1.Newark-RH-VNNN_2004_07_27_21_37_14_217/&rptPid=6232&rptTimeStamp=1090960641&rptET=routerSwitch&subjectType=element&subject=ENVOY_1_SG1.Newark-RH-VNNN&report=Standard&isShortCut=No&includeNav=');
}

Clarification of Answer by palitoy-ga on 28 Jul 2004 01:30 PDT
The regex should be something like this:

$page_content =~ m/\/cgi\/(.*)includeNav\=\'/ ;
$text_link = $1 . "includeNav=";
grabby-ga rated this answer:4 out of 5 stars and gave an additional tip of: $25.00
It didn't work straight-away but it has given me *lots* of ideas.
Thanks for the excellent answers and guidance.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy