Google Answers: Scraping ascii file with redirected URL using Perl

View Question

Q: Scraping ascii file with redirected URL using Perl ( Answered 4 out of 5 stars

, 0 Comments )

Question

Subject: Scraping ascii file with redirected URL using Perl
Category: Computers > Programming
Asked by: grabby-ga
List Price: $40.00

Posted: 27 Jul 2004 07:33 PDT
Expires: 26 Aug 2004 07:33 PDT
Question ID: 379640

I am trying to scrape a page that is generated on the fly by a
webserver. Consistent data is submitted (by me) to the CGI on the
target machine, then a report is generated on the fly and a redirect
is issued. I need the script to obtain and follow this redirect, match
a "ascii download" link on the target page and "click" this link to
download. I was thinking of using WWW:Mechanize to achieve this. A
full code example would be required.

Answer

Subject: Re: Scraping ascii file with redirected URL using Perl
Answered By: palitoy-ga on 27 Jul 2004 09:06 PDT
Rated: 4 out of 5 stars

Hello grabby-ga There are few exact details in your question so I will give a generic perl script that should hopefully allow you to modify it and generate the solution you require. This script will: 1) Go to the page and retrieve the redirect 2) Download the redirect page 3) Match the text link 4) Download the text link into a file #BEGIN #!/usr/bin/perl # modules to use use LWP::UserAgent; # this is the url of the page that gets redirected $url = "http://redirectedurl.com"; # get this redirected page and find out where it is redirected to $ua = new LWP::UserAgent; $request = new HTTP::Request HEAD => $url; $response = $ua->request($request); $url = $response->request->url; # download the redirected page $browser = LWP::UserAgent->new(); $response = $browser->get($url); $page_content = $response->content; # use a regular expression to match the text file # this will need to be nailed down more to ensure the link is correct # but this is difficult to do without seeing a copy of the page it is on $page_content =~ m/http\:\/\/(.*)\.txt/ ; $text_link = "http://" . $1 . ".txt"; # download the text link page $browser = LWP::UserAgent->new(); $response = $browser->get($text_link); # save the output open(OUT,">config.txt") \|\| die $!; print OUT $response->content; close(OUT); # end the program exit(0); #END If you have any questions or need some more help in adapting this to your situation please ask for clarification and give as much further information as you can for your exact requirements.
Request for Answer Clarification by grabby-ga on 27 Jul 2004 09:38 PDT This is a great start, the actual file that I am trying to download is in the following format (it is in a frame BTW):- ENVOY1_SG1.NEWARK-RH-VNNN_2004_07_27_17_30_29_556.ascii The name and date changes daily but this should illustrate it somewhat.
Clarification of Answer by palitoy-ga on 27 Jul 2004 09:44 PDT How much of the name changes every day? Do you have a link of the page that you are trying to scrape? If the page you are trying to scrape is in a frame you should be able to discover the name of the page by looking at the frameset of the pages...
Request for Answer Clarification by grabby-ga on 27 Jul 2004 09:55 PDT the format of the name stays the same, just the date changes and the unique numeric identifier directly before .ascii Can you update the regex to scrape this?
Clarification of Answer by palitoy-ga on 27 Jul 2004 09:59 PDT Another way to match the ASCII file to download would be to parse the HTML in the downloaded file. This can be done something like this: use HTML::TokeParser; $stream = HTML::TokeParser->new( $response->content_ref ); # process the <a> tags while ( my $tag = $stream->get_tag('a') ) { # get the href from the <a> tag $text_link = $tag->[1]{'href'}; # if the href contains .ascii... if ( $text_link =~ m/\.ascii/ ) { #it is the link we are looking for so break from the while loop break; }; }
Clarification of Answer by palitoy-ga on 27 Jul 2004 10:04 PDT I was writing my last comment as you posted yours :-) The updated regex part would require you to change these lines: $page_content =~ m/http\:\/\/(.)\.txt/ ; $text_link = "http://" . $1 . ".txt"; to: $page_content =~ m/http\:\/\/(.)ENVOY1\_SG1\.NEWARK\-RH\-VNNN\_(.)\.ascii/ ; $text_link = "http://" . $1 . "ENVOY1_SG1.NEWARK-RH-VNNN_" . $2 . ".ascii"; This could be simplified further still by removing the first (.) in the regex if the domain name/directory information also does not change.
Request for Answer Clarification by grabby-ga on 27 Jul 2004 10:21 PDT Eek! 501 Protocol scheme 'javascript' is not supported Anything I can do here? I tip well ;-)
Clarification of Answer by palitoy-ga on 27 Jul 2004 10:36 PDT I don't think I can help much with this error without being able to see exactly what you are doing. I guess the error is testing your script to see if javascript is supported when fetching the page with LWP. I don't know if this will work but it could be worth a try... After this line: $ua = new LWP::UserAgent; Add this: @pretend_to_be_netscape = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept-Language' => 'en-US', 'Accept-Charset' => 'iso-8859-1,,utf-8', 'Accept-Encoding' => 'gzip', 'Accept' => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, /*", 'Referer' => "://www.google.com" ); Then change this line: $response = $browser->get($url); to: $response = $browser->get($url, @pretend_to_be_netscape); And this line: $response = $browser->get($text_link); to: $response = $browser->get($text_link, @pretend_to_be_netscape); This should make the website think it is being queried by a Netscape browser... if this doesn't help I am afraid I am not sure how to solve the problem.
Clarification of Answer by palitoy-ga on 27 Jul 2004 11:15 PDT Thanks for the tip and rating! I'm sorry I couldn't work out the last error part for you.
Request for Answer Clarification by grabby-ga on 27 Jul 2004 12:34 PDT I think I have worked out why this is barfing, when I submit the precomposed URL to the servers' CGI the page that is returned is a "Processing" holding page, then 30-40 seconds later a redirect is issued, so is this why the "$request = new HTTP::Request HEAD => $url;" doesnt work? Is there a way of doing this with WWW::Mechanize ? Any other ideas?
Clarification of Answer by palitoy-ga on 27 Jul 2004 12:51 PDT If the redirect is not happening for a period of time, try looking at the code of the "processing" page (you can access print this out by using print $response->content;). The processing page may have an HTTP redirect command in it which will include the URL you need.
Request for Answer Clarification by grabby-ga on 27 Jul 2004 13:45 PDT excellent, I'm almost there... This URL when pasted in gives me the result I am looking for!!! Is there a way that I can pull the following out of the HTML and assign a variable with anything after the "/cgi/" up to the apostrophe/closing paranthasis. I'm thinking regex trickery ;-) function reloadPage() { location.replace ('/cgi/nhWeb?func=viewProgress&rptHtml=users/ENVOY1_SG1/mark/stats/ENVOY1_SG1.Newark-RH-VNNN_2004_07_27_21_37_14_217/&rptPid=6232&rptTimeStamp=1090960641&rptET=routerSwitch&subjectType=element&subject=ENVOY_1_SG1.Newark-RH-VNNN&report=Standard&isShortCut=No&includeNav='); }
Clarification of Answer by palitoy-ga on 28 Jul 2004 01:30 PDT The regex should be something like this: $page_content =~ m/\/cgi\/(.*)includeNav\=\'/ ; $text_link = $1 . "includeNav=";

grabby-ga rated this answer: 4 out of 5 stars

and gave an additional tip of: $25.00

It didn't work straight-away but it has given me *lots* of ideas.
Thanks for the excellent answers and guidance.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy