Hello grabby-ga
There are few exact details in your question so I will give a generic
perl script that should hopefully allow you to modify it and generate
the solution you require.
This script will:
1) Go to the page and retrieve the redirect
2) Download the redirect page
3) Match the text link
4) Download the text link into a file
#BEGIN
#!/usr/bin/perl
# modules to use
use LWP::UserAgent;
# this is the url of the page that gets redirected
$url = "http://redirectedurl.com";
# get this redirected page and find out where it is redirected to
$ua = new LWP::UserAgent;
$request = new HTTP::Request HEAD => $url;
$response = $ua->request($request);
$url = $response->request->url;
# download the redirected page
$browser = LWP::UserAgent->new();
$response = $browser->get($url);
$page_content = $response->content;
# use a regular expression to match the text file
# this will need to be nailed down more to ensure the link is correct
# but this is difficult to do without seeing a copy of the page it is on
$page_content =~ m/http\:\/\/(.*)\.txt/ ;
$text_link = "http://" . $1 . ".txt";
# download the text link page
$browser = LWP::UserAgent->new();
$response = $browser->get($text_link);
# save the output
open(OUT,">config.txt") || die $!;
print OUT $response->content;
close(OUT);
# end the program
exit(0);
#END
If you have any questions or need some more help in adapting this to
your situation please ask for clarification and give as much further
information as you can for your exact requirements. |
Request for Answer Clarification by
grabby-ga
on
27 Jul 2004 09:38 PDT
This is a great start, the actual file that I am trying to download is
in the following format (it is in a frame BTW):-
ENVOY1_SG1.NEWARK-RH-VNNN_2004_07_27_17_30_29_556.ascii
The name and date changes daily but this should illustrate it somewhat.
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 09:44 PDT
How much of the name changes every day? Do you have a link of the
page that you are trying to scrape? If the page you are trying to
scrape is in a frame you should be able to discover the name of the
page by looking at the frameset of the pages...
|
Request for Answer Clarification by
grabby-ga
on
27 Jul 2004 09:55 PDT
the format of the name stays the same, just the date changes and the
unique numeric identifier directly before .ascii
Can you update the regex to scrape this?
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 09:59 PDT
Another way to match the ASCII file to download would be to parse the
HTML in the downloaded file. This can be done something like this:
use HTML::TokeParser;
$stream = HTML::TokeParser->new( $response->content_ref );
# process the <a> tags
while ( my $tag = $stream->get_tag('a') ) {
# get the href from the <a> tag
$text_link = $tag->[1]{'href'};
# if the href contains .ascii...
if ( $text_link =~ m/\.ascii/ ) {
#it is the link we are looking for so break from the while loop
break;
};
}
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 10:04 PDT
I was writing my last comment as you posted yours :-)
The updated regex part would require you to change these lines:
$page_content =~ m/http\:\/\/(.*)\.txt/ ;
$text_link = "http://" . $1 . ".txt";
to:
$page_content =~ m/http\:\/\/(.*)ENVOY1\_SG1\.NEWARK\-RH\-VNNN\_(.*)\.ascii/ ;
$text_link = "http://" . $1 . "ENVOY1_SG1.NEWARK-RH-VNNN_" . $2 . ".ascii";
This could be simplified further still by removing the first (.*) in
the regex if the domain name/directory information also does not
change.
|
Request for Answer Clarification by
grabby-ga
on
27 Jul 2004 10:21 PDT
Eek! 501 Protocol scheme 'javascript' is not supported
Anything I can do here?
I tip well ;-)
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 10:36 PDT
I don't think I can help much with this error without being able to
see exactly what you are doing. I guess the error is testing your
script to see if javascript is supported when fetching the page with
LWP.
I don't know if this will work but it could be worth a try...
After this line:
$ua = new LWP::UserAgent;
Add this:
@pretend_to_be_netscape = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept-Language' => 'en-US',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Encoding' => 'gzip',
'Accept' => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
image/png, */*",
'Referer' => "://www.google.com"
);
Then change this line:
$response = $browser->get($url);
to:
$response = $browser->get($url, @pretend_to_be_netscape);
And this line:
$response = $browser->get($text_link);
to:
$response = $browser->get($text_link, @pretend_to_be_netscape);
This should make the website think it is being queried by a Netscape
browser... if this doesn't help I am afraid I am not sure how to solve
the problem.
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 11:15 PDT
Thanks for the tip and rating! I'm sorry I couldn't work out the last
error part for you.
|
Request for Answer Clarification by
grabby-ga
on
27 Jul 2004 12:34 PDT
I think I have worked out why this is barfing, when I submit the
precomposed URL to the servers' CGI the page that is returned is a
"Processing" holding page, then 30-40 seconds later a redirect is
issued, so is this why the "$request = new HTTP::Request HEAD =>
$url;" doesnt work? Is there a way of doing this with WWW::Mechanize ?
Any other ideas?
|
Clarification of Answer by
palitoy-ga
on
27 Jul 2004 12:51 PDT
If the redirect is not happening for a period of time, try looking at
the code of the "processing" page (you can access print this out by
using print $response->content;). The processing page may have an
HTTP redirect command in it which will include the URL you need.
|
Request for Answer Clarification by
grabby-ga
on
27 Jul 2004 13:45 PDT
excellent, I'm almost there... This URL when pasted in gives me the
result I am looking for!!! Is there a way that I can pull the
following out of the HTML and assign a variable with anything after
the "/cgi/" up to the apostrophe/closing paranthasis. I'm thinking
regex trickery ;-)
function reloadPage()
{
location.replace
('/cgi/nhWeb?func=viewProgress&rptHtml=users/ENVOY1_SG1/mark/stats/ENVOY1_SG1.Newark-RH-VNNN_2004_07_27_21_37_14_217/&rptPid=6232&rptTimeStamp=1090960641&rptET=routerSwitch&subjectType=element&subject=ENVOY_1_SG1.Newark-RH-VNNN&report=Standard&isShortCut=No&includeNav=');
}
|
Clarification of Answer by
palitoy-ga
on
28 Jul 2004 01:30 PDT
The regex should be something like this:
$page_content =~ m/\/cgi\/(.*)includeNav\=\'/ ;
$text_link = $1 . "includeNav=";
|