Google Answers: Logging in to Yahoo! From a Perl Script?

View Question

Q: Logging in to Yahoo! From a Perl Script? ( No Answer, 1 Comment )

Question

Subject: Logging in to Yahoo! From a Perl Script?
Category: Computers > Algorithms
Asked by: mattk11-ga
List Price: $20.00

Posted: 20 Apr 2006 20:44 PDT
Expires: 20 May 2006 20:44 PDT
Question ID: 721176

I'd like a simple Perl script that will automatically log in to a
Yahoo page (like http://finance.yahoo.com) and output the page to a
text file. I've tried some of the scripts online and I can't seem to
get them to work. Specifically this Google Answer from 2002 does not
work anymore: http://answers.google.com/answers/threadview?id=31379

Answer

There is no answer at this time.

Comments

Subject: Re: Logging in to Yahoo! From a Perl Script?
From: popoff-ga on 20 May 2006 20:06 PDT

I'm not sure, but it sounds like you don't really need anything to log
in exactly, just download a file using HTTP.  There's a program you
can run (available for Windows and Linux) called wget which does this,
though not directly from Perl.  You use it from a DOS or Shell prompt
like this:

$ wget <url>

e.g.

$ wget http://www.somedomain.com/somefile.txt

This would save the text file as "somefile.txt" in your current
working directory, which is probably something like what you're after.
 (Above, the $ is the prompt; you don't type it, and in DOS you'll get
the usual C:\> like prompt of course).  Any shell command can be
called from Perl using, if I remember correctly:

system("command here")

if wget is installed on your path.  Wget takes various options to let
you specify things like retry behaviour, output filename and so on. 
-O (dash captial O) sets the output file.  So to download the URL you
suggest and save it as temp.html, you could use:

system("wget http://finance.yahoo.com/ -O temp.html")

Then your script could open the file and read it or do whatever else
it wanted with it.  This is a not-too-pretty method, but it works and
is dead simple, while allowing you to use all the features wget offers
(like retrying and even spider-like link-following and incremental
mirroring behaviour, which is very neat).

On Unix-like systems, you probably already have wget.  For Win32
systems, probably a binary download is the easiest method.  You can
get it from:

    http://users.ugent.be/~bpuype/wget/

If you actually want to log in, supplying a username and password,
then the good news is that it can be done but the bad news is that
it's not that easy and requires a bit of background information.

When a browser logs in using the login form, it sends a request -
either an HTTP GET request or an HTTP POST request - to the server. 
This includes name-value pairs that are sent; for example, when
logging into something one will probably be username=something and the
other will be password=somethingelse (though the names, contents, and
overall strategy may well differ).  If you see these in the addressbar
when using the site, they're being sent as GET requests.  If you
don't, they'll be POST requests, where the browser sends the
information along with the request which points to the URL of
something known as a server-side script or server-side application of
some kind.

When you want your program to be able to fake these post and get
requests (although I suspect wget might well be able to do it for you)
you probably need to do the HTTP stuff inside Perl.  This can be done
with a library or module for the purpose; see the following reference
for more information:

  http://www.devdaily.com/perl/edu/qanda/plqa00020.shtml

The really hard part is in knowing what is actually posted by your
browser so that you can have your program post the same sort of
information.  Finding this out requires a certain amount of HTML
knowledge.  I can give you a quick overview of how you find out here
but you'll probably need to do more reading if this is what you're
after (see http://www.w3schools.com/ - great site generally -  for
some good introductory HTML and other programming information if you
need it).

Basically, the method is to go to the site you want your program to
interact with using your browser and then use the 'view source'
command a lot and pour over the source you see.  Usually you should be
able to find the login form in there - starting with a <FORM> tag. 
This tag includes an "action" attribute which will tell you the URL
your program needs to interact with, and it will also specify a
"method" (like GET or POST).  If you don't see method, the method is
POST (the default).  If you don't see action (rarer), then the target
is the URL you're looking at the source of - the same URL as the form
itself.

Right, great so far.  Now look over the <INPUT> tags.  In there
somewhere will be the ones that take the ussername and password, and
you will be able to get their names from here.  Remember if there are
any funny hidden and other options, you will probably have to have
your program send these too to make sure the server side code believes
you're a browser.

There are just a couple of other complications: first, the server side
code will probably send you cookies and your program will often need
to handle these in order to log in and usse features of a site that
require it.  The library you're using to access HTTP information will
probably provide a cookie mechanism of some kind - you just have to
accept these cookies and send them back to the server with any future
requests you send and the server will be happy.

Second, the server might just not like the fact that you're not a real
browser.  To fake it, you need to find out the user agent string that
you want to impersonate and send it along in the HTTP headers.  Wget
allows you to set the user agent string if you want to (see the wget
documentation for details), and most HTTP libraries will also allow
you to set arbitrary headers.  The user agent strings for well-known
and therefore probably acceptable browsers to the server, if it cares,
are well known and available online.  Most services will be happy to
deal with you anyway, and most of those that distinguish between
browsers are happy if you send a string that contains "Mozilla/4.0
Compatible" or contains "MSIE" or both.  If in doubt,
://www.google.com/ - take a look for some sites that list valid
user agent strings.  There are plenty such sites out there and you can
probably make your program perfectly impersonate any browser you like
with these.

Most web services aren't that picky about the browser, though, as long
as you get the arguments, method and usually cookie handling right.

Hope this helps!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy