Hello!
I'll show you another way to do the research for your program. Instead
of sniffing what your browser does, I'll explain you how to understand
HTML forms to know what to send to each step of the authentication
process and to get the info you're looking for.
First of all, this method is not less powerfull than the one with the
proxy. It takes more time, but it's better, because you have to
understand what does your browser do on each step.
------------------------
Web Developers use HTML forms to send information that the user enter
to the server.
Forms have input elements, for example, a Text box, a Check box, a
Button, etc.
Each element has a name and a value (may be entered by the user or be
a fixed value) and when the user clicks on the 'Submit' button, the
browser sends the information to the server.
An example of a simple HTML form:
<form action='login.php' method='post'>
Username <input type='text' name='username' /><br />
Password <input type='password' name='password' /><br />
<input type='submit' value='Send' />
</form>
When this form is rendered by a browser it will show 2 text boxes and
a button.
Then, when the user clicks the 'Submit' button, the browser will send
the information to the program 'login.php' on the server side.
There're 2 ways to send the data to the server, specified by the
'method' attribute on the form tag: via POST or via GET
GET: When using GET the browser pass parameters after asking for the
file. For example, if the previous form used 'GET' to send the
information after clicking on the button, the browser will show on the
Location Bar:
http://www.mysite.com/login.php?username=entered_username&password=entered_password
If you want to make a robot to login into that server, you only have
to tell it to ask for that URL!!
POST: This is more 'secure' because the sent data is not shown on the
Location bar. Probably you will need to log in to your account using
one of this forms.
If you want to make your robot to login into this site (using the
above form), check this example:
use LWP;
use HTTP::Request::Common;
$ua = LWP::UserAgent->new;
$ua->request(POST 'http://www.mysite.com/login.php', ["username" =>
"my_username", "password" => "password"]);
Please note that every element in a form that has a name will be
passed as a parameter to the 'action' script.
[ http://www.w3.org/TR/REC-html40/interact/forms.html ]
----
To get the information you're looking for, the steps are:
1) Login into the system
2) Go to the page that has the information
3) Parse it using regex
4) Mail it, print it, etc
( To show you how to understand the forms, I'll develop a little
program to log into Google Answers and get the status of the account (
https://answers.google.com/answers/main?cmd=myinvoices ))
1) Login
To start, go to the main page of the site you're looking for and click
until you get on the 'login page'.
In my case, this will be
https://answers.google.com/answers/main?cmd=login
Then, click with the right button on the page and click on 'View
Source'. Find where the form starts (<form ...) and check if it's
using GET or POST and where's the information submitted after clicking
on the submit button.
In my case, the forms starts like this:
<form method="post" action="main?cmd=login">
Note that 'action' doesn't have the full URI of the file, so you have
to prepend the current directory.
After clicking on the 'Login' button, the information will be sent to
https://answers.google.com/answers/main?cmd=login
Now, check which are the input elements on the form: look for <input..
<textarea... and <select.. tags. Get their names and figure out the
value to send to the server.
In my case, the input elements are:
<input type="text" name="email" size="20">
<input type="password" name="password" size="20">
<input type="submit" name="submit" value="Login">
So, I have to send 3 variables to the server:
'email' with my email address
'password' with my password
'submit' with the value 'Login'
Please note, that 'submit' is a button, so the defaul value ('Login')
can't be changed. But if this variable is not sent, you won't be
logged in
Try this little program:
------------------------------------
use LWP;
use HTTP::Request::Common;
use HTTP::Cookies;
$email='my_email@address.com';
$pass='my_google_answers_password';
$ua = LWP::UserAgent->new;
$ua->cookie_jar(HTTP::Cookies->new);
$req=$ua->request(POST
'https://answers.google.com/answers/main?cmd=login', ['email'=>$email,
'password'=>$pass, 'submit'=>'Login']);
if ($req->content=~ /Invalid login/){
print "invalid login!\n";
}else{
print "welcome to google answers :)\n";
}
-----------------------------------
In the 8th line, I tell LWP to request
'https://answers.google.com/answers/main?cmd=login' and pass the
parameters 'email'=$email, 'password'=$pass and 'submit='Login'
Set $email and $pass with your info and try it!
2) Getting the info
Now you're into the system, you have to go to the page where the info
you're looking for is. Click on the link that takes you there and
write down the address on your browser's Location bar when you're
there.
For example, if I want to get the status of my account, I'll have to
go to https://answers.google.com/answers/main?cmd=myinvoices
So, after login into the system, I'll go to that address:
$req=$ua->request(GET
'https://answers.google.com/answers/main?cmd=myinvoices');
and inside $req->content I'll have the contents of the page. Then, I
have to parse it:
$req->content=~/<td> Current Earnings \(what you will be paid\) for
Answering Questions: <\/td> <td width="1%"> \$([0-9]+(?:.[0-9]+)?)/;
$ear=$1;
$req->content=~/<td> Current Balance \(what you will be charged\) for
Asked Questions: <\/td> <td width="1%"> \$([0-9]+(?:.[0-9]+)?)/;
$char=$1;
print "Will be paid: $ear \nWill be charged: $char\n";
--------------------------
The finished script will be:
use LWP;
use HTTP::Request::Common;
use HTTP::Cookies;
$email='my_email@address.com';
$pass='my_google_answers_password';
$ua = LWP::UserAgent->new;
$ua->cookie_jar(HTTP::Cookies->new);
$req=$ua->request(POST
'https://answers.google.com/answers/main?cmd=login', ['email'=>$email,
password=>$pass, 'submit'=>'Login']);
if ($req->content=~ /Invalid login/){
print "invalid login!\n";
}else{
print "welcome to google answers :)\n";
$req=$ua->request(GET
'https://answers.google.com/answers/main?cmd=myinvoices');
$req->content=~/<td> Current Earnings \(what you will be paid\)
for Answering Questions: <\/td> <td width="1%">
\$([0-9]+(?:.[0-9]+)?)/;
$ear=$1;
$req->content=~/<td> Current Balance \(what you will be charged\)
for Asked Questions: <\/td> <td width="1%"> \$([0-9]+(?:.[0-9]+)?)/;
$char=$1;
print "Will be paid: $ear \nWill be charged: $char\n";
}
-------------------------
Probably it won't be this straightfoward on a Bank (you know, their
HTML will be very messy: they don't understand the beauty of the
simple things, as google ;) but it won't be very hard if you have
patience :)
Good luck with your program, and feel free to ask all the
clarifications you need!
Aditional links:
LWP
[ http://www.linpro.no/lwp/ ]
HTML Forms
[ http://www.w3.org/TR/REC-html40/interact/forms.html ]
Search Strategy:
Personal experience |
Clarification of Answer by
runix-ga
on
23 Jun 2002 15:43 PDT
(I posted this clarification as a comment, please ignore it)
Gerbil,
When a site works dinamically (ie, CGI, PHP, JSP, ASP, etc) it sends
to the browser pure HTML. The 'dynamic' part is on the server side
(ie, DB access ,etc). There's no way to work on the server side
information!
The pages that are dinamically generated, are HTML pages.
Think about this: your browser only knows about the HTML the site
sent: It knows what to do when you press the 'submit' button, from the
form definition.
I can give you examples about how to handle redirections, if you ask
me to.
Other technologies that the site may use are cookies which are
automatically handled by HTTP::Cookies.
If you want to tell me which sophisticated interaction you have to do
with the site, I will be happy to help you!
|
Request for Answer Clarification by
gerbil-ga
on
23 Jun 2002 19:50 PDT
I understand that there is no way for me to find out what the server
does internally. The next best thing, as far as I'm concerned, is to
be able to fully listen in the communication between server and
browser; this gives *me* all the information that *I* need to
replicate the interaction in a Perl/LWP script. That was the
objective of my original query, and I don't think it has been met.
When I try the approach you proposed and programmatically request page
X, the contents (of the HTTP::Response object) are often completely
different from the source that I get if I request page X via the
browser. In other words, the browser and the server have a
communication that is very different from what I can achieve with LWP
and the limited information that I have at my disposal by using the
approach you propose.
I have no doubt that someone like you could achieve my ultimate goals
without needing all the information that I need, but I am not you.
And I am also sure that I could achieve my ultimate goal if my query
had been answered in the way I originally posed it. Even if I could
retain you as a consultant for every single page that I may want to
add to the list of sites that my bot would have to visit (I'm sure
each one would have idiosyncracies that would need to be dealt with
specifically), I would have to reveal to you private information
(usernames, passwords, etc.), and that's just not possible. The
approach I originally asked about does not have any of these
drawbacks: it is completely general, it allows me to listen in the
communication between the browser and the server, so that I can
*trivially* replicate it in a Perl script. Your approach, on the
other hand, requires a completely ad hoc analysis of each specific
site, which, from my vantage point is far from trivial. In other
words, I want my money back.
|