Hi,
Both barnes and noble and amazon have this in their pages.
Barnes and noble looks like the easiest to rip from.
Both check for browser and try to set a cookie so setup something like
this, to tell the webserver you are a browser and you can take
cookies.
use LWP::UserAgent;
use HTML::LinkExtractor;
$ua = new LWP::UserAgent;
$ua->agent('Mozilla/5.0');
my $cookies;
$ua->cookie_jar($cookies);
$timeout = 20;
$ua->timeout($timeout);
http://video.barnesandnoble.com/browse_genre.asp
Using HTML::LinkExtractor you can grab the categories looking with a
REGEX for cat= in the link string, grab the title so you can use that
in your database.
Going to the listing page with the script the titles are again pretty
easy to grab off there, and they use a Next for the next page link, so
using the LinkExtractor again to grab that link for your next run. If
it doesn't find it, you'll know you are at the end of this category
and can start on the next one.
http://video.barnesandnoble.com/search/results.asp?userid=52YVL9FXQV&cat=1005126&parent_id=1005125
I've already checked and the "userid" isn't necessary or the parent id
http://video.barnesandnoble.com/search/results.asp?cat=1005126
goes to the same page. So you can make that easier by taking all the
category numbers off the front page, and putting those in an array,
with a for loop.
Grab the titles by looking for links with product.asp in them, they
are the only ones on this page that list the movies. You can grab
other information here but it is better to wait and go to the product
page, keep the title, you are going to need it.
http://video.barnesandnoble.com/search/product.asp?userid=52YVL9FXQV&EAN=85392246724
Here we want to split the page using the title.
my($top, $bottom) = split/$title/, $html;
Then split again using the <p> mark, your first variable now has the
stars of the show, separated by commas, very convenient there. regex
the commas into pipes so that they won't mess up your comma delimited
output and they will go into a database really easy, and then clean up
all the html code using this function I found a while back, very fast
and clean.
# This function created by Powerman on Perlmonks #
sub untag {
local $_ = $_[0] || $_;
# ALGORITHM:
# find < ,
# comment <!-- ... -->,
# or comment <? ... ?> ,
# or one of the start tags which require correspond
# end tag plus all to end tag
# or if \s or ="
# then skip to next "
# else [^>]
# >
s{
< # open tag
(?: # open group (A)
(!--) | # comment (1) or
(\?) | # another comment (2) or
(?i: # open group (B) for /i
( TITLE | # one of start tags
SCRIPT | # for which
APPLET | # must be skipped
OBJECT | # all content
STYLE # to correspond
) # end tag (3)
) | # close group (B), or
([!/A-Za-z]) # one of these chars, remember in (4)
) # close group (A)
(?(4) # if previous case is (4)
(?: # open group (C)
(?! # and next is not : (D)
[\s=] # \s or "="
["`'] # with open quotes
) # close (D)
[^>] | # and not close tag or
[\s=] # \s or "=" with
`[^`]*` | # something in quotes ` or
[\s=] # \s or "=" with
'[^']*' | # something in quotes ' or
[\s=] # \s or "=" with
"[^"]*" # something in quotes "
)* # repeat (C) 0 or more times
| # else (if previous case is not (4))
.*? # minimum of any chars
) # end if previous char is (4)
(?(1) # if comment (1)
(?<=--) # wait for "--"
) # end if comment (1)
(?(2) # if another comment (2)
(?<=\?) # wait for "?"
) # end if another comment (2)
(?(3) # if one of tags-containers (3)
</ # wait for end
(?i:\3) # of this tag
(?:\s[^>]*)? # skip junk to ">"
) # end if (3)
> # tag closed
}{}gsx; # STRIP THIS TAG
return $_ ? $_ : "";
}
If you throw at the end of the link this character set #castcrew it
will list out the whole cast and crew for you as well, which is really
easy to snatch off there. But you said you were only interested in the
stars.
Retail Price: $26.98
Our Price: $21.98
Saving: $5.00 (18.5%)
Readers' Advantage Price: $20.88 Join Now
In Stock:Ships within 24 hours
Same Day Delivery in Manhattan.
Format: Wide Screen
Region Code: 1
Rating:
Original release date: 2001
Video/DVD Release Date: 5/28/2002
UPC: 85392246724
WARNER HOME VIDEO
They were kind enough to use the same keywords with colons there so
that's easy, split or if you want to stretch your regex muscles you
can do that too.
Thanks for the question,
webadept-ga |
Clarification of Answer by
webadept-ga
on
06 Jan 2003 17:07 PST
The snatches I put in the answer there are in perl yes. It's not full
code of course, just enough to tell outline what I would do if I was
writing the script. If this is going to be your first Perl code, it
will help you out a great deal. You will want to install those two
mods at the very least. Since you are not fimilar with the mods,
LinkExtractor, I'll post an example of how to use that here as well.
my $r = new HTTP::Request('GET', $url);
my $res = $ua->request($r);
my $this = $res->{_content};
my $LX = new HTML::LinkExtractor();
$LX->parse(\$this);
for my $Link( @{ $LX->links } )
{
if( $$Link{_TEXT} !~ m/(Next)/sig )
{
# Do something with the Next link at the bottom of the page
}
}
undef $LX;
I don't code in Python at all, never even looked at it, so I couldn't
say how much of a learning curve there is between the two.
These sites will help a great deal
http://www.cpan.org
http://search.cpan.org/
thanks,
webadept-ga
|