Google Answers: American Movie List

View Question

Q: American Movie List ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: American Movie List
Category: Arts and Entertainment > Movies and Film
Asked by: flyboyutah-ga
List Price: $20.00

Posted: 06 Jan 2003 13:15 PST
Expires: 05 Feb 2003 13:15 PST
Question ID: 138398

I'm looking for a delimited text list of American movies published on
either DVD or VHS over the past 30 years.
 At minimum this list would contains the movies title, stars names,
year published, and UPC code.

Request for Question Clarification by webadept-ga on 06 Jan 2003 14:04 PST

Hi, 

This can be done, by writing a Perl script to gather all the
information you requested into a file. I've done a similar task in the
past gathering a list of all DVD/VHS Anime titles. That isn't a
problem. But your bid would suggest you are simply looking for a place
to purchase a list from a company, since these lists are rather
expensive, I thought I would check.

If you are looking to purchase the list, please let us know, so we can
answer your question, otherwise you may consider adjusting your bid.
If you decide to adjust your bid you may want to read this page here.
http://answers.google.com/answers/pricing.html

thanks, 

webadept-ga

Clarification of Question by flyboyutah-ga on 06 Jan 2003 14:32 PST

Hello,
 Well, to be honest I haven't found even where they can be purchased. 
But I would be satisified if you were to tell me where I could use a
perl script to get this info as I programm and could do that part
myself.

Answer

Subject: Re: American Movie List
Answered By: webadept-ga on 06 Jan 2003 15:10 PST
Rated: 5 out of 5 stars

Hi, 

Both barnes and noble and amazon have this in their pages. 

Barnes and noble looks like the easiest to rip from. 

Both check for browser and try to set a cookie so setup something like
this, to tell the webserver you are a browser and you can take
cookies.

use LWP::UserAgent;
use HTML::LinkExtractor; 

$ua = new LWP::UserAgent;
$ua->agent('Mozilla/5.0'); 
my $cookies;
$ua->cookie_jar($cookies);
$timeout = 20;
$ua->timeout($timeout);


http://video.barnesandnoble.com/browse_genre.asp

Using HTML::LinkExtractor you can grab the categories looking with a
REGEX for cat= in the link string, grab the title so you can use that
in your database.

Going to the listing page with the script the titles are again pretty
easy to grab off there, and they use a Next for the next page link, so
using the LinkExtractor again to grab that link for your next run. If
it doesn't find it, you'll know you are at the end of this category
and can start on the next one.

http://video.barnesandnoble.com/search/results.asp?userid=52YVL9FXQV&cat=1005126&parent_id=1005125

I've already checked and the "userid" isn't necessary or the parent id
http://video.barnesandnoble.com/search/results.asp?cat=1005126
goes to the same page. So you can make that easier by taking all the
category numbers off the front page, and putting those in an array,
with a for loop.

Grab the titles by looking for links with product.asp in them, they
are the only ones on this page that list the movies. You can grab
other information here but it is better to wait and go to the product
page, keep the title, you are going to need it.

http://video.barnesandnoble.com/search/product.asp?userid=52YVL9FXQV&EAN=85392246724

Here we want to split the page using the title. 

my($top, $bottom) = split/$title/, $html;

Then split again using the <p> mark, your first variable now has the
stars of the show, separated by commas, very convenient there. regex
the commas into pipes so that they won't mess up your comma delimited
output and they will go into a database really easy, and then clean up
all the html code using this function I found a while back, very fast
and clean.

# This function created by Powerman on Perlmonks #
sub untag {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
  s{
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
}

If you throw at the end of the link this character set #castcrew it
will list out the whole cast and crew for you as well, which is really
easy to snatch off there. But you said you were only interested in the
stars.

Retail Price: $26.98
Our Price: $21.98
Saving: $5.00 (18.5%)
Readers' Advantage Price: $20.88 Join Now
In Stock:Ships within 24 hours 
Same Day Delivery in Manhattan.

Format:    Wide Screen
Region Code:  1 
Rating:  
Original release date: 2001
Video/DVD Release Date: 5/28/2002
UPC: 85392246724
WARNER HOME VIDEO

They were kind enough to use the same keywords with colons there so
that's easy, split or if you want to stretch your regex muscles you
can do that too.

Thanks for the question,

webadept-ga

Request for Answer Clarification by flyboyutah-ga on 06 Jan 2003 16:14 PST
I'm setting my system up now for perl now, and wanted to confirm that
the below script is in perl.(Coded in Python before, but not Perl)

Clarification of Answer by webadept-ga on 06 Jan 2003 17:07 PST

The snatches I put in the answer there are in perl yes. It's not full
code of course, just enough to tell outline what I would do if I was
writing the script. If this is going to be your first Perl code, it
will help you out a great deal. You will want to install those two
mods at the very least. Since you are not fimilar with the mods,
LinkExtractor, I'll post an example of how to use that here as well.

   my $r = new HTTP::Request('GET', $url);
   my $res = $ua->request($r);
   my $this = $res->{_content};   

   my $LX = new HTML::LinkExtractor();

     $LX->parse(\$this);
   
    for my $Link( @{ $LX->links } ) 
    {

     if( $$Link{_TEXT} !~ m/(Next)/sig ) 
	{
          # Do something with the Next link at the bottom of the page

        }

    }
   undef $LX;

I don't code in Python at all, never even looked at it, so I couldn't
say how much of a learning curve there is between the two.

These sites will help a great deal

http://www.cpan.org
http://search.cpan.org/

thanks, 

webadept-ga

flyboyutah-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $3.00

It follows the old saying about teach a man to fish.....  Thanks very
much for the prompt and very complete replay. Well worth the money.
Thanks a bunch

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy