I need a PHP guru to write a script for me to data mine movie synopsis
information, in addition to a MySQL table structure that can be
imported. The mined info will be written to a MySQL database, and
will be scheduled to execute once a week. (How do I schedule BTW?)
I want to gather initial info from these pages:
http://www.formovies.com/video/top_rentals.html
http://www.formovies.com/video/new_releases.html
http://www.formovies.com/video/upcoming_releases.html
Let's start with the top rentals page...
The script should first read "http://www.formovies.com/video/top_rentals.html"
Here I want the week date to be written to a variable (ie. "Top
Rentals (07/02/2006"), then each top 50 movie should be "crawled" to
include the following:
· Rank number
· Movie Title
· product_no (will be our unique ID too in SQL)
· box office gross
code should check for existing 'product_no' in SQL before crawling...
Next, I need the detail crawled for each movie at
http://www.formovies.com/title/detail.html?product_no=######
Here I need the following:
· boxart image (would like to store it in MySQL database, need
instructions on doing this)
· trailer 'popurl' location
· movie rating
· theater release date
· video release date
· cast (seperated; contributor_no will be unique ID in SQL)
· genre(s)
· Director(s) (seperated; contributor_no will be unique ID in SQL)
· length
· synopsis
· reviews (seperated)
The http://www.formovies.com/video/new_releases.html page will crawl
the same way (week dates kept seperate), code should check for
existing data in SQL before crawling previous weeks.
The http://www.formovies.com/video/upcoming_releases.html will crawl
the same way (week dates kept seperate), code should check for
existing data in SQL before crawling previous weeks.
<?php
if ($_POST["product_no"]) {
// product_no: 253142 = Failure to Launch (single director & genre)
// product_no: 256033 = Eight Below (example of multiple directors & genres)
// .. etc
$url = "http://www.formovies.com/title/detail.html?product_no=".$_POST["product_no"];
include "Snoopy.class.php"; #sourceforge.net/projects/snoopy/
$snoopy = new Snoopy;
$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)";
$snoopy->referer = "http://www.formovies.com/video/top_rentals.html";
$snoopy->expandlinks = true;
if($snoopy->fetch($url))
{
$html = $snoopy->results;
// your code here .. ;-)
}
else
echo "error fetching document: ".$snoopy->error."\n";
exit;
}
echo "<table align=\"center\">";
echo "<h1>Movie Detail Lookup</h1>\r";
echo "<form method=\"POST\" action=\"".$_SERVER['PHP_SELF']."\">\r";
echo "<tr><td><label for=\"domain\">Movie Product No: </label></td>\r";
echo "<td><input type=\"text\" name=\"product_no\"
value=\"".$product_no."\" size=\"50\"></td></tr>\r";
echo "<tr><td/><td><input type=\"submit\" value=\"Get Details\"></td></tr>\r";
echo "</form>\r";
echo "</table><br><br>\r";
?>
Note: I will be authorized to mine information from
www.MyStoreName.formovies.com shortly; this is the exact same site. |
Clarification of Question by
evobox-ga
on
18 Jul 2006 01:04 PDT
Please COMMENT heavily so I can learn, such as:
ereg("^(GIR0AA)|(TDCU1ZZ)|((([A-PR-UWYZ][0-9][0-9]?)|" #comment
."(([A-PR-UWYZ][A-HK-Y]][0-9][0-9]?)|" #comment
."(([A-PR-UWYZ][0-9][A-HJKSTUW])|" #comment
."([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))" #comment
."[0-9][ABD-HJLNP-UW-Z]{2})$", $postcode)) #comment
|
Clarification of Question by
evobox-ga
on
18 Jul 2006 01:08 PDT
COMMENTS like these would be EXTREMELY helpful.
# this is the text to be searched
$subject = "The first meeting is <!--comment741-->Monday, April
7<!--comment000-->. Future meetings will be announced
<!--comment751-->here<!--comment000--> on this site.";
# we now decode the < and > signs back to < and > for ease of use
$subject = html_entity_decode($subject);
# this is the regular expression pattern
$pattern = '/--comment(\d{3})-->(.*)<!--comment000-->.*<!--comment(\d{3})-->(.*)</';
|
Clarification of Question by
evobox-ga
on
18 Jul 2006 03:23 PDT
Been working on this awhile (having never done it before), and came up
with the following... as you can see, it works but i'm sure there is a
better way...
$html = $snoopy->results;
$strip = array("\t","\n","\r", "\x20\x20", "\0", "\x0B");
$html = str_replace($strip,"",html_entity_decode($html));;
$titlepattern = '/title-detail">"(.*)"<\/h1>/';
preg_match($titlepattern, $html,$title);
$boxartpattern = '/"boxart"><img src="(.*).jpg"/';
preg_match($boxartpattern, $html,$boxart);
$trailerpattern = '/popurl="(.*)"winpops=/';
preg_match($trailerpattern, $html,$trailer);
$ratingspattern = '/\/ratings\/(.*).gif" \/>/';
preg_match($ratingspattern, $html,$rating);
$therelpattern = '/Theater
Rel<\/b>...<\/td><td>(.*)<\/td><\/tr><tr><td><b>Video Rel/';
preg_match($therelpattern, $html,$therel);
$vidrelpattern = '/Video Rel<\/b>.......<\/td><td
align="left">(.*)<\/td><\/tr><tr><td valign="top"><b>Starring/';
preg_match($vidrelpattern, $html,$vidrel);
|
Clarification of Question by
evobox-ga
on
20 Jul 2006 17:11 PDT
example code snippet:
if($snoopy->fetch($url))
{
$html = $snoopy->results;
$strip = array("\t","\n","\r", "\x20\x20", "\0", "\x0B");
$html = str_replace($strip,"",html_entity_decode($html));;
// Get Last Page Update Timestamp
$updatepattern = '/<!-- CREATED\|top_rentals.html\|(.*)--><div class="float">/';
preg_match($updatepattern, $html, $update); # 20060713 12:09:42
if (($updatestamp = strtotime($update[1])) === false) {
echo "ERROR: Update timestamp ($update[1]) is bogus, code change?";
exit;
} else {
echo "Last Updated: " . date('YmdHis', $updatestamp) . "<br>\r";
}
// Get Top Rental Week Datestamp
$weekpattern = '/<h1 id="page-heading">Top Rentals
\((.*)\)<\/h1><div class="top-title">/';
preg_match($weekpattern, $html, $week); # 07/02/2006
if (($weekstamp = strtotime($week[1])) === false) {
echo "ERROR: Top Rental Week ($week[1]) is bogus, code changed?";
exit;
} else {
echo "Week: " . date('Ymd', $weekstamp) . "<br>\r";
}
|