Google Answers Logo
View Question
 
Q: Using PHP (preg_match_all ) and regex to data mine movie synopsis into MySQL ( No Answer,   0 Comments )
Question  
Subject: Using PHP (preg_match_all ) and regex to data mine movie synopsis into MySQL
Category: Computers > Programming
Asked by: evobox-ga
List Price: $40.00
Posted: 18 Jul 2006 01:01 PDT
Expires: 17 Aug 2006 01:01 PDT
Question ID: 747290
I need a PHP guru to write a script for me to data mine movie synopsis
information, in addition to a MySQL table structure that can be
imported.  The mined info will be written to a MySQL database, and
will be scheduled to execute once a week.  (How do I schedule BTW?)

I want to gather initial info from these pages:
	http://www.formovies.com/video/top_rentals.html
	http://www.formovies.com/video/new_releases.html
	http://www.formovies.com/video/upcoming_releases.html

Let's start with the top rentals page...

The script should first read "http://www.formovies.com/video/top_rentals.html"

Here I want the week date to be written to a variable (ie. "Top
Rentals (07/02/2006"), then each top 50 movie should be "crawled" to
include the following:
	· Rank number
	· Movie Title
	· product_no (will be our unique ID too in SQL)
	· box office gross

code should check for existing 'product_no' in SQL before crawling...

Next, I need the detail crawled for each movie at
http://www.formovies.com/title/detail.html?product_no=######

Here I need the following:
	· boxart image (would like to store it in MySQL database, need
instructions on doing this)
	· trailer 'popurl' location
	· movie rating
	· theater release date
	· video release date
	· cast (seperated; contributor_no will be unique ID in SQL)
	· genre(s)
	· Director(s) (seperated; contributor_no will be unique ID in SQL)
	· length
	· synopsis
	· reviews (seperated)
	

The http://www.formovies.com/video/new_releases.html page will crawl
the same way (week dates kept seperate), code should check for
existing data in SQL before crawling previous weeks.
The http://www.formovies.com/video/upcoming_releases.html will crawl
the same way (week dates kept seperate), code should check for
existing data in SQL before crawling previous weeks.


<?php

if ($_POST["product_no"]) {

	// product_no: 253142 = Failure to Launch (single director & genre)
	// product_no: 256033 = Eight Below (example of multiple directors & genres)
	// .. etc
	
	$url = "http://www.formovies.com/title/detail.html?product_no=".$_POST["product_no"];
	include "Snoopy.class.php"; #sourceforge.net/projects/snoopy/
	$snoopy = new Snoopy;
	
	$snoopy->agent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; .NET CLR 1.1.4322)";
	$snoopy->referer = "http://www.formovies.com/video/top_rentals.html";
	$snoopy->expandlinks = true;
	
if($snoopy->fetch($url))
	{

	$html = $snoopy->results;

	// your code here .. ;-)
	
	}
	else
		echo "error fetching document: ".$snoopy->error."\n";
		
exit;

}

echo "<table align=\"center\">";
echo "<h1>Movie Detail Lookup</h1>\r";
echo "<form method=\"POST\" action=\"".$_SERVER['PHP_SELF']."\">\r";
echo "<tr><td><label for=\"domain\">Movie Product No: </label></td>\r";
echo "<td><input type=\"text\" name=\"product_no\"
value=\"".$product_no."\" size=\"50\"></td></tr>\r";
echo "<tr><td/><td><input type=\"submit\" value=\"Get Details\"></td></tr>\r";
echo "</form>\r";
echo "</table><br><br>\r";

?>

Note: I will be authorized to mine information from
www.MyStoreName.formovies.com shortly; this is the exact same site.

Clarification of Question by evobox-ga on 18 Jul 2006 01:04 PDT
Please COMMENT heavily so I can learn, such as:

ereg("^(GIR0AA)|(TDCU1ZZ)|((([A-PR-UWYZ][0-9][0-9]?)|"  #comment
."(([A-PR-UWYZ][A-HK-Y]][0-9][0-9]?)|"                  #comment
."(([A-PR-UWYZ][0-9][A-HJKSTUW])|"                      #comment
."([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRVWXY]))))"          #comment
."[0-9][ABD-HJLNP-UW-Z]{2})$", $postcode))              #comment

Clarification of Question by evobox-ga on 18 Jul 2006 01:08 PDT
COMMENTS like these would be EXTREMELY helpful.

# this is the text to be searched
$subject = "The first meeting is &lt;!--comment741--&gt;Monday, April
7&lt;!--comment000--&gt;. Future meetings will be announced
&lt;!--comment751--&gt;here&lt;!--comment000--&gt; on this site.";

# we now decode the &lt; and &gt; signs back to < and > for ease of use
$subject = html_entity_decode($subject);

# this is the regular expression pattern
$pattern = '/--comment(\d{3})-->(.*)<!--comment000-->.*<!--comment(\d{3})-->(.*)</';

Clarification of Question by evobox-ga on 18 Jul 2006 03:23 PDT
Been working on this awhile (having never done it before), and came up
with the following... as you can see, it works but i'm sure there is a
better way...

		$html = $snoopy->results;
		$strip = array("\t","\n","\r", "\x20\x20", "\0", "\x0B");
		$html = str_replace($strip,"",html_entity_decode($html));;
		
		$titlepattern = '/title-detail">"(.*)"<\/h1>/';
		preg_match($titlepattern, $html,$title);
		
		$boxartpattern = '/"boxart"><img src="(.*).jpg"/';
		preg_match($boxartpattern, $html,$boxart);
		
		$trailerpattern = '/popurl="(.*)"winpops=/';
		preg_match($trailerpattern, $html,$trailer);
		
		$ratingspattern = '/\/ratings\/(.*).gif" \/>/';
		preg_match($ratingspattern, $html,$rating);
		
		$therelpattern = '/Theater
Rel<\/b>...<\/td><td>(.*)<\/td><\/tr><tr><td><b>Video Rel/';
		preg_match($therelpattern, $html,$therel);
		
		$vidrelpattern = '/Video Rel<\/b>.......<\/td><td
align="left">(.*)<\/td><\/tr><tr><td valign="top"><b>Starring/';
		preg_match($vidrelpattern, $html,$vidrel);

Clarification of Question by evobox-ga on 20 Jul 2006 17:11 PDT
example code snippet:

if($snoopy->fetch($url))
	{
		
		$html = $snoopy->results;
		$strip = array("\t","\n","\r", "\x20\x20", "\0", "\x0B");
		$html = str_replace($strip,"",html_entity_decode($html));;

		// Get Last Page Update Timestamp
		$updatepattern = '/<!-- CREATED\|top_rentals.html\|(.*)--><div class="float">/';
		preg_match($updatepattern, $html, $update); # 20060713 12:09:42
		if (($updatestamp = strtotime($update[1])) === false) {
   		echo "ERROR: Update timestamp ($update[1]) is bogus, code change?";
   		exit;
		} else {
   		echo "Last Updated: " . date('YmdHis', $updatestamp) . "<br>\r";
		}
				
		// Get Top Rental Week Datestamp
		$weekpattern = '/<h1 id="page-heading">Top Rentals
\((.*)\)<\/h1><div class="top-title">/';
		preg_match($weekpattern, $html, $week); # 07/02/2006
		if (($weekstamp = strtotime($week[1])) === false) {
   		echo "ERROR: Top Rental Week ($week[1]) is bogus, code changed?";
   		exit;
		} else {
   		echo "Week: " . date('Ymd', $weekstamp) . "<br>\r";
		}
Answer  
There is no answer at this time.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy