Google Answers Logo
View Question
 
Q: Use PHP (preg_match_all ) and regex to suck player scores from website ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Use PHP (preg_match_all ) and regex to suck player scores from website
Category: Computers > Programming
Asked by: srv-ga
List Price: $25.00
Posted: 04 Jul 2002 16:54 PDT
Expires: 03 Aug 2002 16:54 PDT
Question ID: 36615
I need to use PHP to suck all down the Australian and New Zealnd
golfers from the live scoreboard at the European Tour site. The
players are marked as "AUS" and "NZL" on the scoreboard under the
"Country" column. There is one thing that makes this difficult as part
of the scoreboard page url changes each week -
http://scoring.europeantour.com/tourn2002038/htmlscores/lb_0_0.html
(this weeks url) - in the previous example the "tourn2002038" changes
every week and doesn't seem to be the week of the year or date or
anything that can be guessed.

So the best "automagic" way I can think of to get that part of the url
would be to first suck the front page of the site -
http://www.europeantour.com/home/ and then grab the "number" or "url"
from the right hand side column about half way down under "Live
Scoring". There are several events each week (could be  1, 2, 3 or 4
events listed) but the first one is always the main European PGA Tour
match we want. Then that will give you the full url for the scoreboard
page then it is just a matter of pulling out the rows that have "AUS"
or "NZL" in the "Country" column from that url.

Also another thing needed is that the player names be changed from
this format - "FICHARDT, Darren" to this format "Darren Fichardt".

So the scoreboard that I need outputted would have the following
columns left to right:
Rank-Name-Hole- Par-1-2-3-4-Total

Once the first two rounds of a tournament have been played then all
players below a certain score "Miss the cut" and don't play the final
two rounds. I also need their scores to come in below the current
scores but only to have their "Name" and the first two rounds of
scores "R1" and "R2" under the appropriate columns. The blank cells on
those rows eg R3 and R4 should be filled with a "-". The missed cut
scores are held on a seperate page again which you will find at -
http://scoring.europeantour.com/tourn2002036/htmlscores/lb_0_0.html -
As the tournament this week hasn't got any missed cuts yet I gave the
above address from last week's completed match. Use the drop down menu
just above the scorebaord to find the "MIssed the Cut" page -
http://scoring.europeantour.com/tourn2002036/htmlscores/lb_9_0.html

So we should see at the end a full scoreboard with all Australians and
New Zealanders:
Rank-Name-Hole- Par-1-2-3-4-Total
List Aussie/NZ players here who play all rounds
Row header here - "Missed Cut"
List Aussie/NZ players here who "missed the cut"

The missed cut page only appears after the end of the second round so
there will need to be some error checking if you call the missed cut
page say in the first round then it won;t exist and will need to die
quietly.

Phew! - Hope this explains it all...it is pretty simple stuff but just
complicated to explain :))
Answer  
Subject: Re: Use PHP (preg_match_all ) and regex to suck player scores from website
Answered By: e_murphy-ga on 05 Jul 2002 07:29 PDT
Rated:5 out of 5 stars
 
Hello srv-ga.  I have written the PHP script just as your describe it,
but I would also like to give you some additional advice.  I have done
things similar to this in the past and have run into two problems.

Problem number one is load time.  When someone visits this page it
will have the load time of visiting three pages.  To solve this
problem in the past I have used a UNIX cron job to periodically (maybe
once an hour) cache the pages to a local file.  I'm not sure if you
are using UNIX, but it is very easy to call the command "wget" at
intervals to cache pages.

The second problem is the fragility of the script and its dependence
on its input not changing.  I have tried to make it as flexible as
possible (using case insensitive matches, etc) but there is no telling
how they will change the page you are parsing.  The ultimate way to do
this is to have an agreement with the other web site to put the
content in a certain format in a hidden location.  For example, many
web sites use an XML format called RSS to transmit news headlines
between sites.  RSS is a standard format you can rely on.  I
understand this probably wasn't possible in your scenario or you
probably would have done it.  So, keep your eye on the script and make
sure it doesn't break if they change the formatting of their HTML.

With those problems in mind, here is the script I came up with:

---Begin---
<?php

// get the home page content
$homecontent = file('http://www.europeantour.com/home/');

// Look for the section with the "Live Scoring" URL
do {
  $line = array_shift($homecontent);
} while (!preg_match('/Live Scoring/i', $line));

// Look for the actual URL
do {
  $line = array_shift($homecontent);
  if (preg_match('/http:\/\/scoring.europeantour.com\//i', $line)) {
    $url = preg_replace('/^.*<a href="(.*?)".*/', "$1", chop($line));
  }
} while (!preg_match('/http:\/\/scoring.europeantour.com\//i',
$line));

unset($homecontent); // free up a little memory

// get the score content
$content = file($url);

// look for table header

do {
  $line = array_shift($content);
} while (!preg_match('/"smallgreen">Rank<\/font>/i', $line));

$alldone   = false;   // whether we are done parsing the score table
$allscores = array(); // initialize the scores array

do {

  // look for beginning of row
  do {
    $line = array_shift($content);
    if (preg_match('/<\/table/i', $line)) {
      $alldone = true;
      break;
    }
  } while (!preg_match('/<tr/i', $line));  

  if ($alldone == false) {

    $row = array(); // initialize this row

    do {
      $line = array_shift($content);

      if (preg_match('/<td/i', $line)) {
        $data = rtrim(strip_tags($line));
        array_push($row, $data);
      }

    } while (!preg_match('/<\/tr/i', $line));

    if (count($row) == 9 &&
        (preg_match('/AUS/i', $row[2]) || preg_match('/NZL/i',
$row[2]) )) {
      list($lname, $fname) = preg_split('/,\s*/', $row[1]);
      $fname = preg_replace('/\s+/', '', $fname);
      $lname = preg_replace('/\s+/', '', $lname);
      $row[1] = ucfirst(strtolower($fname))."
".ucfirst(strtolower($lname));

      // missed cut
      $row[7] = preg_replace('/&nbsp;/', '-', $row[7]);
      $row[8] = preg_replace('/&nbsp;/', '-', $row[8]);

      array_push($allscores, $row);
    }

  }

} while ($alldone == false);

unset($content); // free up some more memory

?>
<html>
<body bgcolor="#ffffff">

<table border=1>

<tr>
  <th>Rank</th>
  <th>Name</th>
  <th>Hole</th>
  <th>Par</th>
  <th>1</th>
  <th>2</th>
  <th>3</th>
  <th>4</th>
  <th>Total</th>
</tr>

<?php

foreach ($allscores as $score) {

?>
<tr>
  <td><?= $score[0] ?></td>
  <td><?= $score[1] ?></td>
  <td><?= $score[3] ?></td>
  <td><?= $score[4] ?></td>
  <td><?= $score[5] ?></td>
  <td><?= $score[6] ?></td>
  <td><?= $score[7] ?></td>
  <td><?= $score[8] ?></td>
  <td><?= $score[4] ?></td>
</tr>
<?php

}

?>

</table>

</body>
</html>
--End--

I hope this works for you.  If I missed anything you asked for just
let me know and I'll add it.  Here are a couple sites that explain
some of the concepts in this script.  I used mainly Perl style regular
expression functions in PHP because those are the ones I am used to.

The PHP Manual (source of infinite wisdom)
http://www.php.net/manual/en/

Perl Regular Expressions (the "preg" functions in PHP use the same
syntax)
http://www.perldoc.com/perl5.6.1/pod/perlre.html

Request for Answer Clarification by srv-ga on 06 Jul 2002 04:16 PDT
Hi Mate
Thanks for the answer...fantastic stuff and almost there! :))
You have just missed the part at the end where the players that
"Missed the cut" are now off the main scoreboard page and are put onto
another page. We need to pull the scores also from that page and put
under the other scores from the players who are still on the main
scoreboard page. If you go to the following page -
http://scoring.europeantour.com/tourn2002038/htmlscores/lb_0_0.html
and hit the dropdwon menu you will see the "Missed the cut" option -
The page is http://scoring.europeantour.com/tourn2002038/htmlscores/lb_9_0.html.
So we should end up with the scoreboard of all the players that are
still playing plus the players that didn't make the cut. It all needs
to sit in the one table as described at the end of my original
question.

A couple of small things:
1. The players names such as O'Malley are coming out as O'malley
because of the ucfirst command so just need those to be as they should
with the captial letter after the apostrophe
2. There is also a "&nbsp;" before the info in the "Place" columns so
can they be stripped out please

With regards to your statements about asking the website and the
caching stuff....Yep totally agree but these guys are a little behind
the times and we have tried to get the info but it is all a lost case.
We keep saying that things would be so much simpler if they offered an
XML feed to everyone and they could probably charge also but
everything is either Unisys or IBM or such and it is all too hard. We
won't be putting the code up to get live feeds as that is against
copyright and would also be, as you say, much too server intensive.
Basically we will be pulling the feed at the end of play then
inserting the final scores in to a MYSQL db which will then be queried
on our page.

Totally agreed about relying on the page not changing but we have no
options unfortuneately...the golf industry is very antiquated in
general....

Thanks again for the answer and hear from you soon!

Clarification of Answer by e_murphy-ga on 06 Jul 2002 09:22 PDT
OK, sorry about not understand the "missed the cut" part.  I was
getting lost in the golf terminology.  Like most programmers, I don't
know a thing about sports.  In any event, here are the requested
changes:

--Begin--
<?php

// get the home page content
$homecontent = file('http://www.europeantour.com/home/');

// Look for the section with the "Live Scoring" URL
do {
  $line = array_shift($homecontent);
} while (!preg_match('/Live Scoring/i', $line));

// Look for the actual URL
do {
  $line = array_shift($homecontent);
  if (preg_match('/http:\/\/scoring.europeantour.com\//i', $line)) {
    $url = preg_replace('/^.*<a href="(.*?)".*/', "$1", chop($line));
  }
} while (!preg_match('/http:\/\/scoring.europeantour.com\//i',
$line));

unset($homecontent); // free up a little memory

$allscores = array(); // initialize the scores array

for ($i = 0; $i < 2; $i++) { // doing it all twice for the "missed the
cut"

  if ($i == 1) { // make the "missed the cut" url
    $url = substr($url, 0, -8)."9_0.html";
  }

  // get the score content
  $content = file($url);

  // look for table header
  do {
    $line = array_shift($content);
  } while (!preg_match('/"smallgreen">Rank<\/font>/i', $line));

  $alldone   = false; // whether we are done parsing the score table

  do {

    // look for beginning of row
    do {
      $line = array_shift($content);
      if (preg_match('/<\/table/i', $line)) {
        $alldone = true;
        break;
      }
    } while (!preg_match('/<tr/i', $line));  

    if ($alldone == false) {

      $row = array(); // initialize this row

      do {
        $line = array_shift($content);

        if (preg_match('/<td/i', $line)) {
          $data = rtrim(strip_tags($line));
          array_push($row, $data);
        }

      } while (!preg_match('/<\/tr/i', $line));

      if (count($row) == 9 &&
          (preg_match('/AUS/i', $row[2]) || preg_match('/NZL/i',
$row[2]) )) {
        list($lname, $fname) = preg_split('/,\s*/', $row[1]);
        $fname = preg_replace('/\s+/', '', $fname);
        $lname = preg_replace('/\s+/', '', $lname);
        $row[1] = beautify_name($fname)." ".beautify_name($lname);

        // extra stuff in rank
        $row[0] = preg_replace('/&nbsp;/', '', $row[0]);

        if ($i == 1) {
          // missed cut
          $row[7] = preg_replace('/&nbsp;/', '-', $row[7]);
          $row[8] = preg_replace('/&nbsp;/', '-', $row[8]);
        }

        array_push($allscores, $row);
      }

    }

  } while ($alldone == false);

  unset($content); // free up some more memory

}

// Fix name capitalizations, you can put this somewhere else if you
want
function beautify_name ($name) {
  $newname = ucfirst(strtolower($name));

  $apos = strpos($newname, "'"); // find apostrophe
  if ($apos != false) {
    $newname = substr($newname, 0, $apos)."'".
               ucfirst(substr($newname, $apos+1, 1)).
               substr($newname, $apos+2);
  }

  return $newname;
}

?>
<html>
<body bgcolor="#ffffff">

<table border=1>

<tr>
  <th>Rank</th>
  <th>Name</th>
  <th>Hole</th>
  <th>Par</th>
  <th>1</th>
  <th>2</th>
  <th>3</th>
  <th>4</th>
  <th>Total</th>
</tr>

<?php

foreach ($allscores as $score) {

?>
<tr>
  <td><?= $score[0] ?></td>
  <td><?= $score[1] ?></td>
  <td><?= $score[3] ?></td>
  <td><?= $score[4] ?></td>
  <td><?= $score[5] ?></td>
  <td><?= $score[6] ?></td>
  <td><?= $score[7] ?></td>
  <td><?= $score[8] ?></td>
  <td><?= $score[4] ?></td>
</tr>
<?php

}

?>

</table>

</body>
</html>
--End--
srv-ga rated this answer:5 out of 5 stars
Great answer from a reasonably complex question....good stuff!

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy