****Background****
I'm developing an application that utilizes Rubyful Soup to scrape the
daily racing form charts and insert the scraped data into some MySql
tables. Given a url, the application will go to the specified race
card and scrape the related race charts. The program will start by
processing the first race and it will scrape race data, running line
data and race call data identified below and then insert the data into
corresponding tables. When the first race is completed, the program
will continue to the second race. The process will be repeated until
all races are processed. Typically there are 8 to 12 races per race
card.
*** Requirement ****
Develop a script/program using Ruby and Rubyful Soup that given a URL
for a race card, will scrape the race, running line and race call data
items identified below for each chart that comprises the race card and
insert scraped data items into the appropriate MySql table.
***Terminology *****
A chart describes a particular race, at a particular track on a particular day.
A race card is the set of race charts for a particular day and track.
****Rescources*****
Rubyful Soup: http://www.crummy.com/software/RubyfulSoup/
Daily Racing form Chart: http://www.drf.com/charts/13/cBEL13.html?rn=212054
Daily Racing Chart indes: http://www.drf.com/charts/cindex.html
**** Chart Definition *******
Each race chart is comprised of the following elements:
1. Track Name & Date *
2. Race Condition *
3. Running Lines[1..N] *
4. Race Line *
5. Results and Payoffs
6. Winner
7. Trainer
8. Scratched
9. Claimed
10. Comments *
Sample data and explanations for the pertinent race card elements (*)
are set forth below. The sample data was derived from the first race
at Belmont park for May 13, 2006.
http://www.drf.com/charts/13/cBEL13.html?rn=212054
***Track Name****
Belmont Park
Saturday, May 13, 2006
Track: Belmont Park
Date: 5/13/2006
DayofWeek: Saturday
- The track name, date and day of week and appear once at the top of
each race card (set of charts).
***Race Number***
1st Race
RaceNumber: 1
- Each race chart starts with a race number.
***Race Condition***
1 Mile Dirt ALLOWANCE OPTIONAL CLAIMING PURSE $46,000 Open
Value of Race 46000 Value to Winner 27,600 2nd 9,200 3rd 4,600
4th 2,300 5th 1,380 6th 920 Mutuel Pool $295,494.00 Exacta Pool
$309,392.00 Trifecta Pool $205,001.00
- The race condition includes several data elements. The data elements
and extract definitions follow:
Distance: 1 Mile | 8 (furlongs)
Surface: Dirt | Dirt
Type: Allowance Claiming | Allowance Claiming
Purse: Purse 46,000 | 46,000
ToWinner: 27600 | 27600
***Sample Running Line and explanation***
Each race will have 3 to 24 running lines, depending on the number of horses.
PN Horse M Eq Wt PP SP 1/4 1/2 3/4 Str Fin Jockey Odds
6 Touchdown Kid LB b 124 4 4 2 1 1 ½ 1 1½ 1 5 1 4¼ Luzzi M J .60
PN: The horse program number = 6
Horse: The horses name is = Touchdown Kid
M: The horse was medicated with L(asix) and B(ute)
E: The horse races with equipment - b(linkers)
Wt: The horse was carrying 124 pounds
PP: The horse broke from post position 4
SP: The horse starting position right after the break was 4th.
1/4: At the 1/4 pole, the horse was in 2nd, 1 length ahead of the 3rd horse
1/2: At the 1/2 pole, the horse was in 1st, 1/2 length ahead of the 2nd horse
3/4: At the 3/4 pole, the horse was in 1st, 1 1/2 lengths ahead of the 2nd horse
Str: In the stretch, the horse was in 1st, 5 lengths ahead of the 2nd horse.
Fin: At the finish, the horse was in 1st, 4 1/2 lengths ahead of the 2nd horse.
Jockey: The Jockey's name was Luzzi M J
Odds: The horses odds were .60
*** Sample Race Lines ********
Off at 1:00 Start Good for all . Won Driving. Time , :22 4/5, :45
3/5, 1:10 2/5, 1:37, Clear63. Track: Fast.
Off at: 1:00
Start: Good for all
Won: Driving
Time1: :22 4/5
Time2: :45 5/5
Time3: 1:10 2/5
Time4: 1:37
Note: The number of times dependes on the distance of the race
Weather1: Clear
Weather2: 63
Track: Fast
***Sample Comments****
TOUCHDOWN KID quickly showed in front, set the pace while in hand,
drew away when roused and was kept to a drive to the wire. SEEKING
THE MONEY raced close up along the inside and rallied on the rail to
get the place spot. HEATHROW chased the pace while three wide and was
outfinished for the place. DUKE'S CROSSING was outrun early, came
wide for the drive and offered a mild rally outside. CHAMPCHU raced
close up early and lacked a rally. HARD IRON was outrun along the
inside.
*** Database Tables ***
The extracted data should be inserted into one of four MySql tables:
Races, Running_Lines, and Race_Calls. There is a one-to-many
relationship between the Races table and the Running_Lines table and a
one-to-many relationship between the Running_Lines table and the
Race_Calls table. The structure for each table and related sample
data values based on the sample data above follow:
Races:
Name | Type |Sample Data (Domain)
RaceId | Int |Auto Increment
RaceDate | date | 5/13/06 (all valid dates)
Dayofweek | char | Saturday
Track | char | Belmont Park
RaceNumber| Int | 1 (Integers values from 1 to 20)
Distance | float | 8 (distance expressed in furlongs) furlong=1/8 of mile
Surface | char | Dirt
Purse | float | 46000
ToWinner | float | 27600
RaceType | char | Allowance Optional Claiming
FieldSize | int | Number of horses in a race.
Winner | char | The name of the winning horse
WinOdds | float | The odds for the winning horse
Offat | time | 1:00
Start | char | Good for all
Won | char | Driving
Weather1 | char | clear
Weather2 | int | 63
Track | char | Fast
Comments | Longtext |TOUCHDOWN KID quickly showed in front, set the pace...
Running_Lines:
Name | Type |Sample Data
RaceId | Int | From race table
ProgNum | Int | 1
Med | char | LB
Equ | char | b
Wgt | int | 124
PP | int | 4
SP | int | 4
Jockey | char | Luzzi M J
Odds | float| .60
Race_Calls:
Name | Type |Sample Data (domain)
RaceId | Int |From race table
CallNum | Int | 1 (An interger value 1 thru 6)
CallCode | Char |1/4 (1/4, 1/2, 3/4, 1, Str, Fin)
CallPos | Int | 1
CallLen | float| 1
Timevalue | Char | 22 4/5
Time | Float| 22.80 (time expressed in seconds)
Thank you. |