I've been looking at various html scrapers and I've found that looking
at these things is similar to looking at bug tracking tools. Most of
them suck and the rest of them cost ridiculous amounts of money.
I'm building a collecting site (knowlist.com if you're curious. it's
still in beta, but it's reasonably solid right now) where one of the
core ideas is to import data from and link to lots of small websites.
I've already hooked up with two sites that have oked the use of their
content. The problem is that the content is only semi-regular.
The first site I'm trying this on has content that's reasonably
regular, but it's not perfect. For instance, sometimes certain pages
have two images per item while on other pages they have only one
image. Some entries have the description surrounded by <b> tags while
others are surrounded by <div style="font-weight: bold">.
All the pages on that site are very kind in that every single entry
starts with either <hr> or <Hr>, so that's a lucky break. Basically, I
need it to start by applying one rule to cut out all the individual
objects and then apply a group of rules and find the one that fits.
(Try format 1. If not that then try format 2, etc.)
If there was a tool that works then I'd be willing to pay a few
hundred dollars for it. But, it's simply out of the question to pay
more than $1000. And, if the tool is going to be anywhere near that
much then it better save a lot of time for me. If it requires a huge
amount of work then it would probably be easier for me just to use a
free java html parsing library and then do the scraping myself.
Perhaps something like ANTLR would help?
I looked at screen-scraper.com and that tool looks promising, but I
just can't get the stupid thing to use multiple rules. And, it's kind
of stupid that it requires that I use it as a proxy for my browser.
I don't need to pull from sites regularly. I suppose that's a
theoretical possibility, but I don't foresee that as being how things
will work. I'm primarily interesting in doing one big scrape and then
having future elements maintained by the site operators. But, I
suppose having that capability wouldn't hurt. |
Request for Question Clarification by
leapinglizard-ga
on
16 Apr 2005 15:48 PDT
Have you thought of using a scripting language and specifying the
structures to extract with regular expressions? For even more parsing
power, you could use a context-free grammar, but I doubt it's
necessary.
If you don't want to do the coding yourself, I'm at your service.
Given the specifications and ideally a sample web page, whether
fragmentary or whole, I could write a Python script to do the
extraction. You'll see from the following links that I've done
programming projects here before.
http://answers.google.com/answers/threadview?id=449715
http://answers.google.com/answers/threadview?id=402414
http://answers.google.com/answers/threadview?id=402277
http://answers.google.com/answers/threadview?id=121280
One condition is that we must confine all our communication to this
site. If you'd rather not have the code posted here for all to see, I
can't help you in this fashion.
If, however, you want some generalized advice on how to extract HTML
structures with regular expressions, along with some concrete example,
I could provide that too as an answer to the present question.
leapinglizard
|
Clarification of Question by
mxnmatch-ga
on
16 Apr 2005 17:07 PDT
Sounds great! How much would you charge?
The first site is a pac-man collection website:
http://www.zutco.com/pacman.htm
All of the pages can be navigated to from that page.
For each item I want to get the following values when available:
- image (some items have 2 images)
- description (some items have no description)
- url of page the item was on
- inner title of the page (not the html title. I mean the text at
the top of each page that describes what is contained in that page)
- the "tree" for each item. By that I mean that when you click on a
link then capture that link text. If you navigate 3 links deep and
then parse the page then retain the text of all three of those links.
I'll use that information to create a hierarchy for these objects.
The second site is trickier. It's a pez dispenser museum at:
http://www.burlingamepezmuseum.com/pezexhibit.html
I want to capture information about each individual pez dispenser. So,
let's say you're on
http://www.burlingamepezmuseum.com/botleft.html
One of the items to capture would be "Cat with Derby". For that pez
dispenser I would want the following data:
name="Cat with Derby"
manufactureDate="60" (this is the number in parenthesis after the name)
image="http://www.burlingamepezmuseum.com/bl.gif"
section="Bottom Left" (this is from the title near the top of the page)
rowLocation="TOP ROW" (note that on one page it doesn't indicate the
row. in that case just say "TOP ROW")
rowOffset="4" (this is how far into the row from the left it is)
url="http://www.burlingamepezmuseum.com/botleft.html"
I think that's everything. Let me know if you can do these. Thanks!
|
Clarification of Question by
mxnmatch-ga
on
19 Apr 2005 19:42 PDT
So, do you think you can scrape those sites?
|
Request for Question Clarification by
leapinglizard-ga
on
20 Apr 2005 13:09 PDT
Quite possibly, but I'm very busy at the moment and I don't think I
can get to it until the weekend. How urgent is it for you?
leapinglizard
|
Clarification of Question by
mxnmatch-ga
on
20 Apr 2005 19:49 PDT
I guess it's not particularly urgent. I just wanted to know if it's
something you could do or if I need to start working on it myself.
|
Request for Question Clarification by
leapinglizard-ga
on
20 Apr 2005 20:10 PDT
After looking over the sites and reviewing your specs, I'm confident I
can scrape both sites this Saturday. I'm thinking $80 would be a fair
fee, although that might change depending on how irregular the HTML
is, and consequently how many program adjustments and manual data
corrections I have to make. I would post the results as a pair of text
files or, if you prefer, Excel-compatible spreadsheet files. Tell me
if you're agreeable to these terms.
leapinglizard
|
Clarification of Question by
mxnmatch-ga
on
20 Apr 2005 21:13 PDT
Sounds good. Please do the Pac-Man site first. I still need to get
written permission from the pez site. I've got verbal permission from
them, but I would still prefer an email from them to make it official.
If possible I'd prefer the results in XML form.
|
Request for Question Clarification by
leapinglizard-ga
on
20 Apr 2005 21:57 PDT
Sure, I can output XML. Do you have a template, or should I come up
with something reasonable on my own?
leapinglizard
|
Clarification of Question by
mxnmatch-ga
on
21 Apr 2005 15:09 PDT
Anything reasonable is fine. My system is set up so that each xml data
file has a corresponding xml translation file (the translation file
contains lists of xpaths that indicate what info to sue) that
indicates how that info is to be imported into the DB.
|
Request for Question Clarification by
leapinglizard-ga
on
25 Apr 2005 10:33 PDT
I'm afraid I was away on the weekend, but I haven't forgotten this
question. I'll get the work done soon.
leapinglizard
|
Clarification of Question by
mxnmatch-ga
on
01 May 2005 19:16 PDT
any progress?
|
Clarification of Question by
mxnmatch-ga
on
08 May 2005 19:18 PDT
I've done the pac-man site myself.
I can do the pez site too if necessary since you're apparently not interested.
|