Google Answers: need help doing html screen scraping

View Question

Q: need help doing html screen scraping ( No Answer, 0 Comments )

Question

Subject: need help doing html screen scraping
Category: Computers > Software
Asked by: mxnmatch-ga
List Price: $30.00

Posted: 16 Apr 2005 13:34 PDT
Expires: 08 May 2005 19:18 PDT
Question ID: 510153

I've been looking at various html scrapers and I've found that looking at these things is similar to looking at bug tracking tools. Most of them suck and the rest of them cost ridiculous amounts of money. I'm building a collecting site (knowlist.com if you're curious. it's still in beta, but it's reasonably solid right now) where one of the core ideas is to import data from and link to lots of small websites. I've already hooked up with two sites that have oked the use of their content. The problem is that the content is only semi-regular. The first site I'm trying this on has content that's reasonably regular, but it's not perfect. For instance, sometimes certain pages have two images per item while on other pages they have only one image. Some entries have the description surrounded by <b> tags while others are surrounded by <div style="font-weight: bold">. All the pages on that site are very kind in that every single entry starts with either <hr> or <Hr>, so that's a lucky break. Basically, I need it to start by applying one rule to cut out all the individual objects and then apply a group of rules and find the one that fits. (Try format 1. If not that then try format 2, etc.) If there was a tool that works then I'd be willing to pay a few hundred dollars for it. But, it's simply out of the question to pay more than $1000. And, if the tool is going to be anywhere near that much then it better save a lot of time for me. If it requires a huge amount of work then it would probably be easier for me just to use a free java html parsing library and then do the scraping myself. Perhaps something like ANTLR would help? I looked at screen-scraper.com and that tool looks promising, but I just can't get the stupid thing to use multiple rules. And, it's kind of stupid that it requires that I use it as a proxy for my browser. I don't need to pull from sites regularly. I suppose that's a theoretical possibility, but I don't foresee that as being how things will work. I'm primarily interesting in doing one big scrape and then having future elements maintained by the site operators. But, I suppose having that capability wouldn't hurt.
Request for Question Clarification by leapinglizard-ga on 16 Apr 2005 15:48 PDT Have you thought of using a scripting language and specifying the structures to extract with regular expressions? For even more parsing power, you could use a context-free grammar, but I doubt it's necessary. If you don't want to do the coding yourself, I'm at your service. Given the specifications and ideally a sample web page, whether fragmentary or whole, I could write a Python script to do the extraction. You'll see from the following links that I've done programming projects here before. http://answers.google.com/answers/threadview?id=449715 http://answers.google.com/answers/threadview?id=402414 http://answers.google.com/answers/threadview?id=402277 http://answers.google.com/answers/threadview?id=121280 One condition is that we must confine all our communication to this site. If you'd rather not have the code posted here for all to see, I can't help you in this fashion. If, however, you want some generalized advice on how to extract HTML structures with regular expressions, along with some concrete example, I could provide that too as an answer to the present question. leapinglizard
Clarification of Question by mxnmatch-ga on 16 Apr 2005 17:07 PDT Sounds great! How much would you charge? The first site is a pac-man collection website: http://www.zutco.com/pacman.htm All of the pages can be navigated to from that page. For each item I want to get the following values when available: - image (some items have 2 images) - description (some items have no description) - url of page the item was on - inner title of the page (not the html title. I mean the text at the top of each page that describes what is contained in that page) - the "tree" for each item. By that I mean that when you click on a link then capture that link text. If you navigate 3 links deep and then parse the page then retain the text of all three of those links. I'll use that information to create a hierarchy for these objects. The second site is trickier. It's a pez dispenser museum at: http://www.burlingamepezmuseum.com/pezexhibit.html I want to capture information about each individual pez dispenser. So, let's say you're on http://www.burlingamepezmuseum.com/botleft.html One of the items to capture would be "Cat with Derby". For that pez dispenser I would want the following data: name="Cat with Derby" manufactureDate="60" (this is the number in parenthesis after the name) image="http://www.burlingamepezmuseum.com/bl.gif" section="Bottom Left" (this is from the title near the top of the page) rowLocation="TOP ROW" (note that on one page it doesn't indicate the row. in that case just say "TOP ROW") rowOffset="4" (this is how far into the row from the left it is) url="http://www.burlingamepezmuseum.com/botleft.html" I think that's everything. Let me know if you can do these. Thanks!
Clarification of Question by mxnmatch-ga on 19 Apr 2005 19:42 PDT So, do you think you can scrape those sites?
Request for Question Clarification by leapinglizard-ga on 20 Apr 2005 13:09 PDT Quite possibly, but I'm very busy at the moment and I don't think I can get to it until the weekend. How urgent is it for you? leapinglizard
Clarification of Question by mxnmatch-ga on 20 Apr 2005 19:49 PDT I guess it's not particularly urgent. I just wanted to know if it's something you could do or if I need to start working on it myself.
Request for Question Clarification by leapinglizard-ga on 20 Apr 2005 20:10 PDT After looking over the sites and reviewing your specs, I'm confident I can scrape both sites this Saturday. I'm thinking $80 would be a fair fee, although that might change depending on how irregular the HTML is, and consequently how many program adjustments and manual data corrections I have to make. I would post the results as a pair of text files or, if you prefer, Excel-compatible spreadsheet files. Tell me if you're agreeable to these terms. leapinglizard
Clarification of Question by mxnmatch-ga on 20 Apr 2005 21:13 PDT Sounds good. Please do the Pac-Man site first. I still need to get written permission from the pez site. I've got verbal permission from them, but I would still prefer an email from them to make it official. If possible I'd prefer the results in XML form.
Request for Question Clarification by leapinglizard-ga on 20 Apr 2005 21:57 PDT Sure, I can output XML. Do you have a template, or should I come up with something reasonable on my own? leapinglizard
Clarification of Question by mxnmatch-ga on 21 Apr 2005 15:09 PDT Anything reasonable is fine. My system is set up so that each xml data file has a corresponding xml translation file (the translation file contains lists of xpaths that indicate what info to sue) that indicates how that info is to be imported into the DB.
Request for Question Clarification by leapinglizard-ga on 25 Apr 2005 10:33 PDT I'm afraid I was away on the weekend, but I haven't forgotten this question. I'll get the work done soon. leapinglizard
Clarification of Question by mxnmatch-ga on 01 May 2005 19:16 PDT any progress?
Clarification of Question by mxnmatch-ga on 08 May 2005 19:18 PDT I've done the pac-man site myself. I can do the pez site too if necessary since you're apparently not interested.

Answer

There is no answer at this time.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy