Google Answers: Webscraping and WebMacros software

View Question

Q: Webscraping and WebMacros software ( Answered 5 out of 5 stars

Question

Subject: Webscraping and WebMacros software
Category: Computers > Programming
Asked by: stevegill-ga
List Price: $125.00

Posted: 23 Nov 2006 06:57 PST
Expires: 23 Dec 2006 06:57 PST
Question ID: 785059

I have been looking at three software programs,iMacros, WebScraper
Plus and Newbie. and want to know which would be better to invest in.
I need someone to download and run through both sets. If you could
explain the differences and help me decide which would be better to
buy, as related to my company's needs. We download files, and do a
awful lot of cutting and pasting. If snippets of 3-4 websites could be
mashed into one xls or html file would be very diserable.
I want to find a program that will do most of the work, because I will
always be tweaking and adding on the programs. Websites seem to always
change as well probably to discourage webscraping...
I dont want to have someone write code, because I feel it would be a
very steep learning curve for me, a layperson.

Answer

Subject: Re: Webscraping and WebMacros software
Answered By: leapinglizard-ga on 30 Nov 2006 18:37 PST
Rated: 5 out of 5 stars

Dear stevegill,


I base my answer on prior experience with programming and web scraping,
as well as on research I carried out specially for this question.


I have downloaded and evaluated the trial version of iMacros Scripting
Edition, which costs $499 for a single-user license. The less expensive
Pro Edition ($199) can also generate scripts for automated web scraping,
but it doesn't permit you to transfer the scripts to other users. With
the Scripting Edition, the licensed user can make a script and distribute
it to any number of additional users, who can then execute it without
paying extra fees.

iOpus: Feature Comparison Chart
http://www.iopus.com/imacros/compare/


In order to run iMacros through its paces, I planned to apply it to a
simple web scraping task in which data is collected from several web
pages and consolidated locally. Specifically, my goal was to extract the
current local temperature reported by the New York Times and by the Los
Angeles Times, and then to inject these readings into an HTML file.

New York Times: Weather
http://nytimes.com/weather

Los Angeles Times: Weather: Los Angeles
http://weather.latimes.com/US/CA/Los_Angeles.html


The iMacros package includes a souped-up web browser that lets you record
various actions, such as surfing, downloading, and data extraction. The
actions are stored as a script which you can replay at your leisure,
with or without the browser. This approach to web automation sounds fine
in theory, but in practice I found that the script recording process is
burdened by severe limitations.

I did find it possible to record the rudiments of my web scraping task
by opening the New York Times weather page in the iMacros browser,
activating the iMacros "Extract Data" feature, and clicking near the
temperature reading on the web page. This procedure extracts the snippet

    68°F (20°C) Cloudy

from the web page. Alternatively, one can click more closely on the
centigrade reading to extract

    (20°C)

but it is impossible to select just the Fahrenheit reading due to the
HTML structure of the web page.

As a result of these actions, iMacros generates the following script in
its proprietary web-automation language.

    VERSION BUILD=5211028
    TAB T=1
    TAB CLOSEALLOTHERS
    URL GOTO=http://nytimes.com/weather
    SIZE X=819 Y=764
    EXTRACT POS=1 TYPE=TXT ATTR=<DIV<SP>class=summary>*

In order to store the extracted text in a local file, one would have to
manually add something like

    SAVEAS TYPE=EXTRACT FOLDER=C:\scraping\extracts FILE=weather.csv

to the script. However, the iMacros language doesn't have a facility for
manipulating the extracted text in more interesting ways. To combine
the weather data from several different sources in a single file, it
would be necessary to write a plugin in a general-purpose language,
such as Java or Perl, that interfaces with the iMacros script.

The bottom line is that while iMacros provides a friendly visual interface
for automating simple surfing and downloading tasks, it cannot execute
more useful extraction duties unless the user is able to write scripts
in the iMacros language and, in addition, to write iMacros plugins in
a more powerful programming language.

In my opinion, one may as well dispense with iMacros and proceed directly
to writing a custom web scraper in the more powerful language.


VelocityScape's Web Scraper Plus+, priced at $419, does not seem
more useful than iMacros. The whole package is poorly assembled, with
textual errors sprinkled liberally throughout its documentation and
user interface. I was able to download Web Scraper on a Windows 2000
Pro system but not to execute any version of it, even after installing
the auxiliary DLL.

VelocityScape: Compare Versions
http://velocityscape.com/WebScraperCompare.aspx


From perusing the Web Scraper manual, I have learned that its web
automation is less visually oriented than that offered by iMacros, though
also more precise. One must first construct a so-called Datapage that
specifies where on the target page the desired data is to be found. This
is done by stepping through the page's HTML structure with the help of
Web Scraper's parser, ultimately specifying which line contains the data
and what tags enclose it.

This Datapage acts as a template for periodic extraction of text
snippets to a Dataset, which can then export the data automatically to a
spreadsheet file or a relational database. However, there is no provision
for manipulating the extracted and stored data in any way. The user is
left with the task of assembling the data into a useful end form. In
essence, you would be back to cutting and pasting.


I cannot recommend either of these software packages for automated web
scraping. In fact, I am not sure if it is possible to make a web scraping
tool that is very friendly and very powerful at the same time. Any visual
means of specifying target data on a web page is bound to be less precise
than direct textual access to the underlying HTML.

There is some room for improvement at the other end of the process, I
think, where the extracted data could be easily plugged into prefabricated
output templates, but neither iMacros nor Web Scraper Plus+ is capable
of this. In any case, such output templates are readily implemented in
a general-purpose scripting language.

For example, the following Python script carries out precisely the task
I attempted and failed to accomplish with iMacros and Web Scraper. It
comprises about three dozen lines of code that I wrote in less than half
an hour.


#===begin weather.py

output_filename = 'weather.html'
template = """
<html>
  <body>
    <h3> NY and LA weather </h3>
    <p> The temperature in New York is <b>NY_TEMP</b> Fahrenheit. </p>
    <p> The temperature in Los Angeles is <b>LA_TEMP</b> Fahrenheit. </p>
  </body>
</html>
"""

ny_pattern = '(\d+&deg;)F'
ny_address = 'http://nytimes.com/weather'
ny_marker = 'NY_TEMP' 
ny_data = (ny_pattern, ny_address, ny_marker) 

la_pattern = '(\d+&deg;)'
la_address = 'http://weather.latimes.com/US/CA/Los_Angeles.html'
la_marker = 'LA_TEMP'
la_data = (la_pattern, la_address, la_marker)

import re, urllib

for (pattern, address, marker) in [ny_data, la_data]: 
    page = urllib.urlopen(address).read()
    match = re.search(pattern, page)
    if not match:
        continue
    template = re.sub(marker, match.group(1), template)

open(output_filename, 'w').write(template)

#===end weather.py


This script can be executed at regular intervals by Python, which is a
totally free scripting environment. It visits the NYT and LAT websites,
grabs the temperature readings, plugs these into a little template, 
and writes out the result to an HTML file that can be viewed locally
with any web browser. 

Python: Download
http://www.python.org/download/


Even if you are unable to write this script yourself, I think the code
is straightforward enough that you can see how to change some of the
parameters without expert assistance. For example, you could change the
web page addresses and modify the output template, perhaps even fiddle
with the data-extraction patterns. 

Without knowing exactly what kind of web scraping you want to do, I
imagine that for about the same price as either of the software packages I
have reviewed, you could hire a competent programmer to do the scripting
for you. Moreover, no one would have to spend time acquiring expertise
particular to iMacros or Web Scraper. 


I wish you all the best with your venture.

Regards,

leapinglizard

Request for Answer Clarification by stevegill-ga on 03 Dec 2006 11:16 PST

I have really enjoyed your answer! Sorry, It has taken me a couple of
days to really look at this in detail.
The third program that I had listed, Newbie, is basically a shell and
uses Pascal as a language. Just curious, what are your thoughts on
using (and learning) Pascal as compared to Python? Additionally I was
told to learn JavaScript, and PHP if to achieve webscraping,
webautomation.

If you have any experience with the four options, you may really help
me avoid a dead-end!

Of course this is a disapointment, because I was really hoping to
spend a couple of weeks (as my schedule allows) reviewing the
tutorials and really developing something. However, I am not opposed
to bear down and learn a language.
 
And yes, ultimately I will have to have a programmer really write the
majority of the programming, because of my time restraints, but I
wanted to understand what he was doing and tweak with the program
myself.

Clarification of Answer by leapinglizard-ga on 04 Dec 2006 05:23 PST

Pascal is an enjoyable language with an elegant syntax. I, like millions
of other schoolchildren in the eighties, learned to program in Pascal, and
I can still recommend it as an instructional language. On the other hand,
if you want to get something done in short order, I believe there are more
practical options today. The great advantage of using a scripting language
such as Perl, Python, or Ruby, or a modern general-purpose programming
language such as Java, is the ready availability of software libraries
that handle much of the gruntwork for you. When it comes to industrial
tasks like parsing web pages, the built-in libraries of these languages
make quick work of what would otherwise be tedious chores.

I do not agree that it would help to learn JavaScript and PHP for the
purpose of web scraping. These languages are designed to function at
the other end of the transaction, generating web pages on a server
(in the case of PHP) or dynamically manipulating a page in the browser
(JavaScript). Because PHP is transformed into HTML by the server,
you should never even see it in the web pages you download. You will
find JavaScript sprinkled through the HTML source and you will see its
effects on-screen, but it is not suitable for doing much outside the 
browser. You would not want to write a JavaScript program to process
the contents of a downloaded web page.

I try not to pressure people into learning Python, because I am aware of
my emotional bias toward it, but I must say that I find it very convenient
for web chores and for rapid script development in general. It is one of
the three major scripting languages, along with Perl and Ruby, and though
not the most concise of these three, it is surely the most readable. If
you want to learn more about Python without committing yourself, I
suggest you read a few chapters of a tutorial written by its designer.

Guido van Rossum: Python Tutorial: Chapter 1
http://docs.python.org/tut/node3.html

leapinglizard

stevegill-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $25.00

leapinglizard;
You have outdone your first answer, which I was very impressed with. I
really appreciate the time, thought and concise answers you put forth.
I have really enjoyed reading your answers!

Comments

Subject: Re: Webscraping and WebMacros software
From: funtick-ga on 23 Nov 2006 11:58 PST

WebScraper:
Last updated: 08/24/02, Initial Release
http://www.hoyasoft.com/product_webscraper.html

Too old!!!

Try TeleportPro, very good software for crawling Internet sites.

And, your task is simply unclear, what do you want to do with WebScraper?

Subject: Re: Webscraping and WebMacros software
From: stevegill-ga on 23 Nov 2006 19:58 PST

the WebScraper Plus is actually http://www.velocityscape.com/
not the one from Hoyasoft.com

Subject: Re: Webscraping and WebMacros software
From: stevegill-ga on 23 Nov 2006 20:02 PST

I would like to automate or create macros that would save us from
cutting and pasting from websites into a single xls file.
For example I go to our property administrators site and copy the
square footage and tax assessment for a particular property and paste
that into an excel file. There is also a pdf file available that gets
downloaded...
we do this several hundred times a month! Very very tedious... 
I have looked over the programs listed above and they seem to do this
but they are also very technical still, so I want some help to review
them, and hopefully have that same person teach me, and reduce the
learning curve, of the chosen program.

Subject: Re: Webscraping and WebMacros software
From: leapinglizard-ga on 07 Dec 2006 21:52 PST

Thank you for your kind words and generous tip. They are much appreciated.

leapinglizard

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy