Dear stevegill,
I base my answer on prior experience with programming and web scraping,
as well as on research I carried out specially for this question.
I have downloaded and evaluated the trial version of iMacros Scripting
Edition, which costs $499 for a single-user license. The less expensive
Pro Edition ($199) can also generate scripts for automated web scraping,
but it doesn't permit you to transfer the scripts to other users. With
the Scripting Edition, the licensed user can make a script and distribute
it to any number of additional users, who can then execute it without
paying extra fees.
iOpus: Feature Comparison Chart
http://www.iopus.com/imacros/compare/
In order to run iMacros through its paces, I planned to apply it to a
simple web scraping task in which data is collected from several web
pages and consolidated locally. Specifically, my goal was to extract the
current local temperature reported by the New York Times and by the Los
Angeles Times, and then to inject these readings into an HTML file.
New York Times: Weather
http://nytimes.com/weather
Los Angeles Times: Weather: Los Angeles
http://weather.latimes.com/US/CA/Los_Angeles.html
The iMacros package includes a souped-up web browser that lets you record
various actions, such as surfing, downloading, and data extraction. The
actions are stored as a script which you can replay at your leisure,
with or without the browser. This approach to web automation sounds fine
in theory, but in practice I found that the script recording process is
burdened by severe limitations.
I did find it possible to record the rudiments of my web scraping task
by opening the New York Times weather page in the iMacros browser,
activating the iMacros "Extract Data" feature, and clicking near the
temperature reading on the web page. This procedure extracts the snippet
68°F (20°C) Cloudy
from the web page. Alternatively, one can click more closely on the
centigrade reading to extract
(20°C)
but it is impossible to select just the Fahrenheit reading due to the
HTML structure of the web page.
As a result of these actions, iMacros generates the following script in
its proprietary web-automation language.
VERSION BUILD=5211028
TAB T=1
TAB CLOSEALLOTHERS
URL GOTO=http://nytimes.com/weather
SIZE X=819 Y=764
EXTRACT POS=1 TYPE=TXT ATTR=<DIV<SP>class=summary>*
In order to store the extracted text in a local file, one would have to
manually add something like
SAVEAS TYPE=EXTRACT FOLDER=C:\scraping\extracts FILE=weather.csv
to the script. However, the iMacros language doesn't have a facility for
manipulating the extracted text in more interesting ways. To combine
the weather data from several different sources in a single file, it
would be necessary to write a plugin in a general-purpose language,
such as Java or Perl, that interfaces with the iMacros script.
The bottom line is that while iMacros provides a friendly visual interface
for automating simple surfing and downloading tasks, it cannot execute
more useful extraction duties unless the user is able to write scripts
in the iMacros language and, in addition, to write iMacros plugins in
a more powerful programming language.
In my opinion, one may as well dispense with iMacros and proceed directly
to writing a custom web scraper in the more powerful language.
VelocityScape's Web Scraper Plus+, priced at $419, does not seem
more useful than iMacros. The whole package is poorly assembled, with
textual errors sprinkled liberally throughout its documentation and
user interface. I was able to download Web Scraper on a Windows 2000
Pro system but not to execute any version of it, even after installing
the auxiliary DLL.
VelocityScape: Compare Versions
http://velocityscape.com/WebScraperCompare.aspx
From perusing the Web Scraper manual, I have learned that its web
automation is less visually oriented than that offered by iMacros, though
also more precise. One must first construct a so-called Datapage that
specifies where on the target page the desired data is to be found. This
is done by stepping through the page's HTML structure with the help of
Web Scraper's parser, ultimately specifying which line contains the data
and what tags enclose it.
This Datapage acts as a template for periodic extraction of text
snippets to a Dataset, which can then export the data automatically to a
spreadsheet file or a relational database. However, there is no provision
for manipulating the extracted and stored data in any way. The user is
left with the task of assembling the data into a useful end form. In
essence, you would be back to cutting and pasting.
I cannot recommend either of these software packages for automated web
scraping. In fact, I am not sure if it is possible to make a web scraping
tool that is very friendly and very powerful at the same time. Any visual
means of specifying target data on a web page is bound to be less precise
than direct textual access to the underlying HTML.
There is some room for improvement at the other end of the process, I
think, where the extracted data could be easily plugged into prefabricated
output templates, but neither iMacros nor Web Scraper Plus+ is capable
of this. In any case, such output templates are readily implemented in
a general-purpose scripting language.
For example, the following Python script carries out precisely the task
I attempted and failed to accomplish with iMacros and Web Scraper. It
comprises about three dozen lines of code that I wrote in less than half
an hour.
#===begin weather.py
output_filename = 'weather.html'
template = """
<html>
<body>
<h3> NY and LA weather </h3>
<p> The temperature in New York is <b>NY_TEMP</b> Fahrenheit. </p>
<p> The temperature in Los Angeles is <b>LA_TEMP</b> Fahrenheit. </p>
</body>
</html>
"""
ny_pattern = '(\d+°)F'
ny_address = 'http://nytimes.com/weather'
ny_marker = 'NY_TEMP'
ny_data = (ny_pattern, ny_address, ny_marker)
la_pattern = '(\d+°)'
la_address = 'http://weather.latimes.com/US/CA/Los_Angeles.html'
la_marker = 'LA_TEMP'
la_data = (la_pattern, la_address, la_marker)
import re, urllib
for (pattern, address, marker) in [ny_data, la_data]:
page = urllib.urlopen(address).read()
match = re.search(pattern, page)
if not match:
continue
template = re.sub(marker, match.group(1), template)
open(output_filename, 'w').write(template)
#===end weather.py
This script can be executed at regular intervals by Python, which is a
totally free scripting environment. It visits the NYT and LAT websites,
grabs the temperature readings, plugs these into a little template,
and writes out the result to an HTML file that can be viewed locally
with any web browser.
Python: Download
http://www.python.org/download/
Even if you are unable to write this script yourself, I think the code
is straightforward enough that you can see how to change some of the
parameters without expert assistance. For example, you could change the
web page addresses and modify the output template, perhaps even fiddle
with the data-extraction patterns.
Without knowing exactly what kind of web scraping you want to do, I
imagine that for about the same price as either of the software packages I
have reviewed, you could hire a competent programmer to do the scripting
for you. Moreover, no one would have to spend time acquiring expertise
particular to iMacros or Web Scraper.
I wish you all the best with your venture.
Regards,
leapinglizard |
Clarification of Answer by
leapinglizard-ga
on
04 Dec 2006 05:23 PST
Pascal is an enjoyable language with an elegant syntax. I, like millions
of other schoolchildren in the eighties, learned to program in Pascal, and
I can still recommend it as an instructional language. On the other hand,
if you want to get something done in short order, I believe there are more
practical options today. The great advantage of using a scripting language
such as Perl, Python, or Ruby, or a modern general-purpose programming
language such as Java, is the ready availability of software libraries
that handle much of the gruntwork for you. When it comes to industrial
tasks like parsing web pages, the built-in libraries of these languages
make quick work of what would otherwise be tedious chores.
I do not agree that it would help to learn JavaScript and PHP for the
purpose of web scraping. These languages are designed to function at
the other end of the transaction, generating web pages on a server
(in the case of PHP) or dynamically manipulating a page in the browser
(JavaScript). Because PHP is transformed into HTML by the server,
you should never even see it in the web pages you download. You will
find JavaScript sprinkled through the HTML source and you will see its
effects on-screen, but it is not suitable for doing much outside the
browser. You would not want to write a JavaScript program to process
the contents of a downloaded web page.
I try not to pressure people into learning Python, because I am aware of
my emotional bias toward it, but I must say that I find it very convenient
for web chores and for rapid script development in general. It is one of
the three major scripting languages, along with Perl and Ruby, and though
not the most concise of these three, it is surely the most readable. If
you want to learn more about Python without committing yourself, I
suggest you read a few chapters of a tutorial written by its designer.
Guido van Rossum: Python Tutorial: Chapter 1
http://docs.python.org/tut/node3.html
leapinglizard
|