Dear austinjavaman,
There are many programming languages that could crawl a website, extract
text, and connect with a database. These are for the most part high-level
programming languages such as C++, Java, and C#, and scripting languages
such as Python, Perl, and Ruby. The problem could also be solved in
a lower-level language such as C or Pascal, but any performance gains
would be offset by the added pain of working in an error-prone, poorly
maintainable lightweight language.
Since you ask for one programming language, let me argue in favor of
the high-level, dynamic, interpreted scripting language Python. Other
scripting languages would do about as well for this task, and I shall
mention reasons why they are generally preferable to the compiled
object-oriented languages for this task.
The scripting languages, of which Python is one, are high-level, dynamic,
interpreted languages. High-level means that they offer complex data
structures and powerful libraries for carrying out a broad variety of
tasks with minimal effort on the programmer's part. To make a connection
with a web URL and download the page's contents into a text file takes but
a single line in Python, Perl, Ruby, and other scripting languages. While
Java and C# are also packaged with libraries that allow the programmer
to carry out high-level web functions, not even they can accomplish such
a feat in one line.
Witness the single line of Python that writes the contents of a web page
at address URL to a file named FNAME.
open(FNAME, 'w').write(urllib.urlopen(URL).read())
If you want to be pedantic about it, the use of the urllib module requires
a single call in each program environment to the import function. At the
most, then, the above line would have to be preceded by one additional
instruction.
import urllib; open(FNAME, 'w').write(urllib.urlopen(URL).read())
As an open-source software product, Python can be downloaded for free
and used however the user likes as long as the open-source licensing is
not compromised.
python.org: Download Standard Python Software
http://python.org/download/
Its competitors in the scripting arena, namely Ruby and Perl, are also
freely available open-source products.
ruby-lang.org: Download Ruby
http://www.ruby-lang.org/en/20020102.html
perl.com: Downloading the Latest Version of Perl
http://www.perl.com/download.csp
Like Python, Ruby and Perl come with powerful libraries and high-level
data structures. They are also interpreted, dynamic languages, making
it easy for the programmer to inspect and modify the objects in a
program, or even the program as a whole, while it runs. I would opt
for Python instead of the other two major scripting languages because
it has a clearer syntax and more object-oriented approach than Perl,
while offering the support of a larger user community than Ruby. These
characteristics are very helpful when it comes to debugging a script.
Scripting languages are generally well suited to the task of crawling
a web page and parsing its contents because they emerged originally as
text-processing tools. Matching patterns and extracting substrings is a
pleasure with these languages, for they are in their element when asked to
chop up the raw code of a web page. Indeed, Python comes with a ready-made
HTML parser that makes it a breeze to navigate the implicit structure of
a web page. But the scripting languages are also universally applicable,
making it easy for programmers novice and professional alike to code
mathematical functions, user interfaces, and custom data structures.
When it comes to speed, ease of use, and universality, the scripting
languages are clear winners. But the convenience and power come at
a price, and the downfall of scripting languages is their execution
speed. Lower-level code of the kind that is just as easily written in a
spartan language such as C runs 5 to 10 times slower in Python. High-level
code, such as that involved in text processing and file manipulation,
is 50 to 100 times slower. Another interesting limitation of Python is
the bound on the number of recursive function calls one can make. The
maximum depth of recursion is 1000 by default, and although this can be
altered, it is a great change from languages that allow recursion depth
to the full extent of the stack size.
When it comes to the task of crawling, parsing, and storing web-page data,
however, these shortcomings should not pose any obstacle. Recursion is not
likely to be a feature of such an application. When it comes to speed,
the bottleneck will be in the network, not in the local processing. A
Python script, as slow as it may be compared to the equivalent C++
or Java program, will still be faster by orders of magnitude than the
web connection through which web pages are downloaded, so it won't be
caught flatfooted. For this reason, and thanks to the ease of debugging,
readability, and maintainability, I recommend Python as the implementation
language for your next web project.
I have enjoyed replying to your question about the possibilities and
limitations of programming languages. If you should feel that any part
of my answer requires correction or elaboration, do let me know through
a Clarification Request so that I have a chance to fully meet your needs
before you assign a rating.
Regards,
leapinglizard |