Google Answers Logo
View Question
 
Q: Programming Languages: Possibilities and Limitations ( Answered 5 out of 5 stars,   2 Comments )
Question  
Subject: Programming Languages: Possibilities and Limitations
Category: Computers > Programming
Asked by: austinjavaman-ga
List Price: $30.00
Posted: 28 Oct 2004 12:05 PDT
Expires: 27 Nov 2004 11:05 PST
Question ID: 421296
I would like to know what programming language would be necessary to
develop a program capable of spidering a particular website,
extracting specific data (i.e. product information, title, etc) and
then parsing that information into a database.  What are the
advantages and disadvantages of this particular programming language? 
Are there any particular limits of what it is possible with this
particular language?  In terms of cost, speed, ease of use, and
universality, what languages are most useful for this type of
application?
Answer  
Subject: Re: Programming Languages: Possibilities and Limitations
Answered By: leapinglizard-ga on 30 Oct 2004 21:18 PDT
Rated:5 out of 5 stars
 
Dear austinjavaman,

There are many programming languages that could crawl a website, extract
text, and connect with a database. These are for the most part high-level
programming languages such as C++, Java, and C#, and scripting languages
such as Python, Perl, and Ruby. The problem could also be solved in
a lower-level language such as C or Pascal, but any performance gains
would be offset by the added pain of working in an error-prone, poorly
maintainable lightweight language.

Since you ask for one programming language, let me argue in favor of
the high-level, dynamic, interpreted scripting language Python. Other
scripting languages would do about as well for this task, and I shall
mention reasons why they are generally preferable to the compiled
object-oriented languages for this task.

The scripting languages, of which Python is one, are high-level, dynamic,
interpreted languages. High-level means that they offer complex data
structures and powerful libraries for carrying out a broad variety of
tasks with minimal effort on the programmer's part. To make a connection
with a web URL and download the page's contents into a text file takes but
a single line in Python, Perl, Ruby, and other scripting languages. While
Java and C# are also packaged with libraries that allow the programmer
to carry out high-level web functions, not even they can accomplish such
a feat in one line.

Witness the single line of Python that writes the contents of a web page
at address URL to a file named FNAME.

open(FNAME, 'w').write(urllib.urlopen(URL).read())

If you want to be pedantic about it, the use of the urllib module requires
a single call in each program environment to the import function. At the
most, then, the above line would have to be preceded by one additional
instruction.

import urllib; open(FNAME, 'w').write(urllib.urlopen(URL).read())

As an open-source software product, Python can be downloaded for free
and used however the user likes as long as the open-source licensing is
not compromised.

python.org: Download Standard Python Software
http://python.org/download/

Its competitors in the scripting arena, namely Ruby and Perl, are also
freely available open-source products.

ruby-lang.org: Download Ruby
http://www.ruby-lang.org/en/20020102.html

perl.com: Downloading the Latest Version of Perl
http://www.perl.com/download.csp

Like Python, Ruby and Perl come with powerful libraries and high-level
data structures. They are also interpreted, dynamic languages, making
it easy for the programmer to inspect and modify the objects in a
program, or even the program as a whole, while it runs. I would opt
for Python instead of the other two major scripting languages because
it has a clearer syntax and more object-oriented approach than Perl,
while offering the support of a larger user community than Ruby. These
characteristics are very helpful when it comes to debugging a script.

Scripting languages are generally well suited to the task of crawling
a web page and parsing its contents because they emerged originally as
text-processing tools. Matching patterns and extracting substrings is a
pleasure with these languages, for they are in their element when asked to
chop up the raw code of a web page. Indeed, Python comes with a ready-made
HTML parser that makes it a breeze to navigate the implicit structure of 
a web page. But the scripting languages are also universally applicable,
making it easy for programmers novice and professional alike to code
mathematical functions, user interfaces, and custom data structures.

When it comes to speed, ease of use, and universality, the scripting
languages are clear winners. But the convenience and power come at
a price, and the downfall of scripting languages is their execution
speed. Lower-level code of the kind that is just as easily written in a
spartan language such as C runs 5 to 10 times slower in Python. High-level
code, such as that involved in text processing and file manipulation,
is 50 to 100 times slower. Another interesting limitation of Python is
the bound on the number of recursive function calls one can make. The
maximum depth of recursion is 1000 by default, and although this can be
altered, it is a great change from languages that allow recursion depth
to the full extent of the stack size.

When it comes to the task of crawling, parsing, and storing web-page data,
however, these shortcomings should not pose any obstacle. Recursion is not
likely to be a feature of such an application. When it comes to speed,
the bottleneck will be in the network, not in the local processing. A
Python script, as slow as it may be compared to the equivalent C++
or Java program, will still be faster by orders of magnitude than the
web connection through which web pages are downloaded, so it won't be
caught flatfooted. For this reason, and thanks to the ease of debugging,
readability, and maintainability, I recommend Python as the implementation
language for your next web project.

I have enjoyed replying to your question about the possibilities and
limitations of programming languages. If you should feel that any part
of my answer requires correction or elaboration, do let me know through
a Clarification Request so that I have a chance to fully meet your needs
before you assign a rating.

Regards,

leapinglizard
austinjavaman-ga rated this answer:5 out of 5 stars
Excellent!  Very thorough and articulate answer helped get me jump
started on my project.  Thank you for your help!

Comments  
Subject: Re: Programming Languages: Possibilities and Limitations
From: obten-ga on 28 Oct 2004 21:22 PDT
 
I don't know exactly what language, but you may take into account the
stack required for recursion, tipical programming languages are
implemented with a limited one, in that case it is important to take
care of the quality of the program (read about CPS) also some library
to search regular expressions is helpfull.
Speed depends more on the complexity of the algorithm not on the
programming language.
Easy of use? I would say, easy of maintain, it depends of the
education of the programmer, in order to write clean, elegant,
efficient, well documented and easy to modify code. Not in programming
language.
A program written by trial/error using a not well understood (by the
programmer) paradigm (like Object Oriented) may be very dificult to
maitain,
use, ineficient, and plenty of hard to find errors (bugs) like stack overflows,
memory overflow, bad concurrency (it seems not needed by the spidering websites)

Good lock!
Subject: Re: Programming Languages: Possibilities and Limitations
From: digitaltechnic-ga on 09 Nov 2004 21:31 PST
 
I could implement this in Delphi if you're open to such.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy