Google Answers Logo
View Question
 
Q: Website Spidering Software ( No Answer,   1 Comment )
Question  
Subject: Website Spidering Software
Category: Computers
Asked by: joel1357-ga
List Price: $100.00
Posted: 25 Jul 2005 22:05 PDT
Expires: 27 Jul 2005 23:45 PDT
Question ID: 547943
I have a list of a few hundred thousand URLs in an access database
(but I guess I could put them in any common format).

I need a piece of software that can

1) Look at the list of URLs
2) Go to each one and search the entire site for a certain keyword
3) Return a new list of the URLs that contain that keyword

For example, for a list of 5 sites:
http://www.nasa.com
http://www.londonzoo.com
http://www.mickeymouse.com
http://www.americanidol.com
http://www.yahoo.com

And a keyword "space shuttle"

The results would be something like:
http://www.nasa.com/spaceshuttle/page1.html
http://www.nasa.com/spaceshuttle/page2.html
http://www.nasa.com/spaceshuttle/page3.html
http://www.nasa.com/spaceshuttle/page4.html
http://www.nasa.com/spaceshuttle/page5.html
http://www.yahoo.com/space/nasa/shuttle/index.html
...because they are pages that contain the phrase "space shuttle" 
somewhere on the page. Looking at file names or meta-tags is not
enough - the program needs to search entire pages.

The program needs to use its own spidering - we don't want a
metasearch product or one that relies on search engines like Google or
Yahoo.

The spider needs to follow every internal link from the one given, and
check the entire site. So, if we gave it nasa.com, it would follow
links from the nasa.com home page, and follow links of the resulting
pages and so on until it had looked at the entire navigable site -
thousands of pages.

If it has Boolean-type features like - search for "space shuttle" but
not "discovery", that would be a preferable.

If a program can do the job but several hundred thousand URLs is too
much for it, something that can do blocks of 10,000 URLs would be ok.

Price range - free to $5000

Number of sites we would like in an answer - at least 5, and hopefully more than 10.

Clarification of Question by joel1357-ga on 27 Jul 2005 20:04 PDT
We have found two programs that almost do what we want:
 
VelocityScape WebScraper Plus
http://www.velocityscape.com/
This looks like it will do what we require, but we are unable to get
the evaluation version to run, and are waiting to hear from their
support people.

NetTools Spider 
http://www.questronixsoftware.com/Products/NetToolsSpider/NetToolsSpider.aspx
We got the evaluation version to work, but we could not get it to do
most of things it should.
 
Please use the above as examples of the type of product we are looking for. 
We are not after help to get those 2 programs running.
 
We do not need to copy entire sites to our hard drive, and if the
keyword search aspect requires entire sites to be downloaded and
stored, we are not interested.
 
However, we will now be happy with just one working program that can
fulfil our needs.
 
 
--------
 
And a response to the comment:
 
1. That would potential require dozens of terrabytes of hard drive. We
do not need to copy entire sites to our hard drive, and if the keyword
search aspect requires entire sites to be downloaded and stored, we
are not interested.
 
2a Would be nice, but not necessary
2b Would be nice, but not necessary
Answer  
There is no answer at this time.

Comments  
Subject: Re: Website Spidering Software
From: pturing-ga on 27 Jul 2005 11:15 PDT
 
There are two questions that will determine what software you will be able to use:
1. Will the machine have sufficient disk space to download an entire
site and/or all the sites before then searching them and returning the
new list
2a. Do you need the software to behave like a real web browser to get
at content that has been placed behind javascript links (in many cases
to prevent it from being mirrored) and
2b. Should the software to be impolite and ignore the site's
directions for spiders in robots.txt

If you have enough disk space, and don't need the software to be
'impolite', you may be able to do it with Free Software, possibly even
by typing a few commands.
For example:
wget http://www.gnu.org/software/wget/wget.html could take the list of
urls (as a plain text file)
and download them all using the options
wget -m -np -i input_url_list
after which the files could then be searched using find and grep:
find . -type f | xargs grep -i -l "search phrase 1" > matched_keywords1_urls.txt
find . -type f | xargs grep -i -l "search phrase 2" > matched_keywords2_urls.txt
cat "matched_keywords?_urls.txt" | sort | uniq > matched_keywords_all_urls.txt
find . -type f | xargs grep -i -l -v "not phrase" >
has_no_blacklisted_keywords_urls.txt
comm -1 -2 matched_keywords_all_urls.txt has_no_blacklisted_keywords_urls.txt

All of the necessary programs for that route can be acquired in cygwin:
http://www.cygwin.com

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy