I have a list of a few hundred thousand URLs in an access database
(but I guess I could put them in any common format).
I need a piece of software that can
1) Look at the list of URLs
2) Go to each one and search the entire site for a certain keyword
3) Return a new list of the URLs that contain that keyword
For example, for a list of 5 sites:
http://www.nasa.com
http://www.londonzoo.com
http://www.mickeymouse.com
http://www.americanidol.com
http://www.yahoo.com
And a keyword "space shuttle"
The results would be something like:
http://www.nasa.com/spaceshuttle/page1.html
http://www.nasa.com/spaceshuttle/page2.html
http://www.nasa.com/spaceshuttle/page3.html
http://www.nasa.com/spaceshuttle/page4.html
http://www.nasa.com/spaceshuttle/page5.html
http://www.yahoo.com/space/nasa/shuttle/index.html
...because they are pages that contain the phrase "space shuttle"
somewhere on the page. Looking at file names or meta-tags is not
enough - the program needs to search entire pages.
The program needs to use its own spidering - we don't want a
metasearch product or one that relies on search engines like Google or
Yahoo.
The spider needs to follow every internal link from the one given, and
check the entire site. So, if we gave it nasa.com, it would follow
links from the nasa.com home page, and follow links of the resulting
pages and so on until it had looked at the entire navigable site -
thousands of pages.
If it has Boolean-type features like - search for "space shuttle" but
not "discovery", that would be a preferable.
If a program can do the job but several hundred thousand URLs is too
much for it, something that can do blocks of 10,000 URLs would be ok.
Price range - free to $5000
Number of sites we would like in an answer - at least 5, and hopefully more than 10. |
Clarification of Question by
joel1357-ga
on
27 Jul 2005 20:04 PDT
We have found two programs that almost do what we want:
VelocityScape WebScraper Plus
http://www.velocityscape.com/
This looks like it will do what we require, but we are unable to get
the evaluation version to run, and are waiting to hear from their
support people.
NetTools Spider
http://www.questronixsoftware.com/Products/NetToolsSpider/NetToolsSpider.aspx
We got the evaluation version to work, but we could not get it to do
most of things it should.
Please use the above as examples of the type of product we are looking for.
We are not after help to get those 2 programs running.
We do not need to copy entire sites to our hard drive, and if the
keyword search aspect requires entire sites to be downloaded and
stored, we are not interested.
However, we will now be happy with just one working program that can
fulfil our needs.
--------
And a response to the comment:
1. That would potential require dozens of terrabytes of hard drive. We
do not need to copy entire sites to our hard drive, and if the keyword
search aspect requires entire sites to be downloaded and stored, we
are not interested.
2a Would be nice, but not necessary
2b Would be nice, but not necessary
|