I have a list of a few hundred thousand URLs in an access database
(but I guess I could put them in any common format).
I need a piece of READY-MADE software that can
1) Look at the list of URLs
2) Go to each one and search the entire site for a certain keyword
3) Return a new list of the URLs that contain that keyword
For example, for a list of 5 sites:
http://www.nasa.com
http://www.mickeymouse.com
http://www.londonzoo.com
http://www.americanidol.com
http://www.johndoe.com
And a keyword "space shuttle"
The results would be something like:
http://www.nasa.com/spaceshuttle/page1.html
http://www.nasa.com/spaceshuttle/page2.html
http://www.nasa.com/spaceshuttle/page3.html
http://www.nasa.com/spaceshuttle/page4.html
http://www.nasa.com/spaceshuttle/page5.html
http://www.mickeymouse.com/spaceshuttle/index.html
...because they are pages that contain the phrase "space shuttle"
somewhere on the page. Looking at file names or meta-tags is not
enough - the program needs to search entire pages.
The program needs to use its own spidering - we don't want a
metasearch product or one that relies on search engines like Google or
Yahoo.
The spider needs to follow every internal link from the one given, and
check the entire site. So, if we gave it nasa.com, it would follow
links from the nasa.com home page, and follow links of the resulting
pages and so on until it had looked at the entire navigable site -
thousands of pages.
If it has Boolean-type features like - search for "space shuttle" but
not "discovery", that would be a preferable.
If a program can do the job but several hundred thousand URLs is too
much for it, something that can do blocks of 1,000 URLs would be ok.
We have found two programs that almost do what we want:
VelocityScape WebScraper Plus
http://www.velocityscape.com/
This looks like it will do what we require, but we are unable to get
the evaluation version to run, and are waiting to hear from their
support people.
NetTools Spider
http://www.questronixsoftware.com/Products/NetToolsSpider/NetToolsSpider.aspx
We got the evaluation version to work, but we could not get it to do
most of things it should.
Please use the above as examples of the type of product we are looking for.
We are not after help to get those 2 programs running.
We do not need to copy entire sites to our hard drive, and if the
keyword search aspect requires entire sites to be downloaded and
stored, we are not interested. We believe this would require several
terabytes and wouldn't be cost effective.
Price range of the product - free to $10,000 |
Request for Question Clarification by
pafalafa-ga
on
30 Jul 2005 17:47 PDT
joel1357-ga,
I believe a program called Web Data Extractor will do ALMOST
everything you need, with one exception (which may be a
deal-breaker!):
http://www.webextractor.com/index.htm
The program can access a large list of URLs from your access database,
search each site completely, look for desired terms, use boolean
logic, and return URLs for the links that match your terms.
At least, I'm pretty sure it can do all this (I used WDE on a trial
basis a while back, but I can't re-use it to double-check its
capabilities, since my trial has long expired).
The one caveat is that WDE uses multiple search engines (18 all
together, such as Google, Yahoo, AltaVista, etc) for its search
capability, rather than a built-in spider.
Does that make it useless for your needs?
Have a look at their capabilities, and let me know what you think.
Cheers,
pafalafa-ga
|
Clarification of Question by
joel1357-ga
on
01 Aug 2005 08:41 PDT
pafalafa,
The software we need absolutely has to have it's own spidering
capabilities and cannot rely on search engines. We did extensive
research on our own before ever asking our questions here and actually
came across the site you referenced. Unfortunately what they offer
will not take care of our needs.
Thanks
|
Request for Question Clarification by
pafalafa-ga
on
02 Aug 2005 06:20 PDT
Joel,
I'm looking into a piece of software called Teleport Pro, that seems
to get you a long way toward what you need -- I'm not clear (yet) if
they can handle a large list of URLs, but they seem to have most other
capabilities.
Meanwhile, the same company that makes Teleport Pro also offers a
datamining service called Dataplex which might be of interest to you:
http://www.tenmax.com/dataplex/home.htm
Take a look. I'll let you know what I find out about Teleport Pro.
paf
|
Request for Question Clarification by
wengland-ga
on
16 Aug 2005 09:53 PDT
1) How quickly do you want this index to complete?
2) How frequently will you re-run this index?
3) What platform should it run on?
4) Does it need to be commercial boxed software?
5) Can the spider be restricted to the source domain (nasa.com for
example would return only things from www.nasa.com, space.nasa.com,
etc) or do you want it to follow external links?
6) What types of documents do you wish to search?
7) When do you want it by?
|