Google Answers Logo
View Question
 
Q: Website Spidering - $400 question - paid through a $200 question and a $200 tip ( No Answer,   4 Comments )
Question  
Subject: Website Spidering - $400 question - paid through a $200 question and a $200 tip
Category: Computers
Asked by: joel1357-ga
List Price: $200.00
Posted: 30 Jul 2005 01:47 PDT
Expires: 20 Aug 2005 00:08 PDT
Question ID: 549729
I have a list of a few hundred thousand URLs in an access database
(but I guess I could put them in any common format).

I need a piece of READY-MADE software that can

1) Look at the list of URLs
2) Go to each one and search the entire site for a certain keyword
3) Return a new list of the URLs that contain that keyword

For example, for a list of 5 sites:
http://www.nasa.com
http://www.mickeymouse.com
http://www.londonzoo.com
http://www.americanidol.com
http://www.johndoe.com

And a keyword "space shuttle"

The results would be something like:
http://www.nasa.com/spaceshuttle/page1.html
http://www.nasa.com/spaceshuttle/page2.html
http://www.nasa.com/spaceshuttle/page3.html
http://www.nasa.com/spaceshuttle/page4.html
http://www.nasa.com/spaceshuttle/page5.html
http://www.mickeymouse.com/spaceshuttle/index.html
...because they are pages that contain the phrase "space shuttle" 
somewhere on the page. Looking at file names or meta-tags is not
enough - the program needs to search entire pages.

The program needs to use its own spidering - we don't want a
metasearch product or one that relies on search engines like Google or
Yahoo.

The spider needs to follow every internal link from the one given, and
check the entire site. So, if we gave it nasa.com, it would follow
links from the nasa.com home page, and follow links of the resulting
pages and so on until it had looked at the entire navigable site -
thousands of pages.

If it has Boolean-type features like - search for "space shuttle" but
not "discovery", that would be a preferable.

If a program can do the job but several hundred thousand URLs is too
much for it, something that can do blocks of 1,000 URLs would be ok.

We have found two programs that almost do what we want:
 
VelocityScape WebScraper Plus
http://www.velocityscape.com/
This looks like it will do what we require, but we are unable to get
the evaluation version to run, and are waiting to hear from their
support people.

NetTools Spider 
http://www.questronixsoftware.com/Products/NetToolsSpider/NetToolsSpider.aspx
We got the evaluation version to work, but we could not get it to do
most of things it should.
 
Please use the above as examples of the type of product we are looking for. 
We are not after help to get those 2 programs running.
 
We do not need to copy entire sites to our hard drive, and if the
keyword search aspect requires entire sites to be downloaded and
stored, we are not interested. We believe this would require several 
terabytes and wouldn't be cost effective.
 
Price range of the product - free to $10,000

Request for Question Clarification by pafalafa-ga on 30 Jul 2005 17:47 PDT
joel1357-ga,

I believe a program called Web Data Extractor will do ALMOST
everything you need, with one exception (which may be a
deal-breaker!):


http://www.webextractor.com/index.htm


The program can access a large list of URLs from your access database,
search each site completely, look for desired terms, use boolean
logic, and return URLs for the links that match your terms.

At least, I'm pretty sure it can do all this (I used WDE on a trial
basis a while back, but I can't re-use it to double-check its
capabilities, since my trial has long expired).

The one caveat is that WDE uses multiple search engines (18 all
together, such as Google, Yahoo, AltaVista, etc) for its search
capability, rather than a built-in spider.

Does that make it useless for your needs?


Have a look at their capabilities, and let me know what you think.


Cheers,

pafalafa-ga

Clarification of Question by joel1357-ga on 01 Aug 2005 08:41 PDT
pafalafa,

The software we need absolutely has to have it's own spidering
capabilities and cannot rely on search engines. We did extensive
research on our own before ever asking our questions here and actually
came across the site you referenced. Unfortunately what they offer
will not take care of our needs.

Thanks

Request for Question Clarification by pafalafa-ga on 02 Aug 2005 06:20 PDT
Joel,

I'm looking into a piece of software called Teleport Pro, that seems
to get you a long way toward what you need -- I'm not clear (yet) if
they can handle a large list of URLs, but they seem to have most other
capabilities.

Meanwhile, the same company that makes Teleport Pro also offers a
datamining service called Dataplex which might be of interest to you:


http://www.tenmax.com/dataplex/home.htm


Take a look.  I'll let you know what I find out about Teleport Pro.


paf

Request for Question Clarification by wengland-ga on 16 Aug 2005 09:53 PDT
1) How quickly do you want this index to complete?
2) How frequently will you re-run this index?
3) What platform should it run on?
4) Does it need to be commercial boxed software?
5) Can the spider be restricted to the source domain (nasa.com for
example would return only things from www.nasa.com, space.nasa.com,
etc) or do you want it to follow external links?
6) What types of documents do you wish to search?  
7) When do you want it by?
Answer  
There is no answer at this time.

Comments  
Subject: Re: Website Spidering - $400 question - paid through a $200 question and a $200 tip
From: ijazpk-ga on 05 Aug 2005 09:50 PDT
 
There is a site http://www.firststopwebsearch.com/ on which you will
find their products. I have downloaded a search software "FirstStop
WebSearch" which is apower full seach software just as you are looking
for. Thhis software will seach from specific seach engine and not from
list of URL as
desired by you. I have send an email to the comapny and in reply to  my quarry
"Mr.Denis Sinegubko" told me that that they will include 5 search
engine free of cost. I think that if you will offer them more they
will modified that software
so as you can add unlimited number of URL.

Alother product of the firststopwebsearch.com is "WebFinalist" 
http://www.webfinalist.com. This software is exectly what you want.
You can add
as many URL as you wish and can search from your own URL. I have
compared both software but find "firststopwebsearch" more power full
than webfinderlist provided the owner will modified it according to
your requirement.

Please check both software and if needed contect "Mr.Denis Sinegubko" . I am
much confident that your problem will be solved.........

If possible,please let's know whether you are satisfied or not

Best Regards
Ijaz Ahmad
Subject: Re: Website Spidering - $400 question - paid through a $200 question and a $200 tip
From: v9seo-ga on 08 Aug 2005 13:38 PDT
 
Hi, joel1357-ga!

I have my own developements in the Web Spidering and Internet Indexing
technologies. Its used in the well-known Russian Search Engine
"Aport!" (www.aport.ru). I can easily and quickly adjust this package
for your purposes. All that I need - it's just an exact program
specification and some test sets. And, obvious, our agreement about
pricing. ;-)

INetWorm (spider) details: it developed for simultaneous download of
tens and hundreds of sites, totally ready to process millions sites
and millions pages per each site. (Current limitations: about
4,000,000,000 sites and about 4,000,000,000 pages per site). Download
speed limited preferably by speed of Internet channel connection. In
generic environment INetWorm processes several millions pages per day.

HTML indexing engine is independent from INetWorm, but I understand
that for your purposes both programs should be united (to do not flood
your HDDs with uninteresting data). And this is only real coding that
I shoud make to satisfy your requirements.

Working environment: Windows NT 4.0 or any higher version of Win32 implementation.

What do you think about all of this? I hope that we can continue
discussion for future collaboration.

Regards, Eugene V. Bondarenko,
Leading Developer of SE "Aport!".
Subject: Re: Website Spidering - $400 question - paid through a $200 question and a $200 tip
From: sgtcory-ga on 15 Aug 2005 08:15 PDT
 
I'm still not sure if I follow your requirements 100%, but it seems
like all you need is a solution like this :

Mnogo Search
http://search.mnogo.ru/

The *nix variations are free, and there is a fee for the supported
Windows versions. From what I can find so far, it is scalable up to 1
million documents using certain database storage options.

SgtCory
Subject: Re: Website Spidering - $400 question - paid through a $200 question and a $200 tip
From: valhoffman-ga on 19 Aug 2005 05:16 PDT
 
I have the program written by myself. It does what you want. There are options: 

1) You get the program. 
2) You get the program with source code.
3) You get the program with source code and support.
4) All above + full ownership.

The choice is yours.
Will gladly satisfy your needs.

Val.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy