Google Answers: Looking for a spidering solution

View Question

Q: Looking for a spidering solution ( Answered 2 out of 5 stars

Question

Subject: Looking for a spidering solution
Category: Computers > Internet
Asked by: mluggy-ga
List Price: $15.00

Posted: 23 Jun 2002 15:36 PDT
Expires: 23 Jul 2002 15:36 PDT
Question ID: 32044

This is my first question, really - so I'm just exploring.. but the
question is real. very real.

I'm trying to locate an efficent method of spidering complete
websites.
I'm talking about hunderds of thousdands of websites with literally
millions of files. The content needs to be mirrored LOCALLY on
prefarblly windows OS. I should be able to define a list of urls to go
through (with restrictions such as timeouts, level depth and
filetypes). mainly, I would need to spider all text files
(txt,htm*,asp,php...), image files (bmp,jpg,gif,png) and rich content
files such as pdf,rtf,doc,xls,ppt.

Please do not reffer me to an offline explorers programs such as
Teleporter or BlackWidow, as they will not be able to manage this
amount of data.
Also, ASPSEEK, is no good for me - i've tried that.

Let me know if you need more info.

Thanks.

Request for Question Clarification by wengland-ga on 23 Jun 2002 15:44 PDT

Do you have a budget limit?  

Are you able / willing to move to a Unix platform?

How many machines will this run on?

What kind of incoming bandwidth do you have?

Do you need to mirror the data, or simply keep a searchable index?

Thanks for the additional information!

Request for Question Clarification by aditya2k-ga on 23 Jun 2002 16:12 PDT

Hi mluggy,


   Please have a look at http://www.jocsoft.com/jws/ and tell me if it
is of any use to you. If so, then I will answer this question formally
with more such programs.

Answer

Subject: Re: Looking for a spidering solution
Answered By: runix-ga on 23 Jun 2002 16:25 PDT
Rated: 2 out of 5 stars

Hello! The definitive answer to your question is 'wget'. Wget is a Unix program that recursively downloads a site. Now, there's a Win32 version of Wget! Wget doesn't have a nice frontend as Teleport or Blackwidow: it just does what it has to do, and it does it well. You can download from HTTP or FTP and you can specify level depths, filetypes, domains. It supports background operation (it runs completely unattended), HTTP cookies, resume of canceled downloads, etc. Plus, its free and open source :) I think it's definitly what you're looking for. The main page of the project is: http://www.gnu.org/software/wget/wget.html The main page of the Windows port is http://space.tin.it/computer/hherold/ You can download it from here: ftp://sunsite.dk/projects/wget/windows/ ftp://sunsite.dk/projects/wget/windows/wget20020612b.zip Also there's a windows frontend (it uses the same program, wget but it has a nice frontend) http://www.jensroesner.de/wgetgui/ You can see a screenshot here http://www.jensroesner.de/wgetgui/#screen Good luck!
Request for Answer Clarification by mluggy-ga on 24 Jun 2002 00:07 PDT From the product documenation: http://www.gnu.org/manual/wget/html_mono/wget.html --spider' When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. You can use it to check your bookmarks, e.g. with: wget --spider --force-html -i bookmarks.html This feature needs much more work for Wget to get close to the functionality of real WWW spiders. I'm looking for the "real" WWW spider. wget, teleporter and jws, as someone suggested were all entitled "download exploreres" or "offline browsers".. Thanks. When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there. You can use it to check your bookmarks, e.g. with: wget --spider --force-html -i bookmarks.html This feature needs much more work for Wget to get close to the functionality of real WWW spiders.
Clarification of Answer by runix-ga on 24 Jun 2002 10:31 PDT I guess that with 'real spider' you mean 'recursively download everything in a website', well, wget does that with the '-r' (recursive) option. If this is not what you have in mind when you ask for a 'real spider', please tellme which are your requirements, and I will search again!
Request for Answer Clarification by mluggy-ga on 25 Jun 2002 05:10 PDT By real spider I mean a program, code or application that was designed to run as a webcrawler, and as such - opens many internet connections and recursvly downloads (and mirrors) complete websites. By real spider, I also mean an efficient code that needs to support large amount of websites and links. Thanks!
Clarification of Answer by runix-ga on 25 Jun 2002 06:13 PDT Well, wget does that! For example: To download recursively http://www.gnu.org (only accepting .html, htm, .gif, .jpg, .png) wget -r -A html,gif,jpg,png http://www.gnu.org To download ALL the files linked by a webpage to be shown (This includes such things as inlined images, sounds, and referenced stylesheets.) wget -r --page-requisites http://www.gnu.org To download recursively a site but don't go out to other domains (don't follow links that are not in this domain) wget -D www.freshmeat.net http://www.freshmeat.net To download everything up to 2mb wget -Q2m http://www.freshmeat.net Save the server headers with the file, perhaps for post-processing. wget -s http://www.lycos.com/ wget doesn't supports paralell processing, but since you're downloading thousands of sites, you can launch several wgets (to download different sites) at the same time. (It will have the same effect) Plus, wget can make the requests using different IPs, this means that if you have more than one internet connection, you can distribute the work between them good luck!
Clarification of Answer by runix-ga on 25 Jun 2002 07:01 PDT Please give wget a try, and if it doesn't meet your requirements I'll search for another one. Try it! Im 99% sure that it will work for you! best regards runix
Request for Answer Clarification by mluggy-ga on 25 Jun 2002 07:49 PDT Is there something else I need to do in order to release the money? This is my first time, as you may know. I will try wget soon. >and if it doesn't meet your requirements I'll >search for another one. and I might take you up on this ;-) Michael.
Clarification of Answer by runix-ga on 25 Jun 2002 10:37 PDT I think your money is on the way, now. Just to let you know, that if you don't like the answer you can ask for a refund [ https://answers.google.com/answers/faq.html#refund ] And this is my personal view: If you don't like the answer or you think that is not what you're looking for, you should ask for clarifications instead of rank the question low. This will help both the you and the researcher: you will have your answer (or at lease, somewhere to start) and the GA will understand exactly what you wanted to know and get a better rate. Good luck! please, ask for clarifications if you need help with wget or if you want me to search for other programs

mluggy-ga rated this answer: 2 out of 5 stars

I didn't like the idea of runix being stuck on the answer only with
wget. this is obviously not a web spider application, and more of an
offline browser. I'm sceptic as wheter this will be able to handle my
requirements..

Comments

Subject: Re: Looking for a spidering solution
From: googlebrain-ga on 23 Jun 2002 16:30 PDT

Quite frankly, I don't see why Teleport Pro wouldn't work for what you
ask. It does everything that you need. I use it to Hoover websites all
the time. It seems quite able to slam my DSL connection to the wall,
so I don't think the software would be any sort of download
bottleneck.

Just my $1.00/50
googlebrain-ga

Subject: Re: Looking for a spidering solution
From: mluggy-ga on 24 Jun 2002 00:09 PDT

Teleporter pro would NOT handle that amount of files.
I've tried almost every offline browser exists, and the best one is
certianly offline-explorer enterprise (from
http://www.metaproducts.com).

Programs may work, but their parsing speed degrades when trying large
projects.

Subject: Re: Looking for a spidering solution
From: ghs1-ga on 27 Jun 2002 21:24 PDT

Another solution is htdig.  http://www.htdig.org is the URL.

Subject: Re: Looking for a spidering solution
From: gardium-ga on 23 Feb 2005 14:20 PST

I ended up building my own solution at:
http://www.2find.co.il/????_?????/

Subject: Re: Looking for a spidering solution
From: gardium-ga on 23 Feb 2005 14:21 PST

the correct url would be:
http://www.2find.co.il/%D7%9C%D7%AA%D7%95%D7%A8_%D7%9E%D7%95%D7%98%D7%95%D7%A8/

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy