Google Answers Logo
View Question
 
Q: Looking for a spidering solution ( Answered 2 out of 5 stars,   5 Comments )
Question  
Subject: Looking for a spidering solution
Category: Computers > Internet
Asked by: mluggy-ga
List Price: $15.00
Posted: 23 Jun 2002 15:36 PDT
Expires: 23 Jul 2002 15:36 PDT
Question ID: 32044
This is my first question, really - so I'm just exploring.. but the
question is real. very real.

I'm trying to locate an efficent method of spidering complete
websites.
I'm talking about hunderds of thousdands of websites with literally
millions of files. The content needs to be mirrored LOCALLY on
prefarblly windows OS. I should be able to define a list of urls to go
through (with restrictions such as timeouts, level depth and
filetypes). mainly, I would need to spider all text files
(txt,htm*,asp,php...), image files (bmp,jpg,gif,png) and rich content
files such as pdf,rtf,doc,xls,ppt.

Please do not reffer me to an offline explorers programs such as
Teleporter or BlackWidow, as they will not be able to manage this
amount of data.
Also, ASPSEEK, is no good for me - i've tried that.

Let me know if you need more info.

Thanks.

Request for Question Clarification by wengland-ga on 23 Jun 2002 15:44 PDT
Do you have a budget limit?  

Are you able / willing to move to a Unix platform?

How many machines will this run on?

What kind of incoming bandwidth do you have?

Do you need to mirror the data, or simply keep a searchable index?

Thanks for the additional information!

Request for Question Clarification by aditya2k-ga on 23 Jun 2002 16:12 PDT
Hi mluggy,


   Please have a look at http://www.jocsoft.com/jws/ and tell me if it
is of any use to you. If so, then I will answer this question formally
with more such programs.
Answer  
Subject: Re: Looking for a spidering solution
Answered By: runix-ga on 23 Jun 2002 16:25 PDT
Rated:2 out of 5 stars
 
Hello!

The definitive answer to your question is 'wget'.

Wget is a Unix program that recursively downloads a site. Now, there's
a Win32 version of Wget!

Wget doesn't have a nice frontend as Teleport or Blackwidow: it just
does what it has to do, and it does it well.

You can download from HTTP or FTP and you can specify  level depths,
filetypes, domains.
It supports background operation (it runs completely unattended), HTTP
cookies, resume of canceled downloads, etc.

Plus, its free and open source :)

I think it's definitly what you're looking for.

The main page of the project is:

http://www.gnu.org/software/wget/wget.html

The main page of the Windows port is 

http://space.tin.it/computer/hherold/

You can download it from here:

ftp://sunsite.dk/projects/wget/windows/
ftp://sunsite.dk/projects/wget/windows/wget20020612b.zip

Also there's a windows frontend (it uses the same program, wget but it
has a nice frontend)

http://www.jensroesner.de/wgetgui/

You can see a screenshot here
http://www.jensroesner.de/wgetgui/#screen


Good luck!

Request for Answer Clarification by mluggy-ga on 24 Jun 2002 00:07 PDT
From the product documenation:
http://www.gnu.org/manual/wget/html_mono/wget.html
--spider' 
When invoked with this option, Wget will behave as a Web spider, which
means that it will not download the pages, just check that they are
there. You can use it to check your bookmarks, e.g. with:
wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the
functionality of real WWW spiders.

I'm looking for the "real" WWW spider. wget, teleporter and jws, as
someone suggested were all entitled "download exploreres" or "offline
browsers"..

Thanks.


When invoked with this option, Wget will behave as a Web spider, which
means that it will not download the pages, just check that they are
there. You can use it to check your bookmarks, e.g. with:
wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the
functionality of real WWW spiders.

Clarification of Answer by runix-ga on 24 Jun 2002 10:31 PDT
I guess that with 'real spider' you mean 'recursively download
everything in a website', well, wget does that with the '-r'
(recursive) option.

If this is not what you have in mind when you ask for a 'real spider',
please tellme which are your requirements, and I will search again!

Request for Answer Clarification by mluggy-ga on 25 Jun 2002 05:10 PDT
By real spider I mean a program, code or application that was designed
to run as a webcrawler, and as such - opens many internet connections
and recursvly downloads (and mirrors) complete websites.

By real spider, I also mean an efficient code that needs to support
large amount of websites and links.

Thanks!

Clarification of Answer by runix-ga on 25 Jun 2002 06:13 PDT
Well, wget does that! 

For example:

To download recursively http://www.gnu.org (only accepting .html, htm,
.gif, .jpg, .png)
wget  -r -A html,gif,jpg,png http://www.gnu.org

To download ALL the files linked by a webpage to be shown (This
includes such things as inlined images, sounds, and referenced
stylesheets.)
wget -r --page-requisites http://www.gnu.org

To download recursively a site but don't go out to other domains
(don't follow links that are not in this domain)
wget -D www.freshmeat.net http://www.freshmeat.net

To download everything up to 2mb
wget  -Q2m http://www.freshmeat.net

Save the server headers with the file, perhaps for post-processing.
wget -s http://www.lycos.com/

wget doesn't supports paralell processing, but since you're
downloading thousands of sites, you can launch several wgets (to
download different sites) at the same time. (It will have the same
effect)

Plus, wget can make the requests using different IPs, this means that
if you have more than one internet connection, you can distribute the
work between them

good luck!

Clarification of Answer by runix-ga on 25 Jun 2002 07:01 PDT
Please give wget a try, and if it doesn't meet your requirements I'll
search for another one.
Try it! Im 99% sure that it will work for you!

best regards
runix

Request for Answer Clarification by mluggy-ga on 25 Jun 2002 07:49 PDT
Is there something else I need to do in order to release the money?
This is my first time, as you may know.

I will try wget soon.

>and if it doesn't meet your requirements I'll
>search for another one.

and I might take you up on this ;-)

Michael.

Clarification of Answer by runix-ga on 25 Jun 2002 10:37 PDT
I think your money is on the way, now. 
Just to let you know, that if you don't like the answer you can ask
for a refund [ https://answers.google.com/answers/faq.html#refund ]

And this is my personal view:

If you don't like the answer or you think that is not what you're
looking for, you should ask for clarifications instead of rank the
question low.

This will help both the you and the researcher: you will have your
answer (or at lease, somewhere to start) and the GA will understand
exactly  what you wanted to know and get a better rate.

Good luck!

please, ask for clarifications if you need help with wget or if you
want me to search for other programs
mluggy-ga rated this answer:2 out of 5 stars
I didn't like the idea of runix being stuck on the answer only with
wget. this is obviously not a web spider application, and more of an
offline browser. I'm sceptic as wheter this will be able to handle my
requirements..

Comments  
Subject: Re: Looking for a spidering solution
From: googlebrain-ga on 23 Jun 2002 16:30 PDT
 
Quite frankly, I don't see why Teleport Pro wouldn't work for what you
ask. It does everything that you need. I use it to Hoover websites all
the time. It seems quite able to slam my DSL connection to the wall,
so I don't think the software would be any sort of download
bottleneck.

Just my $1.00/50
googlebrain-ga
Subject: Re: Looking for a spidering solution
From: mluggy-ga on 24 Jun 2002 00:09 PDT
 
Teleporter pro would NOT handle that amount of files.
I've tried almost every offline browser exists, and the best one is
certianly offline-explorer enterprise (from
http://www.metaproducts.com).

Programs may work, but their parsing speed degrades when trying large
projects.
Subject: Re: Looking for a spidering solution
From: ghs1-ga on 27 Jun 2002 21:24 PDT
 
Another solution is htdig.  http://www.htdig.org is the URL.
Subject: Re: Looking for a spidering solution
From: gardium-ga on 23 Feb 2005 14:20 PST
 
I ended up building my own solution at:
http://www.2find.co.il/????_?????/
Subject: Re: Looking for a spidering solution
From: gardium-ga on 23 Feb 2005 14:21 PST
 
the correct url would be:
http://www.2find.co.il/%D7%9C%D7%AA%D7%95%D7%A8_%D7%9E%D7%95%D7%98%D7%95%D7%A8/

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy