This is my first question, really - so I'm just exploring.. but the
question is real. very real.
I'm trying to locate an efficent method of spidering complete
websites.
I'm talking about hunderds of thousdands of websites with literally
millions of files. The content needs to be mirrored LOCALLY on
prefarblly windows OS. I should be able to define a list of urls to go
through (with restrictions such as timeouts, level depth and
filetypes). mainly, I would need to spider all text files
(txt,htm*,asp,php...), image files (bmp,jpg,gif,png) and rich content
files such as pdf,rtf,doc,xls,ppt.
Please do not reffer me to an offline explorers programs such as
Teleporter or BlackWidow, as they will not be able to manage this
amount of data.
Also, ASPSEEK, is no good for me - i've tried that.
Let me know if you need more info.
Thanks. |
Request for Question Clarification by
wengland-ga
on
23 Jun 2002 15:44 PDT
Do you have a budget limit?
Are you able / willing to move to a Unix platform?
How many machines will this run on?
What kind of incoming bandwidth do you have?
Do you need to mirror the data, or simply keep a searchable index?
Thanks for the additional information!
|
Request for Question Clarification by
aditya2k-ga
on
23 Jun 2002 16:12 PDT
Hi mluggy,
Please have a look at http://www.jocsoft.com/jws/ and tell me if it
is of any use to you. If so, then I will answer this question formally
with more such programs.
|
Hello!
The definitive answer to your question is 'wget'.
Wget is a Unix program that recursively downloads a site. Now, there's
a Win32 version of Wget!
Wget doesn't have a nice frontend as Teleport or Blackwidow: it just
does what it has to do, and it does it well.
You can download from HTTP or FTP and you can specify level depths,
filetypes, domains.
It supports background operation (it runs completely unattended), HTTP
cookies, resume of canceled downloads, etc.
Plus, its free and open source :)
I think it's definitly what you're looking for.
The main page of the project is:
http://www.gnu.org/software/wget/wget.html
The main page of the Windows port is
http://space.tin.it/computer/hherold/
You can download it from here:
ftp://sunsite.dk/projects/wget/windows/
ftp://sunsite.dk/projects/wget/windows/wget20020612b.zip
Also there's a windows frontend (it uses the same program, wget but it
has a nice frontend)
http://www.jensroesner.de/wgetgui/
You can see a screenshot here
http://www.jensroesner.de/wgetgui/#screen
Good luck! |
Request for Answer Clarification by
mluggy-ga
on
24 Jun 2002 00:07 PDT
From the product documenation:
http://www.gnu.org/manual/wget/html_mono/wget.html
--spider'
When invoked with this option, Wget will behave as a Web spider, which
means that it will not download the pages, just check that they are
there. You can use it to check your bookmarks, e.g. with:
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the
functionality of real WWW spiders.
I'm looking for the "real" WWW spider. wget, teleporter and jws, as
someone suggested were all entitled "download exploreres" or "offline
browsers"..
Thanks.
When invoked with this option, Wget will behave as a Web spider, which
means that it will not download the pages, just check that they are
there. You can use it to check your bookmarks, e.g. with:
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the
functionality of real WWW spiders.
|
Clarification of Answer by
runix-ga
on
24 Jun 2002 10:31 PDT
I guess that with 'real spider' you mean 'recursively download
everything in a website', well, wget does that with the '-r'
(recursive) option.
If this is not what you have in mind when you ask for a 'real spider',
please tellme which are your requirements, and I will search again!
|
Request for Answer Clarification by
mluggy-ga
on
25 Jun 2002 05:10 PDT
By real spider I mean a program, code or application that was designed
to run as a webcrawler, and as such - opens many internet connections
and recursvly downloads (and mirrors) complete websites.
By real spider, I also mean an efficient code that needs to support
large amount of websites and links.
Thanks!
|
Clarification of Answer by
runix-ga
on
25 Jun 2002 06:13 PDT
Well, wget does that!
For example:
To download recursively http://www.gnu.org (only accepting .html, htm,
.gif, .jpg, .png)
wget -r -A html,gif,jpg,png http://www.gnu.org
To download ALL the files linked by a webpage to be shown (This
includes such things as inlined images, sounds, and referenced
stylesheets.)
wget -r --page-requisites http://www.gnu.org
To download recursively a site but don't go out to other domains
(don't follow links that are not in this domain)
wget -D www.freshmeat.net http://www.freshmeat.net
To download everything up to 2mb
wget -Q2m http://www.freshmeat.net
Save the server headers with the file, perhaps for post-processing.
wget -s http://www.lycos.com/
wget doesn't supports paralell processing, but since you're
downloading thousands of sites, you can launch several wgets (to
download different sites) at the same time. (It will have the same
effect)
Plus, wget can make the requests using different IPs, this means that
if you have more than one internet connection, you can distribute the
work between them
good luck!
|
Clarification of Answer by
runix-ga
on
25 Jun 2002 07:01 PDT
Please give wget a try, and if it doesn't meet your requirements I'll
search for another one.
Try it! Im 99% sure that it will work for you!
best regards
runix
|
Request for Answer Clarification by
mluggy-ga
on
25 Jun 2002 07:49 PDT
Is there something else I need to do in order to release the money?
This is my first time, as you may know.
I will try wget soon.
>and if it doesn't meet your requirements I'll
>search for another one.
and I might take you up on this ;-)
Michael.
|
Clarification of Answer by
runix-ga
on
25 Jun 2002 10:37 PDT
I think your money is on the way, now.
Just to let you know, that if you don't like the answer you can ask
for a refund [ https://answers.google.com/answers/faq.html#refund ]
And this is my personal view:
If you don't like the answer or you think that is not what you're
looking for, you should ask for clarifications instead of rank the
question low.
This will help both the you and the researcher: you will have your
answer (or at lease, somewhere to start) and the GA will understand
exactly what you wanted to know and get a better rate.
Good luck!
please, ask for clarifications if you need help with wget or if you
want me to search for other programs
|