Google Answers Logo
View Question
 
Q: Find a software that will download all documents on a series of websites ( No Answer,   10 Comments )
Question  
Subject: Find a software that will download all documents on a series of websites
Category: Computers > Internet
Asked by: goodwasabi79-ga
List Price: $10.00
Posted: 06 Feb 2006 13:05 PST
Expires: 08 Mar 2006 13:05 PST
Question ID: 442258
I would like to know what is the best software solution (free or
paying doesn't matter) that will allow me to entirely download the
contents of a large website

The content I am interested in are only the documents that are hosted
on the pages (word, excel, powerpoint, acrobat, zip, arj...) Basically
all documents that are downloadable by clicking on a link or on a
"click here to download" button. I am not interested in storing the
pages.

The content must be downloaded and stored in a structured manner (not
all just dumped in one "incoming" folder), a possibility could be to
store the content in a structure that resembles that of the pages the
content was downloaded from (i.e. all documents from the root page
would be stored in a root directory, all documents from the "about us"
page would be stored in the "about us" directory, and so on). Another
option would be to create a searchable database that is linked to the
documents (like google desktop does)

The website I am targeting is a secured site to witch I have fully
authorised access so the program must be able to perform secure login.

The website I am targeting is a dynamic site (aspx) that is very large
and that branches off to other sites I also would like to selectively
download (same principle, only downloadable documents).

I must be able to download only parts of the site at a time and resume
the downloading later on, the program will keep track of the pages it
has already downloaded and resume from where it left off.

I am basically looking for a software that will crawl and backup a
website in a way that is very similar to what google does, but on a
smaller scale.

For the moment I have searched download.com and found some "offline
browsing" utilities that are not exactly what I need.
Answer  
There is no answer at this time.

Comments  
Subject: Re: Find a software that will download all documents on a series of websites
From: canadianhelper-ga on 06 Feb 2006 13:50 PST
 
I've tried to match your needs with the 'function' listed on this
site...there is a free 30 day trial..might be worth looking at.

http://www.metaproducts.com/mp/mpProducts_Detail.asp?id=3
Subject: Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 07 Feb 2006 03:26 PST
 
Mass Downloader is more like a download manager, you have to tell it
what do download then it takes care of queuing everything up managind
the downloand sequence, Mass Downloader doesn't really include
crawling capabilities.

The idea is really to be able to feed the program my URL, the username
and password and watch it crawl through the site and download all the
documents.
Subject: Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:12 PST
 
Hmm, i don't know if there is a generalized solutions available, but I
have worked on some web scrapers at work.  It's not too difficult to
write in Perl, given the very nice HTML package.  Check out sections
20.18 - 20.21 in the Perl Cookbook by Christiansen & Torkington for
some very relevent examples of website scraping.  If you are not a
programmer, probably a perl consultant could do this for you in a day
or two.  I'll take a look around freshmeat and the like to see if
something generic and free is out there.
Subject: Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:13 PST
 
I should note that all of section 20 in the perl cookbook is about web automation.
Subject: Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:18 PST
 
sorry for multiple posts, but what you want to search on google for
relevent results is "web scraper".  I took a look and there are many
commercial ones available.  Freshmeat has some also,
www.freshmeat.net.
Subject: Re: Find a software that will download all documents on a series of websites
From: larkas-ga on 07 Feb 2006 22:25 PST
 
Although you may not want to store the pages, the pages must at least
be downloaded so that the pages may be crawled for the documents you
are interested in.

I would suggest using wget (with SSL support) and wgetGUI as a
frontend.  wget can do the job but it is very hard to use all the
options from the command line, which is why you will need a front end.
It will preserve the directory structure. However, this will leave you
with the pages being stored. You can always delete the pages later.
You can choose which files to accept downloading or reject downloading
(except for the HTML pages themselves).

wget (Windows)
http://xoomer.virgilio.it/hherold/

wgetGUI
http://www.jensroesner.de/wgetgui/

If you really don't want the pages stored at least temporarily, then I
would suggest crawling using Xenu
(http://home.snafu.de/tilman/xenulink.html) to crawl and find out
which files to downloads. Then, exporting the list of URLs as a tab
deliminated file (with only the file types you are interested in) and
using Mass Downloader or wget to download the files you are interested
in.

Also you may want to read the following Google Answers thread and
check out the product mentioned:

http://answers.google.com/answers/threadview?id=32044
Subject: Re: Find a software that will download all documents on a series of websites
From: rfremon-ga on 12 Feb 2006 06:26 PST
 
Have you considered WebSnake?
http://www.websnake.com/
Subject: Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 13 Feb 2006 00:25 PST
 
I just got back from a couple days of vacations, I am going two look
at your suggestions this evening and get back to you within the next
couple of days, for the meantime thank you very much for your
comments.
Subject: Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 15 Feb 2006 04:46 PST
 
I looked at the GA thread you suggested, I basically have the same
needs except for the fact that for me it's going to be a one time deal
(download the site before it goes offline). My only extra needs are
authentication, but this feature is included in most solutions anyway.

To repond to your comments, I am not looking at creating my own
solution, so programming in Perl is out of scope.

wget and wgetgui are good solutions technically but I they lack in
ease of use, and since the person that will actually have to do the
job is not very computer literate, I am looking mostly at packaged
solutions.

I wasnt able to try out websnake because there is no downloadable trial version

I am currently trying out a software called Blackwidow that seems to
be a good compromise for what I need, my only doubt about BW is if it
will be able to handle the size of the site.
Subject: Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 20 Feb 2006 01:51 PST
 
Finally I used blackwidow, but the information I found here was very
very usefull so I consider the question to be answered. Thank you very
much.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy