![]() |
|
![]() | ||
|
Subject:
Find a software that will download all documents on a series of websites
Category: Computers > Internet Asked by: goodwasabi79-ga List Price: $10.00 |
Posted:
06 Feb 2006 13:05 PST
Expires: 08 Mar 2006 13:05 PST Question ID: 442258 |
I would like to know what is the best software solution (free or paying doesn't matter) that will allow me to entirely download the contents of a large website The content I am interested in are only the documents that are hosted on the pages (word, excel, powerpoint, acrobat, zip, arj...) Basically all documents that are downloadable by clicking on a link or on a "click here to download" button. I am not interested in storing the pages. The content must be downloaded and stored in a structured manner (not all just dumped in one "incoming" folder), a possibility could be to store the content in a structure that resembles that of the pages the content was downloaded from (i.e. all documents from the root page would be stored in a root directory, all documents from the "about us" page would be stored in the "about us" directory, and so on). Another option would be to create a searchable database that is linked to the documents (like google desktop does) The website I am targeting is a secured site to witch I have fully authorised access so the program must be able to perform secure login. The website I am targeting is a dynamic site (aspx) that is very large and that branches off to other sites I also would like to selectively download (same principle, only downloadable documents). I must be able to download only parts of the site at a time and resume the downloading later on, the program will keep track of the pages it has already downloaded and resume from where it left off. I am basically looking for a software that will crawl and backup a website in a way that is very similar to what google does, but on a smaller scale. For the moment I have searched download.com and found some "offline browsing" utilities that are not exactly what I need. |
![]() | ||
|
There is no answer at this time. |
![]() | ||
|
Subject:
Re: Find a software that will download all documents on a series of websites
From: canadianhelper-ga on 06 Feb 2006 13:50 PST |
I've tried to match your needs with the 'function' listed on this site...there is a free 30 day trial..might be worth looking at. http://www.metaproducts.com/mp/mpProducts_Detail.asp?id=3 |
Subject:
Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 07 Feb 2006 03:26 PST |
Mass Downloader is more like a download manager, you have to tell it what do download then it takes care of queuing everything up managind the downloand sequence, Mass Downloader doesn't really include crawling capabilities. The idea is really to be able to feed the program my URL, the username and password and watch it crawl through the site and download all the documents. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:12 PST |
Hmm, i don't know if there is a generalized solutions available, but I have worked on some web scrapers at work. It's not too difficult to write in Perl, given the very nice HTML package. Check out sections 20.18 - 20.21 in the Perl Cookbook by Christiansen & Torkington for some very relevent examples of website scraping. If you are not a programmer, probably a perl consultant could do this for you in a day or two. I'll take a look around freshmeat and the like to see if something generic and free is out there. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:13 PST |
I should note that all of section 20 in the perl cookbook is about web automation. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: frankcorrao-ga on 07 Feb 2006 13:18 PST |
sorry for multiple posts, but what you want to search on google for relevent results is "web scraper". I took a look and there are many commercial ones available. Freshmeat has some also, www.freshmeat.net. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: larkas-ga on 07 Feb 2006 22:25 PST |
Although you may not want to store the pages, the pages must at least be downloaded so that the pages may be crawled for the documents you are interested in. I would suggest using wget (with SSL support) and wgetGUI as a frontend. wget can do the job but it is very hard to use all the options from the command line, which is why you will need a front end. It will preserve the directory structure. However, this will leave you with the pages being stored. You can always delete the pages later. You can choose which files to accept downloading or reject downloading (except for the HTML pages themselves). wget (Windows) http://xoomer.virgilio.it/hherold/ wgetGUI http://www.jensroesner.de/wgetgui/ If you really don't want the pages stored at least temporarily, then I would suggest crawling using Xenu (http://home.snafu.de/tilman/xenulink.html) to crawl and find out which files to downloads. Then, exporting the list of URLs as a tab deliminated file (with only the file types you are interested in) and using Mass Downloader or wget to download the files you are interested in. Also you may want to read the following Google Answers thread and check out the product mentioned: http://answers.google.com/answers/threadview?id=32044 |
Subject:
Re: Find a software that will download all documents on a series of websites
From: rfremon-ga on 12 Feb 2006 06:26 PST |
Have you considered WebSnake? http://www.websnake.com/ |
Subject:
Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 13 Feb 2006 00:25 PST |
I just got back from a couple days of vacations, I am going two look at your suggestions this evening and get back to you within the next couple of days, for the meantime thank you very much for your comments. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 15 Feb 2006 04:46 PST |
I looked at the GA thread you suggested, I basically have the same needs except for the fact that for me it's going to be a one time deal (download the site before it goes offline). My only extra needs are authentication, but this feature is included in most solutions anyway. To repond to your comments, I am not looking at creating my own solution, so programming in Perl is out of scope. wget and wgetgui are good solutions technically but I they lack in ease of use, and since the person that will actually have to do the job is not very computer literate, I am looking mostly at packaged solutions. I wasnt able to try out websnake because there is no downloadable trial version I am currently trying out a software called Blackwidow that seems to be a good compromise for what I need, my only doubt about BW is if it will be able to handle the size of the site. |
Subject:
Re: Find a software that will download all documents on a series of websites
From: goodwasabi79-ga on 20 Feb 2006 01:51 PST |
Finally I used blackwidow, but the information I found here was very very usefull so I consider the question to be answered. Thank you very much. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |