Google Answers: web surfer that ignores robot.txt meta

View Question

Q: web surfer that ignores robot.txt meta ( Answered, 0 Comments )

Question

Subject: web surfer that ignores robot.txt meta
Category: Computers > Internet
Asked by: ralph_job-ga
List Price: $20.00

Posted: 08 Aug 2005 04:32 PDT
Expires: 07 Sep 2005 04:32 PDT
Question ID: 552989

I am looking for a web crawler to download pages from sites that block
web crawlers with robot.tx/meta. The crawler should allow for multi
sites and password site. I subscribe to a couple of journals that
block yahoo/ie sysn with robot.txt. if that help what type I am
looking for. Would be nice is the crawler parse out the html.
Thank you

Request for Question Clarification by pafalafa-ga on 08 Aug 2005 04:53 PDT

ralph_job-ga,

Can you give us an example of one of the sites your want to crawl? 
The only way to know if a crawler will crawl a given site is to give
it a test run and see what happens.

Clarification of Question by ralph_job-ga on 08 Aug 2005 06:23 PDT
One site would be WWW.WALLSTREETJOURNAL.COM  to read off line. When I
try to sys this site, it comes back with robots.txt

R

Answer

Subject: Re: web surfer that ignores robot.txt meta
Answered By: theta-ga on 08 Aug 2005 11:59 PDT

Hi ralph_job-ga, Based on your requirements, I recommend that you try out HTTrack Website Copier. It is free, available for both Windows and Linux, supports multiple site downloads at once, supports password protected websites, and has configurable robots.txt support. You can get it from: - HTTrack Website Copier (http://www.httrack.com/) Follow the instructions in the Quickstart Guide to create a new download project: - HTTrack manual: How to start, Step-by-step (http://www.httrack.com/html/step.html) Once you create a new project, you will be allowed to configure the various options for it. In the options window, click on the 'Spider' tab, and set the spider option to 'no robots.txt rules'. - HTTrack manual: Spider Options Panel (http://www.httrack.com/html/step9_opt6.html) ====================================================== Hope this helps. If you need any clarifications, just ask! Regards, Theta-ga ====================================================== Google Search Terms Used HTTrack ignore robots.txt
Request for Answer Clarification by ralph_job-ga on 09 Aug 2005 04:35 PDT Hi I tried the software. It has one issue in that it does not have a setup for normal password files. It has a setup for Proxy in case you are in an office env. and need the password to access the internet but there was no box to store the user id/password for the web site Ralph
Clarification of Answer by theta-ga on 09 Aug 2005 05:49 PDT Hi ralph_job-ga, HTTrack supports password protected websites, you just have to encode the username and password in the website URL. For example, if you want to copy the website: www.mywebsite.com and the website requires the following login info: Username: uname Password: upwd then ask HTTrack to download the following URL: http://uname:upwd@www.mywebsite.com See the following FAQ entries: - HTTrack: Can I use username/password authentication on a site? (http://www.httrack.com/html/faq.html#QM6) - HTTrack: Using user:password@address is not working! (http://www.httrack.com/html/faq.html#QT6) Hope this helps. If you need further clarification, just ask. Regards, Theta-ga :)
Clarification of Answer by theta-ga on 09 Aug 2005 06:06 PDT Hi ralph_job-ga, This is regarding your optional requiremnt that the "crawler parse out the html." I assume that you want to save the crawled webpages as plain text instead of HTML. While there doesn't seem to be a webcrawler that offers this functionality along with the others that you require, I was able to find some stand alone utilities that can accomplish this task. - Web2Text (http://www.jetman.dircon.co.uk/software/web2text.html) - W3C.ORG: Converting from HTML (http://www.w3.org/Tools/html2things.html) Hope this helps! :) Regards, Theta-ga :)

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy