Google Answers Logo
View Question
 
Q: web surfer that ignores robot.txt meta ( Answered,   0 Comments )
Question  
Subject: web surfer that ignores robot.txt meta
Category: Computers > Internet
Asked by: ralph_job-ga
List Price: $20.00
Posted: 08 Aug 2005 04:32 PDT
Expires: 07 Sep 2005 04:32 PDT
Question ID: 552989
I am looking for a web crawler to download pages from sites that block
web crawlers with robot.tx/meta. The crawler should allow for multi
sites and password site. I subscribe to a couple of journals that
block yahoo/ie sysn with robot.txt. if that help what type I am
looking for. Would be nice is the crawler parse out the html.
Thank you

Request for Question Clarification by pafalafa-ga on 08 Aug 2005 04:53 PDT
ralph_job-ga,

Can you give us an example of one of the sites your want to crawl? 
The only way to know if a crawler will crawl a given site is to give
it a test run and see what happens.

Clarification of Question by ralph_job-ga on 08 Aug 2005 06:23 PDT
One site would be WWW.WALLSTREETJOURNAL.COM  to read off line. When I
try to sys this site, it comes back with robots.txt

R
Answer  
Subject: Re: web surfer that ignores robot.txt meta
Answered By: theta-ga on 08 Aug 2005 11:59 PDT
 
Hi ralph_job-ga,
   Based on your requirements, I recommend that you try out HTTrack
Website Copier. It is free, available for both Windows and Linux,
supports multiple site downloads at once, supports password protected
websites, and has configurable robots.txt support. You can get it
from:
      - HTTrack Website Copier
        (http://www.httrack.com/)
   
   Follow the instructions in the Quickstart Guide to create a new
download project:
      - HTTrack manual: How to start, Step-by-step
        (http://www.httrack.com/html/step.html)
   Once you create a new project, you will be allowed to configure the
various  options for it. In the options window, click on the 'Spider'
tab, and set the spider option to 'no robots.txt rules'.
      - HTTrack manual: Spider Options Panel
        (http://www.httrack.com/html/step9_opt6.html)

======================================================

Hope this helps.
If you need any clarifications, just ask!

Regards,
Theta-ga


======================================================
Google Search Terms Used
   HTTrack ignore robots.txt

Request for Answer Clarification by ralph_job-ga on 09 Aug 2005 04:35 PDT
Hi
I tried the software. It has one issue in that it does not have a
setup for normal password  files. It has a setup for Proxy in case you
are in an office env. and need the password to access the internet but
there was no box to store the user id/password for the web site

Ralph

Clarification of Answer by theta-ga on 09 Aug 2005 05:49 PDT
Hi ralph_job-ga,
   HTTrack supports password protected websites, you just have to
encode the username and password in the website URL.
   For example, if you want to copy the website:
                  www.mywebsite.com
   and the website requires the following login info:
                Username: uname
                Password: upwd
   then ask HTTrack to download the following URL:
           http://uname:upwd@www.mywebsite.com

   See the following FAQ entries:
       - HTTrack: Can I use username/password authentication on a site?
         (http://www.httrack.com/html/faq.html#QM6)
       - HTTrack: Using user:password@address is not working!
         (http://www.httrack.com/html/faq.html#QT6)

Hope this helps.
If you need further clarification, just ask.
Regards,
Theta-ga
:)

Clarification of Answer by theta-ga on 09 Aug 2005 06:06 PDT
Hi ralph_job-ga,
   This is regarding your optional requiremnt that the "crawler parse
out the html." I assume that you want to save the crawled webpages as
plain text instead of HTML. While there doesn't seem to be a
webcrawler that offers this functionality along with the others that
you require, I was able to find some stand alone utilities that can
accomplish this task.
   - Web2Text
     (http://www.jetman.dircon.co.uk/software/web2text.html)
   - W3C.ORG: Converting from HTML
     (http://www.w3.org/Tools/html2things.html)

Hope this helps! :)
Regards,
Theta-ga
:)
Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy