Google Answers Logo
View Question
 
Q: Blocking searches of selected files ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Blocking searches of selected files
Category: Reference, Education and News > General Reference
Asked by: davidsclarke-ga
List Price: $10.00
Posted: 19 Nov 2002 10:42 PST
Expires: 19 Dec 2002 10:42 PST
Question ID: 110690
I am a member of a Unitarian Fellowship in Carbondale, Illinois.  We
would like some of our posted web files to be accessible to the
general public, others to be available only to Fellowship members in
order to protect privacy.

Is there some means of preventing selected pages to NOT be subject to
your searches?  We find that names on untitled pages will appear on
searches.

A second question: We have document and Excel files on the web as
archived material.  Are these files subject to search?  We would like
to prevent this.
Answer  
Subject: Re: Blocking searches of selected files
Answered By: aceresearcher-ga on 19 Nov 2002 12:44 PST
Rated:5 out of 5 stars
 
davidsclarke,

There is indeed a way to discourage robots from crawling and indexing
certain areas of your site.

From "How to Set Up a robots.txt to Control Search Engine Spiders:
Using the Robots Exclusion Standard to exclude spiders" by Christopher
Heng, The SiteWizard.com:

"Writing a robots.txt file could not be easier. It's just an ASCII
text file that you place at the root of your domain. For example, if
your domain is
http://www.yourdomain.com, you will place the file at
http://www.yourdomain.com/robots.txt."

http://www.thesitewizard.com/archive/robotstxt.shtml


This sample robots.txt file tells all search engines (user-agent *
means "all") not to index any of the pages under
http://www.yourdomain.com/privatedirectory, and not to index
http://www.yourdomain.com/membership/privatefile.html :

User-agent: * 
Disallow: /privatedirectory/
Disallow: /membership/privatefile.html


The statement

User-agent: * 
Disallow: /

tells search engines not to index your site at all.


The statement 

User-agent: googlebot
Disallow: /

Will tell Google not to index your site at all; however, other search
engines that come along will go ahead and spider your site, since you
have NOT asked them not to do so.


However, the SiteWizard offers this additional advice:
"Common Mistakes in Robots.txt:

1. It's Not Guaranteed to Work
As mentioned earlier, although the robots.txt format is listed in a
document called "A Standard for Robots Exclusion", not all spiders and
robots actually bother to heed it. Listing something in your
robots.txt is no guarantee that it will be excluded...[To be more
secure, you should probably implement password protection on those
areas of your site which you do not want to be accessible to the
general public.]

2. Don't List Your Secret Directories
Anyone can access your robots file, not just robots... Listing a
directory in a robots.txt file often attracts attention to the
directory! In fact, some spiders (like certain spammers' email
harvesting robots) make it a point to check the robots.txt for
excluded directories to spider.

3. Only One Directory/File per Disallow line
Don't try to be smart and put multiple directories on your Disallow
line..."


Furthermore, the SiteWizard advises:
"Even if you want all your directories to be accessed by spiders, a
simple robots file with the following may be useful:

User-agent: *
Disallow: 

With no file or directory listed in the Disallow line, you're implying
that every directory on your site may be accessed."


As far as disabling access to certain file types:

User-agent: * 
Disallow: /*.doc$ # disallow access to Word Documents
Disallow: /*.xls$ # disallow access to Excel Spreadsheets
Disallow: /*.ppt$ # disallow access to PowerPoint Presentations
Disallow: /*.mdb$ # disallow access to Access Databases


Microsoft Knowledge Base Article - 217103 
"How to Write a Robots.txt File"
http://support.microsoft.com/default.aspx?scid=KB;en-us;q217103

"A Standard for Robot Exclusion", by Martijn Koster, The Web Robots
pages:
http://www.robotstxt.org/wc/norobots.html


If you want to check the syntax of your robots.txt file (or the
robots.txt file of any other site on the web, you may do so by
entering the path in Search Engine World's Spider Search Engine
Simulator tool at:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi


Information about adding username/password protection to your
webpages:

"Questions about Services and Scripts: Can I require a password for
access to a web page?", Massachusetts Institute of Technology (MIT)
http://www.mit.edu/faq/password.html

"Password protecting a directory from web display", College of
Agriculture and Life Sciences, University of Arizona (September 3,
2002):
http://ag.arizona.edu/ecat/web/password-protect.html


Search Strategy

how to set up a robots.txt file
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=how+to+set+up+a+robots.txt+file&btnG=Google+Search

disallow access to certain file types robots.txt
://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=disallow+access+to+certain+file+types+robots.txt&btnG=Google+Search

how to require "username and password"
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=how+to+require+%22username+and+password%22&btnG=Google+Search


Before Rating my Answer, if you have questions, please post a Request
for Clarification, and I will be glad to see what I can do for you.
 
I hope this Answer provides you with exactly the information you
needed!
 
Regards, 
 
aceresearcher

Clarification of Answer by aceresearcher-ga on 19 Nov 2002 14:09 PST
davidsclarke,

Computer guru/Researcher webadept has very kindly provided me with
some additional information which you may find helpful:

"You could put all the files you don't want robots to visit in a
separate subdirectory, make that directory un-listable on the web (by
configuring your server), then place your files in there, and list
only the directory name in the /robots.txt. Now an ill-willed robot
can't traverse that directory unless you or someone else puts a direct
link on the web to one of your files, and then it's not /robots.txt
fault.

For example, rather than: 

User-Agent: * 
Disallow: /foo.html 
Disallow: /bar.html 

do: 
User-Agent: * 
Disallow: /norobots/ 

and make a "norobots" directory, put foo.html and bar.html into it,
and configure your server to not generate a directory listing for that
directory. Now all an attacker would learn is that you have a
"norobots" directory, but he won't be able to list the files in there;
he'd need to guess their names.

However, in practice this is a bad idea -- it's too fragile. Someone
may publish a link to your files on their site. Or it may turn up in a
publicly accessible log file, say of you user's proxy server, or maybe
it will show up in someone's web server log as a Referer. Or someone
may misconfigure your server at some future date, "fixing" it to show
a directory listing. Which leads me to the real answer:

The real answer is that /robots.txt is not intended for access
control, so don't try to use it as such. Think of it as a "No Entry"
sign, not a locked door. If you have files on your web site that you
don't want unauthorized people to access, then configure your server
to do authentication, and configure appropriate authorization. Basic
Authentication has been around since the early days of the web (and in
e.g. Apache on UNIX is trivial to configure), and if you're really
serious, SSL is commonplace in web servers."

HTAccess Authentication Tutorial 
http://faq.clever.net/htaccess.htm 

If you need further assistance on setting up a username/password
system for your website, I encourage you to post a question on that
subject (probably with a higher fee than this one), asking for
webadept by name.

Again, before Rating my Answer, if you have questions, please post a
Request
for Clarification, and I will be glad to see what I can do for you.
  
Regards,  
  
aceresearcher
davidsclarke-ga rated this answer:5 out of 5 stars and gave an additional tip of: $5.00
A very helpful answer that we will act on.  Thanks very much.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy