davidsclarke,
There is indeed a way to discourage robots from crawling and indexing
certain areas of your site.
From "How to Set Up a robots.txt to Control Search Engine Spiders:
Using the Robots Exclusion Standard to exclude spiders" by Christopher
Heng, The SiteWizard.com:
"Writing a robots.txt file could not be easier. It's just an ASCII
text file that you place at the root of your domain. For example, if
your domain is
http://www.yourdomain.com, you will place the file at
http://www.yourdomain.com/robots.txt."
http://www.thesitewizard.com/archive/robotstxt.shtml
This sample robots.txt file tells all search engines (user-agent *
means "all") not to index any of the pages under
http://www.yourdomain.com/privatedirectory, and not to index
http://www.yourdomain.com/membership/privatefile.html :
User-agent: *
Disallow: /privatedirectory/
Disallow: /membership/privatefile.html
The statement
User-agent: *
Disallow: /
tells search engines not to index your site at all.
The statement
User-agent: googlebot
Disallow: /
Will tell Google not to index your site at all; however, other search
engines that come along will go ahead and spider your site, since you
have NOT asked them not to do so.
However, the SiteWizard offers this additional advice:
"Common Mistakes in Robots.txt:
1. It's Not Guaranteed to Work
As mentioned earlier, although the robots.txt format is listed in a
document called "A Standard for Robots Exclusion", not all spiders and
robots actually bother to heed it. Listing something in your
robots.txt is no guarantee that it will be excluded...[To be more
secure, you should probably implement password protection on those
areas of your site which you do not want to be accessible to the
general public.]
2. Don't List Your Secret Directories
Anyone can access your robots file, not just robots... Listing a
directory in a robots.txt file often attracts attention to the
directory! In fact, some spiders (like certain spammers' email
harvesting robots) make it a point to check the robots.txt for
excluded directories to spider.
3. Only One Directory/File per Disallow line
Don't try to be smart and put multiple directories on your Disallow
line..."
Furthermore, the SiteWizard advises:
"Even if you want all your directories to be accessed by spiders, a
simple robots file with the following may be useful:
User-agent: *
Disallow:
With no file or directory listed in the Disallow line, you're implying
that every directory on your site may be accessed."
As far as disabling access to certain file types:
User-agent: *
Disallow: /*.doc$ # disallow access to Word Documents
Disallow: /*.xls$ # disallow access to Excel Spreadsheets
Disallow: /*.ppt$ # disallow access to PowerPoint Presentations
Disallow: /*.mdb$ # disallow access to Access Databases
Microsoft Knowledge Base Article - 217103
"How to Write a Robots.txt File"
http://support.microsoft.com/default.aspx?scid=KB;en-us;q217103
"A Standard for Robot Exclusion", by Martijn Koster, The Web Robots
pages:
http://www.robotstxt.org/wc/norobots.html
If you want to check the syntax of your robots.txt file (or the
robots.txt file of any other site on the web, you may do so by
entering the path in Search Engine World's Spider Search Engine
Simulator tool at:
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
Information about adding username/password protection to your
webpages:
"Questions about Services and Scripts: Can I require a password for
access to a web page?", Massachusetts Institute of Technology (MIT)
http://www.mit.edu/faq/password.html
"Password protecting a directory from web display", College of
Agriculture and Life Sciences, University of Arizona (September 3,
2002):
http://ag.arizona.edu/ecat/web/password-protect.html
Search Strategy
how to set up a robots.txt file
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=how+to+set+up+a+robots.txt+file&btnG=Google+Search
disallow access to certain file types robots.txt
://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=disallow+access+to+certain+file+types+robots.txt&btnG=Google+Search
how to require "username and password"
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=how+to+require+%22username+and+password%22&btnG=Google+Search
Before Rating my Answer, if you have questions, please post a Request
for Clarification, and I will be glad to see what I can do for you.
I hope this Answer provides you with exactly the information you
needed!
Regards,
aceresearcher |
Clarification of Answer by
aceresearcher-ga
on
19 Nov 2002 14:09 PST
davidsclarke,
Computer guru/Researcher webadept has very kindly provided me with
some additional information which you may find helpful:
"You could put all the files you don't want robots to visit in a
separate subdirectory, make that directory un-listable on the web (by
configuring your server), then place your files in there, and list
only the directory name in the /robots.txt. Now an ill-willed robot
can't traverse that directory unless you or someone else puts a direct
link on the web to one of your files, and then it's not /robots.txt
fault.
For example, rather than:
User-Agent: *
Disallow: /foo.html
Disallow: /bar.html
do:
User-Agent: *
Disallow: /norobots/
and make a "norobots" directory, put foo.html and bar.html into it,
and configure your server to not generate a directory listing for that
directory. Now all an attacker would learn is that you have a
"norobots" directory, but he won't be able to list the files in there;
he'd need to guess their names.
However, in practice this is a bad idea -- it's too fragile. Someone
may publish a link to your files on their site. Or it may turn up in a
publicly accessible log file, say of you user's proxy server, or maybe
it will show up in someone's web server log as a Referer. Or someone
may misconfigure your server at some future date, "fixing" it to show
a directory listing. Which leads me to the real answer:
The real answer is that /robots.txt is not intended for access
control, so don't try to use it as such. Think of it as a "No Entry"
sign, not a locked door. If you have files on your web site that you
don't want unauthorized people to access, then configure your server
to do authentication, and configure appropriate authorization. Basic
Authentication has been around since the early days of the web (and in
e.g. Apache on UNIX is trivial to configure), and if you're really
serious, SSL is commonplace in web servers."
HTAccess Authentication Tutorial
http://faq.clever.net/htaccess.htm
If you need further assistance on setting up a username/password
system for your website, I encourage you to post a question on that
subject (probably with a higher fee than this one), asking for
webadept by name.
Again, before Rating my Answer, if you have questions, please post a
Request
for Clarification, and I will be glad to see what I can do for you.
Regards,
aceresearcher
|