ealdahonda-ga:
Perhaps the easiest way is to set up a robots.txt file with the following string:
User-agent: *
Disallow: /
User-agent: Slurp
Disallow: /
User-agent: YahooSeeker
Disallow: /
User-Agent: Googlebot
Disallow:
Now, if you're picky, like me, you can set up the directories that the
Googlebot is forbidden from. For instance, on my website, Googlebot
is able to search everything but the following directories:
User-agent: Googlebot
Disallow: /database/
Disallow: /images/
Disallow: /modules/
Disallow: /includes/
Disallow: /admin/
Now dealing with Inktomi is a bit of a different issue. First, there
are two main bots that we're dealing with: Slurp (the main robot) and
YahooSeeker. If you want to ban everything from Inktomi, add both
robots. If you just hate Slurp, ignore my reccomendation, and only
disallow Slurp. Both Inktomi and Yahoo note that the Inktomi robots
_will_ follow a valid robots.txt file.
(http://help.yahoo.com/help/us/ysearch/slurp/index.html) and
(http://216.239.63.104/search?q=cache:-4DYg5HC5EIJ:support.inktomi.com/searchfaq.html+Inktomi+%2Bslurp+%2Bstop&hl=en)
(cached because at the time of this writing, the site was down for
maintenance ^__^)
Inktomi is operating for a variety of different companies, and they
report that "[s]lurp sends an average of up to 60 requests per minute
to a single web server, and uses no more than one active connection to
a single web server. We determine a single "web server" by IP address,
so if your host is serving multiple IPs it may see higher levels of
activity."
Inktomi also notes that it may be that your robots.txt file is not viewable.
"Check that your /robots.txt file is readable by web clients from the
URL "http://mywebsite.com/robots.txt". Verify that the robots.txt
syntax is correct per the Robots Exclusion Standard. For performance
reasons, and to reduce the load on your web server, Slurp caches
robots.txt files internally. So if you have modified your exclusion
rules in robots.txt Slurp might not recognize the change immediately."
Further, "Disallow rules in /robots.txt apply to absolute paths, so
the disallow values must begin with a "/" to be effective.
Instructions for specific user-agent values apply instead of general
user-agent instructions. So if /robots.txt includes instruction lines
for "User-agent: Slurp", only those instructions will apply to Slurp.
Any instructions for "User-agent: *" will be ignored if a more
specific user-agent match exists."
Remember, Inktomi scours the web for a variety of different search
engines, including, but not limited to: AllTheWeb (powers Lycos) +
AltaVista + Inktomi Partner Sites (such as MSN), and Yahoo.
Finally, I thought I'd pass along this little gem, aptly titled
'Spider Trap' that I came across on Webmaster World. It's a
three-parter, but it's had quite a bit of success with other folks on
the board who have been using it to alert themselves to the various
incarnations of badly behaving robots/spiders.
This was originally posted by Birdman, on the Webmaster World message
boards (http://www.webmasterworld.com/forum88/3104.htm) (subscription
required).
I include the notes for better clarity:
--------------------
*Notes:
1. Add the robots.txt snippet days before luring bots to the trap.
This gives the good bots time to read the disallow and obey.
2. chmod .htaccess to 666 and getout.php to 755
3. Replace the broken pipe(¦) with a solid one in .htacces snippet.
4. Edit getout.php with the real path to your .htaccess file and
also change the email to your own so you will recieve the "spider
alert".
1.
Robots.txt
User-agent: *
Disallow: /getout.php
2.
.htaccess(keep this code at the top of the file)
SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>
3.
getout.php
<?php
$filename = "/var/www/html/.htaccess";
$content = "SetEnvIf Remote_Addr
^".str_replace(".","\.",$_SERVER["REMOTE_ADDR"])."$ getout\r\n";
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
mail("me@mysite.com",
"Spider Alert!",
"The following ip just got banned because it accessed the spider
trap.\r\n\r\n".$_SERVER["REMOTE_ADDR"]."\r\n".$_SERVER["HTTP_USER_AGENT"]."\r\n".$_SERVER["HTTP_REFERER"]
,"FROM: trap@mysite.com");
print "Goodbye!";
?>
Obviously, me@mystie.com is changed to the appropriate email address.
This will allow you to add disallow lines when Inktomi (or others)
decide to be annoying little snits and ignore your otherwise valid
Robots.txt file.
I hope this helps to fix the problem. Good luck!
bitmaven-ga |