Google Answers: Google Crawler

View Question

Q: Google Crawler ( Answered 5 out of 5 stars

Question

Subject: Google Crawler
Category: Computers > Internet
Asked by: ealdahonda-ga
List Price: $10.00

Posted: 10 Dec 2004 11:02 PST
Expires: 09 Jan 2005 11:02 PST
Question ID: 440956

I'm looking for a method that will ONLY allow google crawler (the only
nice web crawler that I've experienced) to index my website. Inktomi
has so many IP addresses, it's almost impossible to track them all
down and block via .htaccess.

Answer

Subject: Re: Google Crawler
Answered By: bitmaven-ga on 28 Dec 2004 15:21 PST
Rated: 5 out of 5 stars

ealdahonda-ga: 

Perhaps the easiest way is to set up a robots.txt file with the following string: 

User-agent: *
Disallow: /

User-agent: Slurp
Disallow: /

User-agent: YahooSeeker
Disallow: / 

User-Agent: Googlebot
Disallow: 

Now, if you're picky, like me, you can set up the directories that the
Googlebot is forbidden from.  For instance, on my website, Googlebot
is able to search everything but the following directories:

User-agent: Googlebot
Disallow: /database/
Disallow: /images/
Disallow: /modules/
Disallow: /includes/
Disallow: /admin/

Now dealing with Inktomi is a bit of a different issue.  First, there
are two main bots that we're dealing with:  Slurp (the main robot) and
YahooSeeker.  If you want to ban everything from Inktomi, add both
robots.  If you just hate Slurp, ignore my reccomendation, and only
disallow Slurp. Both Inktomi and Yahoo note that the Inktomi robots
_will_ follow a valid robots.txt file. 
(http://help.yahoo.com/help/us/ysearch/slurp/index.html) and
(http://216.239.63.104/search?q=cache:-4DYg5HC5EIJ:support.inktomi.com/searchfaq.html+Inktomi+%2Bslurp+%2Bstop&hl=en)
(cached because at the time of this writing, the site was down for
maintenance ^__^)


Inktomi is operating for a variety of different companies, and they
report that "[s]lurp sends an average of up to 60 requests per minute
to a single web server, and uses no more than one active connection to
a single web server. We determine a single "web server" by IP address,
so if your host is serving multiple IPs it may see higher levels of
activity."

Inktomi also notes that it may be that your robots.txt file is not viewable. 
"Check that your /robots.txt file is readable by web clients from the
URL "http://mywebsite.com/robots.txt". Verify that the robots.txt
syntax is correct per the Robots Exclusion Standard. For performance
reasons, and to reduce the load on your web server, Slurp caches
robots.txt files internally. So if you have modified your exclusion
rules in robots.txt Slurp might not recognize the change immediately."

Further, "Disallow rules in /robots.txt apply to absolute paths, so
the disallow values must begin with a "/" to be effective.
Instructions for specific user-agent values apply instead of general
user-agent instructions. So if /robots.txt includes instruction lines
for "User-agent: Slurp", only those instructions will apply to Slurp.
Any instructions for "User-agent: *" will be ignored if a more
specific user-agent match exists."

Remember, Inktomi scours the web for a variety of different search
engines, including, but not limited to: AllTheWeb (powers Lycos) +
AltaVista + Inktomi Partner Sites (such as MSN), and Yahoo.

Finally, I thought I'd pass along this little gem, aptly titled
'Spider Trap' that I came across on Webmaster World.  It's a
three-parter, but it's had quite a bit of success with other folks on
the board who have been using it to alert themselves to the various
incarnations of badly behaving robots/spiders.

This was originally posted by Birdman, on the Webmaster World message
boards (http://www.webmasterworld.com/forum88/3104.htm) (subscription
required).

I include the notes for better clarity: 
--------------------
*Notes:

   1. Add the robots.txt snippet days before luring bots to the trap.
This gives the good bots time to read the disallow and obey.
   2. chmod .htaccess to 666 and getout.php to 755
   3. Replace the broken pipe(¦) with a solid one in .htacces snippet.
   4. Edit getout.php with the real path to your .htaccess file and
also change the email to your own so you will recieve the "spider
alert".

1. 
Robots.txt

User-agent: *
Disallow: /getout.php

2. 
.htaccess(keep this code at the top of the file)

SetEnvIf Request_URI "^(/403.*\.htm¦/robots\.txt)$" allowsome
<Files *>
order deny,allow
deny from env=getout
allow from env=allowsome
</Files>

3.
getout.php

<?php
$filename = "/var/www/html/.htaccess";
$content = "SetEnvIf Remote_Addr
^".str_replace(".","\.",$_SERVER["REMOTE_ADDR"])."$ getout\r\n";
$handle = fopen($filename, 'r');
$content .= fread($handle,filesize($filename));
fclose($handle);
$handle = fopen($filename, 'w+');
fwrite($handle, $content,strlen($content));
fclose($handle);
mail("me@mysite.com",
"Spider Alert!",
"The following ip just got banned because it accessed the spider
trap.\r\n\r\n".$_SERVER["REMOTE_ADDR"]."\r\n".$_SERVER["HTTP_USER_AGENT"]."\r\n".$_SERVER["HTTP_REFERER"]
,"FROM: trap@mysite.com");
print "Goodbye!";
?> 

Obviously, me@mystie.com is changed to the appropriate email address. 
This will allow you to add disallow lines when Inktomi (or others)
decide to be annoying little snits and ignore your otherwise valid
Robots.txt file.

I hope this helps to fix the problem.  Good luck!

bitmaven-ga

ealdahonda-ga rated this answer: 5 out of 5 stars

Thank you.

Comments

Subject: Re: Google Crawler
From: spaceman2k-ga on 15 Dec 2004 02:04 PST

Hello,

You can find your answer here:
http://www.robotstxt.org/wc/norobots.html

For a list of the robots, please go here:
http://www.robotstxt.org/wc/active/html/index.html

Adrian

Subject: Re: Google Crawler
From: ealdahonda-ga on 15 Dec 2004 06:06 PST

I am familiar with Robots.txt - What I'm looking for is a way to block
crawlers like inktomisearch.com and others of its ilk that decide to
download every page of my site as fast as possible. Robots.txt applies
to all crawlers; this is not what I want.

I want a means to block certain crawlers and greet them with a 403
error message, or even a redirect back to their ip.

Subject: Re: Google Crawler
From: marcbb-ga on 28 Dec 2004 10:20 PST

The simplest solution is a robots.txt file as follows:

User-agent: *
Disallow: /
Usage-agent: GoogleBot
Disallow:

If you can't or won't use that, then your only other option is to
start blocking unwanted spiders at the Firewall, or via your
webserver's access control mechanisms, using their IP addresses. This
becomes quickly unmanageable as there's a lot of spiders out there,
and the people running them add/remove servers to run them on.

Subject: Re: Google Crawler
From: bitmaven-ga on 28 Dec 2004 15:27 PST

ealdahonda-ga

*.inktomisearch.com = Slurp.
(http://www.siteware.ch/webresources/useragents/spiders/hotbot.html)

Subject: Re: Google Crawler
From: bitmaven-ga on 05 Jan 2005 15:50 PST

Thanks for the rating ! :) 

Bitmaven-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy