Google Answers Logo
View Question
 
Q: Is it legal to crawl search engines and display the results on another site ( Answered 5 out of 5 stars,   4 Comments )
Question  
Subject: Is it legal to crawl search engines and display the results on another site
Category: Computers
Asked by: searching777-ga
List Price: $30.00
Posted: 09 Dec 2004 12:08 PST
Expires: 08 Jan 2005 12:08 PST
Question ID: 440474
Hi,

I have a specialist search engine and have recently had the problem of
being crawled and the results being displayed in another site.

Is there any protection against this kind of thing?

I have looked around on the net, and as well as the 1000's of crawler
software running sites [eg. Smartsearch offered by
http://smarterscripts.com/ which crawls ODP, MSN and/or ALTAVISTA is
apparently used by 1000's of sites] which seems to suggest there is no
protection, I have also found freelance jobs posted specifiaclly
asking for crawlers to be made for many engines, Google, Alta vista
images and videos etc.

Here's an example http://search-them-all.com/

I had though that metasearches must be licenced. But it seems I was being naive.

Also I found this article
http://www.ivanhoffman.com/database.html

which further suggest that there is no protection in this matter.

If even the mighty MSN is being crawled regularly by specialist
software it seems like there is no chance. :(

All this has led me to think why don't I just start crawling other
engines myself to expand my site? !!!

What I don't get is I would have thought that there was a way to block
being crawled.

-- A few questions in the above ramble:

1. Is it legal to crawl a search engine and display the results in a
non-realted site in search engine style [different colours etc of
course] without permission.

2. If yes any exceptions.

3. Any way to block crawlers.

Thnaks
Answer  
Subject: Re: Is it legal to crawl search engines and display the results on another site
Answered By: sgtcory-ga on 20 Dec 2004 06:16 PST
Rated:5 out of 5 stars
 
Hello searching777,

Thanks for the questions. I have some experience within this market,
so I'll do my best to fully answer your questions.


1. Is it legal to crawl a search engine and ...?
-------------------------------------------------------------------------------

If the website being crawled offers a 'terms of use' (or anything
similar) , then usage of that website falls under the specified
details. Here is an example of Google's agreement, stating that
'metasearching' or crawling of their content is not permitted :

Google - Personal Use Only
"You may not take the results from a Google search and reformat and
display them, or mirror the Google home page or results pages on your
Web site. You may not "meta-search" Google..."
://www.google.com/terms_of_service.html


Here is an excerpt from the MSN website as well, that seems very clear :

MSN Terms of Use
"The MSN Web Sites are only for your personal use. You will not use
the MSN Web Sites for commercial purposes....you may not use the MSN
Web Sites in any manner that could damage, disable, overburden, or
impair any MSN Web Site..."
http://privacy.msn.com/tou/


Let's look past the database issue, and skip right to bandwidth and
server drain. When a remote computer crawls another website, it uses
bandwidth and resources paid for by the company that is being crawled.
In some cases the information may be free, but the process of
retrieving that information comes at a cost that is covered by the
company hosting the information.

In short, the best protection is to offer your users a terms of use
that is clear, and warns against illegal usage. This will give you
solid ground to stand on should any litigation arise.

Many important websites carry these agreements. Here is an example of
the Superior Courts of California's agreement :

County Website within the Superior Courts of CA
http://www.siskiyou.courts.ca.gov/disclaimer.asp


2. If yes any exceptions
-------------------------------------------------------------------------------

There are exceptions. It depends on the information being requested,
and the guidelines of the offering entity. The DMOZ may be one
exception, although their agreement says nothing about live retrieval
of their data, rather it refers to the usage of their RDF dumps for
local use. I couldn't find one example that directly allows you to
crawl their content, although I am certain they exist. When in doubt,
the best approach is to ask. I did this with a few companies in
Ireland and the United Kingdom, and a majority of them allowed me to
crawl their content, simply because I was the only one to ask.


3. Any way to block crawlers
-------------------------------------------------------------------------------

There are a couple of methods, with the robots exclusion being the
preferred method :

Robots Exclusion
http://www.robotstxt.org/wc/exclusion.html

If the crawler does not adhere to the robots exclusion, you can use a
firewall to block access, assuming the I.P. address(es) are known.
Here is an example firewall for a *nix web server :

KISS Firewall
http://www.geocities.com/steve93138/

This simple firewall allows you to add individual I.P. addresses as
well as ranges simply by dropping them into a configuration file.

If these methods don't work, then you can always contact their
internet provider, stating cleary which terms of your agreement are
being broken. Most upstream/hosting providers understand these issues,
as they too carry usage guidelines that they do not want to see
abused.


To assist with this answer, I referred directly to the terms of use on
a few search engines. Most of the information provided is from first
hand experience.

Should you need further clarification, please do not hesitate to ask.
I will do my best to assist!

SgtCory

Request for Answer Clarification by searching777-ga on 20 Dec 2004 07:06 PST
Hi, thanks for your answer. 

Interesting to read up on terms of service. That is pretty clear.

One point you did not answer was opinion given here

http://www.ivanhoffman.com/database.html

---

dmoz actually lists software services that grab results live btw
http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/

phpODP for instance.

I guess ip blocking is the best way to be sure.

Clarification of Answer by sgtcory-ga on 20 Dec 2004 07:31 PST
Thanks for the request.

"dmoz actually lists software services that grab results live btw.."

I knew they listed sites that use their live results. This is probably
an example of several departments not collaborting, and Netscape
keeping a tight grip on the 'free' data.

Here's the terms of use :

Terms of Use
"You will not disrupt the functioning of the ODP or otherwise act in a
way that interferes with other users' use of the ODP"

And here is the page that tells you how to get the data. You'll notice
that there are no options for live data usage :

Getting the DMOZ Data
http://dmoz.org/help/getdata.html

If we can reasonably assume that 10,000 sites are using live data, we
can see the effect this would, and has had on the availability of the
DMOZ website at times. It's a 'feature' they can do away with, or say
they never agreed to allow, at time of their choosing.

"One point you did not answer was opinion given here.."

It's a very old argument - 'Data wants to be free'. It does. I'm not
certain how well this would hold up in the courts. The internet has
had a major effect on many things, including how law is defined. If
the page you reference was the definitive source, then the license
Netscape has on the DMOZ data would be invalid.

Thanks for the great clarification request. This is truly a question
I've enjoyed answering. Should you still need more clarification -
I'll be happy to assist.

SgtCory

Request for Answer Clarification by searching777-ga on 20 Dec 2004 09:42 PST
"If the page you reference was the definitive source"

....interesting. That's maybe where the crawlers get their legs, so to
speak...anyway...thanks for the info. All very interesting.

If ou dig up/find anything new anytime please post it.

Thanks

Clarification of Answer by sgtcory-ga on 20 Dec 2004 11:47 PST
Thank you for the great rating.

I've spent the last hour or so looking at most of the major search
providers terms of use. All of them have similar wordings and
licensing schemes. I ran a metasearch function for a while and decided
it would be in my best interest to contact these companies
individually. Almost all replied with the same response, which was to
work a commercial agreement with them.

I hope this helps - and best of luck with your venture(s)!

SgtCory
searching777-ga rated this answer:5 out of 5 stars

Comments  
Subject: Re: Is it legal to crawl search engines and display the results on another site
From: crythias-ga on 09 Dec 2004 20:25 PST
 
Stop legit bots from crawling:
http://www.searchengineworld.com/robots/robots_tutorial.htm
This is a free comment.

Of course, if unscrupulous crawl bots ignore the robots.txt file, all
bets are off, but at least you'll stop the legit bots from crawling
your site.

Is it legal? Sorta kinda, but I bet Copyright Lawyers would say your
content is copyright to you and unauthorized reproduction is
prohibited, etc. etc. etc.
Subject: Re: Is it legal to crawl search engines and display the results on another site
From: southof40-ga on 15 Dec 2004 12:33 PST
 
Can't speak for legal stuff - here's some technical ideas.

One thing you can do is to insert some of your content via Javascript.
Most crawlers are either not capable of exectuing the JS within a
crawled page or would find the effort to do so too expensive given the
number of pages they would have to crawl.

Some crawlers reveal who they are via the User Agent string delivered
along with the request for the page - obviously it's then possible to
drop all requests from such UA's into a bucket. However I have
identified crawlers who are faking their UA (and so would almost
certainly be ignoring any robots.txt you might have put in place).

Another suggestion at least with respect to identifying them (perhaps
for the purposes of IP blocking) is that crawlers rarely suck the
graphics associated with a page so if you embed a non-existent graphic
in your page (make it small and it won't trouble real users) and then
wait to see which UA's don't ask for it (they should be asking for it
because it cannot be cached anywhere because you've never issued it !)
then you've probably found a UA which is a crawler (of course it could
be a legitatmate users who has graphics turned off but they are few
and far between.
Subject: Re: Is it legal to crawl search engines and display the results on another site
From: tigerheart-ga on 20 Dec 2004 18:31 PST
 
depending on what scripting language you use serverside (I use php for
example), you can create a simple script to include at the start of
each of your files that will deter (most of) these crawlers and
bots...

here's an example I use for my pages:

<?php 
    $botlist = array(    
                "Teoma",                    
                "alexa", 
                "froogle", 
                "inktomi", 
                "looksmart", 
                "URL_Spider_SQL", 
                "Firefly", 
                "NationalDirectory", 
                "Ask Jeeves", 
                "TECNOSEEK", 
                "InfoSeek", 
                "WebFindBot", 
                "girafabot", 
                "crawler", 
                "www.galaxy.com", 
                "Googlebot", 
                "Scooter", 
                "Slurp", 
                "appie", 
                "FAST", 
                "WebBug", 
                "Spade", 
                "ZyBorg", 
                "rabaz"); 

    foreach($botlist as $bot) { 
      if(ereg($bot, $HTTP_USER_AGENT)) { 
	exit();          		
      } 
    } 
?> 

this little script detects most bots and crawlers and gives them a
blank page when they try to visit the site.
I can prepare a script like this for ASP if you like too... just let me know.
Subject: Re: Is it legal to crawl search engines and display the results on another site
From: searching777-ga on 23 Dec 2004 17:04 PST
 
Hey tigerheart,

THANKS!..Really appreciate the free code.
I use php so no need for the asp version.

:)

searching777

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy