Google Answers: Is it legal to crawl search engines and display the results on another site

View Question

Q: Is it legal to crawl search engines and display the results on another site ( Answered 5 out of 5 stars

Question

Subject: Is it legal to crawl search engines and display the results on another site
Category: Computers
Asked by: searching777-ga
List Price: $30.00

Posted: 09 Dec 2004 12:08 PST
Expires: 08 Jan 2005 12:08 PST
Question ID: 440474

Hi,

I have a specialist search engine and have recently had the problem of
being crawled and the results being displayed in another site.

Is there any protection against this kind of thing?

I have looked around on the net, and as well as the 1000's of crawler
software running sites [eg. Smartsearch offered by
http://smarterscripts.com/ which crawls ODP, MSN and/or ALTAVISTA is
apparently used by 1000's of sites] which seems to suggest there is no
protection, I have also found freelance jobs posted specifiaclly
asking for crawlers to be made for many engines, Google, Alta vista
images and videos etc.

Here's an example http://search-them-all.com/

I had though that metasearches must be licenced. But it seems I was being naive.

Also I found this article
http://www.ivanhoffman.com/database.html

which further suggest that there is no protection in this matter.

If even the mighty MSN is being crawled regularly by specialist
software it seems like there is no chance. :(

All this has led me to think why don't I just start crawling other
engines myself to expand my site? !!!

What I don't get is I would have thought that there was a way to block
being crawled.

-- A few questions in the above ramble:

1. Is it legal to crawl a search engine and display the results in a
non-realted site in search engine style [different colours etc of
course] without permission.

2. If yes any exceptions.

3. Any way to block crawlers.

Thnaks

Answer

Subject: Re: Is it legal to crawl search engines and display the results on another site
Answered By: sgtcory-ga on 20 Dec 2004 06:16 PST
Rated: 5 out of 5 stars

Hello searching777, Thanks for the questions. I have some experience within this market, so I'll do my best to fully answer your questions. 1. Is it legal to crawl a search engine and ...? ------------------------------------------------------------------------------- If the website being crawled offers a 'terms of use' (or anything similar) , then usage of that website falls under the specified details. Here is an example of Google's agreement, stating that 'metasearching' or crawling of their content is not permitted : Google - Personal Use Only "You may not take the results from a Google search and reformat and display them, or mirror the Google home page or results pages on your Web site. You may not "meta-search" Google..." ://www.google.com/terms_of_service.html Here is an excerpt from the MSN website as well, that seems very clear : MSN Terms of Use "The MSN Web Sites are only for your personal use. You will not use the MSN Web Sites for commercial purposes....you may not use the MSN Web Sites in any manner that could damage, disable, overburden, or impair any MSN Web Site..." http://privacy.msn.com/tou/ Let's look past the database issue, and skip right to bandwidth and server drain. When a remote computer crawls another website, it uses bandwidth and resources paid for by the company that is being crawled. In some cases the information may be free, but the process of retrieving that information comes at a cost that is covered by the company hosting the information. In short, the best protection is to offer your users a terms of use that is clear, and warns against illegal usage. This will give you solid ground to stand on should any litigation arise. Many important websites carry these agreements. Here is an example of the Superior Courts of California's agreement : County Website within the Superior Courts of CA http://www.siskiyou.courts.ca.gov/disclaimer.asp 2. If yes any exceptions ------------------------------------------------------------------------------- There are exceptions. It depends on the information being requested, and the guidelines of the offering entity. The DMOZ may be one exception, although their agreement says nothing about live retrieval of their data, rather it refers to the usage of their RDF dumps for local use. I couldn't find one example that directly allows you to crawl their content, although I am certain they exist. When in doubt, the best approach is to ask. I did this with a few companies in Ireland and the United Kingdom, and a majority of them allowed me to crawl their content, simply because I was the only one to ask. 3. Any way to block crawlers ------------------------------------------------------------------------------- There are a couple of methods, with the robots exclusion being the preferred method : Robots Exclusion http://www.robotstxt.org/wc/exclusion.html If the crawler does not adhere to the robots exclusion, you can use a firewall to block access, assuming the I.P. address(es) are known. Here is an example firewall for a *nix web server : KISS Firewall http://www.geocities.com/steve93138/ This simple firewall allows you to add individual I.P. addresses as well as ranges simply by dropping them into a configuration file. If these methods don't work, then you can always contact their internet provider, stating cleary which terms of your agreement are being broken. Most upstream/hosting providers understand these issues, as they too carry usage guidelines that they do not want to see abused. To assist with this answer, I referred directly to the terms of use on a few search engines. Most of the information provided is from first hand experience. Should you need further clarification, please do not hesitate to ask. I will do my best to assist! SgtCory
Request for Answer Clarification by searching777-ga on 20 Dec 2004 07:06 PST Hi, thanks for your answer. Interesting to read up on terms of service. That is pretty clear. One point you did not answer was opinion given here http://www.ivanhoffman.com/database.html --- dmoz actually lists software services that grab results live btw http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/ phpODP for instance. I guess ip blocking is the best way to be sure.
Clarification of Answer by sgtcory-ga on 20 Dec 2004 07:31 PST Thanks for the request. "dmoz actually lists software services that grab results live btw.." I knew they listed sites that use their live results. This is probably an example of several departments not collaborting, and Netscape keeping a tight grip on the 'free' data. Here's the terms of use : Terms of Use "You will not disrupt the functioning of the ODP or otherwise act in a way that interferes with other users' use of the ODP" And here is the page that tells you how to get the data. You'll notice that there are no options for live data usage : Getting the DMOZ Data http://dmoz.org/help/getdata.html If we can reasonably assume that 10,000 sites are using live data, we can see the effect this would, and has had on the availability of the DMOZ website at times. It's a 'feature' they can do away with, or say they never agreed to allow, at time of their choosing. "One point you did not answer was opinion given here.." It's a very old argument - 'Data wants to be free'. It does. I'm not certain how well this would hold up in the courts. The internet has had a major effect on many things, including how law is defined. If the page you reference was the definitive source, then the license Netscape has on the DMOZ data would be invalid. Thanks for the great clarification request. This is truly a question I've enjoyed answering. Should you still need more clarification - I'll be happy to assist. SgtCory
Request for Answer Clarification by searching777-ga on 20 Dec 2004 09:42 PST "If the page you reference was the definitive source" ....interesting. That's maybe where the crawlers get their legs, so to speak...anyway...thanks for the info. All very interesting. If ou dig up/find anything new anytime please post it. Thanks
Clarification of Answer by sgtcory-ga on 20 Dec 2004 11:47 PST Thank you for the great rating. I've spent the last hour or so looking at most of the major search providers terms of use. All of them have similar wordings and licensing schemes. I ran a metasearch function for a while and decided it would be in my best interest to contact these companies individually. Almost all replied with the same response, which was to work a commercial agreement with them. I hope this helps - and best of luck with your venture(s)! SgtCory

searching777-ga rated this answer: 5 out of 5 stars

Comments

Subject: Re: Is it legal to crawl search engines and display the results on another site
From: crythias-ga on 09 Dec 2004 20:25 PST

Stop legit bots from crawling:
http://www.searchengineworld.com/robots/robots_tutorial.htm
This is a free comment.

Of course, if unscrupulous crawl bots ignore the robots.txt file, all
bets are off, but at least you'll stop the legit bots from crawling
your site.

Is it legal? Sorta kinda, but I bet Copyright Lawyers would say your
content is copyright to you and unauthorized reproduction is
prohibited, etc. etc. etc.

Subject: Re: Is it legal to crawl search engines and display the results on another site
From: southof40-ga on 15 Dec 2004 12:33 PST

Can't speak for legal stuff - here's some technical ideas.

One thing you can do is to insert some of your content via Javascript.
Most crawlers are either not capable of exectuing the JS within a
crawled page or would find the effort to do so too expensive given the
number of pages they would have to crawl.

Some crawlers reveal who they are via the User Agent string delivered
along with the request for the page - obviously it's then possible to
drop all requests from such UA's into a bucket. However I have
identified crawlers who are faking their UA (and so would almost
certainly be ignoring any robots.txt you might have put in place).

Another suggestion at least with respect to identifying them (perhaps
for the purposes of IP blocking) is that crawlers rarely suck the
graphics associated with a page so if you embed a non-existent graphic
in your page (make it small and it won't trouble real users) and then
wait to see which UA's don't ask for it (they should be asking for it
because it cannot be cached anywhere because you've never issued it !)
then you've probably found a UA which is a crawler (of course it could
be a legitatmate users who has graphics turned off but they are few
and far between.

Subject: Re: Is it legal to crawl search engines and display the results on another site
From: tigerheart-ga on 20 Dec 2004 18:31 PST

depending on what scripting language you use serverside (I use php for
example), you can create a simple script to include at the start of
each of your files that will deter (most of) these crawlers and
bots...

here's an example I use for my pages:

<?php 
    $botlist = array(    
                "Teoma",                    
                "alexa", 
                "froogle", 
                "inktomi", 
                "looksmart", 
                "URL_Spider_SQL", 
                "Firefly", 
                "NationalDirectory", 
                "Ask Jeeves", 
                "TECNOSEEK", 
                "InfoSeek", 
                "WebFindBot", 
                "girafabot", 
                "crawler", 
                "www.galaxy.com", 
                "Googlebot", 
                "Scooter", 
                "Slurp", 
                "appie", 
                "FAST", 
                "WebBug", 
                "Spade", 
                "ZyBorg", 
                "rabaz"); 

    foreach($botlist as $bot) { 
      if(ereg($bot, $HTTP_USER_AGENT)) { 
	exit();          		
      } 
    } 
?> 

this little script detects most bots and crawlers and gives them a
blank page when they try to visit the site.
I can prepare a script like this for ASP if you like too... just let me know.

Subject: Re: Is it legal to crawl search engines and display the results on another site
From: searching777-ga on 23 Dec 2004 17:04 PST

Hey tigerheart,

THANKS!..Really appreciate the free code.
I use php so no need for the asp version.

:)

searching777

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy