Hello searching777,
Thanks for the questions. I have some experience within this market,
so I'll do my best to fully answer your questions.
1. Is it legal to crawl a search engine and ...?
-------------------------------------------------------------------------------
If the website being crawled offers a 'terms of use' (or anything
similar) , then usage of that website falls under the specified
details. Here is an example of Google's agreement, stating that
'metasearching' or crawling of their content is not permitted :
Google - Personal Use Only
"You may not take the results from a Google search and reformat and
display them, or mirror the Google home page or results pages on your
Web site. You may not "meta-search" Google..."
://www.google.com/terms_of_service.html
Here is an excerpt from the MSN website as well, that seems very clear :
MSN Terms of Use
"The MSN Web Sites are only for your personal use. You will not use
the MSN Web Sites for commercial purposes....you may not use the MSN
Web Sites in any manner that could damage, disable, overburden, or
impair any MSN Web Site..."
http://privacy.msn.com/tou/
Let's look past the database issue, and skip right to bandwidth and
server drain. When a remote computer crawls another website, it uses
bandwidth and resources paid for by the company that is being crawled.
In some cases the information may be free, but the process of
retrieving that information comes at a cost that is covered by the
company hosting the information.
In short, the best protection is to offer your users a terms of use
that is clear, and warns against illegal usage. This will give you
solid ground to stand on should any litigation arise.
Many important websites carry these agreements. Here is an example of
the Superior Courts of California's agreement :
County Website within the Superior Courts of CA
http://www.siskiyou.courts.ca.gov/disclaimer.asp
2. If yes any exceptions
-------------------------------------------------------------------------------
There are exceptions. It depends on the information being requested,
and the guidelines of the offering entity. The DMOZ may be one
exception, although their agreement says nothing about live retrieval
of their data, rather it refers to the usage of their RDF dumps for
local use. I couldn't find one example that directly allows you to
crawl their content, although I am certain they exist. When in doubt,
the best approach is to ask. I did this with a few companies in
Ireland and the United Kingdom, and a majority of them allowed me to
crawl their content, simply because I was the only one to ask.
3. Any way to block crawlers
-------------------------------------------------------------------------------
There are a couple of methods, with the robots exclusion being the
preferred method :
Robots Exclusion
http://www.robotstxt.org/wc/exclusion.html
If the crawler does not adhere to the robots exclusion, you can use a
firewall to block access, assuming the I.P. address(es) are known.
Here is an example firewall for a *nix web server :
KISS Firewall
http://www.geocities.com/steve93138/
This simple firewall allows you to add individual I.P. addresses as
well as ranges simply by dropping them into a configuration file.
If these methods don't work, then you can always contact their
internet provider, stating cleary which terms of your agreement are
being broken. Most upstream/hosting providers understand these issues,
as they too carry usage guidelines that they do not want to see
abused.
To assist with this answer, I referred directly to the terms of use on
a few search engines. Most of the information provided is from first
hand experience.
Should you need further clarification, please do not hesitate to ask.
I will do my best to assist!
SgtCory |
Clarification of Answer by
sgtcory-ga
on
20 Dec 2004 07:31 PST
Thanks for the request.
"dmoz actually lists software services that grab results live btw.."
I knew they listed sites that use their live results. This is probably
an example of several departments not collaborting, and Netscape
keeping a tight grip on the 'free' data.
Here's the terms of use :
Terms of Use
"You will not disrupt the functioning of the ODP or otherwise act in a
way that interferes with other users' use of the ODP"
And here is the page that tells you how to get the data. You'll notice
that there are no options for live data usage :
Getting the DMOZ Data
http://dmoz.org/help/getdata.html
If we can reasonably assume that 10,000 sites are using live data, we
can see the effect this would, and has had on the availability of the
DMOZ website at times. It's a 'feature' they can do away with, or say
they never agreed to allow, at time of their choosing.
"One point you did not answer was opinion given here.."
It's a very old argument - 'Data wants to be free'. It does. I'm not
certain how well this would hold up in the courts. The internet has
had a major effect on many things, including how law is defined. If
the page you reference was the definitive source, then the license
Netscape has on the DMOZ data would be invalid.
Thanks for the great clarification request. This is truly a question
I've enjoyed answering. Should you still need more clarification -
I'll be happy to assist.
SgtCory
|