A. Is it possible for a web site to "hide" content from a search
engine spider program? An answer to this question is essential to
answering the second question, which gets to the meat of what you need
to know:
B. Could content-creation tools be deliberately designed to hide
content from spiders?
* * *
A. The answer to the first question is certainly yes. There are
several means by which this could be done:
1. Major search engine spiders obey the "robot exclusion standard,"
which allows site designers to expressly request that spiders not
visit certain parts of the site. This can be used to exclude the
entire site, or (more commonly) to exclude URLs that have unwanted
side effects (such as triggering financial transactions) or are
otherwise undesirable for spiders to fetch. This is the most
straightforward method. The robot exclusion standard is a voluntary
mechanism, of course. Please see:
A Standard for Robot Exclusion
http://www.robotstxt.org/wc/norobots.html
2. Reputable, major search engine spiders also identify themselves
when communicating with web servers. They do so by supplying a "user
agent" browser identification text string. Ordinary web browsers
supply a string such as "Mozilla/4.0 (Compatible; Internet Explorer
6.0)" or similar. Spiders supply strings that often include the URL of
a page that explains their function, so that web server administrators
will understand their marketing value and not misunderstand their
intentions. You can find these browser identification strings in the
reports of any quality web server log analysis product. It is
technically feasible, using straightforward, widely understood
techniques such as "server side includes" as well as PHP, ASP and
other languages, to provide different content based on the browser or
spider that is accessing the page. See:
Apache module mod_setenvif
http://httpd.apache.org/docs/mod/mod_setenvif.html
For one such mechanism.
3. Spiders typically run on a limited, although sometimes large,
number of servers. Although there are spider programs for private
individuals, whose IP addresses will vary wildly, large organizations
that operate search indexing services on a global scale will of
necessity tend to have a fixed pool of IP addresses. Even if,
hypothetically, spiders were to cease respecting the robot exclusion
standard and also begin to supply a user-agent string identical to
that of a "normal" web browser, the IP addresses from which they come
would still be recognizable and catalogueable, due to their regular
habits. Again, readily available tools such as server side includes,
ASP, PHP and so on can be used to provide differing content based on
the IP address of the client the server is talking to. See:
http://httpd.apache.org/docs/mod/mod_setenvif.html#setenvif
Note that the Remote_Addr variable contains the IP address of the
client. It is this information which could be matched to a database of
known spider IP addresses.
4. Even if the server delivers the same content to all users, spiders cannot
read all types of content that human users can read. The most obvious example
is text placed in a graphic. While perfectly readable to humans, it is
difficult for computers to read, and it is impractical for search
engine indexing spiders to perform optical character recognition
algorithms on every page they encounter. At the very least, increased
use of this technique would greatly increase the expense involved in
spidering web pages.
There are other methods, such as saving pages as Javascript source
code that expands to the correct HTML after performing some simple
transformation of quoted strings that are not immediately readable.
One can also effectively hide a site by designing the entire site in
Macromedia Flash, Java, or as an ActiveX control. But such non-image
methods that are readable as text by those who designed them would not
take long for other parties to deciper as well by study of the
Javascript source code or other method of obfuscation. All of these
techniques share the property that they are not selective; all spiders
would receive the same material and face the same challenges in
deciphering it. Also, all of these methods except for images run the
risk of antagonizing and/or shutting out users who do not have or are
not allowed for security reasons to use various optional technologies
such as Javascript, Flash, Java, etc., especially in the workplace.
Also, none of these methods are compatible with accessibility
guidelines for disabled, except (in principle) for Javascript which
browsers for the blind could support but usually do not.
* * *
B. Content creation tools could be designed to use some of the
techniques above, conferring some possible short-term advantage to a
particular search engine affiliated with the creator of those tools.
However, not all of the techniques mentioned above are practical in
content creation tools, and the presence of such mechanisms is readily
verified by third parties and would quickly create an extensive
backlash against the content creation products in question.
Taking each of the techniques in turn:
1. The robot exclusion standard technique could be employed by content
creation software, provided that the software had direct access to
replace files on the web site, which is often the case in major
content creation programs today. Such software could, openly or
without fanfare, upload a new robots.txt file to the home directory of
the web site containing directives to exclude one or more spiders.
This, of course, would be obvious grounds for legal action and is a
clearly anticompetitive practice.
Alternatively, the software could upload a robots.txt file excluding
all spiders. The affiliated search engine software would then have to
be designed to ignore such files and index the site anyway, possibly
looking for those that contained specific patterns of whitespace or
other normally-ignored characters in order to recognize the specific
robots.txt files created for this purpose. Such a technique would also
doubtless create grounds for legal action as an anticompetitive
practice. However the plaintiff would have a somewhat harder time
proving that the decisions to "helpfully" block spiders in the content
creation program and the decision to ignore spider blocking in the
spider program were not made independently.
Of course, other major search engines would no doubt quickly respond
by ignoring such "involuntarily" installed robots.txt files or
ignoring robots.txt altogether. The advantage conferred by this tactic
would be very short-term, with considerable potential for negative PR
among web designers and others who purchase content creation products.
It must be remembered that web designers are extremely eager to see
their work indexed by search engines and would not take kindly to any
indication of sabotage.
2. The user-agent approach is also possible, but much more difficult
in that the content creation software would have to (a) take into
account the type of server the content would be deployed on, (b)
potentially reconfigure that server which most customers of hosting
firms lack the privileges to do themselves, and (c) supply appropriate
scripting for the type of server and scripting that the server
supports. The content could then fail to work if moved to a new
server, an unpleasant surprise for users who assumed the content was
ordinary, portable HTML. Also, a nontrivial percentage of users have
some familiarity with HTML and would notice the unusual directives in
their code, as well as a possible need to use a different file
extension. And the advantage would be very short-term as other spiders
would quickly switch to simply identifying themselves identically to
common web browsers. There would be fallout as well from users of web
browsers not on an "approved" list; alternatively the code would have
to "exclude" a specific list of spiders, a clear declaration of intent
to force out certain companies. Once again, as soon as the nature of
the "feature" was publicized, there would be an extensive backlash
among users. Finally, this technique is open to clear legal challenges
similar to those facing technique #1.
3. The IP address method is possible subject to the same problems as
the user-agent approach with regard to the need to insert script code
into the customer's pages, possibly breaking them when moving from
server type to server type, and/or a need for the content creation
software to reconfigure the customer's web server software, which the
customer often does not have the privileges to do. While there would
be no potential here for a spider to escape detection by changing its
user agent string, there would also have to be a clearly spelled out
list of excluded IP addresses, which would again be a clear case of
anticompetitive practices open to easy legal challenge. In the medium
term competing spiders could move their servers to different IP
addresses -- an inconvenience they would no doubt cite in court.
4. The obfuscation method is more practical than methods #2 and #3 to
implement in content creation software; the difficulty is somewhat
similar to method #1. However the impact on human users is very
significant, and the advantage quite temporary, as any method that
allowed one company's spider to be designed to understand the content
would soon be deciphered and implemented in others. Finally, the
perceived obnoxiousness to customers who understand HTML would be
highest with any variant of this method.
* * *
In conclusion, it is technically feasible for content creation
software to deliberately frustrate web spidering software to the
advantage of an affiliated web spider, at least temporarily. However,
the available techniques for doing so all involve serious negative PR
consequences among customers and invite legal challenges as well. The
most technically practical technique with the smallest chance of
unintended consequences for the custoemr is probably #1, the robot
exclusion standard approach. However this method confers only a
temporary advantage as competing spiders will be able to compensate
for it. The most effective method in terms of frustrating competing
spiders is method #3, the IP address method, as this is the most
difficult to compensate for. However this method involves the
insertion of scripting into web pages which may or may not work with
particular servers and is the most readily challenged in court. All of
these methods antagonize customers, and given the many alternatives
now available -- including open source and commercial software -- no
company is presently in a position to court negative customer opinion
of their content creation tools. In my opinion, none of these methods
are practical from a legal and economic perspective, given the
temporary advantage and substantial legal risk. |