Google Answers: Can Microsoft 'hide' content from Googlebots?

View Question

Q: Can Microsoft 'hide' content from Googlebots? ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: Can Microsoft 'hide' content from Googlebots?
Category: Computers > Internet
Asked by: mcarole-ga
List Price: $50.00

Posted: 30 Jan 2004 17:34 PST
Expires: 29 Feb 2004 17:34 PST
Question ID: 301974

Microsoft is expected to create a search engine that will compete with
Google and Yahoo.  My understanding of how Google currently finds
content is by sending programs called "Googlebots" into the web to
find and index documents and other content, and then add this new
content to its database of available search results. Much or most of
this content s being created using Microsoft products.  My question
is: Is it possible that MS will be able to engineer its Office
products so that new content will be invisible to Googlebots, and thus
give MS's search engine exclusive access to contents created on MS
software?  i am willing to do a fair bit of reading, what I want is
someone to give me an answer and the references to back it up. 
Answers could look start like:

"This is impossible and will not be possible in the foreseeable
future.  Refer to these links..."
"This is currently possible and being actively pursued by MS.  Refer
to these links..."
"This is theoretically possible but there are significant or
insurmountable legal hurdles involved.  Refer to...."
"This is an area that MS is pursuing now, but they are having
technical difficulties figuring out how to implement it. Refer to..."

If this is possible and is a subject that is known and debated in the
industry, and if the answerer wants to add a paragraph or two of their
own editorial insight, I will at my discretion pay an extra $50 (total
$100) for your insights.

Request for Question Clarification by majortom-ga on 31 Jan 2004 06:46 PST

Google Answers has strict policies about answering questions about
Google itself. On the other hand, the most important part of your
question could be rephrased as a straightforward query about what is
and isn't technically possible and socially and legally feasible with
regard to web spider technology and web design. I could certainly
answer that question, but I can't tackle this one as it stands. Other
researchers may be having similar qualms. Also I probably can't answer
as effectively as you'd like as to what Microsoft is or isn't thinking
of doing as no non-employee would have more than rumor on the subject
at this stage. (Right *now*, of course, Google is clearly the main
search engine and Microsoft would only hurt their own customers and
therefore their own products by interfering with Google's efficacy in
this way.)

Clarification of Question by mcarole-ga on 31 Jan 2004 08:18 PST

I understand your limitations. You want to separate the 'safe' part of
my question (what can be answered without including the names of any
companies) from the 'dangerous' part (what requires reference to the
names of the companies in my question). It's fair enough that you want
to divide it this way, and I do need the safe information.  It's
obvious to both of us that the answer will be much more interesting
with the dangerous part included.
   So, please respond with the "technical/social/legal" aspects you
have referred to. I know that the insider information on Microsoft
would only be hearsay and rumor at this point (if it even exists), but
if any such debate or editorial refers to the possibilities addressed
in my question, please refer me to any links where I might be able to
chase down the rumors according to my own judgement.  I admit my
question looks like i'm trying to break a scandal wide open, but
really my project is more an assessment of what is possible.  First
answer what is possible, please.  Then, if you have any informaiton on
what any big companies are doing or planning based on what is
possible, please provide that as well.  If your answer does not
contain direct information about Google, that will be understood.

Thank you major tom.

Answer

Subject: Re: Can Microsoft 'hide' content from Googlebots?
Answered By: majortom-ga on 31 Jan 2004 12:45 PST
Rated: 5 out of 5 stars

A. Is it possible for a web site to "hide" content from a search
engine spider program? An answer to this question is essential to
answering the second question, which gets to the meat of what you need
to know:

B. Could content-creation tools be deliberately designed to hide
content from spiders?

* * *

A. The answer to the first question is certainly yes. There are
several means by which this could be done:

1. Major search engine spiders obey the "robot exclusion standard,"
which allows site designers to expressly request that spiders not
visit certain parts of the site. This can be used to exclude the
entire site, or (more commonly) to exclude URLs that have unwanted
side effects (such as triggering financial transactions) or are
otherwise undesirable for spiders to fetch. This is the most
straightforward method. The robot exclusion standard is a voluntary
mechanism, of course. Please see:

A Standard for Robot Exclusion
http://www.robotstxt.org/wc/norobots.html

2. Reputable, major search engine spiders also identify themselves
when communicating with web servers. They do so by supplying a "user
agent" browser identification text string. Ordinary web browsers
supply a string such as "Mozilla/4.0 (Compatible; Internet Explorer
6.0)" or similar. Spiders supply strings that often include the URL of
a page that explains their function, so that web server administrators
will understand their marketing value and not misunderstand their
intentions. You can find these browser identification strings in the
reports of any quality web server log analysis product. It is
technically feasible, using straightforward, widely understood
techniques such as "server side includes" as well as PHP, ASP and
other languages, to provide different content based on the browser or
spider that is accessing the page. See:

Apache module mod_setenvif
http://httpd.apache.org/docs/mod/mod_setenvif.html

For one such mechanism.

3. Spiders typically run on a limited, although sometimes large,
number of servers. Although there are spider programs for private
individuals, whose IP addresses will vary wildly, large organizations
that operate search indexing services on a global scale will of
necessity tend to have a fixed pool of IP addresses. Even if,
hypothetically, spiders were to cease respecting the robot exclusion
standard and also begin to supply a user-agent string identical to
that of a "normal" web browser, the IP addresses from which they come
would still be recognizable and catalogueable, due to their regular
habits. Again, readily available tools such as server side includes,
ASP, PHP and so on can be used to provide differing content based on
the IP address of the client the server is talking to. See:

http://httpd.apache.org/docs/mod/mod_setenvif.html#setenvif

Note that the Remote_Addr variable contains the IP address of the
client. It is this information which could be matched to a database of
known spider IP addresses.

4. Even if the server delivers the same content to all users, spiders cannot
read all types of content that human users can read. The most obvious example
is text placed in a graphic. While perfectly readable to humans, it is
difficult for computers to read, and it is impractical for search
engine indexing spiders to perform optical character recognition
algorithms on every page they encounter. At the very least, increased
use of this technique would greatly increase the expense involved in
spidering web pages.
There are other methods, such as saving pages as Javascript source
code that expands to the correct HTML after performing some simple
transformation of quoted strings that are not immediately readable.
One can also effectively hide a site by designing the entire site in
Macromedia Flash, Java, or as an ActiveX control. But such non-image
methods that are readable as text by those who designed them would not
take long for other parties to deciper as well by study of the
Javascript source code or other method of obfuscation. All of these
techniques share the property that they are not selective; all spiders
would receive the same material and face the same challenges in
deciphering it. Also, all of these methods except for images run the
risk of antagonizing and/or shutting out users who do not have or are
not allowed for security reasons to use various optional technologies
such as Javascript, Flash, Java, etc., especially in the workplace.
Also, none of these methods are compatible with accessibility
guidelines for disabled, except (in principle) for Javascript which
browsers for the blind could support but usually do not.

* * *

B. Content creation tools could be designed to use some of the
techniques above, conferring some possible short-term advantage to a
particular search engine affiliated with the creator of those tools.
However, not all of the techniques mentioned above are practical in
content creation tools, and the presence of such mechanisms is readily
verified by third parties and would quickly create an extensive
backlash against the content creation products in question.

Taking each of the techniques in turn:

1. The robot exclusion standard technique could be employed by content
creation software, provided that the software had direct access to
replace files on the web site, which is often the case in major
content creation programs today. Such software could, openly or
without fanfare, upload a new robots.txt file to the home directory of
the web site containing directives to exclude one or more spiders.
This, of course, would be obvious grounds for legal action and is a
clearly anticompetitive practice.

Alternatively, the software could upload a robots.txt file excluding
all spiders. The affiliated search engine software would then have to
be designed to ignore such files and index the site anyway, possibly
looking for those that contained specific patterns of whitespace or
other normally-ignored characters in order to recognize the specific
robots.txt files created for this purpose. Such a technique would also
doubtless create grounds for legal action as an anticompetitive
practice.  However the plaintiff would have a somewhat harder time
proving that the decisions to "helpfully" block spiders in the content
creation program and the decision to ignore spider blocking in the
spider program were not made independently.

Of course, other major search engines would no doubt quickly respond
by ignoring such "involuntarily" installed robots.txt files or
ignoring robots.txt altogether. The advantage conferred by this tactic
would be very short-term, with considerable potential for negative PR
among web designers and others who purchase content creation products.
It must be remembered that web designers are extremely eager to see
their work indexed by search engines and would not take kindly to any
indication of sabotage.

2. The user-agent approach is also possible, but much more difficult
in that the content creation software would have to (a) take into
account the type of server the content would be deployed on, (b)
potentially reconfigure that server which most customers of hosting
firms lack the privileges to do themselves, and (c) supply appropriate
scripting for the type of server and scripting that the server
supports. The content could then fail to work if moved to a new
server, an unpleasant surprise for users who assumed the content was
ordinary, portable HTML. Also, a nontrivial percentage of users have
some familiarity with HTML and would notice the unusual directives in
their code, as well as a possible need to use a different file
extension. And the advantage would be very short-term as other spiders
would quickly switch to simply identifying themselves identically to
common web browsers. There would be fallout as well from users of web
browsers not on an "approved" list; alternatively the code would have
to "exclude" a specific list of spiders, a clear declaration of intent
to force out certain companies. Once again, as soon as the nature of
the "feature" was publicized, there would be an extensive backlash
among users. Finally, this technique is open to clear legal challenges
similar to those facing technique #1.

3. The IP address method is possible subject to the same problems as
the user-agent approach with regard to the need to insert script code
into the customer's pages, possibly breaking them when moving from
server type to server type, and/or a need for the content creation
software to reconfigure the customer's web server software, which the
customer often does not have the privileges to do. While there would
be no potential here for a spider to escape detection by changing its
user agent string, there would also have to be a clearly spelled out
list of excluded IP addresses, which would again be a clear case of
anticompetitive practices open to easy legal challenge. In the medium
term competing spiders could move their servers to different IP
addresses -- an inconvenience they would no doubt cite in court.

4. The obfuscation method is more practical than methods #2 and #3 to
implement in content creation software; the difficulty is somewhat
similar to method #1. However the impact on human users is very
significant, and the advantage quite temporary, as any method that
allowed one company's spider to be designed to understand the content
would soon be deciphered and implemented in others. Finally, the
perceived obnoxiousness to customers who understand HTML would be
highest with any variant of this method.


* * *

In conclusion, it is technically feasible for content creation
software to deliberately frustrate web spidering software to the
advantage of an affiliated web spider, at least temporarily. However,
the available techniques for doing so all involve serious negative PR
consequences among customers and invite legal challenges as well. The
most technically practical technique with the smallest chance of
unintended consequences for the custoemr is probably #1, the robot
exclusion standard approach. However this method confers only a
temporary advantage as competing spiders will be able to compensate
for it. The most effective method in terms of frustrating competing
spiders is method #3, the IP address method, as this is the most
difficult to compensate for. However this method involves the
insertion of scripting into web pages which may or may not work with
particular servers and is the most readily challenged in court. All of
these methods antagonize customers, and given the many alternatives
now available -- including open source and commercial software -- no
company is presently in a position to court negative customer opinion
of their content creation tools. In my opinion, none of these methods
are practical from a legal and economic perspective, given the
temporary advantage and substantial legal risk.

Clarification of Answer by majortom-ga on 31 Jan 2004 12:48 PST

With regard to the URLs I cited, I should clarify that the concepts
mentioned, such as user agent strings, IP address recognition and
scripting, are universal aspects of core web standards and are
implementable on all major web servers. I chose open source examples
because of their clear and readily available documentation but similar
techniques are potentially possible with any HTTP server that supports
scripting of any kind.

mcarole-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $50.00

Great clarity, completeness and logical development of response. 
Researcher showed obvious understanding of the question's implications
from technical, business, PR, and consumer points of view.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy