greetings, logisnews!
There are a couple of things you can try to rectify this situation.
1) To maintain their reputation for objectivity and integrity, Google
is very reluctant to tamper with their search results. However, since
you are the website owner, they may be willing to erase or replace
their cached version of this .pdf document at your request. It
certainly can't hurt to try. Contact them via e-mail at
help@google.com . Explain your situation as you did above; give them
your full name, address, phone number, and website URL; ask them if
they would be willing to update the cached version of this .pdf
document for you.
2) Submit the URL of this .pdf document to Google as if you were
asking the Googlebot to index your site for the first time.
://www.google.com/addurl.html
This should speed up the process whereby the cached version gets
replaced by the new file. Once you've done this, make a slight change
to the document (maybe add one space or a carriage return to the end
of a sentence), then save it and reload the new version to the web.
Hopefully, the Googlebot will then cache the previous new version
while indexing the latest new version. This may still take a few days
to a few weeks, but the confidential information in the cache should
get replaced at some point.
Now, a couple of things to bear in mind:
The Googlebot is pretty polite, and will honor requests made in a
/robots.txt file. Thus, if you save a /robots.txt file on your server
that contains the following lines:
User-agent: *
Disallow: /*.pdf$ # disallow access to Acrobat Documents
the Googlebot will not index any .pdf documents on your website.
See
Microsoft Knowledge Base Article - 217103
"How to Write a Robots.txt File"
http://support.microsoft.com/default.aspx?scid=KB;en-us;q217103
and
"A Standard for Robot Exclusion", by Martijn Koster, The Web Robots
pages:
http://www.robotstxt.org/wc/norobots.html
for further information.
***HOWEVER***
Not all Search Engines are as polite as Google, and they may index
your .pdf documents anyway. In fact, some hackers and other nefarious
characters actually troll the web looking for directories disallowed
in /robots.txt files, specifically looking in them for confidential
information. /robots.txt files are just NOT a good way to secure
confidential data.
If you MUST make confidential information available on the web for
clients, you can put all the files you don't want robots to visit in a
separate subdirectory, configure your server to make that directory
un-listable on the web, and list only the directory name in the
/robots.txt. Now a malicious robot can't traverse that directory
unless you or someone else puts a direct link on the web to one of
your files (something you DON'T want to do!).
For example, rather than putting in your robots.txt file:
User-Agent: *
Disallow: /*.pdf$
do:
User-Agent: *
Disallow: /norobots/
and make a "norobots" directory, put your .pdf documents into it, and
configure your server to not generate a directory listing for that
directory. Now all an attacker would learn is that you have a
"norobots" directory, but he won't be able to list the files in there;
he'd need to guess their names.
However, in practice this is a bad idea -- it's just not secure
enough. Someone may publish a link to your files on their site. Or it
may turn up in a publicly accessible log file, say of your user's
proxy server, or maybe it will show up in someone's web server log as
a Referrer. Or someone may misconfigure your server at some future
date, "fixing" it to show a directory listing. Which leads me to the
real answer:
The real answer is that /robots.txt is not intended for access
control, so don't try to use it as such. Think of it as a "No Entry"
sign, not a locked door. If you have files on your web site that you
don't want unauthorized people to access, then configure your server
to do authentication, and configure appropriate authorization. Basic
Authentication has been around since the early days of the web (and is
trivial to configure on certain systems, such as Apache on UNIX), and
if you're really serious, SSL is commonplace in web servers.
HTAccess Authentication Tutorial
http://faq.clever.net/htaccess.htm
Information about adding username/password protection to your
webpages:
"Questions about Services and Scripts: Can I require a password for
access to a web page?", Massachusetts Institute of Technology (MIT)
http://www.mit.edu/faq/password.html
"Password protecting a directory from web display", College of
Agriculture and Life Sciences, University of Arizona (September 3,
2002):
http://ag.arizona.edu/ecat/web/password-protect.html
And of course, if the confidential information doesn't NEED to be
available on the web, just don't load it to your website server at
all.
(My thanks to Researcher webadept for some of the information included
in this Answer.)
Before Rating this Answer, if you have any questions about this
information, please post a Request for Clarification, and I will be
glad to see what I can do for you.
I hope that this Answer has provided you with exactly the information
you needed!
Regards,
aceresearcher |