Google Answers Logo
View Question
 
Q: How do I get confidental data posted accidentally OFF google? ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: How do I get confidental data posted accidentally OFF google?
Category: Miscellaneous
Asked by: logisnews-ga
List Price: $10.00
Posted: 24 Feb 2003 20:09 PST
Expires: 26 Mar 2003 20:09 PST
Question ID: 166697
I accidently posted confidental legal data on my web site and it got
indexed by google.  I didn't know google would index and search .pdf
files... I thought they were secure.  Anyway, I need to get it OFF
google before my client finds out.  I have fixed the .pdf so that
there's no confidental data in it anymore, however, when you search on
google it pulls up the 'new' cleansed document (that's fine) but it
also has a "view html" function which if clicked will show the
original document with the confidental data.  How to I get rid of that
old data from the view html function -- I was hoping that since I
changed the source document, that google would reindex it and generate
new HTML to view based on the new documents.  But so far, after 5
days, that hasn't happened. Googles the only search engine this is
problem on because the other ones that I've searched, while they show
my new "cleansed" document they don't have a view html function.

Can anyone help?  Thanks.
Answer  
Subject: Re: How do I get confidental data posted accidentally OFF google?
Answered By: aceresearcher-ga on 25 Feb 2003 22:31 PST
Rated:5 out of 5 stars
 
greetings, logisnews!

There are a couple of things you can try to rectify this situation.

1) To maintain their reputation for objectivity and integrity, Google
is very reluctant to tamper with their search results. However, since
you are the website owner, they may be willing to erase or replace
their cached version of this .pdf document at your request. It
certainly can't hurt to try. Contact them via e-mail at
help@google.com . Explain your situation as you did above; give them
your full name, address, phone number, and website URL; ask them if
they would be willing to update the cached version of this .pdf
document for you.

2) Submit the URL of this .pdf document to Google as if you were
asking the Googlebot to index your site for the first time.
://www.google.com/addurl.html
This should speed up the process whereby the cached version gets
replaced by the new file. Once you've done this, make a slight change
to the document (maybe add one space or a carriage return to the end
of a sentence), then save it and reload the new version to the web.
Hopefully, the Googlebot will then cache the previous new version
while indexing the latest new version. This may still take a few days
to a few weeks, but the confidential information in the cache should
get replaced at some point.


Now, a couple of things to bear in mind:

The Googlebot is pretty polite, and will honor requests made in a
/robots.txt file. Thus, if you save a /robots.txt file on your server
that contains the following lines:

User-agent: *  
Disallow: /*.pdf$ # disallow access to Acrobat Documents 

the Googlebot will not index any .pdf documents on your website.

See
Microsoft Knowledge Base Article - 217103  
"How to Write a Robots.txt File" 
http://support.microsoft.com/default.aspx?scid=KB;en-us;q217103 
and 
"A Standard for Robot Exclusion", by Martijn Koster, The Web Robots
pages:
http://www.robotstxt.org/wc/norobots.html 
for further information.

***HOWEVER***
Not all Search Engines are as polite as Google, and they may index
your .pdf documents anyway. In fact, some hackers and other nefarious
characters actually troll the web looking for directories disallowed
in /robots.txt files, specifically looking in them for confidential
information. /robots.txt files are just NOT a good way to secure
confidential data.

If you MUST make confidential information available on the web for
clients, you can put all the files you don't want robots to visit in a
separate subdirectory, configure your server to make that directory
un-listable on the web, and list only the directory name in the
/robots.txt. Now a malicious robot can't traverse that directory
unless you or someone else puts a direct link on the web to one of
your files (something you DON'T want to do!).
 
For example, rather than putting in your robots.txt file:  
 
User-Agent: *  
Disallow: /*.pdf$  
 
do:  
User-Agent: *  
Disallow: /norobots/  
 
and make a "norobots" directory, put your .pdf documents into it, and
configure your server to not generate a directory listing for that
directory. Now all an attacker would learn is that you have a
"norobots" directory, but he won't be able to list the files in there;
he'd need to guess their names.
 
However, in practice this is a bad idea -- it's just not secure
enough. Someone may publish a link to your files on their site. Or it
may turn up in a publicly accessible log file, say of your user's
proxy server, or maybe it will show up in someone's web server log as
a Referrer. Or someone may misconfigure your server at some future
date, "fixing" it to show a directory listing. Which leads me to the
real answer:
 
The real answer is that /robots.txt is not intended for access
control, so don't try to use it as such. Think of it as a "No Entry"
sign, not a locked door. If you have files on your web site that you
don't want unauthorized people to access, then configure your server
to do authentication, and configure appropriate authorization. Basic
Authentication has been around since the early days of the web (and is
trivial to configure on certain systems, such as Apache on UNIX), and
if you're really serious, SSL is commonplace in web servers.
 
HTAccess Authentication Tutorial  
http://faq.clever.net/htaccess.htm  
 
Information about adding username/password protection to your
webpages:
 
"Questions about Services and Scripts: Can I require a password for
access to a web page?", Massachusetts Institute of Technology (MIT)
http://www.mit.edu/faq/password.html 
 
"Password protecting a directory from web display", College of
Agriculture and Life Sciences, University of Arizona (September 3,
2002):
http://ag.arizona.edu/ecat/web/password-protect.html 

And of course, if the confidential information doesn't NEED to be
available on the web, just don't load it to your website server at
all.


(My thanks to Researcher webadept for some of the information included
in this Answer.)


Before Rating this Answer, if you have any questions about this
information, please post a Request for Clarification, and I will be
glad to see what I can do for you.

I hope that this Answer has provided you with exactly the information
you needed!

Regards,

aceresearcher
logisnews-ga rated this answer:5 out of 5 stars

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy