Google Answers Logo
View Question
 
Q: Making "pseudo-private" webpages ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: Making "pseudo-private" webpages
Category: Computers > Internet
Asked by: philroy-ga
List Price: $20.00
Posted: 07 Jun 2006 05:20 PDT
Expires: 07 Jul 2006 05:20 PDT
Question ID: 736039
Background: My Internet Service Provider offers space for personal
webpages, so occasionally if I want to share something with a friend
(for example a collection of images too big to directly e-mail), I
create and post a simple temporary "private" webpage in my home
directory, or a subdirectory of it, with a name like
"hellosusan.html", then e-mail the URL of the page only to Susan.
There are no links from anywhere to that page, but I do maintain a
simple "index.html" page in my home directory (that does not link to
the page). The pseudo-privacy of the webpage is simply that no one
else knows what to URL to type to get to that page.

I thought that search engines generally can't find such
"pseudo-private" pages, since they have no way of knowing the spelling
of the URL (particularly the filename of the page) to go to. But a
friend of mine thinks that search engines are capable of
walking/iterating the entire directory hierarchy of a domain,
discovering any and all files and subdirectories in each directory.
While uncommon, I have encountered a few URLs that appear to be an
automated listing of the entire directory, with links to each file. I
thought I once heard that the presence of an index.html page in the
same directory blocks such a listing, but I'm not clear on that point.

I recently started using the tag:
	<meta name="robots" content="noindex, nofollow" />
to tell search engines not to index my page if they do happen to find
it. But I know that some rogue search engine or bot won't necessarily
respect that tag.

I also realize that if I have a link on my page to an external
website, a web browser can include the URL of my referring page as a
field in the request to the external site.

So that leads to the obvious questions:

1. Are search engines capable of iterating the entire contents of a
given known directory in a domain? Or do they have to know the
spellings of the files in the directory to find them (for example, due
to links from an already-known page)?

2. What causes (or prevents) an automated listing of every file in a
directory to appear in a browser? Does the presence or absence of an
index.html file in a given directory affect that in any way?

3. Is there any way to edit my webpage to prevent a browser from
including its URL as a field in requests to linked sites?

4. Assuming my friend doesn't forward the URL I e-mailed to her to
others, are there any other gotchas in how search engines, or other
people, could discover my pseudo-private webpage?

Thanks.
Answer  
Subject: Re: Making "pseudo-private" webpages
Answered By: eiffel-ga on 07 Jun 2006 12:49 PDT
Rated:5 out of 5 stars
 
Hi philroy-ga,

Taking your questions in turn:

> 1. Are search engines capable of iterating the entire contents
> of a given known directory in a domain?

In general, search engines cannot discover the entire contents of a
directory on your webserver. However, there are ways in which this
could be made possible. For example, IF your ISP was serving the files
by anonymous FTP (file transfer protocol) in addition to HTTP (the
web's hypertext transfer protocol), AND a link existed from a web page
to an FTP address in your website, AND the search engine was a
specialized one that wanted to crawl FTP sites (e.g. to compile a list
of downloadable files), then the search engine's crawler could request
a directory listing by issuing an FTP command.

But it's extremely unlikely that your ISP would be serving your files
by anonymous FTP without your knowledge.

In the normal situation, a search engine must follow a link to get to your webpage.

> 2. What causes (or prevents) an automated listing of every
> file in a directory to appear in a browser? Does the presence
> or absence of an index.html file in a given directory affect
> that in any way?

An automated directory listing is produced by the webserver only when
it is configured to do so. The listing is generated if the user types
in (or clicks on a link to) a directory name AND the webserver can't
find an ordinary page to serve (or has been instructed not to serve
one). The webserver will look for pages such as index.html, index.php,
index.cgi etc and will display the first one that it finds. The
webserver will only generate the autoindex if none of these pages are
found.

The exact list of pages that the webserver looks for will differ
according to how the webserver is configured, but unless your ISP has
grossly misconfigured its webserver you can bet that index.html will
be one of the pages that the webserver will display in preference to
an autoindex.

The Apache webserver uses the "DirectoryIndex" directive for this purpose:

  "The DirectoryIndex directive sets the list of resources
   to look for, when the client requests an index of the
   directory by specifying a / at the end of the directory
   name ,,. If none of the resources exist and the Indexes
   option is set, the server will generate its own listing
   of the directory."

   Apache HTTP Server Documentation
   http://httpd.apache.org/docs/2.0/mod/mod_dir.html#directoryindex

> 3. Is there any way to edit my webpage to prevent a browser
> from including its URL as a field in requests to linked sites?

The URL of the page that you are leaving is called the referrer URL.
It's not up to the webserver whether this is sent; it's up to the
browser. Some browsers can be configured so that they do not send the
referrer URL, but this is not usually satisfactory because some web
pages depend on the presence of the referrer URL to function properly.

> 4. Assuming my friend doesn't forward the URL I e-mailed to her
> to others, are there any other gotchas in how search engines, or
> other people, could discover my pseudo-private webpage?

You should avoid using guessable names for the HTML files,
particularly standard names that are frequently used (such as
sitemap.html, login.html etc), because people might construct these
URLs directly rather than following links.

You need to also ensure that the webserver statistics for your website
are not posted on the web, because they will contain links to your
URLs. Similarly, your webserver logs must not be published on the web.

Make sure too that the intended viewer does not bookmark your URLs
publicly, for example by using a social bookmarking site such as
http://del.icio.us/

Make sure that your ISP is not participating in any scheme to promote
your URLs (e.g. by submitting URLs to search engines, or by submitting
a sitemap to a service such as Google Sitemaps).

To summarise: you can make your pages "pseudo-private" by:
1. Not divulging the URL to anyone except the intended viewer
2. Trusting the intended viewer to do likewise
3. Having an index.html file in each pseudo-private directory
4. Turning off the sending of referrers by your browser
5. Keeping your stats off the web
6. Using the robots meta-tag to keep the honest robots out (and
   a robots.txt file if your hosting arrangements permit)

When all is said and done, it seems a lot more straightforward to
forget about "pseudo-private" and go for password-protected. You can
then email a password to the desired recipients.

The procedure to set up a password-protected directory will differ
according to which kind of webserver your ISP is using, and may not be
possible with all ISPs. However, it is often straightforward, and will
certainly keep the search engine crawlers out.

If that is not possible, you could consider an online service that
allows you to create webpages to be shared with people who you invite,
for example:

   MyFamily
   http://www.myfamily.com/

I trust this answers your questions. If not, feel free to request clarification.

Regards,
eiffel-ga


Google Search Strategy:

apache autoindex
://www.google.com/search?hl=en&q=apache+autoindex

"private website"
://www.google.com/search?hl=en&q=%22private+website%22

"keep out search engines"
://www.google.com/search?hl=en&q=%22keep+out+search+engines%22
philroy-ga rated this answer:5 out of 5 stars and gave an additional tip of: $5.00
Thorough and useful answer; thanks!

I checked and discovered that my web provider, Apple's .Mac, with
their "HomePage" builder, does offer password protection as you
recommend, but also slightly but irritatingly modifies the appearance
of the HTML pages I import through that service. I'll look further
into how much I can edit around that without breaking HomePage's
infrastructure. But it's good to know it's available. Thanks.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy