Hi philroy-ga,
Taking your questions in turn:
> 1. Are search engines capable of iterating the entire contents
> of a given known directory in a domain?
In general, search engines cannot discover the entire contents of a
directory on your webserver. However, there are ways in which this
could be made possible. For example, IF your ISP was serving the files
by anonymous FTP (file transfer protocol) in addition to HTTP (the
web's hypertext transfer protocol), AND a link existed from a web page
to an FTP address in your website, AND the search engine was a
specialized one that wanted to crawl FTP sites (e.g. to compile a list
of downloadable files), then the search engine's crawler could request
a directory listing by issuing an FTP command.
But it's extremely unlikely that your ISP would be serving your files
by anonymous FTP without your knowledge.
In the normal situation, a search engine must follow a link to get to your webpage.
> 2. What causes (or prevents) an automated listing of every
> file in a directory to appear in a browser? Does the presence
> or absence of an index.html file in a given directory affect
> that in any way?
An automated directory listing is produced by the webserver only when
it is configured to do so. The listing is generated if the user types
in (or clicks on a link to) a directory name AND the webserver can't
find an ordinary page to serve (or has been instructed not to serve
one). The webserver will look for pages such as index.html, index.php,
index.cgi etc and will display the first one that it finds. The
webserver will only generate the autoindex if none of these pages are
found.
The exact list of pages that the webserver looks for will differ
according to how the webserver is configured, but unless your ISP has
grossly misconfigured its webserver you can bet that index.html will
be one of the pages that the webserver will display in preference to
an autoindex.
The Apache webserver uses the "DirectoryIndex" directive for this purpose:
"The DirectoryIndex directive sets the list of resources
to look for, when the client requests an index of the
directory by specifying a / at the end of the directory
name ,,. If none of the resources exist and the Indexes
option is set, the server will generate its own listing
of the directory."
Apache HTTP Server Documentation
http://httpd.apache.org/docs/2.0/mod/mod_dir.html#directoryindex
> 3. Is there any way to edit my webpage to prevent a browser
> from including its URL as a field in requests to linked sites?
The URL of the page that you are leaving is called the referrer URL.
It's not up to the webserver whether this is sent; it's up to the
browser. Some browsers can be configured so that they do not send the
referrer URL, but this is not usually satisfactory because some web
pages depend on the presence of the referrer URL to function properly.
> 4. Assuming my friend doesn't forward the URL I e-mailed to her
> to others, are there any other gotchas in how search engines, or
> other people, could discover my pseudo-private webpage?
You should avoid using guessable names for the HTML files,
particularly standard names that are frequently used (such as
sitemap.html, login.html etc), because people might construct these
URLs directly rather than following links.
You need to also ensure that the webserver statistics for your website
are not posted on the web, because they will contain links to your
URLs. Similarly, your webserver logs must not be published on the web.
Make sure too that the intended viewer does not bookmark your URLs
publicly, for example by using a social bookmarking site such as
http://del.icio.us/
Make sure that your ISP is not participating in any scheme to promote
your URLs (e.g. by submitting URLs to search engines, or by submitting
a sitemap to a service such as Google Sitemaps).
To summarise: you can make your pages "pseudo-private" by:
1. Not divulging the URL to anyone except the intended viewer
2. Trusting the intended viewer to do likewise
3. Having an index.html file in each pseudo-private directory
4. Turning off the sending of referrers by your browser
5. Keeping your stats off the web
6. Using the robots meta-tag to keep the honest robots out (and
a robots.txt file if your hosting arrangements permit)
When all is said and done, it seems a lot more straightforward to
forget about "pseudo-private" and go for password-protected. You can
then email a password to the desired recipients.
The procedure to set up a password-protected directory will differ
according to which kind of webserver your ISP is using, and may not be
possible with all ISPs. However, it is often straightforward, and will
certainly keep the search engine crawlers out.
If that is not possible, you could consider an online service that
allows you to create webpages to be shared with people who you invite,
for example:
MyFamily
http://www.myfamily.com/
I trust this answers your questions. If not, feel free to request clarification.
Regards,
eiffel-ga
Google Search Strategy:
apache autoindex
://www.google.com/search?hl=en&q=apache+autoindex
"private website"
://www.google.com/search?hl=en&q=%22private+website%22
"keep out search engines"
://www.google.com/search?hl=en&q=%22keep+out+search+engines%22 |