Search engine technology is an exciting and rapidly evolving subject
right now. The question you ask addresses the heart of how these
engines determine the order of presentation of what may be an enormous
number or "hits" on your search topic.
First, a little vocabulary: the term "document weight" is used as you
use it by some people to describe some measure of the relevance of a
particular web page to the search being conducted. In this context it
is generally synonymous with "page rank," a term which seems to be
more widely used (at least, judging from the number of relevant hits
which turn up when you do a Google search on "page rank" compared to
the number you get when you search on "document weight"). "Document
weight" is also used to describe the size of a "page" (the amount of
material downloaded via a single URL), but we will ignore that meaning
in this discussion. Let's use the term "page rank" here since it
appears in more references.
Also, I'm going to assume that your interest is primarily academic.
There is a growing business in trying to trick search engines into
giving a higher page rank to particular web pages as a marketing tool.
The makers of search engines, of course, strive to make sure that such
tricks are ineffective. One way they do this is by keeping their
algorithmic details confidential. They also may play the spy vs. spy
game of watching for the use of such tricks and refining their ranking
algorithms to circumvent the tricks. At the same time, some search
companies try to play double agent by selling improved page rank
(positioning in search results).
As maxhodges notes in his comment, each search engine uses its own
proprietary algorithms to determine page rank. Probably the most
authoritative description of these algorithms is in material released
by each company. A few of these are summarized below:
Google:
Google page rank is described at the Google website:
://www.google.com/technology/index.html
"PageRank relies on the uniquely democratic nature of the web by using
its vast link structure as an indicator of an individual page's value.
In essence, Google interprets a link from page A to page B as a vote,
by page A, for page B. But, Google looks at more than the sheer volume
of votes, or links a page receives; it also analyzes the page that
casts the vote. Votes cast by pages that are themselves 'important'
weigh more heavily and help to make other pages 'important.'
"Important, high-quality sites receive a higher PageRank, which Google
remembers each time it conducts a search. Of course, important pages
mean nothing to you if they don't match your query. So, Google
combines PageRank with sophisticated text-matching techniques to find
pages that are both important and relevant to your search. Google goes
far beyond the number of times a term appears on a page and examines
all aspects of the page's content (and the content of the pages
linking to it) to determine if it's a good match for your query.
"Google's complex, automated methods make human tampering with our
results extremely difficult. And though we do run relevant ads above
and next to our results, Google does not sell placement within the
results themselves (i.e., no one can buy a higher PageRank). A Google
search is an easy, honest and objective way to find high-quality
websites with information relevant to your search."
Inktomi ( http://www.inktomi.com ):
Inktomi provides the search engine used by a number of Internet
portals. They don't describe their page ranking strategy in detail,
but they do provide some indications as to how they try to prevent
attempts to circumvent it:
Inktomi Content Guidelines
http://www.inktomi.com/products/web_search/guidelines.html
"Inktomi strives to provide the best search experience on the Web by
directing searchers to high-quality and relevant Web content in
response to a search query.
Pages Inktomi Wants Included in Its Index
- Original and unique content of genuine value
- Pages designed primarily for humans, with search engine
considerations secondary
- Hyperlinks intended to help people find interesting, related
content, when applicable
- Metadata (including title and description) that accurately describes
the contents of a Web page
- Good Web design in general
"Unfortunately, not all Web pages contain information that is valuable
to a user. Some pages are created deliberately to trick the search
engine into offering inappropriate, redundant or poor-quality search
results; this is often called 'spam.' Inktomi does not want these
pages in the index.
"What Inktomi Considers Unwanted
"Some, but not all, examples of the more common types of pages that
Inktomi does not want include:
- Pages that harm accuracy, diversity or relevance of search results
- Pages dedicated to directing the user to another page
- Pages that have substantially the same content as other pages
- Sites with numerous, unnecessary virtual hostnames
- Pages in great quantity, automatically generated or of little value
- Pages using methods to artificially inflate search engine ranking
- The use of text that is hidden from the user
- Pages that give the search engine a different page than the public
sees (cloaking)
- Excessively cross-linking sites to inflate a site's apparent
popularity
- Pages built primarily for the search engines
- Misuse of competitor names
- Multiple sites offering the same content
- Pages that use excessive pop-ups, interfering with user navigation
- Pages that seem deceptive, fraudulent or provide a poor user
experience
"Inktomi Guidelines
"Inktomi's policies are designed to ensure that poor-quality pages do
not degrade the user experience in any way. As with Inktomi's other
guidelines, Inktomi reserves the right, at its sole discretion, to
take any and all action it deems appropriate to insure the quality of
its index."
Teoma ( http://www.teoma.com ):
A new player in the search engine market, Teoma describes their
ranking at:
"How Teoma Works"
http://static.wc.teoma.com/docs/teoma/about/searchWithAuthority.html
"Teoma employs a technique called Subject-Specific Popularity.
Subject-Specific Popularity analyzes the relationship of sites within
a community, ranking a site based on the number of same-subject pages
that reference it, among other things. In other words, Teoma
determines the best answer for a search by asking experts within a
specific subject community about who they believe is the best resource
for that subject. By assessing the opinions of a sites peers, Teoma
establishes authority for the search result. Relevant search results
ranked by Subject-Specific Popularity are presented under the heading
Results on the Teoma.com results page."
AltaVista( http://www.altavista.com ):
For a time, AltaVista was one of the most popular search engines. If
they use a page-ranking scheme it is not described in detail. The only
reference I could find to results placement on their site says:
" AltaVista first looks for Web pages that contain all of the words.
Pages with all those words will be at the top of your results; those
with only one of the words will be at the bottom."
Most other major search engines that appear to compete with those
describe above now use results or techniques from one of them. A good
listing of the major players can be found at SearchEngineWatch.com:
http://searchenginewatch.com/links/major.html
Even though the article was last updated 3 months ago, some of the
information is already inaccurate, but at least you can find links to
the major players.
You may also be interested in the article "How Search Engines Work,"
by Danny Sullivan, posted at SearchEngineWatch.com:
http://searchenginewatch.com/webmasters/work.html
For more information, you can use your favorite search engine to
search for such terms as
"search engine"
"page rank"
"document weight"
Direct links to these searches using Google:
://www.google.com/search?q="search engine"
://www.google.com/search?q="page rank"
://www.google.com/search?q="document weight"
And you may also want to read the April-fool description of Google's
"PigeonRank" system at
://www.google.com/technology/pigeonrank.html
Happy searching! |