The Bright Planet ?Deep Web? white paper is still one of the best
investigations of the web. You linked the FAQ and the actual 2001
white paper by Michael K. Bergman is here:
?Deep Web White Paper,? (Berman, July 2001)
It?s important because of the contentions regarding the web,
specifically that the ?deep web? is 400 to 550 times larger than the
public web. It consists of databases that are not searchable by a
searchbot, including public databases with a CGI interface; private
Intranet pages; proprietary databases such as Lexis/Nexis or the
Thomson Gale databases; and pages that block searchbots.
Note here that in mid-2003, Google itself estimated that it was
reaching about 50% of the 4 billion pages on the Internet. Then, in
February 2004, Google was reaching 4.28 billion web pages (and with
images and message boards indexing 6 billion items):
?Google Achieves Search Milestone,? (Feb. 17, 2004)
The copyright notice on today?s Google home page now claims
8,058,044,651 web pages.
But we don?t know if Google has increased the percentage of web pages
indexed or not.
Probably one of the most-exhaustive studies done to assess the amount
of existing information is a study titled ?How Much Information?? done
at University of California at Berkeley. The good aspect of this
study is that it was done in 2000, then again in 2003. Also, it
attempts to measure the TOTAL amount of information, including that on
paper in library stacks and in other areas where search engines are
starting to make penetration ? like TV, film and other recorded
Between the 2 studies, the size of the public web went from the 14-28
terabyte range in 2000 (a terabyte is 1 million megabytes) to 167
terabytes in 2003. That was a growth of 6x to 12x in size. Their
estimate of the ?deep web? was the same figure used by Bright Planet ?
400 to 550 times larger.
Both studies use the same average web page size of 18.7K bytes, taken
from a Nature Magazine article that had studied the statistical
average of web sizes back in 1999. Though I?m skeptical that web page
sizes have remained constant, because of readability it?s unlikely
that they?ve grown to double or triple their size, so we can do some
estimates of web page growth from the ?How Much Information?? study:
Low-end 2000 (14 terabytes or 14 x 10^12 bytes): 749 million pages
High-end 2000 (28 terabytes): 1.50 billion pages
2003 estimate (167 terabytes): 8.93 billion pages
Note that there are some strong differences here between the data
accumulated by Bright Planet and the Berkeley study. For example,
Bright Planet claims that ?Sixty of the largest Deep Web sites
collectively contain about 750 terabytes of information.? But Bright
Planet is also using all documents and document types, including
images, which the Berkeley studies seem to exclude. The Berkeley
studies break the data types (e-mail, blogs, spam, web pages, web
images) down into great detail.
Here are links to the 2 studies:
"How Much Information?" (2000)
"How Much Information?" (2003)
What is safe to conclude from all of this?
? that digitization of information is growing rapidly
? that it is highly segmented, including non-searchable and proprietary databases
? that it?s difficult to measure because of the combination of image
and page objects. Here?s an example ? are these two different pages
or really just one?
Mooney Owners Poll
Are You Afraid of Heights?
? that search engine reach is growing but the creation of electronic
content is growing faster
? that a 6X-12X growth in web pages during the 2000-2003 period is reasonable
? that we need another study soon