Google Answers Logo
View Question
 
Q: Non-HTML files on the Internet ( No Answer,   10 Comments )
Question  
Subject: Non-HTML files on the Internet
Category: Computers > Internet
Asked by: ll-ga
List Price: $15.00
Posted: 12 May 2002 22:52 PDT
Expires: 16 May 2002 20:21 PDT
Question ID: 15420
Information on the Internet in non-HTML formats:  
1) Which non-HTML formats are most popular on the Internet? 
2) What is the percentage of PDF, DOC, XLS, PS, ZIP etc. files on the Internet  
   both in terms of size and total number of files? 
3) What are the trends (the dynamics of the above numbers)?  
Answer  
There is no answer at this time.

Comments  
Subject: Re: Non-HTML files on the Internet
From: till-ga on 13 May 2002 01:32 PDT
 
are you sure that ANYBODY will be able to answer that ! no one knows
how many pages there are on the web. so how can you expect exact
figures in percentage ?
till
Subject: Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 04:37 PDT
 
You can get a pretty good idea of some of these files by looking at
Google's database:


PDF files: 4,390,000
GIF files: 1,900,000
JPG files: 1,860,000
SWF files: 1,720,000
DOC files: 1,470,000
PS files:    578,000
XLS files:   337,000
PPT files:   322,000
RTF files:   226,000
WP files:     41,800
WP5 files:    21,300
WP6 files:     9,580
WPD files:        78
WP4 files:        10
Subject: Re: Non-HTML files on the Internet
From: jessamyn-ga on 13 May 2002 08:06 PDT
 
I gave it my best shot, but this data is particularly tough to come
by. Hardest part for me was trying to find file size information.

The so-called science of poll-taking is not a science at all but a
mere necromancy. People are unpredictable by nature, and although you
can take a nation's pulse, you can't be sure that the nation hasn't
just run up a flight of stairs.
--E.B. White, The New Yorker, Nov. 1948

While till is correct that an exact count of file types and sizes is
impossible [what is the count now? .....how about now?] researchers
have been trying to nail down rough statistics on this for some time,
particularly since being able to answer this question assists them in
learning how to better create web tools and caching capacities of
servers. As a result a lot of the good data on this comes up in
research papers studying caching and traffic analysis. And of course,
you need at least two studies using the same metrics or the data
results are non-comparable.

Here are some interesting statistics I came up with when I was looking
this up

- Although the number of gif files is as twice as the number of jpeg
files, the total bytes of gif files is pretty close to that of jpeg,
meaning that the average jpeg is twice the size of the average GIF.

- Applications take 2.0% of the total requests but it takes 31.5% of
the total bytes, which is the largest function of the total bytes.

source: The Cacheability of Web Objects 
     http://cs-www.bu.edu/techreports/pdf/2000-019-web-cachability.pdf
Google cache: 
     http://216.239.33.100/search?q=cache:Sw96xn7L0iEC:cs-www.bu.edu/techreports/pdf/2000-019-web-cachability.pdf+%22file+type+distribution%22+statistics&hl=en

[google stats from november 2001, I'm not sure when cheese's are from]

Portable Document Format (.pdf) about 4,690,000
Microsoft Word (.doc) about 1,430,000
Postscript (.ps) about 632,000
Microsoft Excel (.xls) about 357,000
Microsoft PowerPoint (.ppt) about 312,000
Rich Text Format (.rtf) about 262,000
Corel WordPerfect (.wpd & .wp5 & .wp4 & .wp) about 56,841

[source: http://www.faganfinder.com/site/news2.shtml ]

For a good read on measuring the internet, see
"Measuring the Immeasurable: Global Internet Measurement
Infrastructure"
     http://www.caida.org/outreach/papers/2001/MeasInfra/

For a historical viewpoint, see these reports

Why web usage statistics are (worse than) meaningless
     http://www.goldmark.org/netrants/webstats/
"An Investigation of Documents from the World Wide Web"
     http://www5conf.inria.fr/fich_html/papers/P7/Overview.html
     [great charts]
"Multimedia Traffic Analysis Using CHITRA95"
     http://ei.cs.vt.edu/~succeed/95multimediaAWAFPR/95multimediaAWAFPR.html
Subject: Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 12:22 PDT
 
I had a good laugh over your answer! You quoted figures from
http://www.faganfinder.com/ , my own website :-) . The figures from my
first post are up to date ones, I checked Google's database as I was
writing the post.

I won't say what search terms I used to find those numbers because
Google probably wouldn't like it.
Subject: Re: Non-HTML files on the Internet
From: jzig-ga on 13 May 2002 14:35 PDT
 
Taking the idea of a google survey further, I think I'll have a rough
estimate of the documents ont he 19th, and a trend reading.  Although
it would be more reliable if you increased the length of the question
so I could do a better trend report.  Note to cheese: your figures are
around 2 million low on pdf, and low on everything else too. Try
different methods of searching google.
Subject: Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 15:13 PDT
 
Yes, the numbers do seem a little small. Remember that Google
maintains a number of databases, and so results can be different all
the time. I re-ran all of my searches on all of the databases and
these were the largest numbers that I found (I also included txt,
which I didn't before):

PDF 4,470,000
TXT 2,300,000
SWF 1,940,000
GIF 1,900,000
JPG 1,860,000
DOC 1,580,000
JPEG  624,000
PS    546,000
XLS   398,000
PPT   390,000
RTF   261,000
WP     43,800
WP5    21,500
WP6     9,620
WPD        79
WP4        11
Subject: Re: Non-HTML files on the Internet
From: mikepake-ga on 14 May 2002 02:51 PDT
 
There was an interesting article on this topic in the Guardian
newspaper in the UK:

'Search for the invisible web'
http://www.guardian.co.uk/online/story/0,3605,547140,00.html 

If you read down to the bottom of the page there are lots of links to
search engines dedicated to uncovering 'hidden' information and
unusual file formats that conventional search engines don't uncover.
Like the other people above, I'd have to say that it's pretty much
impossible to give a definitive answer to your question. For example,
MP3 files are very popular on the Internet because of file sharing
programs like AudioGalaxy and Kazaa but many of those files are on
personal computers and are only available 'online' while a person is
connected.

Cheers,
       Mike
Subject: Re: Non-HTML files on the Internet
From: rajeevsmind-ga on 15 May 2002 10:16 PDT
 
A good source on trends in search engine capacity would be 'The
Invisible Web: Searching the sources search engines can't see' by
Chris Sherman.
Subject: Re: Non-HTML files on the Internet
From: huntsman-ga on 15 May 2002 16:15 PDT
 
For the curious, 

To do a general search of file types, you can use the search terms
shown below <between the brackets, including quotes and spaces> in the
appropriate Google database:

________________________________________________________

<Search Terms> ----- Google Database ----- Result

HTML files -

<"*.html" filetype:html> ----- Web ----- "about 7,410,000"
<"*.htm" filetype:htm> ----- Web ----- "about 7,380,000"

Document files -

<"*.pdf" filetype:pdf> ----- Web ----- "about 4,270,000"
<"*.txt" filetype:txt> ----- Web ----- "about 2,000,000"
<"*.swf" filetype:swf> ----- Web ----- "about 1,710,000"
<"*.doc" filetype:doc> ----- Web ----- "about 1,390,000"
<"*.ps" filetype:ps> ----- Web ----- "about 549,000"
<"*.xls" filetype:xls> ----- Web ----- "about 333,000"
<"*.ppt" filetype:ppt> ----- Web ----- "about 350,000"
<"*.rtf" filetype:rtf> ----- Web ----- "about 222,000"

Image files -

<"*.gif" filetype:gif> ----- Images ----- "about 1,900,000"
<"*.jpg" filetype:jpg> ----- Images ----- "about 1,860,000"
<"*.jpeg" filetype:jpeg> ----- Images ----- "about 624,000"

________________________________________________________


While this type of search might give us some relative ratios of the
file types available, it doesn't find everything. Google's "Image
Search Help" page states:

"About Google's Image Search" 
http://images.google.com/help/faq_images.html
"Google's Image Search is the most comprehensive on the Web, with more
than 330 million images indexed and available for viewing."

Yet the total GIF and JPEG file results shown above are less than
4-1/2 million.

FYI,
huntsman
Subject: Re: Non-HTML files on the Internet
From: ll-ga on 16 May 2002 20:21 PDT
 
Thank sto everyone who posted cooments on my question.
Using those I was able to come up with the following:

	total #		average		total		% in	% in	convert.
Format	of files	file size	size		number	size	ratio

All	2,450,000,000	22 KB		53.9 TB		-	-	-
PDF	6,040,000	485 KB		2,926 GB	0.25%	5.4%	3.0
PS	557,000		495 KB		275 GB		0.02%	0.5%	12.7
DOC	1,580,000	212 KB		334 GB		0.06%	0.62%	6.9
XLS	398,000		41 KB		16 GB		0.02%	0.03%	0.36
PPT	390,000		417 KB		162 GB		0.02%	0.3%	20.6
RTF	261,000		389 KB		102 GB		0.01%	0.2%	5.2
TXT	2,300,000	22 KB		51 GB		0.09%	0.09%	1.0
TEX	179,000		236 KB		42 GB		0.01%	0.08%	1.0

Data from Google Web Database as of 05/15/2002.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy