![]() |
|
![]() | ||
|
Subject:
Non-HTML files on the Internet
Category: Computers > Internet Asked by: ll-ga List Price: $15.00 |
Posted:
12 May 2002 22:52 PDT
Expires: 16 May 2002 20:21 PDT Question ID: 15420 |
Information on the Internet in non-HTML formats: 1) Which non-HTML formats are most popular on the Internet? 2) What is the percentage of PDF, DOC, XLS, PS, ZIP etc. files on the Internet both in terms of size and total number of files? 3) What are the trends (the dynamics of the above numbers)? |
![]() | ||
|
There is no answer at this time. |
![]() | ||
|
Subject:
Re: Non-HTML files on the Internet
From: till-ga on 13 May 2002 01:32 PDT |
are you sure that ANYBODY will be able to answer that ! no one knows how many pages there are on the web. so how can you expect exact figures in percentage ? till |
Subject:
Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 04:37 PDT |
You can get a pretty good idea of some of these files by looking at Google's database: PDF files: 4,390,000 GIF files: 1,900,000 JPG files: 1,860,000 SWF files: 1,720,000 DOC files: 1,470,000 PS files: 578,000 XLS files: 337,000 PPT files: 322,000 RTF files: 226,000 WP files: 41,800 WP5 files: 21,300 WP6 files: 9,580 WPD files: 78 WP4 files: 10 |
Subject:
Re: Non-HTML files on the Internet
From: jessamyn-ga on 13 May 2002 08:06 PDT |
I gave it my best shot, but this data is particularly tough to come by. Hardest part for me was trying to find file size information. The so-called science of poll-taking is not a science at all but a mere necromancy. People are unpredictable by nature, and although you can take a nation's pulse, you can't be sure that the nation hasn't just run up a flight of stairs. --E.B. White, The New Yorker, Nov. 1948 While till is correct that an exact count of file types and sizes is impossible [what is the count now? .....how about now?] researchers have been trying to nail down rough statistics on this for some time, particularly since being able to answer this question assists them in learning how to better create web tools and caching capacities of servers. As a result a lot of the good data on this comes up in research papers studying caching and traffic analysis. And of course, you need at least two studies using the same metrics or the data results are non-comparable. Here are some interesting statistics I came up with when I was looking this up - Although the number of gif files is as twice as the number of jpeg files, the total bytes of gif files is pretty close to that of jpeg, meaning that the average jpeg is twice the size of the average GIF. - Applications take 2.0% of the total requests but it takes 31.5% of the total bytes, which is the largest function of the total bytes. source: The Cacheability of Web Objects http://cs-www.bu.edu/techreports/pdf/2000-019-web-cachability.pdf Google cache: http://216.239.33.100/search?q=cache:Sw96xn7L0iEC:cs-www.bu.edu/techreports/pdf/2000-019-web-cachability.pdf+%22file+type+distribution%22+statistics&hl=en [google stats from november 2001, I'm not sure when cheese's are from] Portable Document Format (.pdf) about 4,690,000 Microsoft Word (.doc) about 1,430,000 Postscript (.ps) about 632,000 Microsoft Excel (.xls) about 357,000 Microsoft PowerPoint (.ppt) about 312,000 Rich Text Format (.rtf) about 262,000 Corel WordPerfect (.wpd & .wp5 & .wp4 & .wp) about 56,841 [source: http://www.faganfinder.com/site/news2.shtml ] For a good read on measuring the internet, see "Measuring the Immeasurable: Global Internet Measurement Infrastructure" http://www.caida.org/outreach/papers/2001/MeasInfra/ For a historical viewpoint, see these reports Why web usage statistics are (worse than) meaningless http://www.goldmark.org/netrants/webstats/ "An Investigation of Documents from the World Wide Web" http://www5conf.inria.fr/fich_html/papers/P7/Overview.html [great charts] "Multimedia Traffic Analysis Using CHITRA95" http://ei.cs.vt.edu/~succeed/95multimediaAWAFPR/95multimediaAWAFPR.html |
Subject:
Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 12:22 PDT |
I had a good laugh over your answer! You quoted figures from http://www.faganfinder.com/ , my own website :-) . The figures from my first post are up to date ones, I checked Google's database as I was writing the post. I won't say what search terms I used to find those numbers because Google probably wouldn't like it. |
Subject:
Re: Non-HTML files on the Internet
From: jzig-ga on 13 May 2002 14:35 PDT |
Taking the idea of a google survey further, I think I'll have a rough estimate of the documents ont he 19th, and a trend reading. Although it would be more reliable if you increased the length of the question so I could do a better trend report. Note to cheese: your figures are around 2 million low on pdf, and low on everything else too. Try different methods of searching google. |
Subject:
Re: Non-HTML files on the Internet
From: cheese-ga on 13 May 2002 15:13 PDT |
Yes, the numbers do seem a little small. Remember that Google maintains a number of databases, and so results can be different all the time. I re-ran all of my searches on all of the databases and these were the largest numbers that I found (I also included txt, which I didn't before): PDF 4,470,000 TXT 2,300,000 SWF 1,940,000 GIF 1,900,000 JPG 1,860,000 DOC 1,580,000 JPEG 624,000 PS 546,000 XLS 398,000 PPT 390,000 RTF 261,000 WP 43,800 WP5 21,500 WP6 9,620 WPD 79 WP4 11 |
Subject:
Re: Non-HTML files on the Internet
From: mikepake-ga on 14 May 2002 02:51 PDT |
There was an interesting article on this topic in the Guardian newspaper in the UK: 'Search for the invisible web' http://www.guardian.co.uk/online/story/0,3605,547140,00.html If you read down to the bottom of the page there are lots of links to search engines dedicated to uncovering 'hidden' information and unusual file formats that conventional search engines don't uncover. Like the other people above, I'd have to say that it's pretty much impossible to give a definitive answer to your question. For example, MP3 files are very popular on the Internet because of file sharing programs like AudioGalaxy and Kazaa but many of those files are on personal computers and are only available 'online' while a person is connected. Cheers, Mike |
Subject:
Re: Non-HTML files on the Internet
From: rajeevsmind-ga on 15 May 2002 10:16 PDT |
A good source on trends in search engine capacity would be 'The Invisible Web: Searching the sources search engines can't see' by Chris Sherman. |
Subject:
Re: Non-HTML files on the Internet
From: huntsman-ga on 15 May 2002 16:15 PDT |
For the curious, To do a general search of file types, you can use the search terms shown below <between the brackets, including quotes and spaces> in the appropriate Google database: ________________________________________________________ <Search Terms> ----- Google Database ----- Result HTML files - <"*.html" filetype:html> ----- Web ----- "about 7,410,000" <"*.htm" filetype:htm> ----- Web ----- "about 7,380,000" Document files - <"*.pdf" filetype:pdf> ----- Web ----- "about 4,270,000" <"*.txt" filetype:txt> ----- Web ----- "about 2,000,000" <"*.swf" filetype:swf> ----- Web ----- "about 1,710,000" <"*.doc" filetype:doc> ----- Web ----- "about 1,390,000" <"*.ps" filetype:ps> ----- Web ----- "about 549,000" <"*.xls" filetype:xls> ----- Web ----- "about 333,000" <"*.ppt" filetype:ppt> ----- Web ----- "about 350,000" <"*.rtf" filetype:rtf> ----- Web ----- "about 222,000" Image files - <"*.gif" filetype:gif> ----- Images ----- "about 1,900,000" <"*.jpg" filetype:jpg> ----- Images ----- "about 1,860,000" <"*.jpeg" filetype:jpeg> ----- Images ----- "about 624,000" ________________________________________________________ While this type of search might give us some relative ratios of the file types available, it doesn't find everything. Google's "Image Search Help" page states: "About Google's Image Search" http://images.google.com/help/faq_images.html "Google's Image Search is the most comprehensive on the Web, with more than 330 million images indexed and available for viewing." Yet the total GIF and JPEG file results shown above are less than 4-1/2 million. FYI, huntsman |
Subject:
Re: Non-HTML files on the Internet
From: ll-ga on 16 May 2002 20:21 PDT |
Thank sto everyone who posted cooments on my question. Using those I was able to come up with the following: total # average total % in % in convert. Format of files file size size number size ratio All 2,450,000,000 22 KB 53.9 TB - - - PDF 6,040,000 485 KB 2,926 GB 0.25% 5.4% 3.0 PS 557,000 495 KB 275 GB 0.02% 0.5% 12.7 DOC 1,580,000 212 KB 334 GB 0.06% 0.62% 6.9 XLS 398,000 41 KB 16 GB 0.02% 0.03% 0.36 PPT 390,000 417 KB 162 GB 0.02% 0.3% 20.6 RTF 261,000 389 KB 102 GB 0.01% 0.2% 5.2 TXT 2,300,000 22 KB 51 GB 0.09% 0.09% 1.0 TEX 179,000 236 KB 42 GB 0.01% 0.08% 1.0 Data from Google Web Database as of 05/15/2002. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |