Trunks1054-
The working behind many compression algorithms is one of the more
artistic aspects of computer science: it makes elegant sense, but can
get really complicated at higher levels.
The basic idea behind a program like WinZip is to compress the number
of 1's and 0's required to store a set of static data. The premise
behind the most simplistic compression means is to simply come up with
a "key", which the program recognizes as a key and replaces with the
original data. So, a program might go through a file, and replace
every occurrence of 001011011011001001110001 with a shorter key-- say,
11100010. This cuts that part of the the file's size by 67%; the
tradeoff is, there must be a key kept somewhere separately, which
instructs the decompressor what to do when it encounters 11100010.
Obviously, then, the most efficient way of compressing data is to find
the sequence of 1's and 0's that is the MOST often repeated, and
replace it with a key. That way, the key can be stored within the
data as an 8-bit sequence, and the 24-bit "decryption" key needs only
be stored once. If the 24-bit pattern is replaced 20 times within the
document, that's a net savings of 296 bits. That doesn't sound like
much, but if the same procedure is applied to multiple series of bits
multiple times, compression can reduce a file's size TREMENDOUSLY- in
some cases, upwards of 80%. Efficiency increases as the size of a
file increases, and as its pattern of repeating chunks of data
increases.
To give a more concrete example:
(As quoted from http://www.howstuffworks.com/file-compression1.htm)
"As an example, let's look at a type of information we're all familiar
with: words.
In John F. Kennedy's 1961 inaugural address, he delivered this famous
line:
"Ask not what your country can do for you -- ask what you can do for
your country."
The quote has 17 words, made up of 61 letters, 16 spaces, one dash and
one period. If each letter, space or punctuation mark takes up one
unit of memory, we get a total file size of 79 units. To get the file
size down, we need to look for redundancies.
Immediately, we notice that:
"ask" appears two times
"what" appears two times
"your" appears two times
"country" appears two times
"can" appears two times
"do" appears two times
"for" appears two times
"you" appears two times
Ignoring the difference between capital and lower-case letters,
roughly half of the phrase is redundant. Nine words -- ask, not, what,
your, country, can, do, for, you -- give us almost everything we need
for the entire quote. To construct the second half of the phrase, we
just point to the words in the first half and fill in the spaces and
punctuation.
Most compression programs use a variation of the LZ adaptive
dictionary-based algorithm to shrink files. "LZ" refers to Lempel and
Ziv, the algorithm's creators, and "dictionary" refers to the method
of cataloging pieces of data.
The system for arranging dictionaries varies, but it could be as
simple as a numbered list. When we go through Kennedy's famous words,
we pick out the words that are repeated and put them into the numbered
index. Then, we simply write the number instead of writing out the
whole word.
So, if this is our dictionary:
1. ask
2. what
3. your
4. country
5. can
6. do
7. for
8. you
Our sentence now reads:
"1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"
If you knew the system, you could easily reconstruct the original
phrase using only this dictionary and number pattern. This is what the
expansion program on your computer does when it expands a downloaded
file. You might also have encountered compressed files that open
themselves up. To create this sort of file, the programmer includes a
simple expansion program with the compressed file. It automatically
reconstructs the original file once it's downloaded.
But how much space have we actually saved with this system? "1 not 2 3
4 5 6 7 8 -- 1 2 8 5 6 7 3 4" is certainly shorter than "Ask not what
your country can do for you; ask what you can do for your country;"
but keep in mind that we need to save the dictionary itself along with
the file.
In an actual compression scheme, figuring out the various file
requirements would be fairly complicated; but for our purposes, let's
go back to the idea that every character and every space takes up one
unit of memory. We already saw that the full phrase takes up 79 units.
Our compressed sentence (including spaces) takes up 37 units, and the
dictionary (words and numbers) also takes up 37 units. This gives us a
file size of 74, so we haven't reduced the file size by very much.
But this is only one sentence! You can imagine that if the compression
program worked through the rest of Kennedy's speech, it would find
these words and others repeated many more times.
In our example, we picked out all the repeated words and put those in
a dictionary. To us, this is the most obvious way to write a
dictionary. But a compression program sees it quite differently: It
doesn't have any concept of separate words -- it only looks for
patterns. And in order to reduce the file size as much as possible, it
carefully selects which patterns to include in the dictionary.
If we approach the phrase from this perspective, we end up with a
completely different dictionary.
If the compression program scanned Kennedy's phrase, the first
redundancy it would come across would be only a couple of letters
long. In "ask not what your," there is a repeated pattern of the
letter "t" followed by a space -- in "not" and "what." If the
compression program wrote this to the dictionary, it could write a "1"
every time a "t" were followed by a space. But in this short phrase,
this pattern doesn't occur enough to make it a worthwhile entry, so
the program would eventually overwrite it.
The next thing the program might notice is "ou," which appears in both
"your" and "country." If this were a longer document, writing this
pattern to the dictionary could save a lot of space -- "ou" is a
fairly common combination in the English language. But as the
compression program worked through this sentence, it would quickly
discover a better choice for a dictionary entry: Not only is "ou"
repeated, but the entire words "your" and "country" are both repeated,
and they are actually repeated together, as the phrase "your country."
In this case, the program would overwrite the dictionary entry for
"ou" with the entry for "your country."
The phrase "can do for" is also repeated, one time followed by "your"
and one time followed by "you," giving us a repeated pattern of "can
do for you." This lets us write 15 characters (including spaces) with
one number value, while "your country" only lets us write 13
characters (with spaces) with one number value, so the program would
overwrite the "your country" entry as just "r country," and then write
a separate entry for "can do for you." The program proceeds in this
way, picking up all repeated bits of information and then calculating
which patterns it should write to the dictionary. This ability to
rewrite the dictionary is the "adaptive" part of LZ adaptive
dictionary-based algorithm. The way a program actually does this is
fairly complicated, as you can see by the discussions on this site.
No matter what specific method you use, this in-depth searching system
lets you compress the file much more efficiently than you could by
just picking out words. Using the patterns we picked out above, and
adding "__" for spaces, we come up with this larger dictionary:
1. ask__
2. what__
3. you
4. r__country
5. __can__do__for__you
And this smaller sentence:
"1not__2345__--__12354"
The sentence now takes up 18 units of memory, and our dictionary takes
up 41 units. So we've compressed the total file size from 79 units to
59 units! This is just one way of compressing the phrase, and not
necessarily the most efficient one."
The further study of compression involves exactly HOW to maximize the
efficiency of this algorithm; this is a great part of the inner
workings of DivX (MPEG-4 compression) and many multimedia formats, as
well as the entire purpose of WinZip, PKZip, WinARJ, and various other
compression utilities.
Many or most of the software products on the market use some form of
compression at one point or another; Microsoft Word might partially
compress a saved file to save space, MP3 players decompress music
encoded and compressed, files distributed over the Internet might be
ZIP'ed (compressed using the ZIP protocol) to reduce their size for
transfers, and Media Player decompresses multimedia files in realtime
(Which is why you can now find video files at 1/10th the size they
used to be).
If you'd like more details about any subject I've mentioned, or have
any questions, please feel free to post a clarification request!
Bananarchy-ga |