Hello Jason,
1 Data Compression Algorithms
The effectiveness of algorithms varies with the type of data being
compressed. A good on line reference that describes a variety of
methods is
http://www.ics.uci.edu/~dan/pubs/DataCompression.html
Titled "Data Compression" - this is a little dated - the last
algorithm was defined in 1986, but is a good survey of methods and
effectiveness.
Section 1 describes the differences between block and variable
data/codes, and dynamic or static algorithms. The paper focuses on
lossless (same data out as went in) methods, there are also "lossy"
algorithms that provide better compression while sacrificing some
accuracy in reproducing the original data. Section 2 looks at
algorithms that depend on the type of data (e.g., run length encoding
for images).
Section 3 describes static methods like Huffman where fixed length
characters are coded into variable length bit patterns based on the
probability of occurrence. It achieves its compression be reducing the
average bit size from 8 to a smaller value.
Section 4 describes some adaptive methods (including Adaptive Huffman
- used in the Unix utility compact). Compression in ranges of 30-40%
are typical.
Section 5 describes other adaptive methods including Lempel Ziv - used
in the Unix utility compress. In Lempel-Ziv, fixed length code words
are used to encode (usually longer) strings of characters. Compression
in ranges of 50-60% are typical.
Section 6 has some empirical results. Section 7 describes what happens
when errors occur. Section 8 describes "new directions". Again - this
appears to be a good overview and should be used to get background on
the general topic.
Another good starting point is at
http://www.3dcompression.com/background.phtml
or
http://www.cs.pdx.edu/~idr/compression/
which have a number of links to relevant papers, source code, and
other materials.
For multimedia methods check out
http://www.cs.cf.ac.uk/Dave/Multimedia/node141.html
which has references to a number of current audio and video (both
lossless and lossy algorithms). Check the description of MPEG
effectiveness (a lossy algorithm) which describes a range of 1/4 to
1/12 the size of the compressed data when compared to the original
size.
To get source code, try some of the following sites:
http://datacompression.info/SourceCode.shtml
An extensive list of algorithms, source code references with ratings
of many items.
http://www.hn.is.uec.ac.jp/~arimura/compression_links.html
Another huge list of links - source code references are mixed in with
links to technical articles, standards groups, and so on.
http://directory.google.com/Top/Computers/Algorithms/Compression/
or
http://directory.google.com/Top/Computers/Software/Data_Compression/
Of course, the Google directory has a number of links in page rank
order. The second reference has a mix of free and commercial products
listed.
2 "Scaleable" Algorithms
To a certain point, lossy algorithms such as MPEG provide a range of
compression rates / quality of results. However, based on the way you
phrased #2, it looks like you are more interested in lossless
algorithms. Most of the modern lossless algorithms already have
heuristics that balance the size of the dictionary with the size of
the encoded data to get good compression with minimal overhead. You
will need to check the specific documentation with the source code you
get to get the real answer to this question.
3 Repeated Compression
The data compression section (1.3) at
http://www.ics.uci.edu/~dan/pubs/DC-Sec1.html
describes low redundancy as a goal of good compression. This was
described by Shannon in the late 40's. Basically - a good compression
algorithm has little redundancy, the coded information has the same
number of bits as the "essential content". As a result, repeated
application of a good compression algorithm will fail to reduce the
size of the data and often increase it due to overhead.
If you want, use a clarification request to provide additional
information of the type of data you are trying to compress, and I can
provide some more focused references. I would be glad to go into more
detail on any part of the answer.
--Maniac |
Request for Answer Clarification by
jroff-ga
on
06 Dec 2002 20:38 PST
You rock... you gave me a lot to look into.
If you don't mind...
I am interested in writing a backup utility (lossless) for any data.
I had a thought a while back to scan many machines and try to come up
with a representation of the most common patterns and creating a huge
dictionary. This would be effective for larger sequences. For
instance, the most common 200byte groups found on all the systems I
can find. I don't care if my library is 10GB and I don't care if it
takes a long time to run the compression (or decompression).
Therefore, many passes would be acceptible.
If I could represent common 200byte groups with 5bytes I would have a
potential (but not realistic) compression ratio as high as 1/40. For
the sections that weren't found, I could match them against 100byte
groups on the second pass. Then 50 byte groups on the third pass...
then 25.
With this, if I started over again, this has nothing to do with
reduncancy. The file changed and new groups can be found.
Ultimately, I thought I would have to implement a number of
algorithms. For instance, if I only searched for common byte patterns
in ZIP files, I can then ZIP the files first then work on making that
better with this library.
This was just something that came to me on the two hour commute on the
LIRR last year. Let me know if you think this has any value to it.
Any information on which algorithms I can implement in my product
(without compromising copy rights)?
Thanks again.
|
Clarification of Answer by
maniac-ga
on
07 Dec 2002 15:43 PST
Hello Jason,
If you are interested in compressing data in backups for a number of
systems, there are a few ways you can do this.
#1 - compress the data
Most backup programs use a well established algorithm such as those
I referred to. For example
http://support.discusware.com/center/resources/admdoc40/bkupmgr01.php
under "Compression" describes how they use the zlib library which is
free and not subject to patent license issues.
For reference, most of the links to source code I provided are to
"free software". There may be some restrictions such as the GNU Public
License,
#2 - backup less information
Your request for clarification has an excellent example, but let's
take it a little farther. When scanning a number of systems, you will
find many files duplicated on each machine. Recognize them (as a whole
or as disk blocks) using a checksum (e.g., MD5), record that data once
and keep and mark the duplicates for each machine. This would catch
system files - stored on each machine, but will also handle duplicate
email messages, extra copies of files created by users, and so on.
That would replace a (possibly) large file of many K or M bytes in
size with a single entry (the checksum). The dictionary becomes
relatively easy to generate (a copy of each unique file on your
systems) - store as compressed data, and the data "backed up" is the
list of checksums (and location on each system).
Hmm. Now that I think of it, at my "regular job", each software
developer has a copy of the baseline source code (several thousand
files). They would only have changes for the parts they are working on
- the rest is pretty much a duplicate of what is on the baseline
server. This kind of method would greatly reduce the amount of
information to be backed up.
On the copyright question - there generally is not a problem with
copyrights. The problem is with either:
- license of the library
- patents on the method
There are plenty of methods that can be implemented without
restriction. The methods used by zlib are documented in RFC 1952 and
is not subject to patent. In a similar way PNG was created as an
alternative to GIF for images.
Since you are trying to produce a commercial product, avoid packages
licensed with the GPL. The lesser (or library) GPL may be suitable -
but you may be better off with implementing yourself or getting a good
legal opinion on the rights. Good luck.
--Maniac
|