Google Answers: Compression Algorithms

View Question

Q: Compression Algorithms ( Answered 5 out of 5 stars

Question

Subject: Compression Algorithms
Category: Computers > Algorithms
Asked by: jroff-ga
List Price: $20.00

Posted: 06 Dec 2002 11:26 PST
Expires: 05 Jan 2003 11:26 PST
Question ID: 120440

I would like to know some specific information about data compression
algorithms.
 
1.	What are some data compression algorithms (names and maximum
compression ratios for each) that can be used and integrated into my
product to be sold. The product will be a developer’s tool that allows
data compression – but I don’t want to step on any toes so this list
should include ‘open’ algorithms that aren’t copy written.
 
2.	Of this list, which ones, are scalable. I am looking for a higher
compression ratio based upon the size of the lookup library (if one is
being used). I know very little about this now, but my understanding
is that there is usually a library created. Which one gets the most
benefit from using a larger size library without dead-ending at a
specific size?
 
3.	What is the technical reason for when a file is zipped with two
different compression tools, there is no additional compression gain?
I can understand that if I ZIP a file, then ZIP it again, it is at
it’s maximum compression. However, if I use another tool, such as RAR
(which uses a different algorithm), it cannot compress it any further
(beyond another 0.5% or so).

Thanks for your help.

-Jason

Answer

Subject: Re: Compression Algorithms
Answered By: maniac-ga on 06 Dec 2002 20:02 PST
Rated: 5 out of 5 stars

Hello Jason,

1 Data Compression Algorithms

The effectiveness of algorithms varies with the type of data being
compressed. A good on line reference that describes a variety of
methods is
  http://www.ics.uci.edu/~dan/pubs/DataCompression.html
Titled "Data Compression" - this is a little dated - the last
algorithm was defined in 1986, but is a good survey of methods and
effectiveness.

Section 1 describes the differences between block and variable
data/codes, and dynamic or static algorithms. The paper focuses on
lossless (same data out as went in) methods, there are also "lossy"
algorithms that provide better compression while sacrificing some
accuracy in reproducing the original data. Section 2 looks at
algorithms that depend on the type of data (e.g., run length encoding
for images).

Section 3 describes static methods like Huffman where fixed length
characters are coded into variable length bit patterns based on the
probability of occurrence. It achieves its compression be reducing the
average bit size from 8 to a smaller value.

Section 4 describes some adaptive methods (including Adaptive Huffman
- used in the Unix utility compact). Compression in ranges of 30-40%
are typical.
Section 5 describes other adaptive methods including Lempel Ziv - used
in the Unix utility compress. In Lempel-Ziv, fixed length code words
are used to encode (usually longer) strings of characters. Compression
in ranges of 50-60% are typical.

Section 6 has some empirical results. Section 7 describes what happens
when errors occur. Section 8 describes "new directions". Again - this
appears to be a good overview and should be used to get background on
the general topic.

Another good starting point is at
  http://www.3dcompression.com/background.phtml
or
  http://www.cs.pdx.edu/~idr/compression/
which have a number of links to relevant papers, source code, and
other materials.

For multimedia methods check out
  http://www.cs.cf.ac.uk/Dave/Multimedia/node141.html
which has references to a number of current audio and video (both
lossless and lossy algorithms). Check the description of MPEG
effectiveness (a lossy algorithm) which describes a range of 1/4 to
1/12 the size of the compressed data when compared to the original
size.

To get source code, try some of the following sites:
  http://datacompression.info/SourceCode.shtml
An extensive list of algorithms, source code references with ratings
of many items.
  http://www.hn.is.uec.ac.jp/~arimura/compression_links.html
Another huge list of links - source code references are mixed in with
links to technical articles, standards groups, and so on.
  http://directory.google.com/Top/Computers/Algorithms/Compression/
or
  http://directory.google.com/Top/Computers/Software/Data_Compression/
Of course, the Google directory has a number of links in page rank
order. The second reference has a mix of free and commercial products
listed.

2 "Scaleable" Algorithms

To a certain point, lossy algorithms such as MPEG provide a range of
compression rates / quality of results. However, based on the way you
phrased #2, it looks like you are more interested in lossless
algorithms. Most of the modern lossless algorithms already have
heuristics that balance the size of the dictionary with the size of
the encoded data to get good compression with minimal overhead. You
will need to check the specific documentation with the source code you
get to get the real answer to this question.

3 Repeated Compression

The data compression section (1.3) at
  http://www.ics.uci.edu/~dan/pubs/DC-Sec1.html
describes low redundancy as a goal of good compression. This was
described by Shannon in the late 40's. Basically - a good compression
algorithm has little redundancy, the coded information has the same
number of bits as the "essential content". As a result, repeated
application of a good compression algorithm will fail to reduce the
size of the data and often increase it due to overhead.

If you want, use a clarification request to provide additional
information of the type of data you are trying to compress, and I can
provide some more focused references.  I would be glad to go into more
detail on any part of the answer.

  --Maniac

Request for Answer Clarification by jroff-ga on 06 Dec 2002 20:38 PST

You rock... you gave me a lot to look into.

If you don't mind... 

I am interested in writing a backup utility (lossless) for any data.

I had a thought a while back to scan many machines and try to come up
with a representation of the most common patterns and creating a huge
dictionary. This would be effective for larger sequences. For
instance, the most common 200byte groups found on all the systems I
can find. I don't care if my library is 10GB and I don't care if it
takes a long time to run the compression (or decompression).
Therefore, many passes would be acceptible.

If I could represent common 200byte groups with 5bytes I would have a
potential (but not realistic) compression ratio as high as 1/40. For
the sections that weren't found, I could match them against 100byte
groups on the second pass. Then 50 byte groups on the third pass...
then 25.

With this, if I started over again, this has nothing to do with
reduncancy. The file changed and new groups can be found.

Ultimately, I thought I would have to implement a number of
algorithms. For instance, if I only searched for common byte patterns
in ZIP files, I can then ZIP the files first then work on making that
better with this library.

This was just something that came to me on the two hour commute on the
LIRR last year. Let me know if you think this has any value to it.

Any information on which algorithms I can implement in my product
(without compromising copy rights)?

Thanks again.

Clarification of Answer by maniac-ga on 07 Dec 2002 15:43 PST

Hello Jason,

If you are interested in compressing data in backups for a number of
systems, there are a few ways you can do this.

#1 - compress the data
  Most backup programs use a well established algorithm such as those
I referred to. For example
  http://support.discusware.com/center/resources/admdoc40/bkupmgr01.php
under "Compression" describes how they use the zlib library which is
free and not subject to patent license issues.

For reference, most of the links to source code I provided are to
"free software". There may be some restrictions such as the GNU Public
License,

#2 - backup less information

Your request for clarification has an excellent example, but let's
take it a little farther. When scanning a number of systems, you will
find many files duplicated on each machine. Recognize them (as a whole
or as disk blocks) using a checksum (e.g., MD5), record that data once
and keep and mark the duplicates for each machine. This would catch
system files - stored on each machine, but will also handle duplicate
email messages, extra copies of files created by users, and so on.
That would replace a (possibly) large file of many K or M bytes in
size with a single entry (the checksum). The dictionary becomes
relatively easy to generate (a copy of each unique file on your
systems) - store as compressed data, and the data "backed up" is the
list of checksums (and location on each system).

Hmm. Now that I think of it, at my "regular job", each software
developer has a copy of the baseline source code (several thousand
files). They would only have changes for the parts they are working on
- the rest is pretty much a duplicate of what is on the baseline
server. This kind of method would greatly reduce the amount of
information to be backed up.

On the copyright question - there generally is not a problem with
copyrights. The problem is with either:
 - license of the library
 - patents on the method
There are plenty of methods that can be implemented without
restriction. The methods used by zlib are documented in RFC 1952 and
is not subject to patent. In a similar way PNG was created as an
alternative to GIF for images.

Since you are trying to produce a commercial product, avoid packages
licensed with the GPL. The lesser (or library) GPL may be suitable -
but you may be better off with implementing yourself or getting a good
legal opinion on the rights. Good luck.

  --Maniac

jroff-ga rated this answer: 5 out of 5 stars

Thank you for the extra effort in the reply. You gave me some great ideas.

Comments

Subject: Re: Compression Algorithms
From: mathtalk-ga on 06 Dec 2002 17:40 PST

Hi, Jason:

I would be interested in answering the three broad questions you have
posted here, but the price you list does not seem consistent with the
effort it would require to do a good job on even one part of your
three requests.  You might want to review Google's pricing guidelines:
 
https://answers.google.com/answers/pricing.html 

Of course it's always possible that someone with an even greater
interest in this subject would be willing to provide an acceptable
answer at the price you offer.
 
regards, mathtalk-ga

Subject: Re: Compression Algorithms
From: jroff-ga on 06 Dec 2002 18:31 PST

Okay, okay... so I'm cheap. ;) You can't blame a guy for trying. My
wife is getting on my case because I keep posting questions out of
curiousity. I tell her that if the Discovery had a software
development show, I would probably be set.

I raised the price to $20.00. That fairly represents my level of
interest in the answer! ;)

-Jason

Subject: Re: Compression Algorithms
From: mathtalk-ga on 07 Dec 2002 09:20 PST

Hi, Jason:

Here is an open source algorithm for lossless data compression:

http://www.gzip.org/

Check out in particular the FAQ there and its discussion of the patent
issues in the area of data compression.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy