Note that the chances of false positives quoted by frde-ga apply only
if the received data are random. If someone is trying to spoof your
system, CRC-32 is easy to beat. Using several variants in conjunction
helps, but if the variations employed are known it will still be
simple to beat. Hash functions such as SHA-1 and MD5 would be better
in this respect. However, frde-ga is incorrect when s/he says that
reconstruction of the data would be possible. If you are only storing
the length of the file plus, say, 1000 32-bit CRCs or whatever, you
only have 32000 bits of information. On average there will be one
possible 32000-bit source file, 2 possible 32001-bit source files,
etc. By the time you get up to, say, a 5 kilobyte (40960 bits) source
file, there are 2^8960 or approximately 1.7 * 10^2697 possible source
files.
To *guarantee* uniqueness, your description must (on average) be at
least as long as the file. If you want the process not to be easily
reversible, probably your description will be significantly longer
than the file.
In principle, if a unique description is saved, it is (at least
potentially) possible to reverse the process and derive the original
file. Every one-to-one mapping has an inverse. However, it may be
difficult computationally to reverse it (as long as no new techniques
are invented), much like the most common encryption schemes today.
The goal of comparing files based on the description and the goal of
having a description that cannot be transformed into the original file
are diametrically opposed. The first wants a description that tells
you as much as possible about the original file, the second wants one
that says as little as possible about the original. The only way to
reconcile this is to assume that the party doing the comparison has
access to information that the party attempting the reversal does not.
For instance, you could stipulate that the Unique-O-Matic could know
how to turn a description back into the original (or something
similar), but that users would not. This is known as "security through
obscurity" and is generally regarded as a bad idea, because once the
system is cracked (and sooner or later they all are) the security is
no more. |