Google Answers: use generated information to indirectly compare the similarity of digital files

View Question

Q: use generated information to indirectly compare the similarity of digital files ( No Answer, 2 Comments )

Question

Subject: use generated information to indirectly compare the similarity of digital files
Category: Science > Math
Asked by: johnclubvec-ga
List Price: $5.00

Posted: 08 Jun 2005 17:17 PDT
Expires: 08 Jul 2005 17:17 PDT
Question ID: 531131

How can one use generated information to indirectly compare the
similarity of digital files?

I'll just lay out my question in the homely terms I'm thinking about it.

IMAGINARY SCENARIO:
There exists on the Internet a special machine, the "Unique-o-Matic."
Its purpose is to tell you if it has ever previously seen the digital
file (of an arbitrarily long length) that you've got in your hands. It
also supplies you with a list of the top ten digital files it's ever
seen that are most like your file, and, in quantitative terms, how
"close" the similarity is.

The rules are:

a. You let the Unique-o-Matic read your entire file, from beginning to end.
b. The Unique-o-Matic does not save a copy of anything it reads.
c. However, it does generate -- something; some information, from
reading your file, which it does permanently save.
d. The Unique-o-Matic does not know whether your file is a digital
representation of a fingerprint, DNA, the Mona Lisa, or anything else.
So, for example, it generally can't use domain-specific knowledge to
generate its information.
e. There are sound mathematical reasons to be confident that there is
no way to re-assemble or reproduce your file from the information the
Unique-o-Matic generated and saved.
f. The Unique-o-Matic supplies you with the list of the top ten
most-similar files, not by comparing the files themselves (none of
which it saved), but purely by comparing the information it generated
by reading each of the files.

So, in mathematical/statistical (not engineering) terms, how would you
build a Unique-o-Matic?

Clarification of Question by johnclubvec-ga on 10 Jun 2005 05:59 PDT

Naively, I have been assuming that this problem is a type of sampling
problem. That is, using some algorithm, you pull n white or black
beads off of a long string of black and white beads and save them.
What is the probability that the bead sample you pulled off just now
comes from the same string as another bead sample you pulled off at
some other time? More broadly, I am assuming that a practical approach
would be inherently statistical rather than quantitatively definitive.

Answer

There is no answer at this time.

Comments

Subject: Re: use generated information to indirectly compare the similarity of digital files
From: frde-ga on 09 Jun 2005 01:51 PDT

I would use a CRC32  (Cyclic Redundancy Check - 32 bit)

Actually I would use several CRCs calculated on different readings of
the data, say reading forwards, reading backwards, reversing each
adjacent byte.

Also as a quick check, (obviously) store the length of the data.

With one CRC32 the chances of a 'false positive' is 1 in 4 billion
- with more CRCs the chances decline

Obviously there is still a chance of 'false positives' turning up, but
storing (say) 64 bytes of information on each file will make the
chance incredibly low.

This would not fulfill your requirement f) as the 'keys' would be
radically different for two files that are identical /except/ for one
bit.

Also, it would, theoretically be possible to reconstruct the original
data, but only by repeatedly creating new files and checking if they
have the same 'key'
- which is a bit like getting monkeys to type up the bible.

Subject: Re: use generated information to indirectly compare the similarity of digital files
From: manuka-ga on 10 Jun 2005 00:59 PDT

Note that the chances of false positives quoted by frde-ga apply only
if the received data are random. If someone is trying to spoof your
system, CRC-32 is easy to beat. Using several variants in conjunction
helps, but if the variations employed are known it will still be
simple to beat. Hash functions such as SHA-1 and MD5 would be better
in this respect. However, frde-ga is incorrect when s/he says that
reconstruction of the data would be possible. If you are only storing
the length of the file plus, say, 1000 32-bit CRCs or whatever, you
only have 32000 bits of information. On average there will be one
possible 32000-bit source file, 2 possible 32001-bit source files,
etc. By the time you get up to, say, a 5 kilobyte (40960 bits) source
file, there are 2^8960 or approximately 1.7 * 10^2697 possible source
files.

To *guarantee* uniqueness, your description must (on average) be at
least as long as the file. If you want the process not to be easily
reversible, probably your description will be significantly longer
than the file.

In principle, if a unique description is saved, it is (at least
potentially) possible to reverse the process and derive the original
file. Every one-to-one mapping has an inverse. However, it may be
difficult computationally to reverse it (as long as no new techniques
are invented), much like the most common encryption schemes today.

The goal of comparing files based on the description and the goal of
having a description that cannot be transformed into the original file
are diametrically opposed. The first wants a description that tells
you as much as possible about the original file, the second wants one
that says as little as possible about the original. The only way to
reconcile this is to assume that the party doing the comparison has
access to information that the party attempting the reversal does not.
For instance, you could stipulate that the Unique-O-Matic could know
how to turn a description back into the original (or something
similar), but that users would not. This is known as "security through
obscurity" and is generally regarded as a bad idea, because once the
system is cracked (and sooner or later they all are) the security is
no more.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy