Basically (hehe), I want to generate a 9 (base 62) character CRC for a file. In
other words, assuming you represent each base 62 digit by 0-9, a-b,
and A-B, I want the end CRC to look something like "e9MPq3VoR".
Another way to look at it is as a decimal value from 0 to
13537086546263551, inclusive.
I found some good code for generating a CRC for any number of bits
from 16 to 64 at:
http://apollo.backplane.com/matt/crc64.html
But, of course, base 62 and base 2 don't line up nicely, so I can't
use it as-is.
I had the idea of making a full 64-bit CRC, converting it to base 62, and
lopping off the most significant characters, but I have NO IDEA if
this preserves the beneficial properties of the CRC algorithm. I'd
like someone who understands the math behind CRC calculation to let me
know. :)
FYI, I've already considered using a base 32 or base 64 alphabet
instead, as well as just making a 53-bit CRC and throwing away some range.
A qualifying answer to this question is either:
1) Code or pseudo-code for generating said base 62 CRC, or how to modify
above code to do so
or
2) An explanation why this is impossible to do, since there is
something intrinsically base 2 about CRC generation, and why you can't
lop bits (or parts of bits ;) off the final CRC without destroying its properties.
If your eyes don't glaze over at math, maybe you'll find this page useful:
http://www2.rad.com/networks/1994/err_con/crc_how.htm
I'm kind of fearing that the talk about binary polynomials is going to
make the answer to this question fall into category (2) above, but my
eyes are already starting to glaze over... |
Clarification of Question by
alexander-ga
on
15 Jan 2003 01:36 PST
Hm, I just took a closer look at that second page, and have a feeling
that this might be a surprisingly stupid question...:)
|
Request for Question Clarification by
mathtalk-ga
on
15 Jan 2003 07:49 PST
While the CRC algorithm described on the second page "works above the
binary field" (uses polynomial arithmetic over Z/2Z), there is no
reason one could not apply the same approach to polynomials over any
finite field.
Now there are no finite fields that contain exactly 62 elements, so we
cannot get a base 62 string that way. But one could do the arithmetic
over both Z/2Z and Z/31Z and combine the results to get a base 62
result.
The larger picture is that a CRC calculation is a hash function which
is "sensitive" to all parts of a file, so changing a single character
anywhere in the file is apt to produce a different CRC "checksum".
Functions with this "desirable property" can be computed in a variety
of ways. If you want to follow the CRC model closely, I'd recommend
the Z/2Z + Z/31Z approach.
I'll post a more detailed explanation as an answer, if you wish it.
regards, mathtalk-ga
|
Clarification of Question by
alexander-ga
on
15 Jan 2003 15:37 PST
Hi mathtalk. :)
A detailed explanation of how to do a Z/31Z calculation and then
combine it with the Z/2Z one would be acceptable.
I'd also appreciate pointers to other algorithms that are "sensitive
to all parts of a file", but you don't need to go into depth on them.
I don't need a CRC per se, but whatever algorithm I use does need to
have similar collision resistance.
As far as taking a 54-bit CRC and reducing it to 62^9, that would be
fine with me, but I'm not sure if it preserves said collision
resistance as the Z/31Z Z/2Z approach above presumably would. Do you
know?
|
Hi, alexander:
Thanks for posting this interesting question. To answer your last
question first: Yes, you can safely reduce the 54-bit CRC to a base
62, 9 place number without losing the property that the result is
"sensitive" to all parts of the file. I'll return to this idea again
at the end of my answer.
Perhaps we should back up and discuss the usual purpose of a CRC
value, which may differ in some respects from your own purpose. A CRC
is very useful in detecting accidental changes to a file or message.
CRC32, for example, is used in the algorithm for Ethernet packets as
well as in PKZip to detect transmission errors. For a list of some
CRC type algorithms and their parameters, together with implementation
details:
[A Painless Guide To CRC Error Detection Algorithms]
http://www.repairfaq.org/filipg/LINK/F_crc_v35.html
In this respect one relies on having a well-known standard function
that calculates the values, so that the comparison can be made "at the
far end" independently. By the same token, however, it makes a
vulnerability for intentional tampering with a file or message.
Although the CRC32 value is extremely unlikely to be preserved by
accidental changes, widespread knowledge of the algorithm makes it
feasible for a malicious change to be made and the CRC value altered
in an intercepted message to make it appear that the transmission is
correct.
It is hard for me to guess what utility the novel form of CRC you
propose will have, so let me first present the algorithms as requested
and make a few editorial comments at the end.
The "field" Z/2Z is a mathematical ring containing two elements, 0 and
1, with the arithmetic operations:
0 + 0 = 0; 0 + 1 = 1
1 + 0 = 0; 1 + 1 = 0
0 * 0 = 0; 0 * 1 = 0
1 * 0 = 0; 1 * 1 = 1
Because + here is XOR and * is AND, these arithmetic operations are
easily implemented as bitwise "logic" in many computer languages.
Interpretation of a file or message as a stream of bits (and hence as
a polynomial over Z/2Z) is almost a given on binary computers, apart
from big-endian vs. little-endian (byte ordering) worries.
The case of a field Z/pZ, p any prime (or even more generally, of a
finite field containing p^n elements) is cumbersome for two reasons.
One is that implementing the arithmetic operations will have more
"overhead". Another is that the interpretation of a file or message
as coefficients over Z/pZ for prime p > 2 requires some intellectual
effort. Given a resolution of those design issues, then here is a
hash function algorithm which "reduces" to the CRC case when p = 2:
(1) Let P(x) be a monic (leading coefficient 1) polynomial over Z/pZ.
(2) Divide the message polynomial M(x) by P(x), obtaining a remainder
R(x) whose degree is less than that of P(x).
(3) Return the coefficients of R(x) as the result (hash function).
Let us now consider each of these steps in more detail for case p =
31:
(1) Since you wish to get nine "digits" (radix 62), we would choose
P(x) to be a polynomial of degree 9. If one considers the the
"cyclic" nature of this "cyclic redundancy" check or code (CRC), it is
desirable for the cycle length to be as long as possible. Consulting
Knuth's Art of Computer Programming vol. II (Semi-numerical
Algorithms), we learn that the length of the cycles are maximized by
making P(x) "irreducible", i.e. unfactorable over Z/pZ. I would have
to do a bit of programming to determine one, but would gladly clarify
my answer in this way if you are interested.
For now I will propose:
P(x) = x^9 + x^2 + 1
as I have checked that this at least has no linear factors over Z/31Z.
The small coefficients will also contribute to a faster
implementation, at least when the length of "message" is fairly long.
(2) The naive procedure for dividing polynomial M(x) by P(x) is
essentially long division. Because the leading coefficient of P(x) is
chosen to be 1, the "trial divisors" are simply the leading
coefficients of the partial remainders. I can point you to a nice C++
library in which this sort of polynomial arithmetic is already
implemented:
[Victor Shoup's Numerical Template Library (NTL)]
http://www.shoup.net/ntl/
A tailored implementation for particular p would naturally be more
compact and/or potentially faster. While lookup tables are often
recommended for a standard binary implementation, that looks to me
unwieldy for base 31 (or base 62, see below). However an iterative
(looping) approach may be entirely suitable for your intended
application.
The procedure begins with the interpretation of the message (or file)
as a polynomial with coefficients in Z/31Z. The simplest approach, I
think, is to break each "byte" of the message into two "nibbles", ie.
4 bits in a group. That gives us coefficients 0 to 15, which are then
treated as modulo 31 values. Also, it is a good idea to "prime" the
message stream with an initial nonzero "frame" (because otherwise the
message might begin with an arbitrary size block of zeroes without
effecting the hash function value; unlikely in some contexts but not
in others). For our purposes I think it sufficient to consider the
polynomial M(x) to begin with a leading 1, and coefficients thereafter
to be drawn from the "nibbles" of the actual message or file.
So we can think of the first polynomial division as being done like
this:
partial M(x) = x^9 + nibble(1)*x^8 + nibble(2)*x^7 + . . . +
nibble(7)*x^2 + nibble(8)*x + nibble(9)
Since the leading coefficient is 1, we subtract a copy of P(x) to get
the partial remainder:
partial M(x) - P(x) = nibble(1)*x^8 + nibble(2)*x^7 + . . . +
[nibble(7)-1]*x^2 + nibble(8)*x + [nibble(9)-1]
where of course each coefficient is computed mod 31.
To continue we "multiply" this partial remainder by x (shift
coefficients one place to the left) and tack on the next nibble from
the input message stream. We then subtract the multiple of P(x)
indicated by the leading coefficient, in this case nibble(1) as above.
The process continues until the input stream of nibbles is exhausted,
and the partial remainder at that point is our final remainder R(x).
We insert at this point the observation that this polynomial division
procedure can be carried out over Z/62Z, using the same stream of
nibbles interpreted as values 0 to 15 within the mod 62 range. This
would probably be faster than doing the computations of Z/2Z and Z/31Z
separately and combining them at the end, though the results would not
be equal. We would want to check the irreducibility of P(x) over both
fields Z/2Z and Z/31Z in this approach.
(3) The reporting of remainder R(x) as a nine "digit" result using
0-9,A-Z,a-z is essentially a matter of deciding in what order these
characters are to be taken. That is, suppose we have separately taken
nine coefficients from Z/2Z:
B1 B2 . . . B9
and nine coefficients from Z/31Z, i.e. 0 through 30:
C1 C2 . . . C9
They could be combined simply by taking each corresponding "digit" D =
B + 2C to get values in the range 0 through 61. The mapping from
there to specific characters would perhaps be most expeditiously done
with a lookup table, i.e. a character array of length 62.
Summary
=======
It is possible to compute a hash function using the CRC approach but
taking values in base other than 2, though the implementation will be
disadvantaged by a need to implement arithmetic in another base and to
come up with an interpretation of messages as streams of values from
that base. Here base 2 enjoys a "beaten path".
Having sketched out the steps involved a CRC-like algorithm using a
base other than 2, it seems that such an option would be most
attractive if one wished to have a "privately known" hash function,
perhaps for the purpose of ensuring a license agreement or other
"tamper-proof" application. A standard CRC algorithm would suffer in
such a comparison, because of the well-known details of the CRC
calculations.
On the other hand the standard ensures that a check can be easily done
at a "destination" of a message with little prior arrangement.
If it were desirable to obfuscate the precise details of a hash
functions implementation (while providing a built-in implementation of
some kind at the destination for checking purposes), then it might
arguably be a good idea to use a standard CRC value with a few more
bits than 54, say a 60 bit check, and reduce that (through repeated
division by 62 with remainders) to a nine character "base 62" string.
The effect would be that by wrapping the range around so many
additional times, it would be extremely difficult to detect variations
in the distribution of hash values that might suggest their origin.
Thanks again for posting this very interesting question, and please
let me know if something I have said (or failed to say!) requires
additional clarification.
regards, mathtalk-ga
Search Strategy
Keywords: CRC algorithm polynomial
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=CRC+algorithm+polynomial&btnG=Google+Search
Keywords: Shoup NTL
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Shoup+NTL&btnG=Google+Search |
Request for Answer Clarification by
alexander-ga
on
16 Jan 2003 17:52 PST
Wow. Keep in mind that that's already a more-than-satisfactory answer,
and your tip is already racking up like a taxi meter...
First, some background, in case you're interested:
This will be used to create a more-or-less "unique" identifier for a
database of millions of files. Collisions are obviously unwanted, but
having one or two will actually not be the end of the world. More than
a handful, however, starts to become a problem, but the twist is that
this won't even be easily detectable below an extreme level once said
database is in operation. The identifier will be represented in ASCII,
and must be as short as possible, as well as not containing any
non-alphanumeric characters.
Malicious attacks or needing to rely on a standard algorithm are not a
concern.
Because of the speed concerns you mentioned (it does need to be
reasonably fast), I think it might be better to go with the reduction
of a larger CRC. As far as performing that, is this the gist of what
you meant? :
long long crc= Get60BitCRC();
for(x= 0; x < 9; x++) {
id[x]= crc%62;
crc/= 62;
}
What I need is some sort of evidence that shows that this will be at
least as collision resistant as a 53-bit CRC (and preferably a little
bit more), as well as reasonably sensitive to all parts of the file.
Can you help there?
|
Clarification of Answer by
mathtalk-ga
on
17 Jan 2003 13:58 PST
Hi, alexander:
Thanks for the additional information and the kind words of
encouragement.
Your code snippet, using long long integer to represent an unsigned
value of up to 64 bits, is just the "reduction to base 62" string
computation I had in mind.
You also ask for "some sort of evidence" that this approach is "at
least as collision resistant as a 53-bit CRC (and preferably a little
bit more), as well as reasonably sensitive to all parts of the file."
Such evidence might, I suppose, be either of a empirical or analytical
kind. Without having access to even a sample distribution of the
actual files you wish to process, I'll present what can be said
analytically.
Background
==========
In your scheme you wish to design for one or two (unmonitored)
collisions over some millions of function. As collisions in "hash
value space" are a normal and expected part of many applications, even
where one needs to maintain distinct "slots" for objects with equal
hashed values, I suspect that the best advice would depend on an even
clearer understanding of your application. In particular detection of
hash value collisions is ordinarily quite easy in storage and
retrieval applications.
The natural approach for such a design would be to _assume_ that the
outputs of the function calls are uniformly randomly distributed
across the output range. A simple probability model can then be
applied to compute the chance of no collisions, one collision, two,
etc.
However this begs the question of whether the outputs are really
uniformly distributed. A brief discussion of CRC's as random number
generators may be found here:
[Ritter, T. 1991. The Efficient Generation of Cryptographic Confusion
Sequences. Cryptologia. 15(2): 81-139]
(see esp. Section 5.3)
http://www.ciphersbyritter.com/ARTS/CRNG2ART.HTM
The CRC algorithm is essentially a type of random number generator
(RNG) known as a "linear feedback shift register" (LFSR), or sometimes
by the initials TLP of three researchers (Tausworthe and subsequently
Lewis and Payne) who studied their efficacy, except that with the CRC
"external data are fed into the mechanism." That is, the bit contents
of the message or file are sequentially "added" to the CRC
computational results. For an equivalent RNG we would simply have a
short message (seed) to which thousands of zero bits are appended.
As the word "linear" indicates, the effects of changing a single bit
anywhere in the file can be analyzed exactly. If the Nth bit from the
end of the file is changed, the remainder of the polynomial division
changes by this amount:
x^N mod P(x)
Since P(x) is chosen to be irreducible, this change remainder is never
zero. However it would be possible to change several bits with "zero
effect" on the remainder. In order for this to be the case, the bit
pattern would have to be one whose corresponding polynomial is
divisible by P(x). The width of such a bit pattern would have to be
at least degree of P(x) plus 1, and in general it seems reasonable to
estimate the chances of such a pattern occurring "by accident" at 1 in
2^k where k is the degree of P(x).
To summarize, the sort of calculation that underlies the CRC
calculation has been studied in connection with random number
generation since at least:
R. C. Tausworthe, Math. Comp. 19 (1965), 201-209
A battery of tests, such as Knuth outlines in Sec. 3.3 of Art of
Computer Programming, can be applied to particular choices of P(x) to
confirm their "randomness". It should be noted, however, that the
design feature needed here is perhaps narrower than what is sought in
general RNG design for statistical purposes, but applied to a more
complex (and potentially biased) set of inputs.
The sole feature that I think relevant is the uniformity of the
output. E.g. the runs-up and runs-down or other measures of
"correlations" in outputs are not significant, so long as there is no
"clustering" of outputs to increase the likelihood of collisions.
However without some specific information about the inputs, I don't
think we can carry the discussion of uniformity much further forward
with analytical arguments. We could, of course, _assume_ random
inputs, but this would be begging an even larger question than the
assumption of random outputs we wish to avoid.
So next let's turn to calculating the likelihood of collisions,
identifying as an assumption that our CRC outputs are assumed
uniformly distributed.
Collisions with uniform outputs
===============================
You may be familiar with the non-intuitive "birthday problem", in
which one asks for variously sized groups of people the likelihood of
two sharing a birthday. The number of people needed to provide a
likely match is surprisingly small. The probability exceeds 50% with
as few as 23 people, and closely approaches 100% with 50 or more:
[Birthday Problem -- from MathWorld]
http://mathworld.wolfram.com/BirthdayProblem.html
Thus the requirement of no or very few collisions, achieved without
being able to monitor the outcomes, is a surprisingly strict one.
Let d = 2^k where k is the number of bits in a CRC result (or degree
of P(x) as discussed above). Then the exact chance of no collisions
in n CRC invocations or "trials" is:
F(n) = Pr(n collisionless trials) = d!/[(d-n)!d^n]
as shown in the link above.
You said that the hash values will be computed for "a database of
millions of files", so I've computed a few values of this probability
for varying n and d:
d n F(n)
===== ===== ==========
2^53 10^6 0.99994449
2^53 10^7 0.99446426
2^53 10^8 0.57400825
62^9 10^6 0.99996307
62^9 10^7 0.99631326
62^9 10^8 0.69117952
As you can see, the probability of collisions is not too much
different with between a million and ten million trials, whether we
have a "short" 2^53 range or a slightly larger 69^9 range. One starts
to see a noticeable difference as the number of trials (files)
increases towards one hundred million. I'm not sure where your
architecture fits in this range, but a more precise estimate based on
actual number of files might be warranted.
I'm going to have to take a break and put pencil to paper to calculate
the effects of reducing a larger CRC value to the smaller 62^9 range.
The drawback of reducing a larger CRC value back to a base 62 string
of length 9 is that the results are _not_ uniformly distributed
(because no power of two can be an exact multiple of 62^9). The
larger the number of bits used in the CRC computation, the smaller the
deviation from uniformity, so I expect that in the limit, the
probability of no collisions approaches that of the "ideal" uniform
distribution over a range of 62^9.
To make an accurate comparison, however, requires either a stupendous
amount of extended precision arithmetic or evaluation of some tricky
integrals. I'll let you know when I get the details sorted out.
regards, mathtalk
|
Clarification of Answer by
mathtalk-ga
on
18 Jan 2003 06:34 PST
Hi, alexander:
Thanks for the kind words (and awesome tip)! Let me at least unburden
myself of the "torturous" idea for handling the wrap around effects of
reducing from a CRC with more than 53 bits to a base 62 string of
length 9.
If we use a 54 bit CRC value, then the range is 0 to 2^54 - 1.
Repeated reductions mod 62 produce successive "digits" base 62 and
give a value in range 0 to 62^9 - 1. However 2^54 is about one-third
more than 62^9, so that we "double up" the distribution of values
within the range 0 to 2^54 - 62^9 - 1.
We can consider "collisions" in [0,62^9) as occurring either in this
doubled-up lower range [0,2^54 - 62^9) or in the "flat" upper range
[2^54 - 62^9, 62^9). Because the lower range is about a third of the
length but gets "covered" twice, and the upper range is about
two-thirds and covered once, the chances are about equal of a value
falling in either interval. More precisely, one has a binomial
distribution between falling in the lower range vs. upper range with
probabilities of roughly 49.7% vs. 50.3% respectively.
Suppose we did an "experiment" with n = one million trials, where p ~
49.7% is the probability of being in the lower range and 1-p the
complementary probability of being in the upper range. Then the
chance of no collisions can be expressed as:
SUM F(j,d) F(n-j,62^9 - d) C(n,j) p^j (1-p)^(n-j)
FOR j = 0 to n
where:
d = 2^54 - 62^9
is the length of lower range,
F(j,d) = d!/[(d-j)!(d^j)]
is the chance of j collisionless trials in that range, and similarly:
F(n-j,62^9 - d)
is the chance of n-j collisionless trials in the upper range.
C(n,j) = n!/[(n-j)!j!]
is combinations of n things taken j at a time, and:
C(n,j) p^j (1-p)^(n-j)
is the binomial probability of having j trials wind up in the lower
range and n-j in the upper range.
Attempts to directly compute this expression proved futile, because
individual terms are tiny. It dawned on me that this ought to be
computed by an integral rather than a sum, and indeed the binomial
probability C(n,j) p^j (1-p)^(n-j) can be nicely approximated by
integrating from j-0.5 to j+0.5 a normal distribution which shares the
same mean and variance as the binomial distribution:
mean = np variance = np(1-p)
The details of reducing a 60 bit CRC value to the range 62^9 are
similiar except that the deviation from uniformity is less noticeable.
The 2^60 range wraps around 85 times and leaves an "extra" layer in
the distribution over about the first sixth of the reduced range.
If I find time, I plan to compute these results to satisfy my
curiousity about whether reducing the 54 bit CRC does increase the
chance of collisions (compared to using the 53 bit CRC) and my
intuition that reducing the 60 bit CRC provides a slight improvement
over the 53 bit CRC (approaching the ideal efficiency for a uniform
62^9 range).
regards, mathtalk-ga
|
Request for Answer Clarification by
alexander-ga
on
18 Jan 2003 08:05 PST
Interesting analysis. So by definition, since the CRC algorithm
provides a uniform distribution, reducing the resulting value by an
amount that causes the "wraparound" to wrap exactly would also yield a
uniform distribution. So you could take any integer x-bit CRC and
reduce it to any smaller integer x-bit CRC without any loss in
collision resistance. And if the wrap isn't exact, but is such that
the "incomplete" part is a very small part of the total range, the
resistance *is* decreased, but by a very small amount indeed.
If so, wouldn't it be even better to take a 64-bit CRC for reduction
to 62^9? If I'm doing the numbers right, its "incomplete" part is just
under 0.05% of the total range, where a 60-bit one would be just under
0.2%. Minute difference, sure, but a 64-bit CRC isn't any more
expensive to calculate.
And since I appear to really be using it as a hash function instead of
an error detection algorithm, any small reduction in sensitivity is
likely irrelevant.
|
Clarification of Answer by
mathtalk-ga
on
18 Jan 2003 16:16 PST
Yes, your analysis is right. The 64-bit CRC would be the best choice.
There is code for that which is highly optimized and may very well be
faster than a decent 60-bit implementation.
It may be a little tricky to compare how much closer the 64-bit
reduction is to a uniform distribution over 62^9 than a 60-bit
reduction would be, but I think your figures err if anything on the
conservative side. I make out the 60-bit reduction's deviation from
uniformity to be 0.16%, consistent with your figure of 0.2%, and the
64-bit reduction's deviation to be 0.016% (an order of magnitude less)
using "L1" norms of the distributions' discrepancies.
regards, mathtalk
|