Google Answers: Non-base-2 CRCs? An impossibility?

View Question

Q: Non-base-2 CRCs? An impossibility? ( Answered 5 out of 5 stars

Question

Subject: Non-base-2 CRCs? An impossibility?
Category: Computers > Algorithms
Asked by: alexander-ga
List Price: $30.00

Posted: 15 Jan 2003 00:59 PST
Expires: 14 Feb 2003 00:59 PST
Question ID: 142910

Basically (hehe), I want to generate a 9 (base 62) character CRC for a file. In other words, assuming you represent each base 62 digit by 0-9, a-b, and A-B, I want the end CRC to look something like "e9MPq3VoR". Another way to look at it is as a decimal value from 0 to 13537086546263551, inclusive. I found some good code for generating a CRC for any number of bits from 16 to 64 at: http://apollo.backplane.com/matt/crc64.html But, of course, base 62 and base 2 don't line up nicely, so I can't use it as-is. I had the idea of making a full 64-bit CRC, converting it to base 62, and lopping off the most significant characters, but I have NO IDEA if this preserves the beneficial properties of the CRC algorithm. I'd like someone who understands the math behind CRC calculation to let me know. :) FYI, I've already considered using a base 32 or base 64 alphabet instead, as well as just making a 53-bit CRC and throwing away some range. A qualifying answer to this question is either: 1) Code or pseudo-code for generating said base 62 CRC, or how to modify above code to do so or 2) An explanation why this is impossible to do, since there is something intrinsically base 2 about CRC generation, and why you can't lop bits (or parts of bits ;) off the final CRC without destroying its properties. If your eyes don't glaze over at math, maybe you'll find this page useful: http://www2.rad.com/networks/1994/err_con/crc_how.htm I'm kind of fearing that the talk about binary polynomials is going to make the answer to this question fall into category (2) above, but my eyes are already starting to glaze over...
Clarification of Question by alexander-ga on 15 Jan 2003 01:36 PST Hm, I just took a closer look at that second page, and have a feeling that this might be a surprisingly stupid question...:)
Request for Question Clarification by mathtalk-ga on 15 Jan 2003 07:49 PST While the CRC algorithm described on the second page "works above the binary field" (uses polynomial arithmetic over Z/2Z), there is no reason one could not apply the same approach to polynomials over any finite field. Now there are no finite fields that contain exactly 62 elements, so we cannot get a base 62 string that way. But one could do the arithmetic over both Z/2Z and Z/31Z and combine the results to get a base 62 result. The larger picture is that a CRC calculation is a hash function which is "sensitive" to all parts of a file, so changing a single character anywhere in the file is apt to produce a different CRC "checksum". Functions with this "desirable property" can be computed in a variety of ways. If you want to follow the CRC model closely, I'd recommend the Z/2Z + Z/31Z approach. I'll post a more detailed explanation as an answer, if you wish it. regards, mathtalk-ga
Clarification of Question by alexander-ga on 15 Jan 2003 15:37 PST Hi mathtalk. :) A detailed explanation of how to do a Z/31Z calculation and then combine it with the Z/2Z one would be acceptable. I'd also appreciate pointers to other algorithms that are "sensitive to all parts of a file", but you don't need to go into depth on them. I don't need a CRC per se, but whatever algorithm I use does need to have similar collision resistance. As far as taking a 54-bit CRC and reducing it to 62^9, that would be fine with me, but I'm not sure if it preserves said collision resistance as the Z/31Z Z/2Z approach above presumably would. Do you know?

Answer

Subject: Re: Non-base-2 CRCs? An impossibility?
Answered By: mathtalk-ga on 16 Jan 2003 09:27 PST
Rated: 5 out of 5 stars

Hi, alexander: Thanks for posting this interesting question. To answer your last question first: Yes, you can safely reduce the 54-bit CRC to a base 62, 9 place number without losing the property that the result is "sensitive" to all parts of the file. I'll return to this idea again at the end of my answer. Perhaps we should back up and discuss the usual purpose of a CRC value, which may differ in some respects from your own purpose. A CRC is very useful in detecting accidental changes to a file or message. CRC32, for example, is used in the algorithm for Ethernet packets as well as in PKZip to detect transmission errors. For a list of some CRC type algorithms and their parameters, together with implementation details: [A Painless Guide To CRC Error Detection Algorithms] http://www.repairfaq.org/filipg/LINK/F_crc_v35.html In this respect one relies on having a well-known standard function that calculates the values, so that the comparison can be made "at the far end" independently. By the same token, however, it makes a vulnerability for intentional tampering with a file or message. Although the CRC32 value is extremely unlikely to be preserved by accidental changes, widespread knowledge of the algorithm makes it feasible for a malicious change to be made and the CRC value altered in an intercepted message to make it appear that the transmission is correct. It is hard for me to guess what utility the novel form of CRC you propose will have, so let me first present the algorithms as requested and make a few editorial comments at the end. The "field" Z/2Z is a mathematical ring containing two elements, 0 and 1, with the arithmetic operations: 0 + 0 = 0; 0 + 1 = 1 1 + 0 = 0; 1 + 1 = 0 0 * 0 = 0; 0 * 1 = 0 1 * 0 = 0; 1 * 1 = 1 Because + here is XOR and * is AND, these arithmetic operations are easily implemented as bitwise "logic" in many computer languages. Interpretation of a file or message as a stream of bits (and hence as a polynomial over Z/2Z) is almost a given on binary computers, apart from big-endian vs. little-endian (byte ordering) worries. The case of a field Z/pZ, p any prime (or even more generally, of a finite field containing p^n elements) is cumbersome for two reasons. One is that implementing the arithmetic operations will have more "overhead". Another is that the interpretation of a file or message as coefficients over Z/pZ for prime p > 2 requires some intellectual effort. Given a resolution of those design issues, then here is a hash function algorithm which "reduces" to the CRC case when p = 2: (1) Let P(x) be a monic (leading coefficient 1) polynomial over Z/pZ. (2) Divide the message polynomial M(x) by P(x), obtaining a remainder R(x) whose degree is less than that of P(x). (3) Return the coefficients of R(x) as the result (hash function). Let us now consider each of these steps in more detail for case p = 31: (1) Since you wish to get nine "digits" (radix 62), we would choose P(x) to be a polynomial of degree 9. If one considers the the "cyclic" nature of this "cyclic redundancy" check or code (CRC), it is desirable for the cycle length to be as long as possible. Consulting Knuth's Art of Computer Programming vol. II (Semi-numerical Algorithms), we learn that the length of the cycles are maximized by making P(x) "irreducible", i.e. unfactorable over Z/pZ. I would have to do a bit of programming to determine one, but would gladly clarify my answer in this way if you are interested. For now I will propose: P(x) = x^9 + x^2 + 1 as I have checked that this at least has no linear factors over Z/31Z. The small coefficients will also contribute to a faster implementation, at least when the length of "message" is fairly long. (2) The naive procedure for dividing polynomial M(x) by P(x) is essentially long division. Because the leading coefficient of P(x) is chosen to be 1, the "trial divisors" are simply the leading coefficients of the partial remainders. I can point you to a nice C++ library in which this sort of polynomial arithmetic is already implemented: [Victor Shoup's Numerical Template Library (NTL)] http://www.shoup.net/ntl/ A tailored implementation for particular p would naturally be more compact and/or potentially faster. While lookup tables are often recommended for a standard binary implementation, that looks to me unwieldy for base 31 (or base 62, see below). However an iterative (looping) approach may be entirely suitable for your intended application. The procedure begins with the interpretation of the message (or file) as a polynomial with coefficients in Z/31Z. The simplest approach, I think, is to break each "byte" of the message into two "nibbles", ie. 4 bits in a group. That gives us coefficients 0 to 15, which are then treated as modulo 31 values. Also, it is a good idea to "prime" the message stream with an initial nonzero "frame" (because otherwise the message might begin with an arbitrary size block of zeroes without effecting the hash function value; unlikely in some contexts but not in others). For our purposes I think it sufficient to consider the polynomial M(x) to begin with a leading 1, and coefficients thereafter to be drawn from the "nibbles" of the actual message or file. So we can think of the first polynomial division as being done like this: partial M(x) = x^9 + nibble(1)x^8 + nibble(2)x^7 + . . . + nibble(7)x^2 + nibble(8)x + nibble(9) Since the leading coefficient is 1, we subtract a copy of P(x) to get the partial remainder: partial M(x) - P(x) = nibble(1)x^8 + nibble(2)x^7 + . . . + [nibble(7)-1]x^2 + nibble(8)x + [nibble(9)-1] where of course each coefficient is computed mod 31. To continue we "multiply" this partial remainder by x (shift coefficients one place to the left) and tack on the next nibble from the input message stream. We then subtract the multiple of P(x) indicated by the leading coefficient, in this case nibble(1) as above. The process continues until the input stream of nibbles is exhausted, and the partial remainder at that point is our final remainder R(x). We insert at this point the observation that this polynomial division procedure can be carried out over Z/62Z, using the same stream of nibbles interpreted as values 0 to 15 within the mod 62 range. This would probably be faster than doing the computations of Z/2Z and Z/31Z separately and combining them at the end, though the results would not be equal. We would want to check the irreducibility of P(x) over both fields Z/2Z and Z/31Z in this approach. (3) The reporting of remainder R(x) as a nine "digit" result using 0-9,A-Z,a-z is essentially a matter of deciding in what order these characters are to be taken. That is, suppose we have separately taken nine coefficients from Z/2Z: B1 B2 . . . B9 and nine coefficients from Z/31Z, i.e. 0 through 30: C1 C2 . . . C9 They could be combined simply by taking each corresponding "digit" D = B + 2C to get values in the range 0 through 61. The mapping from there to specific characters would perhaps be most expeditiously done with a lookup table, i.e. a character array of length 62. Summary ======= It is possible to compute a hash function using the CRC approach but taking values in base other than 2, though the implementation will be disadvantaged by a need to implement arithmetic in another base and to come up with an interpretation of messages as streams of values from that base. Here base 2 enjoys a "beaten path". Having sketched out the steps involved a CRC-like algorithm using a base other than 2, it seems that such an option would be most attractive if one wished to have a "privately known" hash function, perhaps for the purpose of ensuring a license agreement or other "tamper-proof" application. A standard CRC algorithm would suffer in such a comparison, because of the well-known details of the CRC calculations. On the other hand the standard ensures that a check can be easily done at a "destination" of a message with little prior arrangement. If it were desirable to obfuscate the precise details of a hash functions implementation (while providing a built-in implementation of some kind at the destination for checking purposes), then it might arguably be a good idea to use a standard CRC value with a few more bits than 54, say a 60 bit check, and reduce that (through repeated division by 62 with remainders) to a nine character "base 62" string. The effect would be that by wrapping the range around so many additional times, it would be extremely difficult to detect variations in the distribution of hash values that might suggest their origin. Thanks again for posting this very interesting question, and please let me know if something I have said (or failed to say!) requires additional clarification. regards, mathtalk-ga Search Strategy Keywords: CRC algorithm polynomial ://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=CRC+algorithm+polynomial&btnG=Google+Search Keywords: Shoup NTL ://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=Shoup+NTL&btnG=Google+Search
Request for Answer Clarification by alexander-ga on 16 Jan 2003 17:52 PST Wow. Keep in mind that that's already a more-than-satisfactory answer, and your tip is already racking up like a taxi meter... First, some background, in case you're interested: This will be used to create a more-or-less "unique" identifier for a database of millions of files. Collisions are obviously unwanted, but having one or two will actually not be the end of the world. More than a handful, however, starts to become a problem, but the twist is that this won't even be easily detectable below an extreme level once said database is in operation. The identifier will be represented in ASCII, and must be as short as possible, as well as not containing any non-alphanumeric characters. Malicious attacks or needing to rely on a standard algorithm are not a concern. Because of the speed concerns you mentioned (it does need to be reasonably fast), I think it might be better to go with the reduction of a larger CRC. As far as performing that, is this the gist of what you meant? : long long crc= Get60BitCRC(); for(x= 0; x < 9; x++) { id[x]= crc%62; crc/= 62; } What I need is some sort of evidence that shows that this will be at least as collision resistant as a 53-bit CRC (and preferably a little bit more), as well as reasonably sensitive to all parts of the file. Can you help there?
Clarification of Answer by mathtalk-ga on 17 Jan 2003 13:58 PST Hi, alexander: Thanks for the additional information and the kind words of encouragement. Your code snippet, using long long integer to represent an unsigned value of up to 64 bits, is just the "reduction to base 62" string computation I had in mind. You also ask for "some sort of evidence" that this approach is "at least as collision resistant as a 53-bit CRC (and preferably a little bit more), as well as reasonably sensitive to all parts of the file." Such evidence might, I suppose, be either of a empirical or analytical kind. Without having access to even a sample distribution of the actual files you wish to process, I'll present what can be said analytically. Background ========== In your scheme you wish to design for one or two (unmonitored) collisions over some millions of function. As collisions in "hash value space" are a normal and expected part of many applications, even where one needs to maintain distinct "slots" for objects with equal hashed values, I suspect that the best advice would depend on an even clearer understanding of your application. In particular detection of hash value collisions is ordinarily quite easy in storage and retrieval applications. The natural approach for such a design would be to _assume_ that the outputs of the function calls are uniformly randomly distributed across the output range. A simple probability model can then be applied to compute the chance of no collisions, one collision, two, etc. However this begs the question of whether the outputs are really uniformly distributed. A brief discussion of CRC's as random number generators may be found here: [Ritter, T. 1991. The Efficient Generation of Cryptographic Confusion Sequences. Cryptologia. 15(2): 81-139] (see esp. Section 5.3) http://www.ciphersbyritter.com/ARTS/CRNG2ART.HTM The CRC algorithm is essentially a type of random number generator (RNG) known as a "linear feedback shift register" (LFSR), or sometimes by the initials TLP of three researchers (Tausworthe and subsequently Lewis and Payne) who studied their efficacy, except that with the CRC "external data are fed into the mechanism." That is, the bit contents of the message or file are sequentially "added" to the CRC computational results. For an equivalent RNG we would simply have a short message (seed) to which thousands of zero bits are appended. As the word "linear" indicates, the effects of changing a single bit anywhere in the file can be analyzed exactly. If the Nth bit from the end of the file is changed, the remainder of the polynomial division changes by this amount: x^N mod P(x) Since P(x) is chosen to be irreducible, this change remainder is never zero. However it would be possible to change several bits with "zero effect" on the remainder. In order for this to be the case, the bit pattern would have to be one whose corresponding polynomial is divisible by P(x). The width of such a bit pattern would have to be at least degree of P(x) plus 1, and in general it seems reasonable to estimate the chances of such a pattern occurring "by accident" at 1 in 2^k where k is the degree of P(x). To summarize, the sort of calculation that underlies the CRC calculation has been studied in connection with random number generation since at least: R. C. Tausworthe, Math. Comp. 19 (1965), 201-209 A battery of tests, such as Knuth outlines in Sec. 3.3 of Art of Computer Programming, can be applied to particular choices of P(x) to confirm their "randomness". It should be noted, however, that the design feature needed here is perhaps narrower than what is sought in general RNG design for statistical purposes, but applied to a more complex (and potentially biased) set of inputs. The sole feature that I think relevant is the uniformity of the output. E.g. the runs-up and runs-down or other measures of "correlations" in outputs are not significant, so long as there is no "clustering" of outputs to increase the likelihood of collisions. However without some specific information about the inputs, I don't think we can carry the discussion of uniformity much further forward with analytical arguments. We could, of course, _assume_ random inputs, but this would be begging an even larger question than the assumption of random outputs we wish to avoid. So next let's turn to calculating the likelihood of collisions, identifying as an assumption that our CRC outputs are assumed uniformly distributed. Collisions with uniform outputs =============================== You may be familiar with the non-intuitive "birthday problem", in which one asks for variously sized groups of people the likelihood of two sharing a birthday. The number of people needed to provide a likely match is surprisingly small. The probability exceeds 50% with as few as 23 people, and closely approaches 100% with 50 or more: [Birthday Problem -- from MathWorld] http://mathworld.wolfram.com/BirthdayProblem.html Thus the requirement of no or very few collisions, achieved without being able to monitor the outcomes, is a surprisingly strict one. Let d = 2^k where k is the number of bits in a CRC result (or degree of P(x) as discussed above). Then the exact chance of no collisions in n CRC invocations or "trials" is: F(n) = Pr(n collisionless trials) = d!/[(d-n)!d^n] as shown in the link above. You said that the hash values will be computed for "a database of millions of files", so I've computed a few values of this probability for varying n and d: d n F(n) ===== ===== ========== 2^53 10^6 0.99994449 2^53 10^7 0.99446426 2^53 10^8 0.57400825 62^9 10^6 0.99996307 62^9 10^7 0.99631326 62^9 10^8 0.69117952 As you can see, the probability of collisions is not too much different with between a million and ten million trials, whether we have a "short" 2^53 range or a slightly larger 69^9 range. One starts to see a noticeable difference as the number of trials (files) increases towards one hundred million. I'm not sure where your architecture fits in this range, but a more precise estimate based on actual number of files might be warranted. I'm going to have to take a break and put pencil to paper to calculate the effects of reducing a larger CRC value to the smaller 62^9 range. The drawback of reducing a larger CRC value back to a base 62 string of length 9 is that the results are _not_ uniformly distributed (because no power of two can be an exact multiple of 62^9). The larger the number of bits used in the CRC computation, the smaller the deviation from uniformity, so I expect that in the limit, the probability of no collisions approaches that of the "ideal" uniform distribution over a range of 62^9. To make an accurate comparison, however, requires either a stupendous amount of extended precision arithmetic or evaluation of some tricky integrals. I'll let you know when I get the details sorted out. regards, mathtalk
Clarification of Answer by mathtalk-ga on 18 Jan 2003 06:34 PST Hi, alexander: Thanks for the kind words (and awesome tip)! Let me at least unburden myself of the "torturous" idea for handling the wrap around effects of reducing from a CRC with more than 53 bits to a base 62 string of length 9. If we use a 54 bit CRC value, then the range is 0 to 2^54 - 1. Repeated reductions mod 62 produce successive "digits" base 62 and give a value in range 0 to 62^9 - 1. However 2^54 is about one-third more than 62^9, so that we "double up" the distribution of values within the range 0 to 2^54 - 62^9 - 1. We can consider "collisions" in [0,62^9) as occurring either in this doubled-up lower range [0,2^54 - 62^9) or in the "flat" upper range [2^54 - 62^9, 62^9). Because the lower range is about a third of the length but gets "covered" twice, and the upper range is about two-thirds and covered once, the chances are about equal of a value falling in either interval. More precisely, one has a binomial distribution between falling in the lower range vs. upper range with probabilities of roughly 49.7% vs. 50.3% respectively. Suppose we did an "experiment" with n = one million trials, where p ~ 49.7% is the probability of being in the lower range and 1-p the complementary probability of being in the upper range. Then the chance of no collisions can be expressed as: SUM F(j,d) F(n-j,62^9 - d) C(n,j) p^j (1-p)^(n-j) FOR j = 0 to n where: d = 2^54 - 62^9 is the length of lower range, F(j,d) = d!/[(d-j)!(d^j)] is the chance of j collisionless trials in that range, and similarly: F(n-j,62^9 - d) is the chance of n-j collisionless trials in the upper range. C(n,j) = n!/[(n-j)!j!] is combinations of n things taken j at a time, and: C(n,j) p^j (1-p)^(n-j) is the binomial probability of having j trials wind up in the lower range and n-j in the upper range. Attempts to directly compute this expression proved futile, because individual terms are tiny. It dawned on me that this ought to be computed by an integral rather than a sum, and indeed the binomial probability C(n,j) p^j (1-p)^(n-j) can be nicely approximated by integrating from j-0.5 to j+0.5 a normal distribution which shares the same mean and variance as the binomial distribution: mean = np variance = np(1-p) The details of reducing a 60 bit CRC value to the range 62^9 are similiar except that the deviation from uniformity is less noticeable. The 2^60 range wraps around 85 times and leaves an "extra" layer in the distribution over about the first sixth of the reduced range. If I find time, I plan to compute these results to satisfy my curiousity about whether reducing the 54 bit CRC does increase the chance of collisions (compared to using the 53 bit CRC) and my intuition that reducing the 60 bit CRC provides a slight improvement over the 53 bit CRC (approaching the ideal efficiency for a uniform 62^9 range). regards, mathtalk-ga
Request for Answer Clarification by alexander-ga on 18 Jan 2003 08:05 PST Interesting analysis. So by definition, since the CRC algorithm provides a uniform distribution, reducing the resulting value by an amount that causes the "wraparound" to wrap exactly would also yield a uniform distribution. So you could take any integer x-bit CRC and reduce it to any smaller integer x-bit CRC without any loss in collision resistance. And if the wrap isn't exact, but is such that the "incomplete" part is a very small part of the total range, the resistance is decreased, but by a very small amount indeed. If so, wouldn't it be even better to take a 64-bit CRC for reduction to 62^9? If I'm doing the numbers right, its "incomplete" part is just under 0.05% of the total range, where a 60-bit one would be just under 0.2%. Minute difference, sure, but a 64-bit CRC isn't any more expensive to calculate. And since I appear to really be using it as a hash function instead of an error detection algorithm, any small reduction in sensitivity is likely irrelevant.
Clarification of Answer by mathtalk-ga on 18 Jan 2003 16:16 PST Yes, your analysis is right. The 64-bit CRC would be the best choice. There is code for that which is highly optimized and may very well be faster than a decent 60-bit implementation. It may be a little tricky to compare how much closer the 64-bit reduction is to a uniform distribution over 62^9 than a 60-bit reduction would be, but I think your figures err if anything on the conservative side. I make out the 60-bit reduction's deviation from uniformity to be 0.16%, consistent with your figure of 0.2%, and the 64-bit reduction's deviation to be 0.016% (an order of magnitude less) using "L1" norms of the distributions' discrepancies. regards, mathtalk

alexander-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $50.00

Absolutely amazing answer; goes well beyond what I expected, including
addressing underlying issues. Thank you very much.

To save you from the tricky integrals (sounds like torture), it
occurred to me that I could tweak the code I mentioned in my first
link, which was (now it hits me) designed for testing collision
resistance. The ultimate dataset doesn't exist yet, but I think I have
enough samples to sufficiently test the algorithm. This should
actually provide a pretty good "real-world" indicator of how it will
perform in my application, including sensitivity.

Thank you again for your prodigious help. :)

Comments

Subject: Re: Non-base-2 CRCs? An impossibility?
From: mathtalk-ga on 15 Jan 2003 08:05 PST

Hi, alexander:

Since 2^53 < 62^9 < 2^54, you could either compute the 53-bit CRC and
express this "base 62" (without any loss of information) or compute
the 54-bit CRC and reduce it mod 62^9 to the desired form (perhaps
another sense of "throwing away some range").

best wishes, mathtalk

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy