 View Question
Q: Optimized algorithm for constant devision with 255 ( Answered,   1 Comment ) Question
 Subject: Optimized algorithm for constant devision with 255 Category: Computers > Algorithms Asked by: mathiasm-ga List Price: \$2.00 Posted: 29 Mar 2005 07:40 PST Expires: 28 Apr 2005 08:40 PDT Question ID: 502016
 ```What's the fastest algorithm for deviding a 32-bit integer variable with a constant of (or as close as possible to) 255? I.e. instead of using the quite slow integer devision assembler operator, it is possible to get the same result, or a result close enouth, by multiplying and rotating the variable. What is the algortim for deviding with 255, or 254 or 256 if that gives a faster algorthm? The algorthm shall be used for alpha blending pixel calculations, so it is not important that the algorithm gives an answer that is 100% the same as actuall deviding with 255.``` Request for Question Clarification by mathtalk-ga on 29 Mar 2005 18:58 PST ```Hi, mathiasm-ga: The "algorithm" for dividing by 256 is to shift the number over 8-bits, ie. by one byte. So, it's going to hard to beat the speed of that. You basically just add one to the address of the operand on a byte-addressed computer, and you have the address of the result (assuming the most-significant word is zero-padded by at least one additional byte. Getting the result for dividing by 255 is only a little harder. Saying what precisely is the "fastest algorithm" is either trivial (but infeasible, create a lookup table for all possible 32-bit values; you've got 4Gbytes free don't you?), or at least somewhat dependent on the machine language instructions available to you. If you like I'll outline the high-level options for computing the quotient on dividing by 255, either exactly or "close enough", depending on your criteria. For \$2.00 I can't promise to write the assembly language code for you! regards, mathtalk-ga``` ```Hi, mathiasm-ga: In byte arithmetic, dividing by 255 is analogous to dividing by 9 in base ten arithmetic. We sketch first an approach for doing division by 255 in byte arithmetic, and then illustrate the ideas with some familiar base ten calculations. At the end see some remarks on doing the arithmetic with an Intel-like instruction set. * * * * * * * * * * * * * * * * * * * * * * * Suppose that a 32-bit value W is stored in computer memory as: W : b1 | b2 | b3 | b4 where we have depicted the four bytes as ranging from most significant b1 to least significant b4. The exact quotient of this value divided by 256 is the right shift of these by one byte (padded by zero at "left"): W/256: 0 | b1 | b2 | b3 Also b4 is the exact remainder after division of W by 256. How good is an approximation is this to the quotient when dividing by 255? First, 255 is less than 256, so the quotient from dividing by 255 is always greater than or equal to the quotient from dividing by 256. So the shift-by-one-byte approach underestimates the true value. If we were doing the arithmetic as fractions (rational values): (W/255) - (W/256) = W/(255*256) then the "error" of approximation would be precisely 1/256'th of the exact answer W/255. Since 1/256 is about 0.39%, it might be accurate enough for some applications. Our accuracy can be improved a good bit by doing one 32-bit addition: W/256: 0 | b1 | b2 | b3 + 0 | 0 | b1 | b2 --------------------- Qa: c0 c1 c2 c3 Note that a carry will potentially produce a nonzero value in the most significant byte c0 of value Qa. Qa represents the sum of the quotients when W is divided by 256 (shift by one byte) and by 256*256 (shift by two bytes). This is still an underestimate of the true quotient when dividing by 255, but: (W/255) - (W/256 + W/(256*256)) = (W/255) * (1/(256*256)) Since 1/(256*256) is about 1.53E-5, that one 32-bit addition reduces the relative error by more than two orders of magnitude. One way to think about this is that since the error of our initial approximation was 1/256 of the exact answer, by adding 1/256 of the approximate answer to itself, we eliminate most of the error. In fact one can express the exact quotient by 255 by a sequence of similarly repeated additions in "radix 256" arithmetic: 0 | b1 | b2 | b3 # b4 | 0 | 0 ... + 0 | 0 | b1 | b2 # b3 | b4 | 0 ... + 0 | 0 | 0 | b1 # b2 | b3 | b4 ... + 0 | 0 | 0 | 0 # b1 | b2 | b3 ... ... ----------------------------------------- d0 | d1 | d2 | d3 # d4 | d5 | d6 ... where # denotes the "radix point" separating the whole number value of the quotient from the fractional part (this corresponding to a discrete remainder over 255). Although this appears to require an infinite number of additions, essentially evaluating a geometric series: W W W W W --- = --- + ----- + ----- + ----- + ... 255 256 2^16 2^24 2^32 with a bit of cleverness and attention to details, the exact quotient could actually be obtained in a finite number of steps. * * * * * * * * * * * * * * * * * * * * * * * Let's illustrate the ideas by translating them into examples where we divide a four-digit (base ten) number by 9. Of course the effect of shifting digits over once is dividing by 10. Example: Divide 1776 by 9. ======= Since 1776 divided by 10 gives 177 (and remainder 6), we know that our division by 9 should give a bigger quotient, and bigger actually by about one tenth: 177 + 17 ---- 194 This is still an underestimate of the true quotient when dividing by 9, but it's pretty close. 1776 divided by 9 gives quotient 197 with remainder 3. To see how one might push these ideas to obtain the exact answer, let's begin by doing the addition with preserving values to the right of the decimal point: 177.6 + 17.76 ------- 195.36 This approximation falls short of the true answer by 1/100'th, just as the "first" approximation 177.6 falls short by 1/10'th. So the next correction is to add to itself the value shifted over by two decimal places: 195.36 + 1.9536 --------- 197.3136 If we do one more addition, this time shifting the value over by four decimal places before adding: 197.3136 + .01973136 ------------- 197.33333136 it becomes clear that the limit is 197 and one third. The size of the shift and number of correct digits doubles with each iteration, as this is a disguised "Newton" iteration for 1/255. * * * * * * * * * * * * * * * * * * * * * * * Some remarks on an implementation in Intel-like machine instruction will compare the integer division and integer addition approaches. Before the introduction of the Pentium line of microprocessors, it appeared that Intel was willing to sacrifice valuable space on the chip's surface to lower the time needed for an integer divide. However current designs reflect different priorities, and general division instructions (versus divisions by powers of two, say) are somewhat regarded as rarely used and hence of little importance to overall optimization. For some general remarks on the integer division operations, by operand size and signed vs. unsigned, see midway down on this page: [Pentium Discussion on Special Instructions -- GameDev.net] http://www.gamedev.net/reference/articles/article208.asp Keeping up with the instruction timings is complicated by various model, pipeline, and address decoding issues, and no detailed analysis is attempted. However as a proxy for such considerations we will reference the following technical document by AMD: [AMD Athlon Processor x86 Code Optimization Guide] http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf and esp. Appendix F, "Instruction Dispatch and Execution Resources/Timing". The most natural way to do the division is with a DIV instruction: [DIV -- Unsigned Divide] http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/DIV.htm Placing the 32-bit dividend into EAX, with EDX set to zero, you can divide it by a register containing 255, say EBX. Then you get the quotient in EDX (and remainder in EAX). The operation requires about 40 clocks, which is a pretty hefty cost, but there is no prospect of overflow. One gimmick to keep in mind is the option of using a floating-point operation to get the desired quotient. The floating-point operation can be executed in parallel with an integer divide, so with careful code there is some prospect here of effectively cutting the time by maybe half. Although I won't go into any level of detail, the ADD instruction is so much faster that the amount of time spent moving data into or out of registers becomes more significant. If one intends to do just the single 32-bit ADD, then it probably doesn't matter too much whether the ADD or the LOAD will bear the penalty for the unaligned operand. Let us suppose for definiteness sake that the original 32-bit dividend is aligned on a word boundary (even address), say: W : b1 | b2 | b3 | b4 [address of W = mem] and that the bytes "above" mem + 4 are set to zero. Then we can load EAX with the shifted 32-bit value: W/256: 0 | b1 | b2 | b3 [address is mem + 1] with a MOV instruction: [MOV -- Move Data] http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/MOV.htm at a cost of probably 4 clocks, including the alignment penalty. This "three byte" 32-bit value can then be added to the zero-padded upper word of W in memory: W/256^2: 0 | 0 | b1 | b2 [address is mem + 2] without any (additional) alignment penalty in another 4 clocks. [ADD -- Add] http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/ADD.htm We now have the (approximate) quotient stored in memory at mem + 2. This net of 8 clocks avoids the overhead involved in positioning and disposing of the dividend/quotient data, but of course that is something any algorithm/application code will have to deal with. It certainly makes for a favorable comparison with the integer divide instruction, certainly if the slight loss of precision is of no consequence. Furthermore the current chip architectures seem to boast more than one integer execution unit, and if anything the prospect for overlapped execution of the MOV/ADD operations outlined above are better than for the overlapping of integer and floating-point divisions. regards, mathtalk-ga``` ```The result of dividing a 32-bit number by 255 will fit in 32-bits. While this is a trivial observation, division is one of the arithmetic operations that can cause overflow at the machine instruction level (multiplication cannot!). So it's a good idea to keep in mind the maximum possible result, which would be: 0xFFFFFFFF / 0xFF = 0x01010101 Notice that dividing (with truncation, ie. discarding any remainder) by 256 for this example would give: 0xFFFFFFFF / 0x0100 = 0x00FFFFFF and the difference in the two results is 0x00010102, which is also the maximum possible (absolute) discrepancy for using the quotient by 256 to approximate the quotient by 255. In general the quotient by 256 is less than or equal to the quotient by 255 (since dividing by a larger number gives potentially a smaller quotient), but the _relative_ discrepancy is no more than about 0.4%. If a more precise value were important, then on 64-bit Intel hardware and similar CPU's I'd look into doing two 32-bit calculations in parallel using the MMX instructions. However the Answer I'd propose providing for is simply the "algorithm" for computing the more precise value and not a detailed analysis at the machine instruction level. regards, mathtalk-ga``` 