Google Answers: Optimized algorithm for constant devision with 255

View Question

Q: Optimized algorithm for constant devision with 255 ( Answered, 1 Comment )

Question

Subject: Optimized algorithm for constant devision with 255
Category: Computers > Algorithms
Asked by: mathiasm-ga
List Price: $2.00

Posted: 29 Mar 2005 07:40 PST
Expires: 28 Apr 2005 08:40 PDT
Question ID: 502016

What's the fastest algorithm for deviding a 32-bit integer variable with
a constant of (or as close as possible to) 255? I.e. instead of using
the quite slow integer devision assembler operator, it is possible to
get the same result, or a result close enouth, by multiplying and
rotating the variable. What is the algortim for deviding with 255, or
254 or 256 if that gives a faster algorthm? The algorthm shall be used
for alpha blending pixel calculations, so it is not important that the
algorithm gives an answer that is 100% the same as actuall deviding
with 255.

Request for Question Clarification by mathtalk-ga on 29 Mar 2005 18:58 PST

Hi, mathiasm-ga:

The "algorithm" for dividing by 256 is to shift the number over
8-bits, ie. by one byte.  So, it's going to hard to beat the speed of
that.  You basically just add one to the address of the operand on a
byte-addressed computer, and you have the address of the result
(assuming the most-significant word is zero-padded by at least one
additional byte.

Getting the result for dividing by 255 is only a little harder. 
Saying what precisely is the "fastest algorithm" is either trivial
(but infeasible, create a lookup table for all possible 32-bit values;
you've got 4Gbytes free don't you?), or at least somewhat dependent on
the machine language instructions available to you.

If you like I'll outline the high-level options for computing the
quotient on dividing by 255, either exactly or "close enough",
depending on your criteria.

For $2.00 I can't promise to write the assembly language code for you!

regards, mathtalk-ga

Answer

Subject: Re: Optimized algorithm for constant devision with 255
Answered By: mathtalk-ga on 09 Apr 2005 16:22 PDT

Hi, mathiasm-ga:

In byte arithmetic, dividing by 255 is analogous to dividing by 9 in
base ten arithmetic.  We sketch first an approach for doing division
by 255 in byte arithmetic, and then illustrate the ideas with some
familiar base ten calculations.  At the end see some remarks on doing
the arithmetic with an Intel-like instruction set.

*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *

Suppose that a 32-bit value W is stored in computer memory as:

    W :       b1 | b2 | b3 | b4
   
where we have depicted the four bytes as ranging from most significant
b1 to least significant b4.

The exact quotient of this value divided by 256 is the right shift of
these by one byte (padded by zero at "left"):

    W/256:     0 | b1 | b2 | b3

Also b4 is the exact remainder after division of W by 256.

How good is an approximation is this to the quotient when dividing by
255?  

First, 255 is less than 256, so the quotient from dividing by 255
is always greater than or equal to the quotient from dividing by 256.
So the shift-by-one-byte approach underestimates the true value.

If we were doing the arithmetic as fractions (rational values):

    (W/255) - (W/256) = W/(255*256)

then the "error" of approximation would be precisely 1/256'th of the
exact answer W/255.  Since 1/256 is about 0.39%, it might be accurate
enough for some applications.

Our accuracy can be improved a good bit by doing one 32-bit addition:

    W/256:     0 | b1 | b2 | b3
            +  0 |  0 | b1 | b2
           ---------------------
     Qa:      c0   c1   c2   c3

Note that a carry will potentially produce a nonzero value in the most
significant byte c0 of value Qa.

Qa represents the sum of the quotients when W is divided by 256 (shift
by one byte) and by 256*256 (shift by two bytes).  This is still an
underestimate of the true quotient when dividing by 255, but:

    (W/255) - (W/256 + W/(256*256)) = (W/255) * (1/(256*256))

Since 1/(256*256) is about 1.53E-5, that one 32-bit addition reduces
the relative error by more than two orders of magnitude.

One way to think about this is that since the error of our initial
approximation was 1/256 of the exact answer, by adding 1/256 of the
approximate answer to itself, we eliminate most of the error.

In fact one can express the exact quotient by 255 by a sequence of
similarly repeated additions in "radix 256" arithmetic:

               0 | b1 | b2 | b3 # b4 |  0 |  0 ...
            +  0 |  0 | b1 | b2 # b3 | b4 |  0 ...
            +  0 |  0 |  0 | b1 # b2 | b3 | b4 ...
            +  0 |  0 |  0 |  0 # b1 | b2 | b3 ...
               ...
           -----------------------------------------
              d0 | d1 | d2 | d3 # d4 | d5 | d6 ...

where # denotes the "radix point" separating the whole number value
of the quotient from the fractional part (this corresponding to a
discrete remainder over 255).

Although this appears to require an infinite number of additions,
essentially evaluating a geometric series:

     W     W      W       W       W
    --- = --- + ----- + ----- + ----- + ...
    255   256    2^16    2^24    2^32

with a bit of cleverness and attention to details, the exact quotient
could actually be obtained in a finite number of steps.

*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *

Let's illustrate the ideas by translating them into examples where we
divide a four-digit (base ten) number by 9.

Of course the effect of shifting digits over once is dividing by 10.

Example:  Divide 1776 by 9.
=======

Since 1776 divided by 10 gives 177 (and remainder 6), we know that our
division by 9 should give a bigger quotient, and bigger actually by
about one tenth:

                  177
                 + 17
                 ----
                  194

This is still an underestimate of the true quotient when dividing by 9,
but it's pretty close.  

1776 divided by 9 gives quotient 197 with remainder 3.

To see how one might push these ideas to obtain the exact answer, let's
begin by doing the addition with preserving values to the right of the
decimal point:

                  177.6
                 + 17.76
                 -------
                  195.36

This approximation falls short of the true answer by 1/100'th, just as
the "first" approximation 177.6 falls short by 1/10'th.  So the next
correction is to add to itself the value shifted over by two decimal
places:

                  195.36
                 +  1.9536
                 ---------
                  197.3136

If we do one more addition, this time shifting the value over by four
decimal places before adding:

                  197.3136
                 +   .01973136
                 -------------
                  197.33333136

it becomes clear that the limit is 197 and one third.

The size of the shift and number of correct digits doubles with each
iteration, as this is a disguised "Newton" iteration for 1/255.

*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *

Some remarks on an implementation in Intel-like machine instruction
will compare the integer division and integer addition approaches.

Before the introduction of the Pentium line of microprocessors, it
appeared that Intel was willing to sacrifice valuable space on the
chip's surface to lower the time needed for an integer divide.

However current designs reflect different priorities, and general
division instructions (versus divisions by powers of two, say) are
somewhat regarded as rarely used and hence of little importance to
overall optimization.

For some general remarks on the integer division operations, by
operand size and signed vs. unsigned, see midway down on this page:

[Pentium Discussion on Special Instructions -- GameDev.net]
http://www.gamedev.net/reference/articles/article208.asp

Keeping up with the instruction timings is complicated by various
model, pipeline, and address decoding issues, and no detailed 
analysis is attempted.  However as a proxy for such considerations
we will reference the following technical document by AMD:

[AMD Athlon Processor x86 Code Optimization Guide]
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf

and esp. Appendix F, "Instruction Dispatch and Execution Resources/Timing".

The most natural way to do the division is with a DIV instruction:

[DIV -- Unsigned Divide]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/DIV.htm

Placing the 32-bit dividend into EAX, with EDX set to zero, you can
divide it by a register containing 255, say EBX.  Then you get the
quotient in EDX (and remainder in EAX).

The operation requires about 40 clocks, which is a pretty hefty cost,
but there is no prospect of overflow.

One gimmick to keep in mind is the option of using a floating-point
operation to get the desired quotient.  The floating-point operation
can be executed in parallel with an integer divide, so with careful
code there is some prospect here of effectively cutting the time by
maybe half.

Although I won't go into any level of detail, the ADD instruction is
so much faster that the amount of time spent moving data into or out
of registers becomes more significant.  If one intends to do just the
single 32-bit ADD, then it probably doesn't matter too much whether
the ADD or the LOAD will bear the penalty for the unaligned operand.

Let us suppose for definiteness sake that the original 32-bit dividend
is aligned on a word boundary (even address), say:

    W :       b1 | b2 | b3 | b4  [address of W = mem]

and that the bytes "above" mem + 4 are set to zero.  Then we can load
EAX with the shifted 32-bit value:

    W/256:     0 | b1 | b2 | b3  [address is mem + 1]

with a MOV instruction:

[MOV -- Move Data]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/MOV.htm

at a cost of probably 4 clocks, including the alignment penalty.

This "three byte" 32-bit value can then be added to the zero-padded
upper word of W in memory:

    W/256^2:   0 |  0 | b1 | b2  [address is mem + 2]

without any (additional) alignment penalty in another 4 clocks.

[ADD -- Add]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/ADD.htm

We now have the (approximate) quotient stored in memory at mem + 2.

This net of 8 clocks avoids the overhead involved in positioning
and disposing of the dividend/quotient data, but of course that is
something any algorithm/application code will have to deal with.

It certainly makes for a favorable comparison with the integer
divide instruction, certainly if the slight loss of precision is
of no consequence.  Furthermore the current chip architectures
seem to boast more than one integer execution unit, and if anything
the prospect for overlapped execution of the MOV/ADD operations
outlined above are better than for the overlapping of integer and
floating-point divisions.


regards, mathtalk-ga

Comments

Subject: Re: Optimized algorithm for constant devision with 255
From: mathtalk-ga on 30 Mar 2005 07:36 PST

The result of dividing a 32-bit number by 255 will fit in 32-bits. 
While this is a trivial observation, division is one of the arithmetic
operations that can cause overflow at the machine instruction level
(multiplication cannot!).

So it's a good idea to keep in mind the maximum possible result, which would be:

  0xFFFFFFFF / 0xFF  =  0x01010101

Notice that dividing (with truncation, ie. discarding any remainder)
by 256 for this example would give:

  0xFFFFFFFF / 0x0100 = 0x00FFFFFF

and the difference in the two results is 0x00010102, which is also the
maximum possible (absolute) discrepancy for using the quotient by 256
to approximate the quotient by 255.

In general the quotient by 256 is less than or equal to the quotient
by 255 (since dividing by a larger number gives potentially a smaller
quotient), but the _relative_ discrepancy is no more than about 0.4%.

If a more precise value were important, then on 64-bit Intel hardware
and similar CPU's I'd look into doing two 32-bit calculations in
parallel using the MMX instructions.  However the Answer I'd propose
providing for is simply the "algorithm" for computing the more precise
value and not a detailed analysis at the machine instruction level.

regards, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy