| Hi, mathiasm-ga:
In byte arithmetic, dividing by 255 is analogous to dividing by 9 in
base ten arithmetic.  We sketch first an approach for doing division
by 255 in byte arithmetic, and then illustrate the ideas with some
familiar base ten calculations.  At the end see some remarks on doing
the arithmetic with an Intel-like instruction set.
*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
Suppose that a 32-bit value W is stored in computer memory as:
    W :       b1 | b2 | b3 | b4
   
where we have depicted the four bytes as ranging from most significant
b1 to least significant b4.
The exact quotient of this value divided by 256 is the right shift of
these by one byte (padded by zero at "left"):
    W/256:     0 | b1 | b2 | b3
Also b4 is the exact remainder after division of W by 256.
How good is an approximation is this to the quotient when dividing by
255?  
First, 255 is less than 256, so the quotient from dividing by 255
is always greater than or equal to the quotient from dividing by 256.
So the shift-by-one-byte approach underestimates the true value.
If we were doing the arithmetic as fractions (rational values):
    (W/255) - (W/256) = W/(255*256)
then the "error" of approximation would be precisely 1/256'th of the
exact answer W/255.  Since 1/256 is about 0.39%, it might be accurate
enough for some applications.
Our accuracy can be improved a good bit by doing one 32-bit addition:
    W/256:     0 | b1 | b2 | b3
            +  0 |  0 | b1 | b2
           ---------------------
     Qa:      c0   c1   c2   c3
Note that a carry will potentially produce a nonzero value in the most
significant byte c0 of value Qa.
Qa represents the sum of the quotients when W is divided by 256 (shift
by one byte) and by 256*256 (shift by two bytes).  This is still an
underestimate of the true quotient when dividing by 255, but:
    (W/255) - (W/256 + W/(256*256)) = (W/255) * (1/(256*256))
Since 1/(256*256) is about 1.53E-5, that one 32-bit addition reduces
the relative error by more than two orders of magnitude.
One way to think about this is that since the error of our initial
approximation was 1/256 of the exact answer, by adding 1/256 of the
approximate answer to itself, we eliminate most of the error.
In fact one can express the exact quotient by 255 by a sequence of
similarly repeated additions in "radix 256" arithmetic:
               0 | b1 | b2 | b3 # b4 |  0 |  0 ...
            +  0 |  0 | b1 | b2 # b3 | b4 |  0 ...
            +  0 |  0 |  0 | b1 # b2 | b3 | b4 ...
            +  0 |  0 |  0 |  0 # b1 | b2 | b3 ...
               ...
           -----------------------------------------
              d0 | d1 | d2 | d3 # d4 | d5 | d6 ...
where # denotes the "radix point" separating the whole number value
of the quotient from the fractional part (this corresponding to a
discrete remainder over 255).
Although this appears to require an infinite number of additions,
essentially evaluating a geometric series:
     W     W      W       W       W
    --- = --- + ----- + ----- + ----- + ...
    255   256    2^16    2^24    2^32
with a bit of cleverness and attention to details, the exact quotient
could actually be obtained in a finite number of steps.
*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
Let's illustrate the ideas by translating them into examples where we
divide a four-digit (base ten) number by 9.
Of course the effect of shifting digits over once is dividing by 10.
Example:  Divide 1776 by 9.
=======
Since 1776 divided by 10 gives 177 (and remainder 6), we know that our
division by 9 should give a bigger quotient, and bigger actually by
about one tenth:
                  177
                 + 17
                 ----
                  194
This is still an underestimate of the true quotient when dividing by 9,
but it's pretty close.  
1776 divided by 9 gives quotient 197 with remainder 3.
To see how one might push these ideas to obtain the exact answer, let's
begin by doing the addition with preserving values to the right of the
decimal point:
                  177.6
                 + 17.76
                 -------
                  195.36
This approximation falls short of the true answer by 1/100'th, just as
the "first" approximation 177.6 falls short by 1/10'th.  So the next
correction is to add to itself the value shifted over by two decimal
places:
                  195.36
                 +  1.9536
                 ---------
                  197.3136
If we do one more addition, this time shifting the value over by four
decimal places before adding:
                  197.3136
                 +   .01973136
                 -------------
                  197.33333136
it becomes clear that the limit is 197 and one third.
The size of the shift and number of correct digits doubles with each
iteration, as this is a disguised "Newton" iteration for 1/255.
*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
Some remarks on an implementation in Intel-like machine instruction
will compare the integer division and integer addition approaches.
Before the introduction of the Pentium line of microprocessors, it
appeared that Intel was willing to sacrifice valuable space on the
chip's surface to lower the time needed for an integer divide.
However current designs reflect different priorities, and general
division instructions (versus divisions by powers of two, say) are
somewhat regarded as rarely used and hence of little importance to
overall optimization.
For some general remarks on the integer division operations, by
operand size and signed vs. unsigned, see midway down on this page:
[Pentium Discussion on Special Instructions -- GameDev.net]
http://www.gamedev.net/reference/articles/article208.asp
Keeping up with the instruction timings is complicated by various
model, pipeline, and address decoding issues, and no detailed 
analysis is attempted.  However as a proxy for such considerations
we will reference the following technical document by AMD:
[AMD Athlon Processor x86 Code Optimization Guide]
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
and esp. Appendix F, "Instruction Dispatch and Execution Resources/Timing".
The most natural way to do the division is with a DIV instruction:
[DIV -- Unsigned Divide]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/DIV.htm
Placing the 32-bit dividend into EAX, with EDX set to zero, you can
divide it by a register containing 255, say EBX.  Then you get the
quotient in EDX (and remainder in EAX).
The operation requires about 40 clocks, which is a pretty hefty cost,
but there is no prospect of overflow.
One gimmick to keep in mind is the option of using a floating-point
operation to get the desired quotient.  The floating-point operation
can be executed in parallel with an integer divide, so with careful
code there is some prospect here of effectively cutting the time by
maybe half.
Although I won't go into any level of detail, the ADD instruction is
so much faster that the amount of time spent moving data into or out
of registers becomes more significant.  If one intends to do just the
single 32-bit ADD, then it probably doesn't matter too much whether
the ADD or the LOAD will bear the penalty for the unaligned operand.
Let us suppose for definiteness sake that the original 32-bit dividend
is aligned on a word boundary (even address), say:
    W :       b1 | b2 | b3 | b4  [address of W = mem]
and that the bytes "above" mem + 4 are set to zero.  Then we can load
EAX with the shifted 32-bit value:
    W/256:     0 | b1 | b2 | b3  [address is mem + 1]
with a MOV instruction:
[MOV -- Move Data]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/MOV.htm
at a cost of probably 4 clocks, including the alignment penalty.
This "three byte" 32-bit value can then be added to the zero-padded
upper word of W in memory:
    W/256^2:   0 |  0 | b1 | b2  [address is mem + 2]
without any (additional) alignment penalty in another 4 clocks.
[ADD -- Add]
http://www.cs.ucla.edu/~kohler/class/04f-aos/ref/i386/ADD.htm
We now have the (approximate) quotient stored in memory at mem + 2.
This net of 8 clocks avoids the overhead involved in positioning
and disposing of the dividend/quotient data, but of course that is
something any algorithm/application code will have to deal with.
It certainly makes for a favorable comparison with the integer
divide instruction, certainly if the slight loss of precision is
of no consequence.  Furthermore the current chip architectures
seem to boast more than one integer execution unit, and if anything
the prospect for overlapped execution of the MOV/ADD operations
outlined above are better than for the overlapping of integer and
floating-point divisions.
regards, mathtalk-ga |