John West wrote:
Code:
x % 120= d*2^9 + c*2^8 + b*2^7 + a
(d, c, b are 0 or 1. a is between 0 and 63)
= (d*(2^9 % 120) + c*(2^8 % 120) + b*(2^7 % 120) + a%120) % 120
= (d*32 + c*16 + b*8 + a) % 120
= ((x>>4)&56 + x&127) % 120
Since this hasn't been elaborated upon yet, here's what's happening. First, there is a minor typo -- "a" ranges from 0 to 127, rather than 0 to 63. Anyway, to compute:
result = x % 120
where
x = u * 128 + a
i.e. a is the lower 7 bits of x and u is the upper bits.
result = (u * 128 + a) % 120
which is the same as:
result = (u * (120 + 8) + a) % 120
which is the same as:
result = (u * 120 + u * 8 + a) % 120
since u * 120 is a multiple of 120 it won't affect the result, so that term can be discarded, leaving:
result = (u * 8 + a) % 120
So far, none of this depends on how many bits u consists of. In the example, x ranged from 0 to 1023, so u is 3 bits.
u = (x >> 7) & 7, so u * 8 is the same as (x >> 4) & (7 << 3), and a = (x & 127), so the formula can be written as:
result = ((x >> 4) & 56 + x & 127) % 120
since u can be at most 7 and a can be at most 127, u * 8 + a can be at most 183, which is less than 2 * 120, so only one comparison to 120 (and subtracting 120 if greater than or equal) is needed to compute the final % 120. In other words, the calculation can be performed in two steps:
1. result = (x >> 4) & 56 + x & 127
2. if (result >= 120) result = result - 120
This is much, much faster than a general purpose shift-compare-subtract division routine.
Different moduli give different optimized formulas though, some of which are smoother than others. (This is analogous to multiplication for a specific multiplier, where e.g. 10 * x can be quickly computed as x * 8 + x * 2 (or usually, as (x * 4 + x) * 2), and 255 * x = 256 * x - x, but 163 * x (163 = $A3), while not bad, isn't as smooth a case as the other two.) For example, (hi * 256 + lo) % 255, where hi and lo both range from 0 to 255, is:
result = (256 * hi + lo) % 255
result = (255 * hi + hi + lo) % 255
result = (hi + lo) % 255
hi + lo can be at most 510, which isn't less than 2 * 255 ... but, if the sum is split up as follows:
hi + lo = sumh * 256 + suml
where sumh is the upper bit (i.e. bit 8) of the sum and suml is the lower 8 bits, so:
result = (sumh * 256 + suml) % 255
result = (sumh * 255 + sumh + suml) % 255
result = (sumh + suml) % 255
sumh + suml will be at most 255 (since 510 is the maximum sum, when sumh is 1, suml can be at most 254) and this is less than 2 * 255, so only a single compare and subtract if greater than or equal to is necessary. In 6502 assembly, this is:
Code:
CLC
LDA HI
ADC LO
ADC #0 ; add the high bit of sum
CMP #$FF ; subtract 255 if greater than 255
BCC SKIP
SBC #$FF ; note that carry known to be set beforehand
SKIP
In fact, this can be simplified even further. Consider the result of SBC #$FF, which is A - $FF if the carry was set (beforehand), and A - $FF - 1 if the carry was clear; A - $FF - 1 is A - $100, which leaves A the same and clears the carry. So the BCC is completely unnecessary, and can be discarded, leaving:
Code:
CLC
LDA HI
ADC LO
ADC #0 ; add the high bit of sum
CMP #$FF ; subtract 255 if greater than 255
SBC #$FF
It seems like it ought to be possible to optimize that even further, especially since the carry out of the ADC #0 is not used by the subsequent CMP #$FF instruction. In fact it is possible. After the ADC LO instruction, the accumulator and carry will be:
Code:
1. A = 0 to 254, C = 0 if HI + LO was 0 to 254
2. A = 255, C = 0 if HI + LO was 255
3. A = 0 to 253, C = 1 if HI + LO was 256 to 509
4. A = 254, C = 1 if HI + LO was 510
If the ADC LO instruction is followed by a ADC #1 instruction, A and C (after the ADC #1 instruction) will be:
Code:
1. A = 1 to 255, C = 0 if HI + LO was 0 to 254
2. A = 0, C = 1 if HI + LO was 255
3. A = 2 to 255, C = 0 if HI + LO was 256 to 509
4. A = 0, C = 1 if HI + LO was 510
So all that is necessary is to subtract 1 from A in cases 1 and 3, or in other words, subtract 1 when the carry is 0. This can be done with a SBC #0 instruction, which subtracts 0 when C = 1 and subtracts 1 when C = 0. So the code becomes:
Code:
CLC
LDA HI
ADC LO
ADC #1
SBC #0
That's a 16-bit mod 255 calculation in only 5 instructions, without a loop!
Another example is mod 65521 (used in Adler-32 checksum calculations). For a * 65536 + b (where b ranges from 0 to 65535):
result = (a * 65536 + b) % 65521
result = (a * 65521 + 15 * a + b) % 65521
result = (15 * a + b) % 65521
result = (16 * a - a + b) % 65521
which is a faster computation, and again, the lower 16 bits of 16 * a - a + b can be split off to reduce it further.
Basically, this mod optimization technique is the same principle as the old "a number is divisible by 9 if its digits sum to 9" trick.