Re: UM* (multiplication) bug in common 6502 Forths
Posted: Sat Aug 25, 2012 3:53 pm
I wanted to follow up on this topic, and opted for a picture rather than 1,000 words. Or would that be 1024 words..??
Anyway, for clarity (and as a tutorial in case anyone's not 100% up to speed) here's an animation of the three multiplication routines mentioned earlier in this thread:.All three routines operate by testing one bit at a time and, if the bit is a 1, performing an addition. The values involved are:
(So, if there are two 16-bit and one 32-bit values involved, why do the diagrams show 6 bytes of storage rather than 8? Folks familiar with this sort of code will recognize an ubiquitous optimization. The 32-bit result begins as a 16-bit result, gradually expanding as the routine executes. At the same time the bit-tested multiplicand shrinks -- it can gradually be discarded -- as the routine executes. The upshot is that a 16-bit portion of these two values can cohabitate in the same storage! This boosts speed. Both values can be shifted at once, and they still maintain their individual functions. BTW if you find all this shifting bewildering, it'll help to think of how the columns need to be skewed when you do multiplication with a pencil and paper. 'Nuff said -- please refer to the code!)
The other day I ran these routines on actual 65xx hardware, bracketed with code snippets to read a 6522 timer and derive the execution times. The graph below presents these figures; also the memory requirements. Not surprisingly, there's a clear tradeoff between memory usage and execution speed. The graph lists five versions of Forth's U* (unsigned multiply):
Your choice of the "best" routine will vary, of course, depending on how much of a premium you place on speed (versus memory consumption). The basic Right-Shift version is compact and succinct, whereas the fully-unrolled modified Right-Shift is plainly at the point of diminishing returns (burning memory to scrounge every possible cycle, I mean). It's not a criticism; in certain circumstances profligate memory usage is perfectly justifiable. The next step of course would be to adopt a Lookup Table approach: Garth explores this powerful concept here.
-- Jeff
- 16-bit Multiplicand "a" (the one that remains static)
- 16-bit Multiplicand "b" (the one that gets shifted and tested)
- the 32-bit result -- the Product
(So, if there are two 16-bit and one 32-bit values involved, why do the diagrams show 6 bytes of storage rather than 8? Folks familiar with this sort of code will recognize an ubiquitous optimization. The 32-bit result begins as a 16-bit result, gradually expanding as the routine executes. At the same time the bit-tested multiplicand shrinks -- it can gradually be discarded -- as the routine executes. The upshot is that a 16-bit portion of these two values can cohabitate in the same storage! This boosts speed. Both values can be shifted at once, and they still maintain their individual functions. BTW if you find all this shifting bewildering, it'll help to think of how the columns need to be skewed when you do multiplication with a pencil and paper. 'Nuff said -- please refer to the code!)
The other day I ran these routines on actual 65xx hardware, bracketed with code snippets to read a 6522 timer and derive the execution times. The graph below presents these figures; also the memory requirements. Not surprisingly, there's a clear tradeoff between memory usage and execution speed. The graph lists five versions of Forth's U* (unsigned multiply):
- the FIG code Garth posted (bug-fix included)
- the Right-Shift version dclxvi posted
- the modified Right-Shift version I posted
- another modified Right-Shift version (not posted) with the inner loop partially unrolled (2 bits at a time, 4 iterations)
- another modified Right-Shift version (not posted) with no inner loop (inline code for all 8 bits)
Your choice of the "best" routine will vary, of course, depending on how much of a premium you place on speed (versus memory consumption). The basic Right-Shift version is compact and succinct, whereas the fully-unrolled modified Right-Shift is plainly at the point of diminishing returns (burning memory to scrounge every possible cycle, I mean). It's not a criticism; in certain circumstances profligate memory usage is perfectly justifiable. The next step of course would be to adopt a Lookup Table approach: Garth explores this powerful concept here.
-- Jeff