6502.org

Posted: **Sat Aug 25, 2012 3:53 pm**

I wanted to follow up on this topic, and opted for a picture rather than 1,000 words. Or would that be 1024 words..??

Anyway, for clarity (and as a tutorial in case anyone's not 100% up to speed) here's an animation of the three multiplication routines mentioned earlier in this thread:.

All three routines operate by testing one bit at a time and, if the bit is a 1, performing an addition. The values involved are:

16-bit Multiplicand "a" (the one that remains static)
16-bit Multiplicand "b" (the one that gets shifted and tested)
the 32-bit result -- the Product

I have labelled the locations (eg: 0,X; eg: A) where the six bytes of data are held. For example: Garth posted the FIG multiply routine -- and the animation's upper panel is your visual reference when reading that piece of code. Likewise the animation's middle panel illustrates the Right-shifting algorithm dclxvi posted (the basic version), and the bottom panel hopefully clarifies my "Modified Right-shift" approach.

(So, if there are two 16-bit and one 32-bit values involved, why do the diagrams show 6 bytes of storage rather than 8? Folks familiar with this sort of code will recognize an ubiquitous optimization. The 32-bit result begins as a 16-bit result, gradually expanding as the routine executes. At the same time the bit-tested multiplicand shrinks -- it can gradually be discarded -- as the routine executes. The upshot is that a 16-bit portion of these two values can cohabitate in the same storage! This boosts speed. Both values can be shifted at once, and they still maintain their individual functions. BTW if you find all this shifting bewildering, it'll help to think of how the columns need to be skewed when you do multiplication with a pencil and paper. 'Nuff said -- please refer to the code!)

The other day I ran these routines on actual 65xx hardware, bracketed with code snippets to read a 6522 timer and derive the execution times. The graph below presents these figures; also the memory requirements. Not surprisingly, there's a clear tradeoff between memory usage and execution speed. The graph lists five versions of Forth's U* (unsigned multiply):

the FIG code Garth posted (bug-fix included)
the Right-Shift version dclxvi posted
the modified Right-Shift version I posted
another modified Right-Shift version (not posted) with the inner loop partially unrolled (2 bits at a time, 4 iterations)
another modified Right-Shift version (not posted) with no inner loop (inline code for all 8 bits)

Each code version has its execution time listed for three multiplication tests: 1 times 1, 5555h times 5555h, and FFFFh times FFFFh. Regarding execution time, all versions except FIG respond to the number of ones in multiplicand "b," and multiplicand "a" is irrelevant. The Right-Shift version takes 21~ extra for each additional one, and the modified Right-Shift versions takes 17~ extra (15~ extra for the fully-unrolled version, which uses Y in place of N+1). The FIG version takes 33~ extra for each additional one and (by inspecting the code) seems to also delay nonlinearly according to the value of multiplicand "a"; this is due to the potential carry that may or may not occur, propagating "backwards" into bits that've already been shifted.

Your choice of the "best" routine will vary, of course, depending on how much of a premium you place on speed (versus memory consumption). The basic Right-Shift version is compact and succinct, whereas the fully-unrolled modified Right-Shift is plainly at the point of diminishing returns (burning memory to scrounge every possible cycle, I mean). It's not a criticism; in certain circumstances profligate memory usage is perfectly justifiable. The next step of course would be to adopt a Lookup Table approach: Garth explores this powerful concept here.

-- Jeff

Posted: **Sat Aug 25, 2012 4:26 pm**

When you list memory usage, you're counting the size of the actual routine, not the working set of the problem, correct?

Posted: **Sat Aug 25, 2012 4:54 pm**

Wow, thankyou Jeff!

Posted: **Sat Aug 25, 2012 7:03 pm**

Just a thought: Booth's method allows for a little less work on average, by opting for subtraction instead of addition according to a cunning plan. The state of play could be encoded in the PC by having two branches to the code (no need to explicitly hold two bits of operand to make a decision)

Posted: **Sun Aug 26, 2012 6:02 pm**

Cheers, fellas. And thanks for the suggestion, Ed. I wasn't familiar with Booth's method, but eventually it began to make sense, and I scribbled out a flow chart in preparation to start coding. (Your "two branches to the code" idea works well here.) But on further reflection I'm not sure I understand your assertion that, "Booth's method allows for a little less work on average."

As I understand the matter (and sticking with the example of 16 bit operands), the worst case example for Booth's method will require 8 additions and 8 subtractions -- scarcely any faster or slower than 16 additions (as required by the worst case for the more conventional algorithms above). Likewise the speed in the best cases is the same. But, instead of the worst-case operand being FFFFh (or any number with lots of 1 bits), with Booth's method the worst case operand is something like 5555h or AAAAh -- where there are lots of 1-to-0 or 0-to-1 transitions between adjacent bits. So, how common are these transitions? By my calculations (I wrote a little Forth program

) if you look at each number from 0 to FFFFh and count the transitions seen by Booth in each, the total number is 557,056. This is slightly worse than 524,288 -- the number of 1 bits in those same 65,536 16-bit numbers. I suppose this might be a flaw in my notion of the algorithm -- and unless I code and successfully test it I won't know for sure...

BTW in the context of Forth and U* is it reasonable to optimize our coding based on the assumption that we have a random series of operands -- that all of the 64K possible values are equally probable? I'm skeptical of that assumption, but OTOH haven't come up with anything that's clearly superior. Really we're just guessing what sort of math the Forth user is coding, and what sort of numbers are likely to come up.

Comments, anyone?

-- Jeff
ps- (for Will): when I mentioned memory usage I was refering to program space: the amount of storage occupied by the instructions.
pps- I wonder if Booth's method only shows a performance edge for numbers much longer than 16 bits...

Posted: **Sun Aug 26, 2012 6:22 pm**

Hi Jeff
that doesn't sound so positive, but I think it must be a flawed analysis. Looking for example at some short strings in the middle of a word of zeros:

Code: Select all

00000000 - no action
00000100 - single addition
00001000 - ditto
00001100 - one addition, one subtraction
00010000 - single add
00010100 - two additions
00011000 - one addition, one subtraction
00011100 - ditto (saving one op)
00100000 - single addition
00100100 - two additions
00101000 - ditto
00101100 - 3 ops: two additions and a subtraction
00110000 - one addition, one subtraction
00110100 - 3 ops: two additions and a subtraction
00111000 - one addition, one subtraction (saving one op)
00111100 - ditto (saving two ops)

Overall, we saved 4 operations (from an original 32.) Which isn't a huge proportion. But, how would your counting have analysed these?
Cheers
Ed

Posted: **Sun Aug 26, 2012 6:34 pm**

Code: Select all

00000100 - single addition

I see this (and similar cases) as 2 transitions. It contains the string 10 and the string 01. So, one subtraction and one addition, respectively (if we're shifting to the right). Have I misinterpreted the method?

Posted: **Sun Aug 26, 2012 6:39 pm**

ok, that would explain the pessimistic result. It's clear that a single addition is enough: can it be arranged? Material explaining radix-4 Booth's tells us to look at overlapping sets of 3 bits. That's more obvious for hardware, especially where a shift is free (no carry chain) and the overall result is half the number of operations. In our case, shift is just as expensive as add, unless we were going to do it anyway.
Maybe this idea won't fly - a saving of one-eighth isn't much.

Posted: **Sun Aug 26, 2012 7:17 pm**

Looking at Table 1 of http://www.geoffknagge.com/fyp/booth.shtml, if we take an 8-way branch on three bits of the operand, we can arrange for a single op (or no op) to cater for 2 bits.

If we can arrange to do everything from MSB first, that might be even easier
- the top bit is shifted out, and we BCC/BCS to the add or to the subtract branch of the code
- we shift once more
- a combination of BMI and BCS takes us to the four possibilities: an add then shift, a shift then add, or just a shift.

I am hand-waving rather than coding, but it's a saving of a quarter in additions. Would be more worthwhile for a 4-byte routine of course because the additions are more expensive. For 16-bit routines, we'd normally have an average of 8 additions, and we get that down to 6. Is that going to cover the overhead and make it worthwhile?

Cheers
Ed

Posted: **Sun Aug 26, 2012 7:32 pm**

OK, you're getting ahead of me -- I'm not that quick. But definitely you are onto something. My approach was based on the Wikipedia article, under the section "A typical implementation." Their version only looks at two bits at a time. It works -- and is simple enough to make a good tutorial -- but, as you noted, it makes dumb decisions in certain cases. Three bits at a time would be better -- and even eight bits at a time may not be entirely ludicrous! It's all about anayzing tradeoffs... If I get time I'll have a go at coding a three-at-a-time version. Edit: Or at least predict the benefit, if any. Thanks for the info, Ed

J.

Posted: **Wed Aug 29, 2012 5:13 am**

Not meaning to distract you, but I did some digging. I knew vaguely that there's a trick to long multiplication which uses fewer digit multiplications than you'd expect, and I wondered if a fast 4x4 or 8x8 multiplication could be used with this trick to do a fast 16x16. It turns out the trick is by Karatsuba (and was revolutionary): unfortunately it seems to demand one extra bit, so you end up needing 5x5 or 9x9 as a primitive: not so handy. Instead of doing that, and needing only 3 digit-sized multiplications but having to deal with 9x9, might be better to use 4 digit-sized multiplications in the obvious way.

How to get a fast 8x8? It turns out that there's a very neat Quarter-Square method. For the squares, you need a lookup table: seems to need 1k or 2k of table. (Would be much smaller if we proceed by nibbles and use a 4x4 primitive, but then 16x16 would take 16 of those and it's not obvious that we'd be winning.)

Inevitably, people have already tackled this, so here are some links:
26 cycles (??) for 8x8 A Different Perspective, part II by Taylor and Judd (search for "code to multiply")
66 cycles for 8x8 Fastest 6502 Multiplication Yet in AAL by Charles Putney using 1k of tables
90 cycles for 8x8 Fast 6502 multiplication by damian yerrick
8x8 and 16x16 Seriously fast multiplication (8/16 bit signed/unsigned multiplication) - by JackAsser/Instinct (uses 2k of tables)
More multiplication routines at codebase64
Sam Falvo's post covering 6502 and 65816
8x8 Fast multiplication by Fox, originally from Polish disk magazine "Syzygy" no. 6
Shift and add method explained by Jim Butterfield

Cheers
Ed

Posted: **Wed Aug 29, 2012 7:15 am**

'kay, Ed -- I'll start checking all that out in the morning. I've taken a new tack with my own doodlings, and it'll be interesting to compare my efforts with what's out already there. As you say, many minds have already addressed the problem. I'll be happy if I can simply add a useful selection point on the speed-vs-memory tradeoff.

in your PREVIOUS post you wrote:

if we take an 8-way branch on three bits [...] For 16-bit routines, we'd normally have an average of 8 additions, and we get that down to 6. Is that going to cover the overhead and make it worthwhile?

Either there's a minor slip-up here or else I misunderstand your drift. Is it Radix 8 you were assuming? Although 6 out of 8 possible bit-patterns result in an operation (ie, an add or subtract), you mustn't suppose all patterns are equally probable. If we test the numbers 0 to FFFFh for pattern matching we see...

557,056 operations result using Radix 4, where 2 out of 4 patterns operate
425,984 operations result using Radix 8, where 6 out of 8 patterns operate
335,872 operations result using Radix 16 where 14 out of 16 patterns operate
278,528 operations result using Radix 32 where 30 out of 32 patterns operate

The explanation mostly has to do with the fact that Radix 32 (for example) shifts the number 4 bits at a time (compared to 3, 2, or 1 bit at a time), but there's more to it than that. BTW the Wikipedia article is flawed in that it scarcely even mentions radix -- you're out of luck unless you spot the external links.

For a while I was playing with a radix 32 scheme, but the cost in (zero-page) ram was gonna be pretty high, and that's much less acceptable than simply burning up a bunch of EPROM space.

Quote:

you need a lookup table: seems to need 1k or 2k of table

Lookups are definitely the secret weapon of choice when you need to nuke a speed problem! I've chosen a strategic target, though, and that is shifting -- which is the main bottleneck if you look at the graph I posted. (For the multiply-by-one cases, addition is pretty well out of the picture, and most of the remaining overhead is shifting.) The scheme I'm working on shifts 4 bits at a time, using two 256-byte Lookup Tables to keep things snappy. In the setup it retains the "times 1" multiplicand, of course, and it also prepares a times 2, 4 and 8 version. But that means most of the additions end up spanning 3 bytes instead of two, and it remains to be seen how much the accelerated shifting outweighs that.

cheers

-- Jeff

Posted: **Wed Aug 29, 2012 8:53 pm**

I've been thinking about 4-bit shifts by lookup table, so it's good to hear about that!

I don't disbelieve your numbers, but I can't quite see why the patterns shouldn't be evenly distributed. (Is it because of the overlap??)

In that Table 1, it's a Radix 4 approach, but there are 8 patterns because of the need to include the leading bit. 6 of those 8 do some work, and as it happens each does the same amount of work, a single operation, for an average of 0.75 ops per two bits. The two bits we've dealt with would conventionally have cost us an average of one op.

So, in Radix 8 we should have 16 patterns, and again all but 2 will need to do some work, but in this case sometimes we need more than one operation. It seems(*) to take 9 ops per 3 bits consumed, compared to 12 for longhand multiplication. That is, no gain... hmm. (In hardware, it's only 7 ops because plus or minus 3 is arranged to done in a single cycle. In software plus or minus 3 takes two ops.)

At least one of us is missing something! (The material in https://www.google.co.uk/search?q=%22Ra ... gorithm%22 seems helpful.)

Cheers
Ed
(*) See for example page 2 of http://www.ijvspa.net/docs/IJVSPA20120215.pdf

Posted: **Wed Aug 29, 2012 10:36 pm**

Quote:

So, in Radix 8 we should have 16 patterns, and again all but 2 will need to do some work, but in this case sometimes we need more than one operation. [...] (In hardware, it's only 7 ops because plus or minus 3 is arranged to done in a single cycle. In software plus or minus 3 takes two ops.)

I'm a newcomer to the topic of Booth's method, and readily admit I may be missing something. But my conception of a software implementation is that it would match what you seem to be saying about a hardware implementation -- that each pattern will correspond to zero or one op, not more. This assumption is implicit in the numbers I presented.

If all patterns must process with one op maximum then this demands some pre-processing (some addition that takes place in advance) unless the radix only relies on powers of two (as with radix 4 and radix 2). As noted, radix 8 needs to have 3 times the multiplicand available -- as a single entity, not as a postponed obligation to add 1+2. The doc on the hardware models is pretty dense and hard to take in, but a quick perusal left me with the understanding that they use enough preliminary logic (including addition) that eventually there's a multiplexer which can instantly choose 3 times the multiplicand just as easily as any of the simpler numbers (ie, the powers of two) -- is that a reasonable way of putting it?

I left off working on the Booth radix 16 code because my understanding is I'd need to compute 3, 5, 6 and 7 times the multiplicand in advance, and buffer these values in ram so that -- as in hardware -- they are available on an equal footing compared with power-of-two numbers. But in software that wasn't looking very attractive, due to the ram required (I would use zero-page) and due to the delay of pre-processing -- although the latter may be recouped, perhaps many times over, when multiplying very large numbers. For a 16 by 16 multiply on 6502 it didn't seem worth pursuing. I switched to a scheme that does shifting in advance (ie, the powers of two) but no addition in advance. So I buffer 1, 2, 4 and 8 times the multiplicand in ram but not 3, 5, 6 or 7. It'll be at least a few days before I can find the time to finish this and report back.

Cheers, Ed, and thanks for the never-ending supply of links

Jeff

Posted: **Thu Aug 30, 2012 12:58 am**

Jeff:

I developed a hardware Booth multiplier some time ago. My reference for the algorithm was "Computer Organization", Hamacher et al, McGraw-Hill Book Company, New York, NY, 1978, ISBN: 0-07-025681-0. It does a very nice job of laying out the algorithm for 1 bit at a time algorithm. Like you, some additional research on the web let me extend the Booth algorithm for any number of bits. I specifically put three Booth multiplier modules together: 1x, 2x, and 4x. The following two excerpts may be of help with regard to the Booth "recoding" tables that will feed a look-up table:

(Booth 2 bits at a time Multiplier)

Code: Select all

// Additional Comments:
//
//  The basic operations follow those of the standard Booth multiplier except
//  that the transitions are being tracked across 2 bits plus the guard bit.
//  The result is that the operations required are 0, ±1, and ±2 times the 
//  multiplicand (M). That is:
//  
//  Prod[2:0]   Operation
//     000      Prod <= (Prod + 0*M) >> 2;
//     001      Prod <= (Prod + 1*M) >> 2;
//     010      Prod <= (Prod + 1*M) >> 2;
//     011      Prod <= (Prod + 2*M) >> 2;
//     100      Prod <= (Prod - 2*M) >> 2;
//     101      Prod <= (Prod - 1*M) >> 2;
//     110      Prod <= (Prod - 1*M) >> 2;
//     111      Prod <= (Prod - 0*M) >> 2;
//
//  The operations in this table can be seen as direct extensions of the four
//  conditions used in the standard Booth algorithm. The first and last terms
//  simply indicate to skip over runs of 1s and 0s. The terms 001 and 110 are
//  indicative of the 01 (+1) and 10 (-1) operations of the standard Booth al-
//  gorithm. The terms 010 (+2,-1) and 101 (-2,+1) reduce to the operations
//  noted in the table, namely ±1*M, respectively. The terms 011 (+2,0) and
//  100 (-2, 0) are the two new operations required for this modified Booth
//  algorithm.
//
//  The algorithm could be extended to any number of bits as required by noting
//  that as more bits are added to the left, the number of terms (operations)
//  required increases. Addding another bit, i.e. a 3-bit Booth recoding, means
//  that the operations required of the adder/partial product are 0, ±1, ±2,
//  and ±4. The guard bits provided for the sign bit accomodates the ±2 opera-
//  tion required for the 2-bit modified booth algorithm. Adding a third bit
//  means that a third guard bit will be required for the sign bit to insure
//  that there is no overflow in the partial product when ±4 is the operation.
//

(Booth 4 bits at a time multiplier)

Code: Select all

// Additional Comments:
//
//  The basic operations follow those of the standard Booth multiplier except
//  that the transitions are being tracked across 4 bits plus the guard bit.
//  The result is that the operations required are 0, ±1, ±2, ±3, ±4, ±5, ±6,
//  ±7, and ±8 times the multiplicand (M). However, it is possible to reduce 
//  the number of partial products required to implement the multiplication to
//  two. That is, ±3, ±5, ±6, and ±7 can be written in terms of combinations of
//  ±1, ±2, ±4, and ±8. For example, 3M = (2M + 1M), 5M = (4M + M), 6M = (4M
//  + 2M), and 7M = (8M - M). Thus, the following 32 entry table defines the
//  operations required for generating the partial products through each pass
//  of the algorithm over the multiplier:
//  
//  Prod[4:0]       Operation
//    00000      Prod <= (Prod + 0*M + 0*M) >> 4;
//    00001      Prod <= (Prod + 0*M + 1*M) >> 4;
//    00010      Prod <= (Prod + 0*M + 1*M) >> 4;
//    00011      Prod <= (Prod + 2*M + 0*M) >> 4;
//    00100      Prod <= (Prod + 2*M + 0*M) >> 4;
//    00101      Prod <= (Prod + 2*M + 1*M) >> 4;
//    00110      Prod <= (Prod + 2*M + 1*M) >> 4;
//    00111      Prod <= (Prod + 4*M + 0*M) >> 4;
//    01000      Prod <= (Prod + 4*M + 0*M) >> 4;
//    01001      Prod <= (Prod + 4*M + 1*M) >> 4;
//    01010      Prod <= (Prod + 4*M + 1*M) >> 4;
//    01011      Prod <= (Prod + 4*M + 2*M) >> 4;
//    01100      Prod <= (Prod + 4*M + 2*M) >> 4;
//    01101      Prod <= (Prod + 8*M - 1*M) >> 4;
//    01110      Prod <= (Prod + 8*M - 1*M) >> 4;
//    01111      Prod <= (Prod + 8*M + 0*M) >> 4;
//    10000      Prod <= (Prod - 8*M - 0*M) >> 4;
//    10001      Prod <= (Prod - 8*M + 1*M) >> 4;
//    10010      Prod <= (Prod - 8*M + 1*M) >> 4;
//    10011      Prod <= (Prod - 4*M - 2*M) >> 4;
//    10100      Prod <= (Prod - 4*M - 2*M) >> 4;
//    10101      Prod <= (Prod - 4*M - 1*M) >> 4;
//    10110      Prod <= (Prod - 4*M - 1*M) >> 4;
//    10111      Prod <= (Prod - 4*M - 0*M) >> 4;
//    11000      Prod <= (Prod - 4*M - 0*M) >> 4;
//    11001      Prod <= (Prod - 2*M - 1*M) >> 4;
//    11010      Prod <= (Prod - 2*M - 1*M) >> 4;
//    11011      Prod <= (Prod - 2*M - 0*M) >> 4;
//    11100      Prod <= (Prod - 2*M - 0*M) >> 4;
//    11101      Prod <= (Prod - 0*M - 1*M) >> 4;
//    11110      Prod <= (Prod - 0*M - 1*M) >> 4;
//    11111      Prod <= (Prod - 0*M - 0*M) >> 4;
//

I've successfully implemented these multipliers in a signal processing application. I believe that the recoding tables provided above are accurate.

6502.org

UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths

Re: UM* (multiplication) bug in common 6502 Forths