Comparing 6502 multiply routines

TobyLobster · Post by **TobyLobster** » Mon Jan 02, 2023 11:39 pm

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Jan 03, 2023 2:03 am

TobyLobster wrote:

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet.

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.

GARTHWILSON · Post by **GARTHWILSON** » Tue Jan 03, 2023 2:58 am

BigDumbDinosaur wrote:

It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.

Attaching it would be excellent, if it's practical. One of my problems with github is that I can never can find what I'm looking for on github when someone gives us a link to their repo; but in this case, your detailed summary on the front page was excellent. I added the link at the end of the UM* topic in the Forth section of the forum at viewtopic.php?f=9&t=689 . I also put it on my links page.

BigEd · Post by **BigEd** » Tue Jan 03, 2023 9:08 am

Well done TobyLobster - a tour de force!

(All the static parts of 6502.org are mastered on github... in that sense, we all use it. But way to derail an interesting offering...)

teamtempest · Post by **teamtempest** » Tue Jan 03, 2023 3:50 pm

That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.

All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).

Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.

TobyLobster · Post by **TobyLobster** » Tue Jan 03, 2023 4:50 pm

teamtempest wrote:

That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.

Just to be clear, the 'average' figures I give are the arithmetic mean: the sum of all cycle counts divided by the number of counts. The error bars shown in the graphs show this average as the dot between the minimum and maximum cycle counts. Many of the routines have a low variance in the min and max values (bars of small height). Those that don't is usually because they have special case code for handling zero.

teamtempest wrote:

All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).

Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.

Good ideas. I wonder if the time to check for these cases is small enough to get win across all inputs? I'd be interested to see. The OR-ing trick is only done by omult1.a I think, but is it worth it? Certainly if you know in advance you are likely to be multiplying small numbers these will win.

barrym95838 · Post by **barrym95838** » Tue Jan 03, 2023 4:53 pm

teamtempest wrote:

Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.

That's how I did it in VTL02. It's very quick if the first term is small or the second term is a large-ish power of two, but probably suffers a bit with random inputs. I won't use tables in VTL02, but I might peek at a few of the smaller routines Toby has to see if they might be more suitable for an unsigned 16x16->16:

Code: Select all

;-----------------------------------------------------;
; var[x] = (var[x]*var[x+2])mod65536 (unsigned)
; uses:    plus:, {>}
; exit:    var[x+2] and {>} are modified
; 39 bytes
mul:
    lda  0,x        ;
    sta  gthan      ;
    lda  1,x        ; copy multiplicand to {>}
    sta  gthan+1    ;
    stz  0,x        ; zero the product to begin
    stz  1,x        ;
    .db  $cd        ; "cmp abs" naked op-code
mul1:
    rol  3,x        ;
mul2:
    lda  gthan      ;
    ora  gthan+1    ;
    beq  mulrts     ; exit early if multiplicand = 0
    lsr  gthan+1    ;
    ror  gthan      ; right-shift multiplicand
    bcc  mul3       ; check the bit shifted out
    jsr  plus       ; form the product in var[x]
mul3:
    asl  2,x        ; left-shift multiplier
    bne  mul1       ;
    rol  3,x        ;
    bne  mul2       ; loop until multiplier = 0
mulrts:
    rts             ;

teamtempest · Post by **teamtempest** » Tue Jan 03, 2023 7:37 pm

Quote:

One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.

My bad. I meant to say 'median'. And maybe not the cycle time but what the actual operands are at that point (ie., do they have more than eight bits?).

BigEd · Post by **BigEd** » Tue Jan 03, 2023 7:59 pm

It's a tricky question, I think, because the distribution of input values will come into it. It may well be that in many applications small multipliers will be found often, but in many others they won't.

TobyLobster · Post by **TobyLobster** » Tue Jan 03, 2023 8:36 pm

barrym95838 wrote:

That's how I did it in VTL02

Interesting technique to index with '0,X' to allow any zero page variable. If speed is important it may actually be faster to copy those inputs into temporary zp storage to avoid losing a cycle on every access.

barrym95838 · Post by **barrym95838** » Tue Jan 03, 2023 9:15 pm

Well, I'm already blowing 12 cycles per iteration by not in-lining the add. The VTL02 "IDE" is a balancing act between small size and high performance, with a (not so) slight bias in favor of small size.

TobyLobster · Post by **TobyLobster** » Wed Jan 04, 2023 12:12 am

The idea of testing for an 8x16 bit multiply inside a 16x16 bit multiply was implemented back in 1980, in 'Micro - The 6502 Journal' magazine https://archive.org/details/micro-6502- ... 1/mode/2up (they also have a separate 8x8 multiply).

teamtempest · Post by **teamtempest** » Wed Jan 04, 2023 8:00 pm

Intuitively it makes sense to shortcut, but it may not be so in practice. Suppose we add a header to an 16x16 multiply routine like this:

Code: Select all

       ldx #8             ; assume 8 loops
       lda MPLIER+1    ; can we get away with that ?
       beq @1            ; b: yes
       ldx #16            ; no, full mutiply
@1

That seems simple enough, but throughout the full 16-bit range of operands will add 30,081,548,288 cycles (2^24 * 8 branch taken + (2^32 -2^24) * 7 branch not taken) to the overall execution time. On the other hand, it will only save 5,100,273,664 or so cycles overall (2^24 * (8 * 38)), depending on how long a 16x16 do-nothing loop takes. So a net slowdown.

Half measures may not do. Rather than try to optimize one loop, it may be best to split into multiple loops, as the article suggests.

TobyLobster · Post by **TobyLobster** » Thu Jan 05, 2023 8:01 pm

I added the Micro magazine listing that I linked to above into my tests including its branch to 8x16 multiply (mult48). It's not very competitive these days, coming in at 707 cycles average, but should be representative of a typical shift-and-add.

If I remove the supposed 'shortcut' to the 8x16 routine it comes in slightly faster at 703 cycles average (mult49). So the supposed shortcut is actually slightly counterproductive if all inputs are equally likely (as we suspected). However, if in the context of a hypothetical use case the 'acl/h' input number of this routine is limited to < 32768 (i.e. top bit clear) then the shortcut gives a minor improvement. The smaller the top of the range the better the improvement:

Code: Select all

(acl,ach) less than | average cycles
--------------------+---------------
65536 (no limit)    | 707    
32768               | 696
16384               | 685
8192                | 672
4096                | 656
2048                | 632
1024                | 594
512                 | 527
256                 | 404    (always uses the 8x16 code)

This is for unsigned numbers. (Negative numbers will have the top bit set, so don't get a benefit if the unsigned multiply is used before adjusting for signedness).
So a shortcut can be useful if one of the inputs has a smaller range.

TobyLobster · Post by **TobyLobster** » Fri Jan 06, 2023 4:23 pm

I've added some more good routines to the page, mostly 16 bit.

Comparing 6502 multiply routines

Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines