6502.org

Posted: **Sun Jan 08, 2023 3:31 am**

Hi!

TobyLobster wrote:

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet.

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

Thanks for doing this, it is very useful!

My interest is in the 16x16=16 routine, as this is needed for language implementations. In FastBasic I use this simple code: https://github.com/dmsc/fastbasic/blob/ ... rp/mul.asm , you could add it to your benchmark.

Seeing your results, I'm interested if adapting mult60 to 16x16=16 could be faster.

Have Fun!

Posted: **Sun Jan 08, 2023 1:18 pm**

dmsc wrote:

Hi!
My interest is in the 16x16=16 routine, as this is needed for language implementations. In FastBasic I use this simple code: https://github.com/EricCarrGH/fastbasic ... rp/mul.asm , you could add it to your benchmark.

Seeing your results, I'm interested if adapting mult60 to 16x16=16 could be faster.
Have Fun!

Thanks, I've added your FastBasic code code as `omult14.a` which works out as 575 average cycles in 43 bytes.

I cut down mult61 a little to give a 16 bit result (mult61 is more amenable to this optimisation than mult60), which I've added as `omult15.a`. This gives a faster 390 cycles average in 47 bytes.

But, if you don't care about the answer if there is an overflow, then omult6.a wins with 153.46 average cycles in 38 bytes.

EDIT: I updated omult15 so have updated the numbers above to the latest.

Posted: **Sun Jan 08, 2023 10:41 pm**

Hi!

TobyLobster wrote:

Thanks, I've added your FastBasic code code as `omult14.a` which works out as 575 average cycles in 43 bytes.

Thank you!
In the above, I wrote the wrong URL for the FastBasic code, correct one is https://github.com/dmsc/fastbasic/blob/ ... rp/mul.asm , but the code is the same. (this is what happens when it is late and you use google to search to your own code

)

Quote:

I cut down mult61 a little to give a 16 bit result (mult61 is more amenable to this optimisation than mult60), which I've added as `omult15.a`. This gives a faster 390 cycles average in 47 bytes.

That is very reasonable for 47 bytes! In the FastBasic case, I would have to modify it a little to get the multiplicand from A/X and the stack, and get the result into A/X, but I will try to see if it is worth.

Quote:

But, if you don't care about the answer if there is an overflow, then omult6.a wins with 153.46 average cycles in 38 bytes.

I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode

Have Fun!

Posted: **Mon Jan 09, 2023 4:05 pm**

dmsc wrote:

Hi!
I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode

Have Fun!

If on an overflow you only care about the sign bit of the result then I imagine adding this before the omult6 routine:

Code: Select all

    lda multiplicand+1      ;
    eor multiplier+1        ;
    sta temp                ; store sign bit

and adding `ldy temp` at the end, before returning in the out_of_memory condition would keep the sign bit correct.

Posted: **Mon Jan 09, 2023 7:01 pm**

Hi!

TobyLobster wrote:

dmsc wrote:

Hi!
I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode

Have Fun!

If on an overflow you only care about the sign bit of the result then I imagine adding this before the omult6 routine:

Code: Select all

    lda multiplicand+1      ;
    eor multiplier+1        ;
    sta temp                ; store sign bit

and adding `ldy temp` at the end, before returning in the out_of_memory condition would keep the sign bit correct.

No, what I tried to say is that the omult6.a routine will not give the correct result for any negative argument, as it will detect overflow instead of giving the correct answer - my example was -1 times -1, should give 1, but the omult6.a routine will detect overflow and give an invalid answer (in that case, it will give "-2" as answer).

Have Fun!

Posted: **Mon Jan 09, 2023 7:12 pm**

dmsc wrote:

Hi!
No, what I tried to say is that the omult6.a routine will not give the correct result for any negative argument, as it will detect overflow instead of giving the correct answer - my example was -1 times -1, should give 1, but the omult6.a routine will detect overflow and give an invalid answer (in that case, it will give "-2" as answer).

Have Fun!

Ah, I understand now, thanks.

Posted: **Mon Dec 11, 2023 10:17 am**

TobyLobster wrote:

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

Hi,

I was pleased to find your comparison, and I must say thrilled to see that I am the world record holder

I have improved my published version; the fastest unsigned 16x16=32 is now 194.1 cycles. I'll publish it on codebase when I can. I've also updated my previous entry. I estimate that using a 128k table (C64 REU) would take 134 cycles.

I could easily make versions to claim the top spot of signed, partial results etc.

I'm also working on x^2 (about 160), 32x32=64 (about 800), and 7x8 (about 30). I'm putting all these into an integrated library.
There's many more ways of multiplying, including trig identities, knuth, karatsuba, nybble, residue, table, tree, jmp, and convolutions.

Could you add integer division and floating point as well?

formula based mults

1 - sin(a)*cos(b)=(sin(a+b)+sin(a-b))/2
2 - cos(a)*cos(b)=(cos(a+b)+cos(a-b))/2
3 - sin(a)*sin(b)=(cos(a-b)-cos(a+b))/2
4 - a*b=exp(log(a)+log(b))
5 - a*b = [(a + b)² - (a - b)²]/4

Knuth mult
k=(a-b)*(c-d)
l=a*c
m=b*d
n=l+m-k
x*y=l*256+n*16+m (for 8 bit a/b/c/d)

Residue uses modulo numbers, which can reduce the multiplies to trade for conversions. Convolution reduces multiplies for a large number of additions. Nybble trades for a large number of tables and additions. Tree selects code fragments for each bit of the multiplier. Preprocessing can speed up any shift and add. Stats can optimize for specific ranges.

Posted: **Mon Dec 11, 2023 2:05 pm**

TobyLobster wrote:

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

I do have a suggestion though, I'd like to see the results sorted by time, size ascending, and highlight the optimal ones as yellow like you do in the graphs. There's no real use of looking them up by numbering.

You could also tabulate code vs data, if they are self-modified, and the size of the init routines (including table building).

Posted: **Sun Jan 07, 2024 2:06 pm**

I've updated the results with repose's improved code.

repose wrote:

I do have a suggestion though, I'd like to see the results sorted by time, size ascending, and highlight the optimal ones as yellow like you do in the graphs.

Sadly github's markdown doesn't allow for colour, nor dynamically sorting tables by column, so no dice.

Posted: **Sun Jan 28, 2024 2:58 am**

Thanks for the update! I'm a bit worried though, as I did label my routine as updated in 2023, which is true, but only for clarity. My actual 188.1 routine is still not published, so I'm hoping you didn't think my existing 2023 update was the fast one, though I'm glad a more clear version was updated in your mult15.a.

Posted: **Sun Jan 28, 2024 7:56 am**

repose wrote:

Thanks for the update! I'm a bit worried though, as I did label my routine as updated in 2023, which is true, but only for clarity. My actual 188.1 routine is still not published, so I'm hoping you didn't think my existing 2023 update was the fast one, though I'm glad a more clear version was updated in your mult15.a.

Hi repose. That's ok, If you do publish a faster version, I will be happy to include it in the list.

Posted: **Mon Feb 12, 2024 6:28 am**

Here you go, turns out I lost it

Code: Select all

* = $c000

p_sqr_lo = $8b
p_sqr_hi = $8d
p_neg_sqr_lo = $a3
p_neg_sqr_hi = $a5
x0 = $fb
x1 = $fc
y0 = $fd
y1 = $fe
z0 = $8004
z1 = $8005
z2 = $8006
z3 = $8007

umult16:
        ;unsigned 16x16 mult, 188.1 cycle version
        ;111 (code) + 2044 (data) = 2155 bytes
        ;inputs:
        ; x0 x1 
        ; y0 y1 
        ;outputs:
        ; z3 A z1 z0
        
        ;set multiplier as x1
        lda x1
        sta p_sqr_lo
        sta p_sqr_hi
        eor #$ff
        sta p_neg_sqr_lo
        sta p_neg_sqr_hi
        
        ;set multiplicand as y0
        ldy y0

        ;x1y0l =  low(x1*y0)
        ;x1y0h = high(x1*y0)
        sec
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x1y0l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        sta x1y0h+1
        
        ;set multiplicand as y1
        ldy y1
        
        ;x1y1l =  low(x1*y1)
        ;z3    = high(x1*y1)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x1y1l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        sta z3
        
        ;set multiplier as x0
        lda x0
        sta p_sqr_lo
        sta p_sqr_hi
        eor #$ff
        sta p_neg_sqr_lo
        sta p_neg_sqr_hi
        
        ;x0y1l =  low(x0*y1)
        ;X     = high(x0*y1)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x0y1l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        tax
        
        ;set multiplicand as y0
        ldy y0
        
        ;z0    =  low(x0*y0)
        ;A     = high(x0*y0)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta z0
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        
        clc
do_adds:
        ;add the first two numbers of column 1
x0y1l:  adc #0 ;x0y0h + x0y1l
        tay ;6

        ;continue to first two numbers of column 2
        txa
x1y0h:  adc #0 ;x0y1h + x1y0h
        tax ;2 X=z2 so far, 6 cycles
        bcc + ;9
        
        ;column 3
        inc z3
        clc ;(+6) taken 7% of the time

        ;add last number of column 1
+       tya
x1y0l:  adc #0 ;+ x1y0l
        sta z1 ;7

        ;add last number of column 2
        txa
x1y1l:  adc #0 ;+ x1y1l
        ;A=z2
        bcc fin ;7

        ;column 3
        inc z3 ;(+4) taken 42% of the time

        ;A=z2
        ;add part 29-39, avg 31.1
        
        ;total = 156.96875+31.1=188.06875 (178-204)
        ;sta z2 ; optional
fin:    rts
        ;+6 = 194.06875

Posted: **Mon Feb 12, 2024 6:34 am**

It could go faster yet - maybe save another 20 cycles, look forward to trying the idea.

Posted: **Mon Feb 12, 2024 12:05 pm**

repose wrote:

Here you go

Very nice Repose! That takes the top spot for the fastest 16 bit multiply. I published your code with cosmetic changes and a couple of minor trivial tweaks: I put the z0-z3 in zero page and replaced "sta z1" with "tay" to return the value in a register instead of memory.

Published as mult86 at the usual page https://github.com/tobyLobster/multiply_test/

repose wrote:

It could go faster yet - maybe save another 20 cycles, look forward to trying the idea.

Sounds promising!

Posted: **Tue Feb 13, 2024 7:54 am**

I've properly published this now at https://codebase64.org/doku.php?id=base ... ation_2023

Note, there's some changes. My post above was missing an important comment (at first?), in any case your version is missing it

Code: Select all

    ; set multiplicand as y0
    ldy y0
    
    ;x1y0l =  low(x1*y0)
    ;x1y0h = high(x1*y0)
    sec

It would be good if you added that to your copy.

6502.org

Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines