6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 11:47 am

All times are UTC




Post new topic Reply to topic  [ 37 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Sun Jan 08, 2023 3:31 am 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132
Hi!

TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet.

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!


Thanks for doing this, it is very useful!

My interest is in the 16x16=16 routine, as this is needed for language implementations. In FastBasic I use this simple code: https://github.com/dmsc/fastbasic/blob/ ... rp/mul.asm , you could add it to your benchmark.

Seeing your results, I'm interested if adapting mult60 to 16x16=16 could be faster.

Have Fun!


Last edited by dmsc on Sun Jan 08, 2023 5:27 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 08, 2023 1:18 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
dmsc wrote:
Hi!
My interest is in the 16x16=16 routine, as this is needed for language implementations. In FastBasic I use this simple code: https://github.com/EricCarrGH/fastbasic ... rp/mul.asm , you could add it to your benchmark.

Seeing your results, I'm interested if adapting mult60 to 16x16=16 could be faster.
Have Fun!


Thanks, I've added your FastBasic code code as `omult14.a` which works out as 575 average cycles in 43 bytes.

I cut down mult61 a little to give a 16 bit result (mult61 is more amenable to this optimisation than mult60), which I've added as `omult15.a`. This gives a faster 390 cycles average in 47 bytes.

But, if you don't care about the answer if there is an overflow, then omult6.a wins with 153.46 average cycles in 38 bytes.

EDIT: I updated omult15 so have updated the numbers above to the latest.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 08, 2023 10:41 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132
Hi!

TobyLobster wrote:
Thanks, I've added your FastBasic code code as `omult14.a` which works out as 575 average cycles in 43 bytes.


Thank you!
In the above, I wrote the wrong URL for the FastBasic code, correct one is https://github.com/dmsc/fastbasic/blob/ ... rp/mul.asm , but the code is the same. (this is what happens when it is late and you use google to search to your own code :-P )

Quote:
I cut down mult61 a little to give a 16 bit result (mult61 is more amenable to this optimisation than mult60), which I've added as `omult15.a`. This gives a faster 390 cycles average in 47 bytes.


That is very reasonable for 47 bytes! In the FastBasic case, I would have to modify it a little to get the multiplicand from A/X and the stack, and get the result into A/X, but I will try to see if it is worth.

Quote:
But, if you don't care about the answer if there is an overflow, then omult6.a wins with 153.46 average cycles in 38 bytes.


I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode :-P

Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 09, 2023 4:05 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
dmsc wrote:
Hi!
I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode :-P

Have Fun!

If on an overflow you only care about the sign bit of the result then I imagine adding this before the omult6 routine:

Code:
    lda multiplicand+1      ;
    eor multiplier+1        ;
    sta temp                ; store sign bit


and adding `ldy temp` at the end, before returning in the out_of_memory condition would keep the sign bit correct.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 09, 2023 7:01 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132
Hi!

TobyLobster wrote:
dmsc wrote:
Hi!
I don't care about the answer in the overflow case, but I *do* care about the sign of the result, so that $FFFF times $FFFF is $FFFF, and I suspect this is not the case on that code. Modifying it to return the correct signed result would make the code a lot bigger, and slower.... the 6502 certainly misses a "NEG" opcode :-P

Have Fun!

If on an overflow you only care about the sign bit of the result then I imagine adding this before the omult6 routine:

Code:
    lda multiplicand+1      ;
    eor multiplier+1        ;
    sta temp                ; store sign bit


and adding `ldy temp` at the end, before returning in the out_of_memory condition would keep the sign bit correct.


No, what I tried to say is that the omult6.a routine will not give the correct result for any negative argument, as it will detect overflow instead of giving the correct answer - my example was -1 times -1, should give 1, but the omult6.a routine will detect overflow and give an invalid answer (in that case, it will give "-2" as answer).

Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 09, 2023 7:12 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
dmsc wrote:
Hi!
No, what I tried to say is that the omult6.a routine will not give the correct result for any negative argument, as it will detect overflow instead of giving the correct answer - my example was -1 times -1, should give 1, but the omult6.a routine will detect overflow and give an invalid answer (in that case, it will give "-2" as answer).

Have Fun!

Ah, I understand now, thanks.


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 11, 2023 10:17 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!


Hi,

I was pleased to find your comparison, and I must say thrilled to see that I am the world record holder :) I have improved my published version; the fastest unsigned 16x16=32 is now 194.1 cycles. I'll publish it on codebase when I can. I've also updated my previous entry. I estimate that using a 128k table (C64 REU) would take 134 cycles.

I could easily make versions to claim the top spot of signed, partial results etc.

I'm also working on x^2 (about 160), 32x32=64 (about 800), and 7x8 (about 30). I'm putting all these into an integrated library.
There's many more ways of multiplying, including trig identities, knuth, karatsuba, nybble, residue, table, tree, jmp, and convolutions.

Could you add integer division and floating point as well?

formula based mults

1 - sin(a)*cos(b)=(sin(a+b)+sin(a-b))/2
2 - cos(a)*cos(b)=(cos(a+b)+cos(a-b))/2
3 - sin(a)*sin(b)=(cos(a-b)-cos(a+b))/2
4 - a*b=exp(log(a)+log(b))
5 - a*b = [(a + b)² - (a - b)²]/4


Knuth mult
k=(a-b)*(c-d)
l=a*c
m=b*d
n=l+m-k
x*y=l*256+n*16+m (for 8 bit a/b/c/d)

Residue uses modulo numbers, which can reduce the multiplies to trade for conversions. Convolution reduces multiplies for a large number of additions. Nybble trades for a large number of tables and additions. Tree selects code fragments for each bit of the multiplier. Preprocessing can speed up any shift and add. Stats can optimize for specific ranges.


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 11, 2023 2:05 pm 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!


I do have a suggestion though, I'd like to see the results sorted by time, size ascending, and highlight the optimal ones as yellow like you do in the graphs. There's no real use of looking them up by numbering.

You could also tabulate code vs data, if they are self-modified, and the size of the init routines (including table building).


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 07, 2024 2:06 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
I've updated the results with repose's improved code.

repose wrote:
I do have a suggestion though, I'd like to see the results sorted by time, size ascending, and highlight the optimal ones as yellow like you do in the graphs.

Sadly github's markdown doesn't allow for colour, nor dynamically sorting tables by column, so no dice.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2024 2:58 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
Thanks for the update! I'm a bit worried though, as I did label my routine as updated in 2023, which is true, but only for clarity. My actual 188.1 routine is still not published, so I'm hoping you didn't think my existing 2023 update was the fast one, though I'm glad a more clear version was updated in your mult15.a.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 28, 2024 7:56 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
repose wrote:
Thanks for the update! I'm a bit worried though, as I did label my routine as updated in 2023, which is true, but only for clarity. My actual 188.1 routine is still not published, so I'm hoping you didn't think my existing 2023 update was the fast one, though I'm glad a more clear version was updated in your mult15.a.

Hi repose. That's ok, If you do publish a faster version, I will be happy to include it in the list.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 12, 2024 6:28 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
Here you go, turns out I lost it :(

Code:
* = $c000

p_sqr_lo = $8b
p_sqr_hi = $8d
p_neg_sqr_lo = $a3
p_neg_sqr_hi = $a5
x0 = $fb
x1 = $fc
y0 = $fd
y1 = $fe
z0 = $8004
z1 = $8005
z2 = $8006
z3 = $8007

umult16:
        ;unsigned 16x16 mult, 188.1 cycle version
        ;111 (code) + 2044 (data) = 2155 bytes
        ;inputs:
        ; x0 x1
        ; y0 y1
        ;outputs:
        ; z3 A z1 z0
       
        ;set multiplier as x1
        lda x1
        sta p_sqr_lo
        sta p_sqr_hi
        eor #$ff
        sta p_neg_sqr_lo
        sta p_neg_sqr_hi
       
        ;set multiplicand as y0
        ldy y0

        ;x1y0l =  low(x1*y0)
        ;x1y0h = high(x1*y0)
        sec
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x1y0l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        sta x1y0h+1
       
        ;set multiplicand as y1
        ldy y1
       
        ;x1y1l =  low(x1*y1)
        ;z3    = high(x1*y1)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x1y1l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        sta z3
       
        ;set multiplier as x0
        lda x0
        sta p_sqr_lo
        sta p_sqr_hi
        eor #$ff
        sta p_neg_sqr_lo
        sta p_neg_sqr_hi
       
        ;x0y1l =  low(x0*y1)
        ;X     = high(x0*y1)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta x0y1l+1
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
        tax
       
        ;set multiplicand as y0
        ldy y0
       
        ;z0    =  low(x0*y0)
        ;A     = high(x0*y0)
        lda ( p_sqr_lo ), y
        sbc ( p_neg_sqr_lo ), y
        sta z0
        lda ( p_sqr_hi ), y
        sbc ( p_neg_sqr_hi ), y
       
        clc
do_adds:
        ;add the first two numbers of column 1
x0y1l:  adc #0 ;x0y0h + x0y1l
        tay ;6

        ;continue to first two numbers of column 2
        txa
x1y0h:  adc #0 ;x0y1h + x1y0h
        tax ;2 X=z2 so far, 6 cycles
        bcc + ;9
       
        ;column 3
        inc z3
        clc ;(+6) taken 7% of the time

        ;add last number of column 1
+       tya
x1y0l:  adc #0 ;+ x1y0l
        sta z1 ;7

        ;add last number of column 2
        txa
x1y1l:  adc #0 ;+ x1y1l
        ;A=z2
        bcc fin ;7

        ;column 3
        inc z3 ;(+4) taken 42% of the time

        ;A=z2
        ;add part 29-39, avg 31.1
       
        ;total = 156.96875+31.1=188.06875 (178-204)
        ;sta z2 ; optional
fin:    rts
        ;+6 = 194.06875


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 12, 2024 6:34 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
It could go faster yet - maybe save another 20 cycles, look forward to trying the idea.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 12, 2024 12:05 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
repose wrote:
Here you go

Very nice Repose! That takes the top spot for the fastest 16 bit multiply. I published your code with cosmetic changes and a couple of minor trivial tweaks: I put the z0-z3 in zero page and replaced "sta z1" with "tay" to return the value in a register instead of memory.

Published as mult86 at the usual page https://github.com/tobyLobster/multiply_test/

repose wrote:
It could go faster yet - maybe save another 20 cycles, look forward to trying the idea.

Sounds promising!


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 13, 2024 7:54 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America
I've properly published this now at https://codebase64.org/doku.php?id=base ... ation_2023

Note, there's some changes. My post above was missing an important comment (at first?), in any case your version is missing it
Code:
    ; set multiplicand as y0
    ldy y0
   
    ;x1y0l =  low(x1*y0)
    ;x1y0h = high(x1*y0)
    sec


It would be good if you added that to your copy.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 37 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 20 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: