6502.org • View topic - Comparing 6502 multiply routines

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

Comparing 6502 multiply routines

Page 3 of 3

[ 37 posts ]

Go to page Previous 1, 2, 3

Previous topic | Next topic

Author

Message

TobyLobster

Post subject: Re: Comparing 6502 multiply routines

Posted: Tue Feb 13, 2024 9:30 am

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34

repose wrote:

I've properly published this now at https://codebase64.org/doku.php?id=base ... ation_2023

Note, there's some changes. My post above was missing an important comment (at first?), in any case your version is missing it

Code:

    ; set multiplicand as y0
    ldy y0
    
    ;x1y0l =  low(x1*y0)
    ;x1y0h = high(x1*y0)
    sec

It would be good if you added that to your copy.

Done.

Top

repose

Post subject: Re: Comparing 6502 multiply routines

Posted: Tue Feb 13, 2024 11:29 pm

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America

trick:

Code:

x1 = p_sqr_lo

umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi

; set multiplier as x0
; *x1 is no longer preserved

To easily save 3 cycles and 1 byte zp.

I could also accept Y0 in Y, but that can place restrictions on the calling code. What if Y0 were in X? Stx y0 is still faster than txa:tay.

I could save 6 cycles with two sets of pointers, and psqrlo1 is x0 and psqrlo2 is x1, but that adds 6 zp, which can be quite precious. I could also save more memory by overwriting the inputs with the outputs. Finally, I could have the caller set up the pointers itself, but could force it to reserve the A register in the first setup. The A register can be assumed free in the 2nd setup because by then, its just about to call umult16 which trashes all registers anyhow.

So, there is still the potential to save 10% or more just in little tweaks.

Top

TobyLobster

Post subject: Re: Comparing 6502 multiply routines

Posted: Wed Feb 14, 2024 2:13 pm

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34

repose wrote:

trick:

Code:

x1 = p_sqr_lo

umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi

; set multiplier as x0
; *x1 is no longer preserved

I like the first trick, saving bytes and cycles. Not so sure about the Y0 stuff?
I like setting x1 = psqrlo2, which means you get back one zero page variable at least.

Those two tricks don't affect the calling code, and give 187.07 average cycles (including RTS).

EDIT: I updated mult86.a to include these two tricks.

Top

repose

Post subject: Re: Comparing 6502 multiply routines

Posted: Sun Feb 18, 2024 10:02 am

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America

Great! However, I would suggest

Code:

x0  = p_sqr_lo1      ; multiplier, 2 bytes
x1  = p_sqr_lo2

as it seems more logical to me (yes, I did lead you down the wrong path with my mistake).

I've also done a first pass of a signed 16x16.

The fastest of course, is to use signed magnitude.

As for 2's complement, the post-fix method is the fastest. It also opens up some interesting tricks.
-Postfix is fastest with z2 in X and z3 in Y, and can finish with z3 in A and z2 in Y. Perhaps a different permutation of my routine will be faster overall with postfix. My original version ended with z2 in A and z3 in Y. I like z3 ending in A because you can immediately check the magnitude or possibly sign of the product. On the other hand, its not so great if you're doing a "MAC", multiply accumulate, or adding a value immediately to the product.
-With the prefix method, you do abs(x)*abs(y)*sgn(x eor y), since you are only multiplying 15-bit numbers, there's a carry check in the additions of column 2 you can skip, which saves 3.42 cycles. In any case, if the unsigned version is domain restricted to 15-bit numbers, you can make it faster.
-There's an interesting case of manipulating the tables to either add a fixed value to the product, or do a restricted domain (~6 bit) signed multiply directly at the same speed. This opens some possibilities like signed or unsigned upper half with correct rounding.
-I still haven't tried my other idea. It involves adding from the sqr tables directly, no after step of adding columns. It needs special tables to deal with the carry complications.

Top

TobyLobster

Post subject: Re: Comparing 6502 multiply routines

Posted: Mon Feb 19, 2024 1:46 pm

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34

repose wrote:

Great! However, I would suggest

Code:

x0  = p_sqr_lo1      ; multiplier, 2 bytes
x1  = p_sqr_lo2

as it seems more logical to me (yes, I did lead you down the wrong path with my mistake).

Done.

repose wrote:

I've also done a first pass of a signed 16x16.

Sounds good. I've just included a 8x8 signed multiply that is the fastest yet, https://github.com/TobyLobster/multiply ... /smult10.a - perhaps it can be expanded to 16 bit.

Top

repose

Post subject: Re: Comparing 6502 multiply routines

Posted: Sun Feb 25, 2024 12:44 am

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 23
Location: America

I believe that is the fastest possible 8-bit signed multiply. I would point out a few things though, the total size is wrong. I break it down like this:

Code:

        ;zp 0
        ;data 2044
        ;data total 2044
        ;code 35
        ;code+data 2079

and the table generation is slightly wrong. lda $x0FF,X where X=$ff can only reach to $x1FE, since 255+255=510 (511 bytes). I wrote the tables like this (!align is definitely compatible with ACME):

Code:

; Tables must be aligned to page boundary
!align 255,0 ;fill with 0 until $xxFF
sqr_lo
!for i = -256 TO 254
          !byte <((i*i)/4)
!end

free1 = *

!align 255,0
sqr_hi
!for i = -256 TO 254
          !byte >((i*i)/4)
!end

free2 = *

!align 255,0
neg_sqr_lo
!for i = -255 TO 255
          !byte <((i*i)/4)
!end

free3 = *

!align 255,0
neg_sqr_hi
!for i = -255 TO 255
          !byte >((i*i)/4)
!end

free4 = *

I wrote my main routine slightly differently (input in Y is consistent with some other routines that have to use (zp),Y in them):

Code:

smult8
        ;signed 8 bit multiply, selfmod version
        ;perform A:X = A*Y
        ;inputs:
        ; A, Y
        ;outputs:
        ; X, A
        ;registers used: A X Y
        ;cycles: 53.99
        ;memory used:
        ;zp 0
        ;data 2044
        ;data total 2044
        ;code 35
        ;code+data 2079

        eor #$80
        sta p_sqr_lo+1
        sta p_sqr_hi+1
        eor #$ff
        sta p_neg_sqr_lo+1
        sta p_neg_sqr_hi+1
        tya
        eor #$80
        tay
        
        sec
p_sqr_lo
        lda sqr_lo,y
p_neg_sqr_lo
        sbc neg_sqr_lo,y
        tax
p_sqr_hi
        lda sqr_hi,y
p_neg_sqr_hi
        sbc neg_sqr_hi,y
        rts

In order to use a table for (x+y)^2/4, it has to be monotonic (linear), that's the purpose of the eor #$80, because then the values range from -128 to +127 in order as 0 to 255. Otherwise, its the same as the unsigned version. There is a way to avoid it, but then you have to restrict the domain (input) so that all pairs have a unique place in the table.

As far as extending to 16-bit, its not that simple unfortunately. I can comment later.

Top

TobyLobster

Post subject: Re: Comparing 6502 multiply routines

Posted: Mon Feb 26, 2024 8:36 am

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34

repose wrote:

I believe that is the fastest possible 8-bit signed multiply. I would point out a few things though...

I have amended to show the unused bytes.
I've used Y on input for smult11, a variant that saves two cycles by using another 256 byte lookup table.

Top

Page 3 of 3

[ 37 posts ]

Go to page Previous 1, 2, 3

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 27 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum