6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 10, 2024 9:30 pm

All times are UTC




Post new topic Reply to topic  [ 48 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Tue Feb 13, 2024 9:30 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
repose wrote:
I've properly published this now at https://codebase64.org/doku.php?id=base ... ation_2023

Note, there's some changes. My post above was missing an important comment (at first?), in any case your version is missing it
Code:
    ; set multiplicand as y0
    ldy y0
   
    ;x1y0l =  low(x1*y0)
    ;x1y0h = high(x1*y0)
    sec


It would be good if you added that to your copy.

Done.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 13, 2024 11:29 pm 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 25
Location: America
trick:

Code:
x1 = p_sqr_lo

umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi

; set multiplier as x0
; *x1 is no longer preserved


To easily save 3 cycles and 1 byte zp.

I could also accept Y0 in Y, but that can place restrictions on the calling code. What if Y0 were in X? Stx y0 is still faster than txa:tay.

I could save 6 cycles with two sets of pointers, and psqrlo1 is x0 and psqrlo2 is x1, but that adds 6 zp, which can be quite precious. I could also save more memory by overwriting the inputs with the outputs. Finally, I could have the caller set up the pointers itself, but could force it to reserve the A register in the first setup. The A register can be assumed free in the 2nd setup because by then, its just about to call umult16 which trashes all registers anyhow.

So, there is still the potential to save 10% or more just in little tweaks.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 14, 2024 2:13 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
repose wrote:
trick:

Code:
x1 = p_sqr_lo

umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi

; set multiplier as x0
; *x1 is no longer preserved


To easily save 3 cycles and 1 byte zp.

I could also accept Y0 in Y, but that can place restrictions on the calling code. What if Y0 were in X? Stx y0 is still faster than txa:tay.

I could save 6 cycles with two sets of pointers, and psqrlo1 is x0 and psqrlo2 is x1, but that adds 6 zp, which can be quite precious. I could also save more memory by overwriting the inputs with the outputs. Finally, I could have the caller set up the pointers itself, but could force it to reserve the A register in the first setup. The A register can be assumed free in the 2nd setup because by then, its just about to call umult16 which trashes all registers anyhow.

So, there is still the potential to save 10% or more just in little tweaks.


I like the first trick, saving bytes and cycles. Not so sure about the Y0 stuff?
I like setting x1 = psqrlo2, which means you get back one zero page variable at least.

Those two tricks don't affect the calling code, and give 187.07 average cycles (including RTS).

EDIT: I updated mult86.a to include these two tricks.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 18, 2024 10:02 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 25
Location: America
Great! However, I would suggest
Code:
x0  = p_sqr_lo1      ; multiplier, 2 bytes
x1  = p_sqr_lo2

as it seems more logical to me (yes, I did lead you down the wrong path with my mistake).

I've also done a first pass of a signed 16x16.

The fastest of course, is to use signed magnitude.

As for 2's complement, the post-fix method is the fastest. It also opens up some interesting tricks.
-Postfix is fastest with z2 in X and z3 in Y, and can finish with z3 in A and z2 in Y. Perhaps a different permutation of my routine will be faster overall with postfix. My original version ended with z2 in A and z3 in Y. I like z3 ending in A because you can immediately check the magnitude or possibly sign of the product. On the other hand, its not so great if you're doing a "MAC", multiply accumulate, or adding a value immediately to the product.
-With the prefix method, you do abs(x)*abs(y)*sgn(x eor y), since you are only multiplying 15-bit numbers, there's a carry check in the additions of column 2 you can skip, which saves 3.42 cycles. In any case, if the unsigned version is domain restricted to 15-bit numbers, you can make it faster.
-There's an interesting case of manipulating the tables to either add a fixed value to the product, or do a restricted domain (~6 bit) signed multiply directly at the same speed. This opens some possibilities like signed or unsigned upper half with correct rounding.
-I still haven't tried my other idea. It involves adding from the sqr tables directly, no after step of adding columns. It needs special tables to deal with the carry complications.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 19, 2024 1:46 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
repose wrote:
Great! However, I would suggest
Code:
x0  = p_sqr_lo1      ; multiplier, 2 bytes
x1  = p_sqr_lo2

as it seems more logical to me (yes, I did lead you down the wrong path with my mistake).

Done.

repose wrote:
I've also done a first pass of a signed 16x16.

Sounds good. I've just included a 8x8 signed multiply that is the fastest yet, https://github.com/TobyLobster/multiply ... /smult10.a - perhaps it can be expanded to 16 bit.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 25, 2024 12:44 am 
Offline

Joined: Mon Feb 20, 2012 1:46 pm
Posts: 25
Location: America
I believe that is the fastest possible 8-bit signed multiply. I would point out a few things though, the total size is wrong. I break it down like this:
Code:
        ;zp 0
        ;data 2044
        ;data total 2044
        ;code 35
        ;code+data 2079


and the table generation is slightly wrong. lda $x0FF,X where X=$ff can only reach to $x1FE, since 255+255=510 (511 bytes). I wrote the tables like this (!align is definitely compatible with ACME):
Code:
; Tables must be aligned to page boundary
!align 255,0 ;fill with 0 until $xxFF
sqr_lo
!for i = -256 TO 254
          !byte <((i*i)/4)
!end

free1 = *

!align 255,0
sqr_hi
!for i = -256 TO 254
          !byte >((i*i)/4)
!end

free2 = *

!align 255,0
neg_sqr_lo
!for i = -255 TO 255
          !byte <((i*i)/4)
!end

free3 = *

!align 255,0
neg_sqr_hi
!for i = -255 TO 255
          !byte >((i*i)/4)
!end

free4 = *


I wrote my main routine slightly differently (input in Y is consistent with some other routines that have to use (zp),Y in them):
Code:
smult8
        ;signed 8 bit multiply, selfmod version
        ;perform A:X = A*Y
        ;inputs:
        ; A, Y
        ;outputs:
        ; X, A
        ;registers used: A X Y
        ;cycles: 53.99
        ;memory used:
        ;zp 0
        ;data 2044
        ;data total 2044
        ;code 35
        ;code+data 2079

        eor #$80
        sta p_sqr_lo+1
        sta p_sqr_hi+1
        eor #$ff
        sta p_neg_sqr_lo+1
        sta p_neg_sqr_hi+1
        tya
        eor #$80
        tay
       
        sec
p_sqr_lo
        lda sqr_lo,y
p_neg_sqr_lo
        sbc neg_sqr_lo,y
        tax
p_sqr_hi
        lda sqr_hi,y
p_neg_sqr_hi
        sbc neg_sqr_hi,y
        rts


In order to use a table for (x+y)^2/4, it has to be monotonic (linear), that's the purpose of the eor #$80, because then the values range from -128 to +127 in order as 0 to 255. Otherwise, its the same as the unsigned version. There is a way to avoid it, but then you have to restrict the domain (input) so that all pairs have a unique place in the table.

As far as extending to 16-bit, its not that simple unfortunately. I can comment later.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 26, 2024 8:36 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
repose wrote:
I believe that is the fastest possible 8-bit signed multiply. I would point out a few things though...

I have amended to show the unused bytes.
I've used Y on input for smult11, a variant that saves two cycles by using another 256 byte lookup table.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 22, 2024 3:32 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 105
This thread is fantastic, along with the github for all the variations! Thanks for all the work and discussion.

I'm curious about the distribution of the input data. It looks like the tests results measure average performance over all possible input values, e.g. for 16x16 multiply it averages over every possible pair of inputs. Lately I've been fooling with taliforth's `m*/` and `um*` (in fact it looks like the latter was based on this thread). I've noticed anecdotally that the distribution of inputs for my benchmark suite is far from uniform. As one example, the 32-bit input to `m*/` often gets extended from 16-bits via `s>d` so its high bytes are zero. That means it pays off to optimize for some of those special cases.

I wonder if the same applies to `um*`? For example, you might add a short-circuit if one of the inputs is zero, or swap the multiplier/multiplicand if one has a zero high byte so you could do only 8 shifts instead of 16. Obviously in the current test framework, these would "look bad" because both average cycles and size would increase. But I wonder if it'd be interesting to measure a second average over a more realistic input distribution.

Of course that just begs the question of what "realistic" distribution to use. Clearly a benchmark isn't "real word" but https://en.wikipedia.org/wiki/Benford%27s_law might be one way: instead of assuming every 16 bit input is equally likely, you'd pick inputs whose logs are uniformly distributed. Something like int(pow(2, random.uniform(0,16))-1) gives you unsigned 16 bit values like that. This approach gives a value of exactly 0 about 6% of the time (1/16 rather than 1/65536), and a value < 256 (zero high byte) half of the time (1/2 rather than 256/65536=1/256), so special case optimizations would presumably pay off.

Is that something you've considered? If not I might have a go at running some of the 16x16 -> 32 benchmarks from GH but collecting a couple of additional stats, like average cycles when one or both inputs is zero, or when one or both inputs fit in 8 bits.

addendum: just found this nice paper by Hamming, (Bell labs, 1970), On the Distribution of Numbers that talks about the same idea in the context of optimizing floating point representations.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 23, 2024 11:49 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
Thanks. It should be possible to account for a skew in the probability of input data, as long as that skew is properly defined. I've not attempted that.

I could imagine (a) the usual algorithms running over all possible inputs, but with each result weighted depending on the size of the inputs. OR
(b) Alternatively a 'monte-carlo' algorithm could be fast and prove accurate enough, creating non-uniform random inputs each time.

To reduce the number of calculations required it's worth looking at the type of algorithm used: the table look up algorithms don't vary their performance on size of input, and the shift and add algorithms slow down only based on the number of set bits in one of the two inputs.

The use of signed multiply can skew the results: Some signed multiply routines use an unsigned multiply as the first step. If that's a shift and add algorithm then this can slow down the calculations since a negative number close to zero (likely to happen often) has a lot of 1 bits set.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 24, 2024 12:00 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Nice idea - and thanks for the (1970) Hamming paper - very interesting!

> This paper examines the distribution of the mantissas of floating point numbers and shows how the arithmetic operations of a computer transform various distributions toward the limiting distribution


Top
 Profile  
Reply with quote  
PostPosted: Sat Jul 27, 2024 5:40 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 105
Here's a small contribution for the thread.

I'd seen Karatsuba's algorithm mentioned a couple of times but didn't see a 6502 implementation so I gave it a go. It uses a trick to calculate a 16x16=32 unsigned multiply using only three 8x8=16 multiplies plus some extra addition.

TL;DR the size/speed (adjusting for ,x indexing) is in the same ballpark as Toby's mult63.a so the saved multiply seems a wash with the extra juggling. However I suspect it might win for 32x32=64. This version is forth style (everything is ,x indexed) which I estimated to cost about 30 cycles or so.

I'm sure there's more performance to be squeezed out if anyone is interested - would love to see any tricks you come up with. Here's the core routine (written for 64tass). See attachment for full source file which includes a small test harness and 256 randomized test cases.

Code:
karatsuba:

        .comment

Karatsuba divide and conquer for multiplication.
Use 3 8x8=16 multiplies to do a 16x16=32 multiplication
See https://en.wikipedia.org/wiki/Karatsuba_algorithm

We want to calculate <a0,1> x <b0,1> = <d0,1,2,3>
= a0 x b0 + (a0 x b1 + a1 x b0) << 8 + (a1 x b1) << 16

i.e. decomposing the four byte result <d0-3> as:

    LSB |   d0  |   d1  |   d2  |   d3  | MSB
        |    a0 x b0    |
+               | a0 b1 + a1 b0 |  <= this is a 17 bit result
+                       |    a1 x b1    |

This expression still has four multiplies (and several 16 bit adds)
so doesn't help us much.  But consider the product (a0 - a1) x (b0 - b1)
which involves the same terms:

    (a1 - a0) x (b1 - b0) = a0 x b0 - (a0 x b1 + a1 x b0) + a1 x b1

rearranging we get

    (a0 x b1 + a1 x b0) = (a1 - a0) x (b1 - b0) + a0 x b0 + a1 x b1

so we can rewrite our original product as

        |   d0  |   d1  |   d2  |   d3  |
        |    a0 x b0    |
-               | (a1-a0)(b1-b0)|  <= signed multiply
+               |    a0 x b0    |
+               |    a1 x b1    |
+                       |    a1 x b1    |

This only has three 8 bit multiplies, two of them used twice,
but one of them is signed.  We can avoid that by
evaluating |a1-a0| x |b1-b0| = | (a1-a0)(b1-b0) |
as an unsigned multiply, and tracking a sign bit based on
whether we have to flip 0, 1 or 2 signs to get the two |.| terms.
Writing |a1 - a0| as a01 and likewise for b01, with sign bit s01
our result becomes

        |   d0  |   d1  |   d2  |   d3  |
        |    a0 x b0    |    a1 x b1    |
+               |    a0 x b0    |
+               |    a1 x b1    |
-s01            | |a01| x |b01| |


Let's try a worked example with $FEED * $CAFE = $CA23F126.
So we have:

    a0 = ED, a1 = FE, b0 = FE, b1 = CA
    a0 x a1 = EB26, b0 x b1 = C86C
    a01 = |ED-FE| = 11, b01 = | FE-CA| = 34, s01=-1
    a01 * b01 = 0374

Plugging in to our tableau we get the desired
result in d0..d3, mapped to our normal forth
NUXI layout of 2,x 3,x 0,x and 1,x:

        |   d0  |   d1  |   d2  |   d3  |
        |   26      EB  |   6C      C8  |
+               |   26      EB  |
+               |   6C      C8  |
--              |   74      03  |
-----------------------------------------
        |   26  |   F1  |   23  |   CA  |
        |  2,x  |  3,x  |  0,x  |  1,x  |

Our implementation will arange the stack like this:

        |  2,x  |  3,x  |  0,x  |  1,x  | fe,x  | ff,x  |
        |   a0  |   a1  |   b0  |   b1  | |a01| | |b01| |

              ... computations ensue ...

        |   d0  |   d1  |   d2  |   d3  |

.endcomment

        ; first set up |a01| and |b01| past top of stack
        ; and swap the positions of a1 and b0

        sec
        ldy 0,x         ; stash Y=b0 so we can swap a1 <-> b0
        lda 3,x         ; a1
        sta 0,x         ; a1 -> b0 for a1 x b1
        sty 3,x         ; b0 -> a1 for a0 x b0

        ldy #1          ; track the sign bit of (a1-a0)(b1-b0), where odd means -ve

        sbc 2,x         ; a1 - a0
        bcs +           ; no borrow so a1-a0 is positive
        eor #$ff        ; otherwise negate result
        ina
        iny             ; flip sign
        sec             ; reset carry
+
        sta $fe,x       ; | a1 - a0 |

        lda 1,x         ; b1
        sbc 3,x         ; b0
        bcs +
        eor #$ff
        ina
        iny             ; flip sign
+
        sta $ff,x       ; | b1 - b0 |
        sty sgn         ; y counts 1, 2 or 3 flips, so bit 0 gives sign

        ; now we do three in-place multiplies of the stack values
        ; to get a0 x b0, a1 x b1 and a01 x b01

        ; we'll cycle x => x-2 (mul) => x (mul) => x + 2 (mul) => x
        ; to multiply at fe/ff, 0/1, 2/3 relative to original x
        ; and leave with it unchanged.
        ; there's probably room for improvement here

        ldy #3
        dex
        dex

_mul8x8:
        ; multiply 0,x by 1,x in place
        lda 1,x         ; short-circuit if either input is zero
        beq _z1
        lda 0,x
        beq _z0

        dec 1,x         ; decrement multiplicand so we can add with carry set

        lsr             ; shift low bit of multiplier
        sta 0,x         ; and store it

        lda #0          ; A is now current 8-bit window on result
        bcc +

        adc 1,x         ; add the multiplicand - 1 + carry=1
        lsr             ; shift out low bit of result
                        ; we can skip if A is still zero since it C=0 already
+
        ror 0,x
        bcc +
        adc 1,x
+
.rept 6
        ror             ; shift out low bit of result window
        ror 0,x         ; roll it in to multiplier, fetching next bit
        bcc +
        adc 1,x
+
.endrept
        ror
        ror 0,x
_z0:                    ; exit case for 0,x is zero, when A is also zero
        sta 1,x         ; store the high byte

_mul_end:

        dey
        beq _loop_end
        inx
        inx
        bra _mul8x8

_z1:    stz 0,x         ; exit case when 1,x is 0
        bra _mul_end    ; just set 0,x to 0

_loop_end:
        dex
        dex

.comment

After the in-place multiplies we've got this layout, with sign s01 from the parity bit of sgn

            |  2,x  |  3,x  |  0,x  |  1,x  | fe,x  | ff,x  |
            |    a0 x b0    |    a1 x b1    |  |a01 x b01|  |
            |   v0      v1      v2     v3      v4      v5

And still need:

        +           |    a0 x b0    |
        +           |    a1 x b1    |
        -s01        |  |a01 x b01|  |

to finally arrive at:

            |   d0  |   d1  |   d2  |   d3  |

In terms of the values we currently have:

    d0 = v0 (done!)
    d1 = v1 + v0 + v2 +/- v4
    d2 = v2 + v1 + v3 +/- v5 (+ carries)
    d3 = v3 (+ carries)

Note the shared v1+v2 term in both d1 and d2
.endcomment

        clc
        lda 3,x
        adc 0,x         ; v1 + v2
        tay
        ; carry is tricky, since it should increment both d2 (from d1)
        ; and d3 (from d2).  incrementing v3 accomplishes both
        ; since we'll add it to d2 momentarily
        bcc +
        inc 1,x         ; d3 ++
        clc
+
        adc 2,x
        sta 3,x         ; v0 + (v1 + v2) => d1
        tya
        adc 1,x
        sta 0,x         ; v3 + (v1 + v2) => d2
        bcc +
        inc 1,x         ; carry to d3
+

        ; Finally add or subtract | a01 x b01 | based on sgn

        lda 3,x         ; start with d1 in either case
        ror sgn         ; sign bit to carry (with clc for add, sec for subtract)
        bcc _add

        ; subtract:  d1/d2 -= | a01 x b01 |
        sbc $fe,x
        sta 3,x
        lda 0,x
        sbc $ff,x
        sta 0,x
        bcs +
        dec 1,x         ; borrow from d3
+
        rts

_add:
        ; add: d1/d2 += | a01 x b01 |
        adc $fe,x
        sta 3,x
        lda 0,x
        adc $ff,x
        sta 0,x
        bcc +
        inc 1,x         ; carry to d3
+
        rts


Attachments:
karatsuba.asm [15.42 KiB]
Downloaded 14 times
Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 28, 2024 5:55 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
Thank you for contributing the implementation pdragon. Very interesting to see a new algorithm being used. I had to convert your 65C02 to plain old 6502 for my test framework, but fortunately I managed to convert without any loss of cycles, and only taking an extra 4 bytes. I switched 2xbra->jmp, an stz->sta (where A is known to be zero), and two ina->adc#1 (where carry is known to be clear).
I've added this as mult89 in my tests. The results come out to an average cycle count of 557.02 and 192 bytes, so not as competitive as mult63 (yet) but like you I feel there's more that can be saved.

EDIT: I tweaked the internal 8x8 multiply, cycle count is now 525.55 and 183 bytes.
EDIT: I removed some of the Forth-like 0,x addressing as you suggest to get the cycle count down to 492.95 average and 175 bytes.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 28, 2024 10:33 am 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 105
Nice! Yup I imagine the zero checks and clc avoidance in mul8x8 aren’t worth it with uniform random inputs. One day if I have the energy I might try the 32x32 version where avoiding the extra multiply might pay off.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 28, 2024 11:53 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
pdragon wrote:
Nice! Yup I imagine the zero checks and clc avoidance in mul8x8 aren’t worth it with uniform random inputs. One day if I have the energy I might try the 32x32 version where avoiding the extra multiply might pay off.


Yeah, I would certainly be very interested to see a full 32x32 version.
EDIT: I imagine this would involve a regular 16x16 multiply used three times.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 28, 2024 8:14 pm 
Offline

Joined: Tue Sep 26, 2023 11:09 am
Posts: 105
Yes algo is the same so 8x8 becomes 16x16 multiply and adc/sec become 16bit equivalents


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 48 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: