Re: Comparing 6502 multiply routines
Posted: Thu Aug 08, 2024 8:50 am
>Yeah, I would certainly be very interested to see a full 32x32 version.
It will be close. Three 16x16 multiplies in the best case is already 561 cycles, then a lot of adding, say another 160, estimating over 720 cycles.
There is already a 32x32=64 in <750 cycles. It uses the sqr tables but with a variation on the inner columns where most of the adding of partial products is done in the column. In that case, optimized adding is most important. I believe it comes down to
txa:adc #k3:ldx #0:ldy y3:adc (sqr),y:bcc s1:inx:clc s1...sta z+3.
This looks strange, but remember the sqr method subtracts two table look ups. Some adjustments have to be made, but I add the 2's complement so that I add them all in a row. There's some adjustments which creates constants kn. I have to fudge the sqr tables. There's up to 14 adds in a row in the middle columns. There's a total of 64 adds which are averaging <11 cycles each, so the bulk of the routine is <704 cycles.
I haven't published that one yet.
You could also try the sqr formula with larger inputs, or the Knuth method.
Given x=x1*256+x0, y=y1*256+y0
y1 y0
*x1 x0
Calculation Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1 0-FE01
m=y0*x0 0-FE01
n=l+m-k 0-1FC02
l+m 0-1FC02
l-k or m-k +-FE01
x*y=l*65536+n*256+m (for 8 bit Xn/Yn)
Example (16x16 bits):
y1 y0 ->$20 10
*x1 x0 40 30
k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00
llll0000 -> 08000000
nnnnn00 00a0000
mmmm 0300
----------------
2010*4030 = 080a0300
x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)
---
In the situation of multiplying negative numbers via shift and add, indeed you can save time by checking if there's a lot of 1 bits in the multiplier then eor #$ff, and fix up later.
It will be close. Three 16x16 multiplies in the best case is already 561 cycles, then a lot of adding, say another 160, estimating over 720 cycles.
There is already a 32x32=64 in <750 cycles. It uses the sqr tables but with a variation on the inner columns where most of the adding of partial products is done in the column. In that case, optimized adding is most important. I believe it comes down to
txa:adc #k3:ldx #0:ldy y3:adc (sqr),y:bcc s1:inx:clc s1...sta z+3.
This looks strange, but remember the sqr method subtracts two table look ups. Some adjustments have to be made, but I add the 2's complement so that I add them all in a row. There's some adjustments which creates constants kn. I have to fudge the sqr tables. There's up to 14 adds in a row in the middle columns. There's a total of 64 adds which are averaging <11 cycles each, so the bulk of the routine is <704 cycles.
I haven't published that one yet.
You could also try the sqr formula with larger inputs, or the Knuth method.
Given x=x1*256+x0, y=y1*256+y0
y1 y0
*x1 x0
Calculation Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1 0-FE01
m=y0*x0 0-FE01
n=l+m-k 0-1FC02
l+m 0-1FC02
l-k or m-k +-FE01
x*y=l*65536+n*256+m (for 8 bit Xn/Yn)
Example (16x16 bits):
y1 y0 ->$20 10
*x1 x0 40 30
k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00
llll0000 -> 08000000
nnnnn00 00a0000
mmmm 0300
----------------
2010*4030 = 080a0300
x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)
---
In the situation of multiplying negative numbers via shift and add, indeed you can save time by checking if there's a lot of 1 bits in the multiplier then eor #$ff, and fix up later.