>Yeah, I would certainly be very interested to see a full 32x32 version.
It will be close. Three 16x16 multiplies in the best case is already 561 cycles, then a lot of adding, say another 160, estimating over 720 cycles.
There is already a 32x32=64 in <750 cycles. It uses the sqr tables but with a variation on the inner columns where most of the adding of partial products is done in the column. In that case, optimized adding is most important. I believe it comes down to
txa:adc #k3:ldx #0:ldy y3:adc (sqr),y:bcc s1:inx:clc s1...sta z+3.
This looks strange, but remember the sqr method subtracts two table look ups. Some adjustments have to be made, but I add the 2's complement so that I add them all in a row. There's some adjustments which creates constants kn. I have to fudge the sqr tables. There's up to 14 adds in a row in the middle columns. There's a total of 64 adds which are averaging <11 cycles each, so the bulk of the routine is <704 cycles.
I haven't published that one yet.
You could also try the sqr formula with larger inputs, or the Knuth method.
Given x=x1*256+x0, y=y1*256+y0
y1 y0
*x1 x0
Calculation Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1 0-FE01
m=y0*x0 0-FE01
n=l+m-k 0-1FC02
l+m 0-1FC02
l-k or m-k +-FE01
x*y=l*65536+n*256+m (for 8 bit Xn/Yn)
Example (16x16 bits):
y1 y0 ->$20 10
*x1 x0 40 30
k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00
llll0000 -> 08000000
nnnnn00 00a0000
mmmm 0300
----------------
2010*4030 = 080a0300
x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)
---
In the situation of multiplying negative numbers via shift and add, indeed you can save time by checking if there's a lot of 1 bits in the multiplier then eor #$ff, and fix up later.
Comparing 6502 multiply routines
Re: Comparing 6502 multiply routines
I just realized a note I had. Since the carries are noted in X, you can eliminate all the CLC and just sbc id,x at the end, where id(x)=x. In this case, I save 14xCLC or 28 cycles but add 4, still saving 24 cycles. These savings apply to two inner columns of 14 adds and more, until a break even amount of adds is reached in the outer columns.
Put another way, the adds average to about 9.5 cycles each (instead of 11). That makes the bulk of the routine in the 608 cycle range.
Put another way, the adds average to about 9.5 cycles each (instead of 11). That makes the bulk of the routine in the 608 cycle range.
Re: Comparing 6502 multiply routines
Nice optimisations!
- lightbeing
- Posts: 11
- Joined: 15 Oct 2023
- Location: Paris (France)
Re: Comparing 6502 multiply routines
TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
I recommended your GitHub page through a video explaining, in French, the subtleties of multiplication with the processor 6502.
https://youtu.be/k3cSGC1skDs
Re: Comparing 6502 multiply routines
Hi!
Over the weekend, I found myself writing a routine to calculate "factor1*factor2+offset", a multiply-add, with X and Y 8 bit numbers, and P an 16 bit number. Some of the routines can be reworked to do this faster, I settled in the mult21 with adding only one "adc" at the end, for 20 bytes:
This is useful for calculating screen pointers, where you need "Y * WIDTH + X", or "X * WIDTH + SCREEN_ADDRESS". Perhaps other routines could be modified in the same way.
Have Fun!
Over the weekend, I found myself writing a routine to calculate "factor1*factor2+offset", a multiply-add, with X and Y 8 bit numbers, and P an 16 bit number. Some of the routines can be reworked to do this faster, I settled in the mult21 with adding only one "adc" at the end, for 20 bytes:
Code: Select all
; Input in FACTOR1, FACTOR2, OFFSET_LOW and OFFSET_HI, result in FACTOR1 (low) and FACTOR2 (high)
mult
lda offset_low
ldx #8
lsr factor1
loop
bcc no_add
clc
adc factor2
no_add
ror
ror factor1
dex
bne loop
adc offset_hi
; sta factor2
rts
Have Fun!
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Comparing 6502 multiply routines
dmsc wrote:
This is useful for calculating screen pointers, where you need "Y * WIDTH + X", or "X * WIDTH + SCREEN_ADDRESS". Perhaps other routines could be modified in the same way.
For an X × Y text display, in which cursor position is zero-based from top left of the display, use of a lookup table to map Y to the starting address for the row on which the cursor is to be positioned is bound to be faster than computing that address by multiplying. That’s how it was done in the Commodore 64.
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: Comparing 6502 multiply routines
@TobyLobster
You might find this interesting
https://github.com/halifaxgeorge-bot/Em ... 8/plot.png
Still need to double-check the cycle counts to see if they are the same as yours. You wanna hear something funny? I wrote it because I couldn't compile yours...
There's also a new record in there btw.
You might find this interesting
https://github.com/halifaxgeorge-bot/Em ... 8/plot.png
Still need to double-check the cycle counts to see if they are the same as yours. You wanna hear something funny? I wrote it because I couldn't compile yours...
There's also a new record in there btw.