Comparing 6502 multiply routines

repose · Post by **repose** » Thu Aug 08, 2024 8:50 am

>Yeah, I would certainly be very interested to see a full 32x32 version.
It will be close. Three 16x16 multiplies in the best case is already 561 cycles, then a lot of adding, say another 160, estimating over 720 cycles.
There is already a 32x32=64 in <750 cycles. It uses the sqr tables but with a variation on the inner columns where most of the adding of partial products is done in the column. In that case, optimized adding is most important. I believe it comes down to
txa:adc #k3:ldx #0:ldy y3:adc (sqr),y:bcc s1:inx:clc s1...sta z+3.
This looks strange, but remember the sqr method subtracts two table look ups. Some adjustments have to be made, but I add the 2's complement so that I add them all in a row. There's some adjustments which creates constants kn. I have to fudge the sqr tables. There's up to 14 adds in a row in the middle columns. There's a total of 64 adds which are averaging <11 cycles each, so the bulk of the routine is <704 cycles.
I haven't published that one yet.
You could also try the sqr formula with larger inputs, or the Knuth method.
Given x=x1*256+x0, y=y1*256+y0
y1 y0
*x1 x0

Calculation Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1 0-FE01
m=y0*x0 0-FE01
n=l+m-k 0-1FC02
l+m 0-1FC02
l-k or m-k +-FE01

x*y=l*65536+n*256+m (for 8 bit Xn/Yn)

Example (16x16 bits):
y1 y0 ->$20 10
*x1 x0 40 30

k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00

llll0000 -> 08000000
nnnnn00 00a0000
mmmm 0300
----------------
2010*4030 = 080a0300

x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)
---
In the situation of multiplying negative numbers via shift and add, indeed you can save time by checking if there's a lot of 1 bits in the multiplier then eor #$ff, and fix up later.

repose · Post by **repose** » Thu Aug 08, 2024 10:00 am

I just realized a note I had. Since the carries are noted in X, you can eliminate all the CLC and just sbc id,x at the end, where id(x)=x. In this case, I save 14xCLC or 28 cycles but add 4, still saving 24 cycles. These savings apply to two inner columns of 14 adds and more, until a break even amount of adds is reached in the outer columns.
Put another way, the adds average to about 9.5 cycles each (instead of 11). That makes the bulk of the routine in the 608 cycle range.

BigEd · Post by **BigEd** » Thu Aug 08, 2024 10:25 am

Nice optimisations!

lightbeing · Post by **lightbeing** » Mon Jun 16, 2025 2:20 pm

TobyLobster wrote:

I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

Impressive job. Thank you so much!
I recommended your GitHub page through a video explaining, in French, the subtleties of multiplication with the processor 6502.
https://youtu.be/k3cSGC1skDs

dmsc · Post by **dmsc** » Tue Jun 17, 2025 1:45 am

Hi!

Over the weekend, I found myself writing a routine to calculate "factor1*factor2+offset", a multiply-add, with X and Y 8 bit numbers, and P an 16 bit number. Some of the routines can be reworked to do this faster, I settled in the mult21 with adding only one "adc" at the end, for 20 bytes:

Code: Select all

; Input in FACTOR1, FACTOR2, OFFSET_LOW and OFFSET_HI, result in FACTOR1 (low) and FACTOR2 (high)
mult
    lda offset_low
    ldx #8
    lsr factor1
loop
    bcc no_add
    clc
    adc factor2
no_add
    ror
    ror factor1
    dex
    bne loop
    adc offset_hi
;    sta factor2
    rts

This is useful for calculating screen pointers, where you need "Y * WIDTH + X", or "X * WIDTH + SCREEN_ADDRESS". Perhaps other routines could be modified in the same way.

Have Fun!

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Jun 17, 2025 6:38 am

dmsc wrote:

This is useful for calculating screen pointers, where you need "Y * WIDTH + X", or "X * WIDTH + SCREEN_ADDRESS". Perhaps other routines could be modified in the same way.

For an X × Y text display, in which cursor position is zero-based from top left of the display, use of a lookup table to map Y to the starting address for the row on which the cursor is to be positioned is bound to be faster than computing that address by multiplying. That’s how it was done in the Commodore 64.

repose · Post by **repose** » Fri Mar 06, 2026 9:14 pm

@TobyLobster
You might find this interesting
https://github.com/halifaxgeorge-bot/Em ... 8/plot.png

Still need to double-check the cycle counts to see if they are the same as yours. You wanna hear something funny? I wrote it because I couldn't compile yours...
There's also a new record in there btw.

Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines

Re: Comparing 6502 multiply routines