I'm fairly sure I've read this before - here or over on the retrocomputing forum, but can't for the life of my find a recent thread on it (pointers welcome!)
However this site pops up from time to time:
http://www.txbobsc.com/aal/1986/aal8603.htmland in it there is possibly the fastest unsigned 8x8 multiply routine for the 6502. I decided to implement it in my Ruby system to see how it compares to my existing code... Especially when used to implement a 32x32 multiply as is used inside my BCPL system which is a 32-bit bytecode VM.
The good news is that it's faster than my native 32x32 multiply, the bad news is that it's slower than my "hardware arithmetic accelerator" which is an ATmega which is effectively an 8-bit CPU which also runs at 16Mhz. The advantage the ATmega has is that it has an on-board single cycle 8x8 multiply instruction. However it's only just faster due mostly to the latency of the calls between the 6502/816 side and the ATmega side, so while an interesting exercise, I'm yet again reminded of Knuth's
Premature Optimisation quote.
The code below is in my BCPL VM minus the entry and exit stuff and handling signed numbers - regA and regQ are 32-bits wide located in zero/direct page, regW is 64-bits wide, again in zero page. It implements what's essentially a long multiplication of the bytes, adding them into regW at each step which was previously zeroed. It only needs to do enough to produce a 32-bit result.
Inlining the calls to mul8 will save a JSR/RTS, but again, Knuth is tapping my shoulder as there's only 10 calls, so that would save (6+6) * 10 = 120 cycles... Hmm. Well, might be worth it but at the expense of a lot more code space...
There is no doubt this is faster, so if you need an 8x8 (or more) multiply then this is it. My initial 65816 code for 32x32 mul takes just over 2000 cycles...
Also on that site is code for an 816 ('802) but it's going to be very much slower (at least I think it will be - I'll test it at some point). My own thought are that if I have enough RAM (needs 128KB) I can implement a lookup table which would enable me to do an 8x8 multiply in 16-bit mode with 2 simple lookups and none of the arithmetic needed to handle the table of squares / 4. I will give that a go when I upgrade my Ruby816 board to 1MB of RAM, but with 512KB it's a bit tight to take half it's RAM and leave enough to run the BCPL compiler in.
Code:
aix8
; 32-32 multiply of
; P Q R S
; x T U V W
digS = regA+0
digR = regA+1
digQ = regA+2
digP = regA+3
digW = regQ+0
digV = regQ+1
digU = regQ+2
digT = regQ+3
.macro doMul8 d1,d2,res
lda d1
ldx d2
jsr mul8
tay ; Temp. save high byte of result
txa ; Low byte of result into A
clc
adc res
sta res
tya ; Restore high byte of result
adc res+1
sta res+1
.endmacro
doMul8 digW,digS,regW+0
doMul8 digW,digR,regW+1
doMul8 digW,digQ,regW+2
doMul8 digW,digP,regW+3
doMul8 digV,digS,regW+1
doMul8 digV,digR,regW+2
doMul8 digV,digQ,regW+3
; doMul8 digV,digP,regW+4
doMul8 digU,digS,regW+2
doMul8 digU,digR,regW+3
; doMul8 digU,digQ,regW+4
; doMul8 digU,digP,regW+5
doMul8 digT,digS,regW+3
; doMul8 digT,digR,regW+4
; doMul8 digT,digQ,regW+5
; doMul8 digT,digP,regW+6
aix16
I've attached my versions of the code on the page above:
Attachment:
mul8.s [8.62 KiB]
Downloaded 93 times