I've been following this discussion and have to say it's been interesting.
That said, if you really want to see an improvement in performance
as well as a reduction in code size, try writing the multiply algorithm for the 65C816 running in 16 bit native mode.
You will be pleasantly amazed at what is gained by processing the operands a word at a time.
Code:
;imul: FACA = FACA × FACB (64 bit integer multiplication)
;
; —————————————————————————————————————————————————
; Preparatory Ops : FACA: 32 bit multiplicand
; FACB: 32 bit multplier
;
; Computed Returns: FACA: 64 bit product
; FACB: entry value
;
; Register Usage : .A: used
; .B: used
; .X: used
; .Y: truncated to 8 bits
;
; MPU Flags: NVmxDIZC
; ||||||||
; |||||||+———> 0
; ||||||+————> undefined
; |||||+—————> entry value
; ||+++——————> 0
; ++—————————> undefined
;
; Notes: 1) Product is undefined if either operand
; is greater than $FFFFFFFF.
; 2) A call to CLRFACA must be made before
; FACA is loaded with the multiplicand.
; —————————————————————————————————————————————————
;
imul longa ;16 bit accumulator
shortx ;8 bit index
cld ;ensure binary mode
ldx #s_bdword+1 ;bits to process +1
clc
;
.0000010 ror faca+s_long+s_word;rotate...
ror faca+s_long ;out...
ror faca+s_word ;a...
ror faca ;bit
bcc .0000020 ;0, skip ahead
;
clc
lda faca+s_long ;partial product
adc facb ;multiplier
sta faca+s_long ;new partial product
lda faca+s_long+s_word;ditto
adc facb+s_word ;ditto
sta faca+s_long+s_word;ditto
;
.0000020 dex
bne .0000010 ;next bit
;
longx ;16 bit index
rts
In the above, symbols such as S_WORD and S_LONG define data type sizes in bytes. S_BDWORD defines the number of bits in a double word (DWORD).
Code:
;idiv: FACA = FACA ÷ FACB (64 bit integer division)
;
; —————————————————————————————————————————————————
; Preparatory Ops : FACA: 64 bit dividend
; FACB: 64 bit divisor
;
; Computed Returns: FACA: 64 bit quotient
; FACB: entry value
; FACC: used
;
; Register Returns: .A: used
; .B: used
; .X: remainder LSW
; .Y: remainder MSW
;
; MPU Flags: NVmxDIZC
; ||||||||
; |||||||+———> 0: quotient valid
; ||||||| 1: division by zero
; ||||||+————> 0: remainder != zero
; |||||| 1: remainder == zero
; |||||+—————> entry value
; ||+++——————> 0
; ++—————————> undefined
;
; NOTES: 1) All values are in little-endian format.
; 2) The remainder will also be in _OVRFLO_.
; —————————————————————————————————————————————————
;
idiv longa ;16 bit accumulator
shortx ;8 bit index
cld ;ensure binary mode
jsr clrfacc ;clear tertiary accumulator
jsr clrovrfl ;clear overflow
ldx #s_dlong-s_word
;
;
; perform divide-by-zero check...
;
.0000010 lda facb,x
bne idiv01
;
.rept s_word ;decrement .X twice
dex
.endr
bpl .0000010
;
tax
tay
longx
sec ;division by zero error
rts
;
idiv01 ldy #s_blword
clc
;
.0000010 phy
ldx #0
ldy #_loopct_
;
.0000020 rol faca,x
.rept s_word
inx
.endr
dey
bne .0000020
;
ldx #0
ldy #_loopct_
;
.0000030 rol _ovrflo_,x
.rept s_word
inx
.endr
dey
bpl .0000030
;
ldx #0
ldy #_loopct_
sec
;
.0000040 lda _ovrflo_,x
sbc facb,x
sta facc,x
.rept s_word
inx
.endr
dey
bne .0000040
;
bcc .0000060
;
ldx #_loopct_
;
.0000050 lda facc,x
sta _ovrflo_,x
.rept s_word
dex
.endr
bpl .0000050
;
.0000060 ply
dey
bne .0000010
;
tyx
ldy #_loopct_
;
.0000070 rol faca,x
.rept s_word
inx
.endr
dey
bne .0000070
;
longx
ldx _ovrflo_
ldy _ovrflo_+s_word
txa
ora _ovrflo_+s_word
clc
rts
I haven't yet had the pleasure of trying out the 65C816 but I'm looking forward to it. I hear great things about it and your listing looks quite interesting, Thank you. Cheers and a Happy New Year to you as well!