The following is a simple 32 bit multiply routine.
I'm sure there are better ways to approach this problem,
but I think material for comparison and study is always useful.
77 bytes.
If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.
On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.
edit: to ilustrate I modified the top down multiply as follows.
;32 bit multiply with 64 bit product
.data
PROD: .ds 8 ;64 bit product
MULR: .equ PROD ;32 bit multiplier overlaps product
MULND: .ds 4 ;32 bit multiplicand
.code
MULTIPLY: lda #$00
sta PROD+4 ;Clear upper half of
sta PROD+5 ;product
sta PROD+6
sta PROD+7
ldx #$21 ;Set binary count to 32+1
bne START_R
SHIFT_R:
; lsr MULR+3 ;Shift multiplyer right
; ror MULR+2 ; now part of product shift
; ror MULR+1
; ror MULR
bcc ROTATE_R ;Go rotate right if c = 0
lda PROD+4 ;Get upper half of product
clc ; and add multiplicand to
adc MULND ; it
sta PROD+4
lda PROD+5
adc MULND+1
sta PROD+5
lda PROD+6
adc MULND+2
sta PROD+6
lda PROD+7
adc MULND+3
ROTATE_R: ror a ;Rotate partial product
sta PROD+7 ; right
ror PROD+6
ror PROD+5
ror PROD+4
START_R: ror PROD+3 ;shift multiplier while shifting product
ror PROD+2
ror PROD+1
ror PROD
dex ;Decrement bit count and
bne SHIFT_R ; loop until 32 bits are done
; floating point only
; clc ;Add dps and put sum in MULXP2
; lda MULXP1
; adc MULXP2
; sta MULXP2
rts
edit2: of course the loop count in the modified code must be +1 for the initial multiplier shift!
Last edited by Klaus2m5 on Fri Dec 22, 2017 12:01 pm, edited 1 time in total.
If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.
Thank you! All those zeroes did 'bother' me. Maybe I'll try this alternative some time. Your way is clearly better.
Quote:
On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.
Very interesting. I thought of not using indexing and loops, but then I decided to aim for a lower byte count this time. I wonder how large an 'un-looped' alternative would be...
Thank you for your comment and very good looking SBC, by the way.
(I see the 'edit' to your post even as I type this reply. Thank you, very clear code.)
[corrected typing error: 'lower' instead of 'slower' around "...to aim for a lower byte count..."]
;=================================================================
;=================================================================
;=================================================================
;32bit multiply -- CT32X02 -- 2017, jsii
;(not thoroughly tested)
;Uses .AXY +1
;
;Operation: C = A * B
; where C = 64 bit result
; A = 32 bit #1
; B = 32 bit #2
;
; *** #1 changed
;
;Howto:
;
;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
;$C210 - $C213 = #2
;
;
; -> CT32X02 -> $C200 - $C207 = #1 * #2
;
;Ex.: (assuming $C200-C213 = 00)
;
; LDA #$07
; LDX #$03
; STA $C200
; STX $C210
; JSR CT32X02
; LDA $C200
; LDX $C201
; LDY $C202
; ; .. Here, .Y, .X, .A = first 24 bits of result
; ; .. so that .Y * 65536 + .X *256 + .A = 24 bit result number
;
CT32X02:
LDY #$21
YMEX00:
LDX #$08
LSR $C208
XMEX01:
ROR $C1FF,X
DEX
BNE XMEX01
BCC YMEX01
CLC
PHP
XMEX00:
PLP
LDA $C204,X
ADC $C210,X
STA $C204,X
PHP
INX
CPX #$05
BNE XMEX00
PLP
YMEX01:
DEY
BNE YMEX00
RTS ;38 BYTES
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
I suppose all that x86 programming finally shows its effect; must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 37 bytes would be good.
Based on the routine placed just above, if we don't mind 'upsetting' the carry down below,
I think we can take it down to 37 bytes; 35 bytes if no 'user bytes' are needed.
37 bytes.
;=================================================================
;=================================================================
;=================================================================
;32bit multiply -- CT32X02 -- 2017, jsii
;(not thoroughly tested)
;Uses .AXY +1
;
;Operation: C = A * B
; where C = 64 bit result
; A = 32 bit #1
; B = 32 bit #2
;
; *** #1 changed
;
;Howto:
;
;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
;$C210 - $C213 = #2
;
;
; -> CT32X02 -> $C200 - $C207 = #1 * #2
;
;Ex.: (assuming $C200-C213 = 00)
;
; LDA #$07
; LDX #$03
; STA $C200
; STX $C210
; JSR CT32X02
; LDA $C200
; LDX $C201
; LDY $C202
; ; .. Here, .Y, .X, .A = first 24 bits of result
; ; .. so that .Y * 65536 + .X *256 + .A = 24 bit result number
;
CT32X02:
LDY #$21
YMEX00:
LDX #$09
XMEX01:
ROR $C1FF,X
DEX
BNE XMEX01
BCC YMEX01
CLC
PHP
XMEX00:
PLP
LDA $C204,X
ADC $C210,X
STA $C204,X
PHP
INX
CPX #$05
BNE XMEX00
PLP
YMEX01:
DEY
BNE YMEX00
RTS ;35 BYTES
NOP ;USER OR F.E.
NOP ;USER OR F.E.
;37 BYTES.
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
Definitely must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 32 bytes would be good. "A 32 byte 32-bit multiply routine" sounds good to me.
out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)
out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)
I hear that bile is good around the liver. Anyway, I'm not sure what you mean by 'cheat', by 'function' or by 'inline' for that matter. I wonder how a 6502 feels about cheating, about "well behaved" code, about pretty comments or about nifty indenting altogether. Cheers.