I've attached a 16 x 16 => 32 Booth multiplication routine that I introduced
here. Using the PyAsm65 assembler and the Py65 M65C02A model, I've been able to test it more extensively. I found a couple of errors in the originally defined implementation having to do with operations modifying or not modifying the status flags: NVZC. I also found an error in the implementation in the Py65 cpu model for the M65C02A-specific asr.w a, arithmetic shift rigth accumulator, instruction.
With these corrections, the routine correctly multiplies -32768 * -32768, and several other test cases. Alternating 1's and 0's in the multiplier results in the longest execution time: 411 cycles. The shortest execution time is 0 * x: 303 cycles. The CPI is 1.76 and the average instruction length is 1.63. (Note: different arguments will vary these values a bit.) These results compare favorably with the 16 x16 => 16 rem 16 unsigned division algorithm described / reported yesterday.
Like the division routine, this routine is intended to be an element in the run-time library of the M65C02A Pascal Compiler. Therefore, the return address is at offset 1 from the stack pointer, and the multiplier and the multiplicand are at offset 3 and 5 from the stack pointer, respectively.
Unlike a normal implementation of the Booth algorithm, the multiplier is not stored in the lower half of the product register and right shifted as the algorithm is executed. Instead, the product register, {ATOS, ANOS}, is initialized to 0, and the multiplier is stored in ABOS in bit reversed order so that the Booth recoding logic can use the C and N flags. This approach was possible because the M65C02A includes an instruction to bit reverse the ATOS register. All three registers of the A register are used, and YTOS is used as the loop counter.
Code:
( 1) ; ; signed multiplication: 16 x 16 => 32
( 2) ; .cod 512
( 3) ; _imul .proc
( 4) 0200 A010 ; ldy #16
( 5) 0202 A900 ; lda #0
( 6) 0204 0B ; dup a
( 7) 0205 0B ; dup a
( 8) 0206 CBB503 ; lda.w 3,S
( 9) 0209 9B2B ; rev
( 10) 020B AB090000 ; ora.w #0
( 11) 020F 18 ; clc
( 12) 0210 2B ; rot a
( 13) ; ;
( 14) 0211 8003 ; bra _imul_TstB
( 15) ; ;
( 16) ; _imul_Lp
( 17) 0213 AB0A ; asl.w a
( 18) 0215 2B ; rot a
( 19) ; _imul_TstB
( 20) 0216 9008 ; bcc _imul_SubShft
( 21) ; ;
( 22) ; _imul_AddShft
( 23) 0218 300C ; bmi _imul_ShftP
( 24) ; _imul_AddM
( 25) 021A 18 ; clc
( 26) 021B CB7505 ; adc.w 5,S
( 27) 021E 8006 ; bra _imul_ShftP
( 28) ; ;
( 29) ; _imul_SubShft
( 30) 0220 1004 ; bpl _imul_ShftP
( 31) ; _imul_SubM
( 32) 0222 38 ; sec
( 33) 0223 CBF505 ; sbc.w 5,S
( 34) ; ;
( 35) ; _imul_ShftP
( 36) 0226 BB4A ; asr.w a
( 37) 0228 2B ; rot a
( 38) 0229 AB6A ; ror.w a
( 39) 022B 2B ; rot a
( 40) ; ;
( 41) ; _imul_Dec
( 42) 022C 88 ; dey
( 43) 022D D0E4 ; bne _imul_Lp
( 44) ; ;
( 45) ; _imul_Exit
( 46) 022F 2B ; rot a
( 47) 0230 60 ; rts
( 48) ; ;
( 49) ; .endp
( 50) ; .end
I would like to see a similar multiplication routine for the '816 for comparison purposes.