16 x 16 Signed Multiplication with M65C02A

MichaelM · Post by **MichaelM** » Thu Oct 11, 2018 12:12 am

I've attached a 16 x 16 => 32 Booth multiplication routine that I introduced here. Using the PyAsm65 assembler and the Py65 M65C02A model, I've been able to test it more extensively. I found a couple of errors in the originally defined implementation having to do with operations modifying or not modifying the status flags: NVZC. I also found an error in the implementation in the Py65 cpu model for the M65C02A-specific asr.w a, arithmetic shift rigth accumulator, instruction.

With these corrections, the routine correctly multiplies -32768 * -32768, and several other test cases. Alternating 1's and 0's in the multiplier results in the longest execution time: 411 cycles. The shortest execution time is 0 * x: 303 cycles. The CPI is 1.76 and the average instruction length is 1.63. (Note: different arguments will vary these values a bit.) These results compare favorably with the 16 x16 => 16 rem 16 unsigned division algorithm described / reported yesterday.

Like the division routine, this routine is intended to be an element in the run-time library of the M65C02A Pascal Compiler. Therefore, the return address is at offset 1 from the stack pointer, and the multiplier and the multiplicand are at offset 3 and 5 from the stack pointer, respectively.

Unlike a normal implementation of the Booth algorithm, the multiplier is not stored in the lower half of the product register and right shifted as the algorithm is executed. Instead, the product register, {ATOS, ANOS}, is initialized to 0, and the multiplier is stored in ABOS in bit reversed order so that the Booth recoding logic can use the C and N flags. This approach was possible because the M65C02A includes an instruction to bit reverse the ATOS register. All three registers of the A register are used, and YTOS is used as the loop counter.

Code: Select all

(   1)                  ; ; signed multiplication: 16 x 16 => 32
(   2)                  ;         .cod 512
(   3)                  ; _imul   .proc
(   4) 0200 A010        ;         ldy #16
(   5) 0202 A900        ;         lda #0
(   6) 0204 0B          ;         dup a
(   7) 0205 0B          ;         dup a
(   8) 0206 CBB503      ;         lda.w 3,S
(   9) 0209 9B2B        ;         rev
(  10) 020B AB090000    ;         ora.w #0
(  11) 020F 18          ;         clc
(  12) 0210 2B          ;         rot a
(  13)                  ; ;
(  14) 0211 8003        ;         bra _imul_TstB
(  15)                  ; ;
(  16)                  ; _imul_Lp
(  17) 0213 AB0A        ;         asl.w a
(  18) 0215 2B          ;         rot a
(  19)                  ; _imul_TstB
(  20) 0216 9008        ;         bcc _imul_SubShft
(  21)                  ; ;
(  22)                  ; _imul_AddShft
(  23) 0218 300C        ;         bmi _imul_ShftP
(  24)                  ; _imul_AddM
(  25) 021A 18          ;         clc
(  26) 021B CB7505      ;         adc.w 5,S
(  27) 021E 8006        ;         bra _imul_ShftP
(  28)                  ; ;
(  29)                  ; _imul_SubShft
(  30) 0220 1004        ;         bpl _imul_ShftP
(  31)                  ; _imul_SubM
(  32) 0222 38          ;         sec
(  33) 0223 CBF505      ;         sbc.w 5,S
(  34)                  ; ;
(  35)                  ; _imul_ShftP
(  36) 0226 BB4A        ;         asr.w a
(  37) 0228 2B          ;         rot a
(  38) 0229 AB6A        ;         ror.w a
(  39) 022B 2B          ;         rot a
(  40)                  ; ;
(  41)                  ; _imul_Dec
(  42) 022C 88          ;         dey
(  43) 022D D0E4        ;         bne _imul_Lp
(  44)                  ; ;
(  45)                  ; _imul_Exit
(  46) 022F 2B          ;         rot a
(  47) 0230 60          ;         rts
(  48)                  ; ;
(  49)                  ;         .endp
(  50)                  ;         .end

I would like to see a similar multiplication routine for the '816 for comparison purposes.

BitWise · Post by **BitWise** » Thu Oct 11, 2018 10:44 am

I had a go for the 65C816. This works for normal values but not -32768 - Not sure where the extra bit P can be held.

The '816 shift opcodes don't support stack relative access so I opted to put the 32-bit result (initially the multiplier) and the multiplicand in direct page memory.

Code: Select all

                             ;===============================================================================
                             ;
                             ; Andrew Jacobs
                             ; 2018-10-11
                             ;-------------------------------------------------------------------------------
                             
                                             .65816
                                             
         00000000          = TRACE           .equ    0               ; Set to 1 to print PH/PL during calc.
                             
                             ;===============================================================================
                             ; Workspace
                             ;-------------------------------------------------------------------------------
                             
                                             .page0
                                             .org    $10
                                             
00:0010  0000              : PL              .space  2               ; Result
00:0012  0000              : PH              .space  2
00:0014  0000              : MU              .space  2               ; Multiplicand
                             
                             ;===============================================================================
                             ; The emulator expects the S28 to contain a ROM image so this is kicked off via
                             ; the dummy reset vector.
                             ;-------------------------------------------------------------------------------
                             
                                             .code
                                             .org    $f000
                                             
                             Reset:
00:F000  18                :                 clc                     ; Switch to native mode
00:F001  FB                :                 xce
00:F002  C230              :                 rep     #$30            ; A/X/Y 16-bit
                                             .longa  on
                                             .longi  on
                                             
                             ; Try combinations of signs
                             
00:F004  A90400            :                 lda     #4              ; Load multiplier
00:F007  A20300            :                 ldx     #3              ; .. and multiplicand
00:F00A  207DF0            :                 jsr     BoothMult
00:F00D  2053F0            :                 jsr     Hex8
00:F010  2078F0            :                 jsr     NewLine
                                             
00:F013  A9FCFF            :                 lda     #-4             ; Load multiplier
00:F016  A20300            :                 ldx     #3              ; .. and multiplicand
00:F019  207DF0            :                 jsr     BoothMult
00:F01C  2053F0            :                 jsr     Hex8
00:F01F  2078F0            :                 jsr     NewLine
                             
00:F022  A90400            :                 lda     #4              ; Load multiplier
00:F025  A2FDFF            :                 ldx     #-3             ; .. and multiplicand
00:F028  207DF0            :                 jsr     BoothMult
00:F02B  2053F0            :                 jsr     Hex8

Portable 65xx Assembler [17.12]

00:F02E  2078F0            :                 jsr     NewLine
                             
00:F031  A9FCFF            :                 lda     #-4             ; Load multiplier
00:F034  A2FDFF            :                 ldx     #-3             ; .. and multiplicand
00:F037  207DF0            :                 jsr     BoothMult
00:F03A  2053F0            :                 jsr     Hex8
00:F03D  2078F0            :                 jsr     NewLine
                                             
00:F040  A90080            :                 lda     #-32768         ; Load multiplier
00:F043  A20080            :                 ldx     #-32768         ; .. and multiplicand
00:F046  207DF0            :                 jsr     BoothMult
00:F049  2053F0            :                 jsr     Hex8
00:F04C  2078F0            :                 jsr     NewLine
                                             
                             
                             AllDone:
00:F04F  42FF              :                 wdm     #$ff            ; Stop the emulator
00:F051  80FC              :                 bra     AllDone 
                             
                             ; Output Utilities
                             
                                             .longa  on
                                             .longi  on
                             Hex8:
00:F053  48                :                 pha                     ; Save A
00:F054  8A                :                 txa                     ; Print X
00:F055  2059F0            :                 jsr     Hex4
00:F058  68                :                 pla                     ; Then  A
                             Hex4:
00:F059  48                :                 pha                     ; Save A
00:F05A  EB                :                 xba                     ; Show hi byte
00:F05B  205FF0            :                 jsr     Hex2
00:F05E  68                :                 pla                     ; Then lo byte
                             Hex2:
00:F05F  48                :                 pha                     
00:F060  4A                :                 lsr     a               ; Show hi nybble
00:F061  4A                :                 lsr     a
00:F062  4A                :                 lsr     a
00:F063  4A                :                 lsr     a
00:F064  2068F0            :                 jsr     Hex
00:F067  68                :                 pla                     ; Then lo nybble
                             Hex:
00:F068  E220              :                 sep     #$20
                                             .longa  off
00:F06A  290F              :                 and     #$0f            ; Binary -> ASCII
00:F06C  F8                :                 sed
00:F06D  18                :                 clc
00:F06E  6990              :                 adc     #$90
00:F070  6940              :                 adc     #$40
00:F072  D8                :                 cld
00:F073  C220              :                 rep     #$20
                                             .longa  on
                             
                             UartTx:
00:F075  4201              :                 wdm     #$01            ; Print a character
00:F077  60                :                 rts
                             
                             NewLine:
00:F078  A90A00            :                 lda     #10             ; LF
00:F07B  80F8              :                 bra     UartTx
                             
                             ;-------------------------------------------------------------------------------
                             
                                             .longa  on
                                             .longi  on
                             BoothMult:
00:F07D  8510              :                 sta     PL              ; P =  multiplier
00:F07F  6412              :                 stz     PH
00:F081  8614              :                 stx     MU              ; A =  multiplicand
                                             
00:F083  18                :                 clc
00:F084  208CF0            :                 jsr     BoothStep       ; Perform 16 multiply steps
                             
00:F087  A510              :                 lda     PL              ; Load the result
00:F089  A612              :                 ldx     PH
00:F08B  60                :                 rts
                                             
                             BoothStep:
00:F08C  208FF0            :                 jsr     $+3             ; x16
00:F08F  2092F0            :                 jsr     $+3             ; x8
00:F092  2095F0            :                 jsr     $+3             ; x4
00:F095  2098F0            :                 jsr     $+3             ; x2
                                             
                                             .if     TRACE
                           -                 php
                           -                 php
                           -                 lda     PH
                           -                 jsr     Hex4
                           -                 lda     PL
                           -                 jsr     Hex4
                           -                 lda     #'-'
                           -                 jsr     UartTx  
                           -                 plp
                           -                 lda     #'0'
                           -                 adc     #0
                           -                 jsr     UartTx
                           -                 lda     #'-'
                           -                 jsr     UartTx
                           -                 lda     #'>'
                           -                 jsr     UartTx
                           -                 plp
                                             .endif
                             
00:F098  A510              :                 lda     PL
00:F09A  B00B              :                 if cc
00:F09C  6A                :                  ror    a
00:F09D  A512              :                  lda    PH
00:F09F  9004              :                  if cs
00:F0A1  E514              :                   sbc   MU              ; 10: P += S (or P -= A)
00:F0A3  8512              :                   sta   PH
                                              endif
00:F0A5  8009              :                 else
00:F0A7  6A                :                  ror    a
00:F0A8  A512              :                  lda    PH
00:F0AA  B004              :                  if cc
00:F0AC  6514              :                   adc   MU              ; 01: P += A
00:F0AE  8512              :                   sta   PH
                                              endif
                                             endif
00:F0B0  2A                :                 rol     a               ; ASR P
00:F0B1  6612              :                 ror     PH
00:F0B3  6610              :                 ror     PL
                                     
                                             .if     TRACE
                           -                 php
                           -                 php
                           -                 lda     PH
                           -                 jsr     Hex4
                           -                 lda     PL
                           -                 jsr     Hex4
                           -                 lda     #'-'
                           -                 jsr     UartTx  
                           -                 plp
                           -                 lda     #'0'
                           -                 adc     #0
                           -                 jsr     UartTx
                           -                 jsr     NewLine
                           -                 plp
                                             .endif
                                             
00:F0B5  60                :                 rts
                                             
                                             .org    $fffc
00:FFFC  00F0              :                 .word   Reset           ; Dummy RESET vector
                                             
                                             .end

I found a bug in my emulation of ROR debugging this. The full materials are in the zip.

MichaelM · Post by **MichaelM** » Thu Oct 11, 2018 11:25 am

Thanks for the response.

From a size perspective the two solutions appear about the same. My calculations indicate that your implementation of the multiplication routine for the '816 is 57 bytes in length. My implementation of the multiplication routine is 49 bytes in length.

I looked in the log file, and it looks like the multiplication required about 31 us. Do you have a cycle count rather than a time so that the clock rate of the processors are not a factor?

I am wondering if you felt the same call to develop and test a standard unsigned 16 x 16 division routine? I also posted a routine for that and would like to see an '816 implementation.

BitWise · Post by **BitWise** » Thu Oct 11, 2018 12:07 pm

In the first version I used X & Y to hold the +/- multiplicand value and had to use recursion to execute the 'multiply step' 16 times. As I now keep the multiplicand in memory you can remove a few instructions and a lot of cycles with a simple loop.

Code: Select all

                .longa  on
                .longi  on
BoothMult:
                sta     PL              ; P =  multiplier
                stz     PH
                stx     MU              ; A =  multiplicand

                clc
                ldx     #16
                repeat
                
                .if     TRACE
                php
                php
                lda     PH
                jsr     Hex4
                lda     PL
                jsr     Hex4
                lda     #'-'
                jsr     UartTx  
                plp
                lda     #'0'
                adc     #0
                jsr     UartTx
                lda     #'-'
                jsr     UartTx
                lda     #'>'
                jsr     UartTx
                plp
                .endif

                 lda    PL
                 if cc
                  ror   a
                  lda   PH
                  if cs
                   sbc  MU              ; 10: P += S (or P -= A)
                   sta  PH
                  endif
                 else
                  ror   a
                  lda   PH
                  if cc
                   adc  MU              ; 01: P += A
                   sta  PH
                  endif
                 endif
                 rol    a               ; ASR P
                 ror    PH
                 ror    PL
        
                .if     TRACE
                php
                php
                lda     PH
                jsr     Hex4
                lda     PL
                jsr     Hex4
                lda     #'-'
                jsr     UartTx  
                plp
                lda     #'0'
                adc     #0
                jsr     UartTx
                jsr     NewLine
                plp
                .endif

                 dex
                until eq

                lda     PL              ; Load the result
                ldx     PH
                rts

Which is 47 bytes and takes around 700 cycles to run.

MichaelM · Post by **MichaelM** » Thu Oct 11, 2018 12:35 pm

Cool. Since I don't have access to any '816-related development or simulation / emulation tools, I certainly appreciate you taking the time to write, debug, and test the routine you posted above.

The reduction in the routine length makes the M65C02A and '816 signed 16 x 16 multiplication routines essentially the same length.

The architectural differences, e.g. additional internal registers in the M65C02A register stacks, probably contribute the most to the differences in the cycle counts: 420 (M65C02A) vs 700 ('816).

Perhaps I should have expected it because both routines are implementing a common algorithm using similar instruction sets, but I was somewhat surprised by the similarity in the two routines.

BigEd · Post by **BigEd** » Thu Oct 11, 2018 2:34 pm

This is great stuff - thanks to both of you for coding and sharing.

MichaelM · Post by **MichaelM** » Tue Oct 16, 2018 8:36 pm

BitWise:

Something you said regarding the -32768 * -32768 case made me go back and check my implementation. I found that my result was 0xC000_0000 instead of the correct 0x4000_0000.

In getting the bugs out of the M65C02A Python model and the _imul() source, I had removed a feature of the M65C02A-specific asr.w a instruction that enables the restoration of the correct sign bit when the preceding add/sub operation of the Booth multiplication operation generates an arithmetic overflow. When -32768 is added to -32768 in the upper product register, the result is 0x8000_0000. The V flag correctly indicates an overflow.

I had modified the asr.w a to restore the sign by using the V flag and the N flag. If V is set, then the correct sign bit is the complement of the N flag. (If the V flag is not set, then the sign is the same as the N flag.) With this feature restored to the asr.w a instruction, the _imul() does correctly compute the product of -32768 * -32768.

16 x 16 Signed Multiplication with M65C02A

16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A

Re: 16 x 16 Signed Multiplication with M65C02A