Simple 32 bit multiply

jsii · Post by **jsii** » Sat Dec 30, 2017 9:58 pm

BigDumbDinosaur wrote:

I've been following this discussion and have to say it's been interesting.

That said, if you really want to see an improvement in performance as well as a reduction in code size, try writing the multiply algorithm for the 65C816 running in 16 bit native mode.

You will be pleasantly amazed at what is gained by processing the operands a word at a time.

Code: Select all

;imul: FACA = FACA × FACB  (64 bit integer multiplication)
;
;	—————————————————————————————————————————————————
;	Preparatory Ops : FACA: 32 bit multiplicand
;	                  FACB: 32 bit multplier
;
;	Computed Returns: FACA: 64 bit product
;	                  FACB: entry value
;
;	Register Usage  : .A: used
;	                  .B: used
;	                  .X: used
;	                  .Y: truncated to 8 bits
;
;	MPU Flags: NVmxDIZC
;	           ||||||||
;	           |||||||+———> 0
;	           ||||||+————> undefined
;	           |||||+—————> entry value
;	           ||+++——————> 0
;	           ++—————————> undefined
;
;	Notes: 1) Product is undefined if either operand
;	          is greater than $FFFFFFFF.
;	       2) A call to CLRFACA must be made before
;	          FACA is loaded with the multiplicand.
;	—————————————————————————————————————————————————
;
imul     longa                 ;16 bit accumulator
         shortx                ;8 bit index
         cld                   ;ensure binary mode
         ldx #s_bdword+1       ;bits to process +1
         clc
;
.0000010 ror faca+s_long+s_word;rotate...
         ror faca+s_long       ;out...
         ror faca+s_word       ;a...
         ror faca              ;bit
         bcc .0000020          ;0, skip ahead
;
         clc
         lda faca+s_long       ;partial product
         adc facb              ;multiplier
         sta faca+s_long       ;new partial product
         lda faca+s_long+s_word;ditto
         adc facb+s_word       ;ditto
         sta faca+s_long+s_word;ditto
;
.0000020 dex
         bne .0000010          ;next bit
;
         longx                 ;16 bit index
         rts

In the above, symbols such as S_WORD and S_LONG define data type sizes in bytes. S_BDWORD defines the number of bits in a double word (DWORD).

Code: Select all

;idiv: FACA = FACA ÷ FACB  (64 bit integer division)
;
;	—————————————————————————————————————————————————
;	Preparatory Ops : FACA: 64 bit dividend
;	                  FACB: 64 bit divisor
;
;	Computed Returns: FACA: 64 bit quotient
;	                  FACB: entry value
;	                  FACC: used
;
;	Register Returns: .A: used
;	                  .B: used
;	                  .X: remainder LSW
;	                  .Y: remainder MSW
;
;	MPU Flags: NVmxDIZC
;	           ||||||||
;	           |||||||+———> 0: quotient valid
;	           |||||||      1: division by zero
;	           ||||||+————> 0: remainder != zero
;	           ||||||       1: remainder == zero
;	           |||||+—————> entry value
;	           ||+++——————> 0
;	           ++—————————> undefined
;
;	NOTES: 1) All values are in little-endian format.
;	       2) The remainder will also be in _OVRFLO_.
;	—————————————————————————————————————————————————
;
idiv     longa                 ;16 bit accumulator
         shortx                ;8 bit index
         cld                   ;ensure binary mode
         jsr clrfacc           ;clear tertiary accumulator
         jsr clrovrfl          ;clear overflow
         ldx #s_dlong-s_word
;
;
;	perform divide-by-zero check...
;
.0000010 lda facb,x
         bne idiv01
;
         .rept s_word          ;decrement .X twice
           dex
         .endr
         bpl .0000010
;
         tax
         tay
         longx
         sec                   ;division by zero error
         rts
;
idiv01   ldy #s_blword
         clc
;
.0000010 phy
         ldx #0
         ldy #_loopct_
;
.0000020 rol faca,x
         .rept s_word
           inx
         .endr
         dey
         bne .0000020
;
         ldx #0
         ldy #_loopct_
;
.0000030 rol _ovrflo_,x
         .rept s_word
           inx
         .endr
         dey
         bpl .0000030
;
         ldx #0
         ldy #_loopct_
         sec
;
.0000040 lda _ovrflo_,x
         sbc facb,x
         sta facc,x
         .rept s_word
             inx
         .endr
         dey
         bne .0000040
;
         bcc .0000060
;         
         ldx #_loopct_
;
.0000050 lda facc,x
         sta _ovrflo_,x
         .rept s_word
           dex
         .endr
         bpl .0000050
;
.0000060 ply
         dey
         bne .0000010
;
         tyx
         ldy #_loopct_
;
.0000070 rol faca,x
         .rept s_word
           inx
         .endr
         dey
         bne .0000070
;
         longx
         ldx _ovrflo_
         ldy _ovrflo_+s_word
         txa
         ora _ovrflo_+s_word
         clc
         rts

I haven't yet had the pleasure of trying out the 65C816 but I'm looking forward to it. I hear great things about it and your listing looks quite interesting, Thank you. Cheers and a Happy New Year to you as well!

cbmeeks · Post by **cbmeeks** » Tue Jan 02, 2018 3:46 pm

sark02 wrote:

Isn't it interesting that after 41 or so years, people still find fun in implementing MUL, and finding ways to optimize.

I find this amusing as well. Not just with 6502 code, but 8/16-bit computers in general. I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

whartung · Post by **whartung** » Tue Jan 02, 2018 9:58 pm

cbmeeks wrote:

I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

Of course it's hilarious. Everyone knows it's the Atari...

GARTHWILSON · Post by **GARTHWILSON** » Tue Jan 02, 2018 10:18 pm

...and it brings to our attention the fact that there are plenty of factors besides the processor that determine a computer's capabilities.

jsii · Post by **jsii** » Thu Jan 04, 2018 9:30 pm

cbmeeks wrote:

sark02 wrote:

Isn't it interesting that after 41 or so years, people still find fun in implementing MUL, and finding ways to optimize.

I find this amusing as well. Not just with 6502 code, but 8/16-bit computers in general. I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

This should tell us something, shouldn't it.

jsii · Post by **jsii** » Thu Jan 04, 2018 9:32 pm

whartung wrote:

cbmeeks wrote:

I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

Of course it's hilarious. Everyone knows it's the Atari...

This kind of discussion happens all the time.

jsii · Post by **jsii** » Thu Jan 04, 2018 9:33 pm

GARTHWILSON wrote:

...and it brings to our attention the fact that there are plenty of factors besides the processor that determine a computer's capabilities.

Worth noting.

Klaus2m5 · Post by **Klaus2m5** » Fri Jan 05, 2018 2:13 pm

I made a macro for my preferred version of the multiply representing a good balance of speed vs. size. Its size is 58 bytes for 32 by 32 and includes clearing the high bytes of the product. It was tested in Kowalski and executes in 1483-2699 cycles for a 32 bits by 32 bits multiply. The 27 byte subroutine of jsii executes in 3202-5442 cycles and has a bug when the subroutine is entered with carry set. The lower number of cycles is for a multiplier of 0 and the higher number of cycles is caused by the worst case multiplier of $ffffffff.

Code: Select all

;
; MULTIPLY macro
;
; params:   %1 - address of multiplicand
;           %2 - address of multiplier (replaced by product)
;           %3 - byte count of multiplicand
;           %4 - byte count of multiplier
; product byte count = %3 + %4    !! enough space must be reserved for %2 !!
;
multiply    .macro ...
            lda #$00        ;Clear upper half of product (MSB kept in A)
.offset     .set	%4
            .repeat %3-1
                sta %2+.offset
.offset         .set .offset+1
            .endr
            ldy #%4*8+1     ;Set binary count to # multiplier bits + 1
            bne .start_r
.loop       bcc .skip_add   ;nothing to add on c = 0
            clc             ;add multiplicand to product
            .if   %3 > 1
                pha             ;save MSB of product
.offset         .set  0
                .repeat %3-1    
                    lda %2+%4+.offset
                    adc %1+.offset
                    sta %2+%4+.offset
.offset             .set  .offset+1
                .endr
                pla             ;restore MSB
            .endif
            adc %1+.offset  ;add MSB
.skip_add   ror             ;rotate upper product with carry from MSB
.offset     .set  %3+%4-2
            .repeat %3-1
                ror %2+.offset
.offset         .set .offset-1
            .endr
.start_r    .repeat %4      ;rotate multiplier and lower product
                ror %2+.offset
.offset         .set  .offset-1
            .endr
            dey             ;decrement multiplier bit count
            bne .loop       ;loop until bit count is exhausted
            sta %2+%3+%4-1  ;finally we need to store the MSB
            .endm
            
; testing the macro
            .org    $40
PROD                                ;product overlaps multiplier
MULR        .db     $ff,$ff,$ff,$ff ;multiplier (replaced by product)
            .ds     4               ;additional space for product
MULND       .db     $ff,$ff,$ff,$ff ;multiplicand 

            .org    $1000
            .start  *
            multiply MULND,MULR,4,4
            rts
            .end

Greetings to ELIZA.

jsii · Post by **jsii** » Fri Jan 05, 2018 9:26 pm

Klaus2m5 wrote:

I made a macro for my preferred version of the multiply representing a good balance of speed vs. size. Its size is 58 bytes for 32 by 32 and includes clearing the high bytes of the product. It was tested in Kowalski and executes in 1483-2699 cycles for a 32 bits by 32 bits multiply. The 27 byte subroutine of jsii executes in 3202-5442 cycles and has a bug when the subroutine is entered with carry set. The lower number of cycles is for a multiplier of 0 and the higher number of cycles is caused by the worst case multiplier of $ffffffff.

Code: Select all

;
; MULTIPLY macro
;
; params:   %1 - address of multiplicand
;           %2 - address of multiplier (replaced by product)
;           %3 - byte count of multiplicand
;           %4 - byte count of multiplier
; product byte count = %3 + %4    !! enough space must be reserved for %2 !!
;
multiply    .macro ...
            lda #$00        ;Clear upper half of product (MSB kept in A)
.offset     .set	%4
            .repeat %3-1
                sta %2+.offset
.offset         .set .offset+1
            .endr
            ldy #%4*8+1     ;Set binary count to # multiplier bits + 1
            bne .start_r
.loop       bcc .skip_add   ;nothing to add on c = 0
            clc             ;add multiplicand to product
            .if   %3 > 1
                pha             ;save MSB of product
.offset         .set  0
                .repeat %3-1    
                    lda %2+%4+.offset
                    adc %1+.offset
                    sta %2+%4+.offset
.offset             .set  .offset+1
                .endr
                pla             ;restore MSB
            .endif
            adc %1+.offset  ;add MSB
.skip_add   ror             ;rotate upper product with carry from MSB
.offset     .set  %3+%4-2
            .repeat %3-1
                ror %2+.offset
.offset         .set .offset-1
            .endr
.start_r    .repeat %4      ;rotate multiplier and lower product
                ror %2+.offset
.offset         .set  .offset-1
            .endr
            dey             ;decrement multiplier bit count
            bne .loop       ;loop until bit count is exhausted
            sta %2+%3+%4-1  ;finally we need to store the MSB
            .endm
            
; testing the macro
            .org    $40
PROD                                ;product overlaps multiplier
MULR        .db     $ff,$ff,$ff,$ff ;multiplier (replaced by product)
            .ds     4               ;additional space for product
MULND       .db     $ff,$ff,$ff,$ff ;multiplicand 

            .org    $1000
            .start  *
            multiply MULND,MULR,4,4
            rts
            .end

Greetings to ELIZA.

Based on your tests mentioned above, your subroutine is clearly better with a combination of size and speed; congratulations and thank you for the good work. Regarding the bug on the 25 byte subroutine (mentioning '25' now just for fun) and speaking of the carry flag, I wonder what the decimal mode flag does to it (just imagine). Next I might point out "clear the carry and the decimal mode flags before using it" in the instructions, but that might actually be cheating now, wouldn't it (based on the 'rules' -- just a light joke); or we can add a good "CLC/CLD" combination and call it a 27 byte subroutine at that, at least for the time being. I'd still like to see a 21 byte improvement anyway, though.

Thank you again and very good work.
Happy New Year.
Cheers!

Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply