Simple 32 bit multiply

jsii · Post by **jsii** » Thu Dec 21, 2017 9:37 pm

The following is a simple 32 bit multiply routine.
I'm sure there are better ways to approach this problem,
but I think material for comparison and study is always useful.
77 bytes.

Code: Select all

;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply
       ;(not thoroughly tested)
       ;Uses .AXY +4
       ;
       ;Operation:  C = A * B            (actually, C = C + A * B)
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$1500 - $1503 = #1 (SET $1504-$1507 TO 00)  ('A')
       ;$1510 - $1513 = #2                          ('B')
       ;$1520 - $152f = 00                          ('C')
       ;
       ; -> CT32X01 -> $1520 - $1527 = #1 * #2
       ;
       ;Ex.:  (assuming $1500-152F = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $1500
       ;      STX $1510
       ;      JSR CT32X01
       ;      LDA $1520
       ;      LDX $1521
       ;      LDY $1522
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = first 24 bits result number value
       ;

CT32X01:
        LDX $1510
        JSR ACX02
        LDX $1511
        JSR ACX02
        LDX $1512
        JSR ACX02
        LDX $1513
ACX02:
        LDY #$08
YMXE02:
        TXA
        LSR
        TAX
        BCC YX02

        PHA
        LDX #$00
        CLC
        PHP
XMXE02:
        PLP
        LDA $1500,X
        ADC $1520,X
        STA $1520,X
        PHP
        INX
        CPX #$08
        BNE XMXE02
        PLP
        PLA
        TAX
YX02:
        ASL $1500
        TXA
        PHA
        LDX #$01
        PHP
XMXE0202:
        PLP
        ROL $1500,X
        PHP
        INX
        CPX #$08
        BNE XMXE0202
        PLP
        PLA
        TAX
        DEY
        BNE YMXE02
        RTS
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE

Klaus2m5 · Post by **Klaus2m5** » Fri Dec 22, 2017 10:25 am

If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.

On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.

edit: to ilustrate I modified the top down multiply as follows.

Code: Select all

;32 bit multiply with 64 bit product
            .data
PROD:       .ds      8        ;64 bit product
MULR:       .equ     PROD     ;32 bit multiplier overlaps product
MULND:      .ds      4        ;32 bit multiplicand 

            .code
MULTIPLY:   lda      #$00
            sta      PROD+4   ;Clear upper half of
            sta      PROD+5   ;product
            sta      PROD+6
            sta      PROD+7
            ldx      #$21     ;Set binary count to 32+1
            bne      START_R
SHIFT_R:
;            lsr      MULR+3   ;Shift multiplyer right
;            ror      MULR+2   ; now part of product shift
;            ror      MULR+1
;            ror      MULR
            bcc      ROTATE_R ;Go rotate right if c = 0
            lda      PROD+4   ;Get upper half of product
            clc               ; and add multiplicand to
            adc      MULND    ; it
            sta      PROD+4
            lda      PROD+5
            adc      MULND+1
            sta      PROD+5
            lda      PROD+6
            adc      MULND+2
            sta      PROD+6
            lda      PROD+7
            adc      MULND+3
ROTATE_R:   ror      a        ;Rotate partial product
            sta      PROD+7   ; right
            ror      PROD+6
            ror      PROD+5
            ror      PROD+4
START_R:    ror      PROD+3   ;shift multiplier while shifting product
            ror      PROD+2
            ror      PROD+1
            ror      PROD
            dex               ;Decrement bit count and
            bne      SHIFT_R  ; loop until 32 bits are done
; floating point only
;            clc               ;Add dps and put sum in MULXP2
;            lda      MULXP1
;            adc      MULXP2
;            sta      MULXP2
            rts

edit2: of course the loop count in the modified code must be +1 for the initial multiplier shift!

jsii · Post by **jsii** » Fri Dec 22, 2017 11:44 am

Klaus2m5 wrote:

If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.

Thank you! All those zeroes did 'bother' me. Maybe I'll try this alternative some time. Your way is clearly better.

Quote:

On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.

Very interesting. I thought of not using indexing and loops, but then I decided to aim for a lower byte count this time. I wonder how large an 'un-looped' alternative would be...

Thank you for your comment and very good looking SBC, by the way.
(I see the 'edit' to your post even as I type this reply. Thank you, very clear code.)

[corrected typing error: 'lower' instead of 'slower' around "...to aim for a lower byte count..."]

jsii · Post by **jsii** » Wed Dec 27, 2017 12:25 pm

Considering the exchange above, here's a re-written 32-bit multiply routine.
38 bytes.

Code: Select all

;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X02 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +1
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
       ;$C210 - $C213 = #2
       ;
       ;
       ; -> CT32X02 -> $C200 - $C207 = #1 * #2
       ;
       ;Ex.:  (assuming $C200-C213 = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $C200
       ;      STX $C210
       ;      JSR CT32X02
       ;      LDA $C200
       ;      LDX $C201
       ;      LDY $C202
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;


CT32X02:
        LDY #$21
YMEX00:
        LDX #$08
        LSR $C208
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        CLC
        PHP
XMEX00:
        PLP
        LDA $C204,X
        ADC $C210,X
        STA $C204,X
        PHP
        INX
        CPX #$05
        BNE XMEX00
        PLP
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;38 BYTES
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE

I suppose all that x86 programming finally shows its effect; must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 37 bytes would be good.

jsii · Post by **jsii** » Wed Dec 27, 2017 1:59 pm

Based on the routine placed just above, if we don't mind 'upsetting' the carry down below,
I think we can take it down to 37 bytes; 35 bytes if no 'user bytes' are needed.
37 bytes.

Code: Select all

;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X02 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +1
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
       ;$C210 - $C213 = #2
       ;
       ;
       ; -> CT32X02 -> $C200 - $C207 = #1 * #2
       ;
       ;Ex.:  (assuming $C200-C213 = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $C200
       ;      STX $C210
       ;      JSR CT32X02
       ;      LDA $C200
       ;      LDX $C201
       ;      LDY $C202
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;


CT32X02:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        CLC
        PHP
XMEX00:
        PLP
        LDA $C204,X
        ADC $C210,X
        STA $C204,X
        PHP
        INX
        CPX #$05
        BNE XMEX00
        PLP
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;35 BYTES
        NOP          ;USER OR F.E.
        NOP          ;USER OR F.E.
                     ;37 BYTES.
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE

Definitely must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 32 bytes would be good. "A 32 byte 32-bit multiply routine" sounds good to me.

Klaus2m5 · Post by **Klaus2m5** » Wed Dec 27, 2017 4:06 pm

How about 31 bytes?

Code: Select all

CT32X02:
        CLC             ;bugfix!
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$fc        ;added to replace cpx #$05
        CLC
;        PHP            ;carry save no longer needed
XMEX00:
;        PLP            ;carry save no longer needed
        LDA $C204-$fc,X
        ADC $C210-$fc,X
        STA $C204-$fc,X
;        PHP            ;carry save no longer needed
        INX
;        CPX #$05       ;replaced by ldx #$fc
        BNE XMEX00
;        PLP            ;carry save no longer needed
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;32 BYTES after bugfix

Of course this will cause page crossing indexes. So the 7 cycles you gain by avoiding the PHP/PLP will be reduced by 2 cycles per add loop.

edit: Found a bug in the original code. Entering the subroutine with carry set will cause the result to be plus $80000000.

jsii · Post by **jsii** » Wed Dec 27, 2017 5:27 pm

Klaus2m5 wrote:

How about 31 bytes?

Code: Select all

CT32X02:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$fc        ;added to replace cpx #$05
        CLC
;        PHP            ;carry save no longer needed
XMEX00:
;        PLP            ;carry save no longer needed
        LDA $C204-$fc,X
        ADC $C210-$fc,X
        STA $C204-$fc,X
;        PHP            ;carry save no longer needed
        INX
;        CPX #$05       ;replaced by ldx #$fc
        BNE XMEX00
;        PLP            ;carry save no longer needed
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;31 BYTES

Of course this will cause page crossing indexes. So the 7 cycles you gain by avoiding the PHP/PLP will be reduced by 2 cycles per add loop.

There you go, well done, glad you caught it!
So now, how about 27 bytes?

jsii · Post by **jsii** » Wed Dec 27, 2017 6:58 pm

Well anyway... 27 bytes.

Code: Select all

;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X03 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +0
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$61 - $64 = #1 (ALSO SET $65-$69 TO 00)
       ;$6A - $6D = #2
       ;
       ;
       ; -> CT32X03 -> $61 - $68 = #1 * #2
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $61
       ;      STX $6A
       ;      JSR CT32X03
       ;      LDA $61
       ;      LDX $62
       ;      LDY $63
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;

CT32X03:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $60,X  
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$FC
        CLC
XMEX00:
        LDA $69,X        
        ADC $6E,X        
        STA $69,X        
        INX
        BNE XMEX00
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;27 BYTES
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE

21 Bytes?

Klaus2m5 · Post by **Klaus2m5** » Wed Dec 27, 2017 9:37 pm

Quote:

21 Bytes?

Don't be silly.

jsii · Post by **jsii** » Wed Dec 27, 2017 11:08 pm

Klaus2m5 wrote:

Quote:

21 Bytes?

Don't be silly.

How?

jsii · Post by **jsii** » Fri Dec 29, 2017 5:09 am

25 bytes.

Code: Select all

;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply and 'variable bit' multiply -- CT32X04 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +0
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result (or 'variable bit' result)(see below)
       ;                             A = 32 bit #1     (or 'variable bit' #1)
       ;                             B = 32 bit #2     (or 'variable bit' #2)
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$61 - $64 = #1 (ALSO SET $65-$69 TO 00)
       ;$6A - $6D = #2
       ;.Y = 33, 25, 17 or 9, for example.     (33 -> 32 bit operation, 25 -> 24 bit operation,
       ;                                        17 -> 16 bit operation,  9 ->  8 bit operation)                               
       ;
       ;
       ;
       ; -> CT32X04 -> $61 - $68 = #1 * #2    (for a 32 bit operation)
       ;               $62 - $67 = #1 * #2    (for a 24 ''    ''     )
       ;               $63 - $66 = #1 * #2    (for 16 bits)
       ;               $64 - $65 = #1 * #2    (for  8 bits)
       ;
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $61
       ;      STX $6A
       ;      LDY #$21    ;32 bit operation
       ;      JSR CT32X04
       ;      LDA $61
       ;      LDX $62
       ;      LDY $63
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$30
       ;      LDX #$67
       ;      STA $61
       ;      STX $6A
       ;      LDY #$09    ;8 bit operation
       ;      JSR CT32X04
       ;      LDA $64
       ;      LDX $65
       ;          ; .. Here,  .X, .A = 16 bit result, .A=80  .X=19
       ;


CT32X04:
        LDX #$09
XMEX01:
        ROR $60,X  
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$FC
        CLC
XMEX00:
        LDA $69,X        
        ADC $6E,X        
        STA $69,X        
        INX
        BNE XMEX00
YMEX01:
        DEY
        BNE CT32X04
        RTS          ;25 BYTES.

;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE

21 Bytes?

rwiker · Post by **rwiker** » Fri Dec 29, 2017 6:55 am

If you're going to cheat, why not move the

Code: Select all

LDX #$09

out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)

BigEd · Post by **BigEd** » Fri Dec 29, 2017 7:17 am

(Can we keep this good-natured? Enjoy yourselves.)

jsii · Post by **jsii** » Fri Dec 29, 2017 10:53 am

rwiker wrote:

If you're going to cheat, why not move the

Code: Select all

LDX #$09

out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)

I hear that bile is good around the liver. Anyway, I'm not sure what you mean by 'cheat', by 'function' or by 'inline' for that matter. I wonder how a 6502 feels about cheating, about "well behaved" code, about pretty comments or about nifty indenting altogether. Cheers.

jsii · Post by **jsii** » Fri Dec 29, 2017 10:55 am

I still say there must be better ways to do this yet both in terms of byte count and speed. 21 bytes? Cheers.

Simple 32 bit multiply

Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply

Re: Simple 32 bit multiply