6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 08, 2024 10:38 pm

All times are UTC




Post new topic Reply to topic  [ 39 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Simple 32 bit multiply
PostPosted: Thu Dec 21, 2017 9:37 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
The following is a simple 32 bit multiply routine.
I'm sure there are better ways to approach this problem,
but I think material for comparison and study is always useful.
77 bytes.

Code:
;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply
       ;(not thoroughly tested)
       ;Uses .AXY +4
       ;
       ;Operation:  C = A * B            (actually, C = C + A * B)
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$1500 - $1503 = #1 (SET $1504-$1507 TO 00)  ('A')
       ;$1510 - $1513 = #2                          ('B')
       ;$1520 - $152f = 00                          ('C')
       ;
       ; -> CT32X01 -> $1520 - $1527 = #1 * #2
       ;
       ;Ex.:  (assuming $1500-152F = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $1500
       ;      STX $1510
       ;      JSR CT32X01
       ;      LDA $1520
       ;      LDX $1521
       ;      LDY $1522
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = first 24 bits result number value
       ;

CT32X01:
        LDX $1510
        JSR ACX02
        LDX $1511
        JSR ACX02
        LDX $1512
        JSR ACX02
        LDX $1513
ACX02:
        LDY #$08
YMXE02:
        TXA
        LSR
        TAX
        BCC YX02

        PHA
        LDX #$00
        CLC
        PHP
XMXE02:
        PLP
        LDA $1500,X
        ADC $1520,X
        STA $1520,X
        PHP
        INX
        CPX #$08
        BNE XMXE02
        PLP
        PLA
        TAX
YX02:
        ASL $1500
        TXA
        PHA
        LDX #$01
        PHP
XMXE0202:
        PLP
        ROL $1500,X
        PHP
        INX
        CPX #$08
        BNE XMXE0202
        PLP
        PLA
        TAX
        DEY
        BNE YMXE02
        RTS
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE


Last edited by jsii on Wed Dec 27, 2017 12:27 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 22, 2017 10:25 am 
Offline

Joined: Sat Jul 28, 2012 11:41 am
Posts: 442
Location: Wiesbaden, Germany
If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.

On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.

edit: to ilustrate I modified the top down multiply as follows.
Code:
;32 bit multiply with 64 bit product
            .data
PROD:       .ds      8        ;64 bit product
MULR:       .equ     PROD     ;32 bit multiplier overlaps product
MULND:      .ds      4        ;32 bit multiplicand

            .code
MULTIPLY:   lda      #$00
            sta      PROD+4   ;Clear upper half of
            sta      PROD+5   ;product
            sta      PROD+6
            sta      PROD+7
            ldx      #$21     ;Set binary count to 32+1
            bne      START_R
SHIFT_R:
;            lsr      MULR+3   ;Shift multiplyer right
;            ror      MULR+2   ; now part of product shift
;            ror      MULR+1
;            ror      MULR
            bcc      ROTATE_R ;Go rotate right if c = 0
            lda      PROD+4   ;Get upper half of product
            clc               ; and add multiplicand to
            adc      MULND    ; it
            sta      PROD+4
            lda      PROD+5
            adc      MULND+1
            sta      PROD+5
            lda      PROD+6
            adc      MULND+2
            sta      PROD+6
            lda      PROD+7
            adc      MULND+3
ROTATE_R:   ror      a        ;Rotate partial product
            sta      PROD+7   ; right
            ror      PROD+6
            ror      PROD+5
            ror      PROD+4
START_R:    ror      PROD+3   ;shift multiplier while shifting product
            ror      PROD+2
            ror      PROD+1
            ror      PROD
            dex               ;Decrement bit count and
            bne      SHIFT_R  ; loop until 32 bits are done
; floating point only
;            clc               ;Add dps and put sum in MULXP2
;            lda      MULXP1
;            adc      MULXP2
;            sta      MULXP2
            rts
edit2: of course the loop count in the modified code must be +1 for the initial multiplier shift!

_________________
6502 sources on GitHub: https://github.com/Klaus2m5


Last edited by Klaus2m5 on Fri Dec 22, 2017 12:01 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 22, 2017 11:44 am 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Klaus2m5 wrote:
If you compare your routine to this one: http://www.6502.org/source/integers/32muldiv.htm you'll find that it does the multiply top down. This eliminates the need to expand one of the operands to 64 bit and at the same time reduces the add routine to 32 bit. It is only larger than yours to avoid most indexed loops and improve the cycle count.
Thank you! All those zeroes did 'bother' me. Maybe I'll try this alternative some time. Your way is clearly better.

Quote:
On the other hand not doing the 32 bit shift for the multiplier but dividing it into 8 bit single shift groups may actually be faster. However in the top down multiply the multiplier could be overlapped with the lower 32 bits of the product avoiding the extra multiplier shifts completely.
Very interesting. I thought of not using indexing and loops, but then I decided to aim for a lower byte count this time. I wonder how large an 'un-looped' alternative would be...

Thank you for your comment and very good looking SBC, by the way.
(I see the 'edit' to your post even as I type this reply. Thank you, very clear code.)

[corrected typing error: 'lower' instead of 'slower' around "...to aim for a lower byte count..."]


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 12:25 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Considering the exchange above, here's a re-written 32-bit multiply routine.
38 bytes.

Code:
;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X02 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +1
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
       ;$C210 - $C213 = #2
       ;
       ;
       ; -> CT32X02 -> $C200 - $C207 = #1 * #2
       ;
       ;Ex.:  (assuming $C200-C213 = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $C200
       ;      STX $C210
       ;      JSR CT32X02
       ;      LDA $C200
       ;      LDX $C201
       ;      LDY $C202
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;


CT32X02:
        LDY #$21
YMEX00:
        LDX #$08
        LSR $C208
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        CLC
        PHP
XMEX00:
        PLP
        LDA $C204,X
        ADC $C210,X
        STA $C204,X
        PHP
        INX
        CPX #$05
        BNE XMEX00
        PLP
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;38 BYTES
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
I suppose all that x86 programming finally shows its effect; must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 37 bytes would be good.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 1:59 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Based on the routine placed just above, if we don't mind 'upsetting' the carry down below,
I think we can take it down to 37 bytes; 35 bytes if no 'user bytes' are needed.
37 bytes.

Code:
;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X02 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +1
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$C200 - $C203 = #1 (ALSO SET $C204-$C208 TO 00)
       ;$C210 - $C213 = #2
       ;
       ;
       ; -> CT32X02 -> $C200 - $C207 = #1 * #2
       ;
       ;Ex.:  (assuming $C200-C213 = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $C200
       ;      STX $C210
       ;      JSR CT32X02
       ;      LDA $C200
       ;      LDX $C201
       ;      LDY $C202
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;


CT32X02:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        CLC
        PHP
XMEX00:
        PLP
        LDA $C204,X
        ADC $C210,X
        STA $C204,X
        PHP
        INX
        CPX #$05
        BNE XMEX00
        PLP
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;35 BYTES
        NOP          ;USER OR F.E.
        NOP          ;USER OR F.E.
                     ;37 BYTES.
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
Definitely must shape up. I'm sure there must be better ways to do this yet both in terms of byte count and speed. 32 bytes would be good. "A 32 byte 32-bit multiply routine" sounds good to me.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 4:06 pm 
Offline

Joined: Sat Jul 28, 2012 11:41 am
Posts: 442
Location: Wiesbaden, Germany
How about 31 bytes?
Code:
CT32X02:
        CLC             ;bugfix!
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$fc        ;added to replace cpx #$05
        CLC
;        PHP            ;carry save no longer needed
XMEX00:
;        PLP            ;carry save no longer needed
        LDA $C204-$fc,X
        ADC $C210-$fc,X
        STA $C204-$fc,X
;        PHP            ;carry save no longer needed
        INX
;        CPX #$05       ;replaced by ldx #$fc
        BNE XMEX00
;        PLP            ;carry save no longer needed
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;32 BYTES after bugfix
Of course this will cause page crossing indexes. So the 7 cycles you gain by avoiding the PHP/PLP will be reduced by 2 cycles per add loop.

edit: Found a bug in the original code. Entering the subroutine with carry set will cause the result to be plus $80000000.

_________________
6502 sources on GitHub: https://github.com/Klaus2m5


Last edited by Klaus2m5 on Fri Jan 05, 2018 1:57 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 5:27 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Klaus2m5 wrote:
How about 31 bytes?
Code:
CT32X02:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $C1FF,X
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$fc        ;added to replace cpx #$05
        CLC
;        PHP            ;carry save no longer needed
XMEX00:
;        PLP            ;carry save no longer needed
        LDA $C204-$fc,X
        ADC $C210-$fc,X
        STA $C204-$fc,X
;        PHP            ;carry save no longer needed
        INX
;        CPX #$05       ;replaced by ldx #$fc
        BNE XMEX00
;        PLP            ;carry save no longer needed
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;31 BYTES
Of course this will cause page crossing indexes. So the 7 cycles you gain by avoiding the PHP/PLP will be reduced by 2 cycles per add loop.
There you go, well done, glad you caught it!
So now, how about 27 bytes?


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 6:58 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Well anyway... 27 bytes.

Code:
;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply -- CT32X03 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +0
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result
       ;                             A = 32 bit #1
       ;                             B = 32 bit #2
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$61 - $64 = #1 (ALSO SET $65-$69 TO 00)
       ;$6A - $6D = #2
       ;
       ;
       ; -> CT32X03 -> $61 - $68 = #1 * #2
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $61
       ;      STX $6A
       ;      JSR CT32X03
       ;      LDA $61
       ;      LDX $62
       ;      LDY $63
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;

CT32X03:
        LDY #$21
YMEX00:
        LDX #$09
XMEX01:
        ROR $60,X 
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$FC
        CLC
XMEX00:
        LDA $69,X       
        ADC $6E,X       
        STA $69,X       
        INX
        BNE XMEX00
YMEX01:
        DEY
        BNE YMEX00
        RTS          ;27 BYTES
;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
21 Bytes?


Last edited by jsii on Fri Dec 29, 2017 5:10 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 9:37 pm 
Offline

Joined: Sat Jul 28, 2012 11:41 am
Posts: 442
Location: Wiesbaden, Germany
Quote:
21 Bytes?
Don't be silly.

_________________
6502 sources on GitHub: https://github.com/Klaus2m5


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 27, 2017 11:08 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Klaus2m5 wrote:
Quote:
21 Bytes?
Don't be silly.
How?


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 29, 2017 5:09 am 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
25 bytes.

Code:
;=================================================================
;=================================================================
;=================================================================
       ;32bit multiply and 'variable bit' multiply -- CT32X04 -- 2017, jsii
       ;(not thoroughly tested)
       ;Uses .AXY +0
       ;
       ;Operation:  C = A * B
       ;                       where C = 64 bit result (or 'variable bit' result)(see below)
       ;                             A = 32 bit #1     (or 'variable bit' #1)
       ;                             B = 32 bit #2     (or 'variable bit' #2)
       ;
       ; *** #1 changed
       ;
       ;Howto:
       ;                       
       ;$61 - $64 = #1 (ALSO SET $65-$69 TO 00)
       ;$6A - $6D = #2
       ;.Y = 33, 25, 17 or 9, for example.     (33 -> 32 bit operation, 25 -> 24 bit operation,
       ;                                        17 -> 16 bit operation,  9 ->  8 bit operation)                               
       ;
       ;
       ;
       ; -> CT32X04 -> $61 - $68 = #1 * #2    (for a 32 bit operation)
       ;               $62 - $67 = #1 * #2    (for a 24 ''    ''     )
       ;               $63 - $66 = #1 * #2    (for 16 bits)
       ;               $64 - $65 = #1 * #2    (for  8 bits)
       ;
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$07
       ;      LDX #$03
       ;      STA $61
       ;      STX $6A
       ;      LDY #$21    ;32 bit operation
       ;      JSR CT32X04
       ;      LDA $61
       ;      LDX $62
       ;      LDY $63
       ;          ; .. Here, .Y, .X, .A = first 24 bits of result
       ;          ; ..                    so that .Y * 65536 + .X *256 + .A = 24 bit result number
       ;
       ;
       ;Ex.:  (assuming $61-$6D = 00)
       ;
       ;      LDA #$30
       ;      LDX #$67
       ;      STA $61
       ;      STX $6A
       ;      LDY #$09    ;8 bit operation
       ;      JSR CT32X04
       ;      LDA $64
       ;      LDX $65
       ;          ; .. Here,  .X, .A = 16 bit result, .A=80  .X=19
       ;


CT32X04:
        LDX #$09
XMEX01:
        ROR $60,X 
        DEX
        BNE XMEX01
        BCC YMEX01
        LDX #$FC
        CLC
XMEX00:
        LDA $69,X       
        ADC $6E,X       
        STA $69,X       
        INX
        BNE XMEX00
YMEX01:
        DEY
        BNE CT32X04
        RTS          ;25 BYTES.

;
;=================================================================
;=================================================================
;=================================================================
;>>> END ROUTINE
21 Bytes?


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 29, 2017 6:55 am 
Offline

Joined: Thu Mar 03, 2011 5:56 pm
Posts: 284
If you're going to cheat, why not move the
Code:
LDX #$09
out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 29, 2017 7:17 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10975
Location: England
(Can we keep this good-natured? Enjoy yourselves.)


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 29, 2017 10:53 am 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
rwiker wrote:
If you're going to cheat, why not move the
Code:
LDX #$09
out of the function, too? Further, if you inline this, you can get rid of the RTS (and everything else; the actual length of the subroutine becomes 0...)
I hear that bile is good around the liver. Anyway, I'm not sure what you mean by 'cheat', by 'function' or by 'inline' for that matter. I wonder how a 6502 feels about cheating, about "well behaved" code, about pretty comments or about nifty indenting altogether. Cheers.


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 29, 2017 10:55 am 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
I still say there must be better ways to do this yet both in terms of byte count and speed. 21 bytes? Cheers.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 39 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: