6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Sep 17, 2024 3:09 am

All times are UTC




Post new topic Reply to topic  [ 39 posts ]  Go to page Previous  1, 2, 3
Author Message
PostPosted: Sat Dec 30, 2017 9:58 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
BigDumbDinosaur wrote:
I've been following this discussion and have to say it's been interesting.

That said, if you really want to see an improvement in performance as well as a reduction in code size, try writing the multiply algorithm for the 65C816 running in 16 bit native mode. :D You will be pleasantly amazed at what is gained by processing the operands a word at a time.

Code:
;imul: FACA = FACA × FACB  (64 bit integer multiplication)
;
;   —————————————————————————————————————————————————
;   Preparatory Ops : FACA: 32 bit multiplicand
;                     FACB: 32 bit multplier
;
;   Computed Returns: FACA: 64 bit product
;                     FACB: entry value
;
;   Register Usage  : .A: used
;                     .B: used
;                     .X: used
;                     .Y: truncated to 8 bits
;
;   MPU Flags: NVmxDIZC
;              ||||||||
;              |||||||+———> 0
;              ||||||+————> undefined
;              |||||+—————> entry value
;              ||+++——————> 0
;              ++—————————> undefined
;
;   Notes: 1) Product is undefined if either operand
;             is greater than $FFFFFFFF.
;          2) A call to CLRFACA must be made before
;             FACA is loaded with the multiplicand.
;   —————————————————————————————————————————————————
;
imul     longa                 ;16 bit accumulator
         shortx                ;8 bit index
         cld                   ;ensure binary mode
         ldx #s_bdword+1       ;bits to process +1
         clc
;
.0000010 ror faca+s_long+s_word;rotate...
         ror faca+s_long       ;out...
         ror faca+s_word       ;a...
         ror faca              ;bit
         bcc .0000020          ;0, skip ahead
;
         clc
         lda faca+s_long       ;partial product
         adc facb              ;multiplier
         sta faca+s_long       ;new partial product
         lda faca+s_long+s_word;ditto
         adc facb+s_word       ;ditto
         sta faca+s_long+s_word;ditto
;
.0000020 dex
         bne .0000010          ;next bit
;
         longx                 ;16 bit index
         rts

In the above, symbols such as S_WORD and S_LONG define data type sizes in bytes. S_BDWORD defines the number of bits in a double word (DWORD).

Code:
;idiv: FACA = FACA ÷ FACB  (64 bit integer division)
;
;   —————————————————————————————————————————————————
;   Preparatory Ops : FACA: 64 bit dividend
;                     FACB: 64 bit divisor
;
;   Computed Returns: FACA: 64 bit quotient
;                     FACB: entry value
;                     FACC: used
;
;   Register Returns: .A: used
;                     .B: used
;                     .X: remainder LSW
;                     .Y: remainder MSW
;
;   MPU Flags: NVmxDIZC
;              ||||||||
;              |||||||+———> 0: quotient valid
;              |||||||      1: division by zero
;              ||||||+————> 0: remainder != zero
;              ||||||       1: remainder == zero
;              |||||+—————> entry value
;              ||+++——————> 0
;              ++—————————> undefined
;
;   NOTES: 1) All values are in little-endian format.
;          2) The remainder will also be in _OVRFLO_.
;   —————————————————————————————————————————————————
;
idiv     longa                 ;16 bit accumulator
         shortx                ;8 bit index
         cld                   ;ensure binary mode
         jsr clrfacc           ;clear tertiary accumulator
         jsr clrovrfl          ;clear overflow
         ldx #s_dlong-s_word
;
;
;   perform divide-by-zero check...
;
.0000010 lda facb,x
         bne idiv01
;
         .rept s_word          ;decrement .X twice
           dex
         .endr
         bpl .0000010
;
         tax
         tay
         longx
         sec                   ;division by zero error
         rts
;
idiv01   ldy #s_blword
         clc
;
.0000010 phy
         ldx #0
         ldy #_loopct_
;
.0000020 rol faca,x
         .rept s_word
           inx
         .endr
         dey
         bne .0000020
;
         ldx #0
         ldy #_loopct_
;
.0000030 rol _ovrflo_,x
         .rept s_word
           inx
         .endr
         dey
         bpl .0000030
;
         ldx #0
         ldy #_loopct_
         sec
;
.0000040 lda _ovrflo_,x
         sbc facb,x
         sta facc,x
         .rept s_word
             inx
         .endr
         dey
         bne .0000040
;
         bcc .0000060
;         
         ldx #_loopct_
;
.0000050 lda facc,x
         sta _ovrflo_,x
         .rept s_word
           dex
         .endr
         bpl .0000050
;
.0000060 ply
         dey
         bne .0000010
;
         tyx
         ldy #_loopct_
;
.0000070 rol faca,x
         .rept s_word
           inx
         .endr
         dey
         bne .0000070
;
         longx
         ldx _ovrflo_
         ldy _ovrflo_+s_word
         txa
         ora _ovrflo_+s_word
         clc
         rts
I haven't yet had the pleasure of trying out the 65C816 but I'm looking forward to it. I hear great things about it and your listing looks quite interesting, Thank you. Cheers and a Happy New Year to you as well!


Last edited by jsii on Thu Jan 04, 2018 9:28 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 02, 2018 3:46 pm 
Offline
User avatar

Joined: Wed Aug 17, 2005 12:07 am
Posts: 1228
Location: Soddy-Daisy, TN USA
sark02 wrote:
Isn't it interesting that after 41 or so years, people still find fun in implementing MUL, and finding ways to optimize.


I find this amusing as well. Not just with 6502 code, but 8/16-bit computers in general. I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

_________________
Cat; the other white meat.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 02, 2018 9:58 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
cbmeeks wrote:
I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

Of course it's hilarious. Everyone knows it's the Atari...


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 02, 2018 10:18 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8508
Location: Southern California
...and it brings to our attention the fact that there are plenty of factors besides the processor that determine a computer's capabilities.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 04, 2018 9:30 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
cbmeeks wrote:
sark02 wrote:
Isn't it interesting that after 41 or so years, people still find fun in implementing MUL, and finding ways to optimize.


I find this amusing as well. Not just with 6502 code, but 8/16-bit computers in general. I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.
This should tell us something, shouldn't it.


Last edited by jsii on Thu Jan 04, 2018 9:32 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 04, 2018 9:32 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
whartung wrote:
cbmeeks wrote:
I still come across forum posts were people argue over which 8-bit is best. Atari or Commodore. Sega or NES. It's hilarious and (mostly) good fun.

Of course it's hilarious. Everyone knows it's the Atari...
This kind of discussion happens all the time.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 04, 2018 9:33 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
GARTHWILSON wrote:
...and it brings to our attention the fact that there are plenty of factors besides the processor that determine a computer's capabilities.
Worth noting.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 05, 2018 2:13 pm 
Offline

Joined: Sat Jul 28, 2012 11:41 am
Posts: 442
Location: Wiesbaden, Germany
I made a macro for my preferred version of the multiply representing a good balance of speed vs. size. Its size is 58 bytes for 32 by 32 and includes clearing the high bytes of the product. It was tested in Kowalski and executes in 1483-2699 cycles for a 32 bits by 32 bits multiply. The 27 byte subroutine of jsii executes in 3202-5442 cycles and has a bug when the subroutine is entered with carry set. The lower number of cycles is for a multiplier of 0 and the higher number of cycles is caused by the worst case multiplier of $ffffffff.
Code:
;
; MULTIPLY macro
;
; params:   %1 - address of multiplicand
;           %2 - address of multiplier (replaced by product)
;           %3 - byte count of multiplicand
;           %4 - byte count of multiplier
; product byte count = %3 + %4    !! enough space must be reserved for %2 !!
;
multiply    .macro ...
            lda #$00        ;Clear upper half of product (MSB kept in A)
.offset     .set   %4
            .repeat %3-1
                sta %2+.offset
.offset         .set .offset+1
            .endr
            ldy #%4*8+1     ;Set binary count to # multiplier bits + 1
            bne .start_r
.loop       bcc .skip_add   ;nothing to add on c = 0
            clc             ;add multiplicand to product
            .if   %3 > 1
                pha             ;save MSB of product
.offset         .set  0
                .repeat %3-1   
                    lda %2+%4+.offset
                    adc %1+.offset
                    sta %2+%4+.offset
.offset             .set  .offset+1
                .endr
                pla             ;restore MSB
            .endif
            adc %1+.offset  ;add MSB
.skip_add   ror             ;rotate upper product with carry from MSB
.offset     .set  %3+%4-2
            .repeat %3-1
                ror %2+.offset
.offset         .set .offset-1
            .endr
.start_r    .repeat %4      ;rotate multiplier and lower product
                ror %2+.offset
.offset         .set  .offset-1
            .endr
            dey             ;decrement multiplier bit count
            bne .loop       ;loop until bit count is exhausted
            sta %2+%3+%4-1  ;finally we need to store the MSB
            .endm
           
; testing the macro
            .org    $40
PROD                                ;product overlaps multiplier
MULR        .db     $ff,$ff,$ff,$ff ;multiplier (replaced by product)
            .ds     4               ;additional space for product
MULND       .db     $ff,$ff,$ff,$ff ;multiplicand

            .org    $1000
            .start  *
            multiply MULND,MULR,4,4
            rts
            .end
Greetings to ELIZA.

_________________
6502 sources on GitHub: https://github.com/Klaus2m5


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 05, 2018 9:26 pm 
Offline

Joined: Wed Mar 16, 2016 8:19 pm
Posts: 38
Klaus2m5 wrote:
I made a macro for my preferred version of the multiply representing a good balance of speed vs. size. Its size is 58 bytes for 32 by 32 and includes clearing the high bytes of the product. It was tested in Kowalski and executes in 1483-2699 cycles for a 32 bits by 32 bits multiply. The 27 byte subroutine of jsii executes in 3202-5442 cycles and has a bug when the subroutine is entered with carry set. The lower number of cycles is for a multiplier of 0 and the higher number of cycles is caused by the worst case multiplier of $ffffffff.
Code:
;
; MULTIPLY macro
;
; params:   %1 - address of multiplicand
;           %2 - address of multiplier (replaced by product)
;           %3 - byte count of multiplicand
;           %4 - byte count of multiplier
; product byte count = %3 + %4    !! enough space must be reserved for %2 !!
;
multiply    .macro ...
            lda #$00        ;Clear upper half of product (MSB kept in A)
.offset     .set   %4
            .repeat %3-1
                sta %2+.offset
.offset         .set .offset+1
            .endr
            ldy #%4*8+1     ;Set binary count to # multiplier bits + 1
            bne .start_r
.loop       bcc .skip_add   ;nothing to add on c = 0
            clc             ;add multiplicand to product
            .if   %3 > 1
                pha             ;save MSB of product
.offset         .set  0
                .repeat %3-1   
                    lda %2+%4+.offset
                    adc %1+.offset
                    sta %2+%4+.offset
.offset             .set  .offset+1
                .endr
                pla             ;restore MSB
            .endif
            adc %1+.offset  ;add MSB
.skip_add   ror             ;rotate upper product with carry from MSB
.offset     .set  %3+%4-2
            .repeat %3-1
                ror %2+.offset
.offset         .set .offset-1
            .endr
.start_r    .repeat %4      ;rotate multiplier and lower product
                ror %2+.offset
.offset         .set  .offset-1
            .endr
            dey             ;decrement multiplier bit count
            bne .loop       ;loop until bit count is exhausted
            sta %2+%3+%4-1  ;finally we need to store the MSB
            .endm
           
; testing the macro
            .org    $40
PROD                                ;product overlaps multiplier
MULR        .db     $ff,$ff,$ff,$ff ;multiplier (replaced by product)
            .ds     4               ;additional space for product
MULND       .db     $ff,$ff,$ff,$ff ;multiplicand

            .org    $1000
            .start  *
            multiply MULND,MULR,4,4
            rts
            .end
Greetings to ELIZA.
Based on your tests mentioned above, your subroutine is clearly better with a combination of size and speed; congratulations and thank you for the good work. Regarding the bug on the 25 byte subroutine (mentioning '25' now just for fun) and speaking of the carry flag, I wonder what the decimal mode flag does to it (just imagine). Next I might point out "clear the carry and the decimal mode flags before using it" in the instructions, but that might actually be cheating now, wouldn't it (based on the 'rules' -- just a light joke); or we can add a good "CLC/CLD" combination and call it a 27 byte subroutine at that, at least for the time being. I'd still like to see a 21 byte improvement anyway, though.

Thank you again and very good work.
Happy New Year.
Cheers!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 39 posts ]  Go to page Previous  1, 2, 3

All times are UTC


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron