6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 7:01 pm

All times are UTC




Post new topic Reply to topic  [ 23 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Tue Oct 10, 2023 8:01 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
and3rson wrote:
This thread totally nerd-sniped me. I decided to go for time efficiency and implement a solution similar to Chromatix's idea of a decision-based algorithm that picks one of 8 cases (2 of which are invalid).
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits! :lol:) - but I didn't have guts to make any use of that...

Snipe :D ! This is an interesting version, similar to gfoot's thoughts at one point too. I tried out the code here and noticed the second test is swapped with the third test (second test is really zp0 >= zp2 and third test is zp1 >= zp2) which means the table is mis-ordered, but this works otherwise. I only have humble 6502 here, not 65C02, so I replaced the "TAX: JMP (table,X)" with
Code:
        STA .addr_low
.addr_low = *+1
        JMP (.table)

for a bit of self modifying code, and put the table at the start of a page.

and3rson wrote:
EDIT: Turns out `reorder` macro can be easily improved:

Careful: The reads need to happen before the writes:

Code:
!macro  reorder .aa, .bb, .cc {
    !if .aa != zp0 {
        LDA zp0
    }
    !if .cc != zp2 {
        LDX zp2
    }

    !if .aa != zp0 {
        STA .aa
    }
    !if .bb != zp1 {
        ; Y is already loaded with zp1
        STY .bb
    }
    !if .cc != zp2 {
        STX .cc
    }
}


With these changes I get 84 bytes of code, and ~50 cycles on average (including the RTS).

EDIT: Actually, using the 6502 indirect jump leaves X register free which we can use to store zp0, reducing the size of the reorder operations. 75 bytes, ~48 cycles average.

Revised version for reference:
Code:
zp0 = $02
zp1 = $03
zp2 = $04

!macro  reorder .aa, .bb, .cc {
    !if .cc != zp2 {
        LDA zp2
    }
    !if .aa != zp0 {
        STX .aa
    }
    !if .bb != zp1 {
        STY .bb
    }
    !if .cc != zp2 {
        STA .cc
    }
}

* = $0200

; table at start of page
.table
        !word .swap0  ; zp0 <  zp1, zp0 <  zp2, zp1 <  zp2
        !word .swap2  ; zp0 <  zp1, zp0 <  zp2, zp1 >= zp2
        !word 0       ; zp0 <  zp1, zp0 >= zp2, zp1 <  zp2 - not possible
        !word .swap3  ; zp0 <  zp1, zp0 >= zp2, zp1 >= zp2
        !word .swap4  ; zp0 >= zp1, zp0 <  zp2, zp1 <  zp2
        !word 0       ; zp0 >= zp1, zp0 <  zp2, zp1 >= zp2 - not possible
        !word .swap5  ; zp0 >= zp1, zp0 >= zp2, zp1 <  zp2
        !word .swap7  ; zp0 >= zp1, zp0 >= zp2, zp1 >= zp2

.sort
        LDA #0
        LDX zp0
        CPX zp1
        ROL             ; 1 if zp0 >= zp1
        CPX zp2
        ROL             ; 1 if zp0 >= zp2
        LDY zp1
        CPY zp2
        ROL             ; 1 if zp1 >= zp2
        ASL

        STA .addr_low
.addr_low = *+1
        JMP (.table)

.swap2
        +reorder zp0, zp2, zp1
.swap0
        RTS

.swap3
        +reorder zp1, zp2, zp0
        RTS

.swap4
        +reorder zp1, zp0, zp2
        RTS

.swap5
        +reorder zp2, zp0, zp1
        RTS

.swap7
        +reorder zp2, zp1, zp0
        RTS


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 10, 2023 8:33 pm 
Offline
User avatar

Joined: Fri Feb 17, 2023 11:59 pm
Posts: 163
Location: Lviv, Ukraine
TobyLobster wrote:
This is an interesting version, similar to gfoot's thoughts at one point too. I tried out the code here and noticed the second test is swapped with the third test (second test is really zp0 >= zp2 and third test is zp1 >= zp2) which means the table is mis-ordered, but this works otherwise. I only have humble 6502 here, not 65C02, so I replaced the "TAX: JMP (table,X)" with
Code:
        STA .addr_low
.addr_low = *+1
        JMP (.table)

for a bit of self modifying code, and put the table at the start of a page.

Ah shoot, my bad! Thanks for pointing out.
The JMP trick is very clever - I haven't thought of it!

TobyLobster wrote:
Careful: The reads need to happen before the writes:

Code:
!macro  reorder .aa, .bb, .cc {
    !if .aa != zp0 {
        LDA zp0
    }
    !if .cc != zp2 {
        LDX zp2
    }

    !if .aa != zp0 {
        STA .aa
    }
    !if .bb != zp1 {
        ; Y is already loaded with zp1
        STY .bb
    }
    !if .cc != zp2 {
        STX .cc
    }
}


With these changes I get 84 bytes of code, and ~50 cycles on average (including the RTS).

EDIT: Actually, using the 6502 indirect jump leaves X register free which we can use to store zp0, reducing the size of the reorder operations. 76 bytes, ~48 cycles average.


Another oversight from my end! I wish I did more testing before posting it. :D Thanks!
Using self-modifying JMP is a really cool trick, I didn't think about it. I can see how it can nicely save X from getting clobbered - of course, as long as the code is in RAM.

_________________
/Andrew

deck65 - 6502 slab with screen and keyboard | ПК-88 - SBC based on KM1810VM88 (Ukrainian i8088 clone) | leo80 - simple Z80 SBC
nice65 - 6502 assembly linter | My parts, footprints & 3D models for KiCad/FreeCAD


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 11, 2023 3:51 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
gfoot wrote:
Edit: And here's a first pass speed-optimised version - just writing out all the comparisons and swaps that are required, without attempting to share any code between branches.

A slight improvement to gfoot's speedy version: I managed to shave 3 bytes off with the same cycle count as before just by merging .z2_lt_z1_lt_z0 into .z1_lt_z0_lt_z2. The branches seem to be the right way around. So this is the fastest version known:

Code:
; 60 bytes, 34.02 cycles
SortThree
    LDA z0
    CMP z1
    BCC .z0_lt_z1

    ; z1 < z0
    CMP z2
    BCC .z1_lt_z0_lt_z2

    ; z1 < z0 and z2 < z0
    LDX z2
    STA z2
    CPX z1
    BCC .z2_lt_z1_lt_z0

    ; z1 < z2 < z0
    LDA z1
    STA z0
    STX z1
    RTS

.z1_lt_z0_lt_z2
    LDX z1
    STA z1
.z2_lt_z1_lt_z0
    STX z0
    RTS

.z0_lt_z1
    CMP z2
    BCS .z2_lt_z0_lt_z1

    ; z0 < z1 and z0 < z2
    LDA z1
    CMP z2
    BCC .return

    LDX z2
    STX z1
    STA z2
.return
    RTS

.z2_lt_z0_lt_z1
    LDX z1
    LDY z2
    STY z0
    STA z1
    STX z2
    RTS


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 11, 2023 4:19 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Cool. I did wonder whether starting with loading z1 might be more efficient somehow. In case it needs moving, we learn that quickly, and otherwise we are well placed to immediately compare it with the other candidate. I'm not sure though it would help or hinder overall but the flow at least would be subtly different.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 11, 2023 5:37 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
gfoot wrote:
Cool. I did wonder whether starting with loading z1 might be more efficient somehow. In case it needs moving, we learn that quickly, and otherwise we are well placed to immediately compare it with the other candidate. I'm not sure though it would help or hinder overall but the flow at least would be subtly different.

I just had a go at this. It saves three bytes, but 0.5 cycles slower. :?


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 2:20 am 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 138
Hi!

TobyLobster wrote:
A slight improvement to gfoot's speedy version: I managed to shave 3 bytes off with the same cycle count as before just by merging .z2_lt_z1_lt_z0 into .z1_lt_z0_lt_z2. The branches seem to be the right way around. So this is the fastest version known:



And here, 6 bytes less:

Code:
; 54 bytes, 34.02 cycles
SortThree
    LDA z0
    CMP z1
    BCC .z0_lt_z1

    ; z1 < z0
    CMP z2
    BCC .z1_lt_z0_lt_z2

    ; z1 < z0 and z2 < z0
    LDX z2
    STA z2
    CPX z1
    BCC .z2_lt_z1_lt_z0

    ; z1 < z2 < z0
    LDA z1
    STA z0
    STX z1
    RTS

.z1_lt_z0_lt_z2
    LDX z1
    STA z1
.z2_lt_z1_lt_z0
    STX z0
    RTS

.z0_lt_z1
    CMP z2
    BCS .z2_lt_z0_lt_z1

    ; z0 < z1 and z0 < z2
    LDA z2
    CMP z1
    BCC .swap_z1_z2

    RTS

.z2_lt_z0_lt_z1
    LDY z2
    STY z0
.swap_z1_z2
    LDX z1
    STA z1
    STX z2
    RTS


Testing all 16M cases, I get 34.002 cycles/call, this is the profile of the full run:

Code:
  CYCLES    CYCLES     INSTRUCTION                            BYTES
  TOTAL     /CALL
--------------------------------------------------------------------------------------------
 50331648   3.0000     SORTTHREE:       LDA Z0              ; [A5][80]
 50331648   3.0000                      CMP Z1              ; [C5][81]
 41910272   2.4980                      BCC .Z0_LT_Z1       ; [90][1A] (8355840 times taken)
 25264128   1.5059                      CMP Z2              ; [C5][82]
 19638912   1.1706                      BCC .Z1_LT_Z0_LT_Z2 ; [90][0F] (2796160 times taken)
 16875648   1.0059                      LDX Z2              ; [A6][82]
 16875648   1.0059                      STA Z2              ; [85][82]
 16875648   1.0059                      CPX Z1              ; [E4][81]
 14046592   0.8372                      BCC .Z2_LT_Z1_LT_Z0 ; [90][0B] (2796160 times taken)
  8487168   0.5059                      LDA Z1              ; [A5][81]
  8487168   0.5059                      STA Z0              ; [85][80]
  8487168   0.5059                      STX Z1              ; [86][81]
 16974336   1.0117                      RTS                 ; [60]
  8388480   0.5000     .Z1_LT_Z0_LT_Z2: LDX Z1              ; [A6][81]
  8388480   0.5000                      STA Z1              ; [85][81]
 16776960   1.0000     .Z2_LT_Z1_LT_Z0: STX Z0              ; [86][80]
 33553920   2.0000                      RTS                 ; [60]
 25067520   1.4941     .Z0_LT_Z1:       CMP Z2              ; [C5][82]
 19507840   1.1628                      BCS .Z2_LT_Z0_LT_Z1 ; [B0][07] (2796160 times taken)
 16679040   0.9941                      LDA Z2              ; [A5][82]
 16679040   0.9941                      CMP Z1              ; [C5][81]
 13882880   0.8275                      BCC .SWAP_Z1_Z2     ; [90][05] (2763520 times taken)
 16776960   1.0000                      RTS                 ; [60]
  8388480   0.5000     .Z2_LT_Z0_LT_Z1: LDY Z2              ; [A4][82]
  8388480   0.5000                      STY Z0              ; [84][80]
 16679040   0.9941     .SWAP_Z1_Z2:     LDX Z1              ; [A6][81]
 16679040   0.9941                      STA Z1              ; [85][81]
 16679040   0.9941                      STX Z2              ; [86][82]
 33358080   1.9883                      RTS                 ; [60]
--------------------------------------------------------------------------------------------
570459264  34.0020


Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 2:46 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
gfoot wrote:
teamtempest, nice, I just did roughly the same thing - it was 20 bytes, 90 cycles.

Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:

Code:
.swap
    ldy z0,X
    sty z1,X
    sta z0,X
.SortThree
    ldx #1
.cmp
    lda z1,X
    cmp z0,X
    bcc swap
.noswap
    dex
    bpl cmp
    rts

That's almost certainly how Woz would've done it in 1976. I approve!

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 7:19 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
dmsc wrote:
Hi!
And here, 6 bytes less:

Excellent! Every time I think "surely that's the limit. Can't get any better than that.", but I always seem to be wrong!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 23 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 26 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: