Mini-challenge: Sort three bytes

TobyLobster · Post by **TobyLobster** » Tue Oct 10, 2023 8:01 pm

and3rson wrote:

This thread totally nerd-sniped me. I decided to go for time efficiency and implement a solution similar to Chromatix's idea of a decision-based algorithm that picks one of 8 cases (2 of which are invalid).
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits!

) - but I didn't have guts to make any use of that...

Snipe

! This is an interesting version, similar to gfoot's thoughts at one point too. I tried out the code here and noticed the second test is swapped with the third test (second test is really zp0 >= zp2 and third test is zp1 >= zp2) which means the table is mis-ordered, but this works otherwise. I only have humble 6502 here, not 65C02, so I replaced the "TAX: JMP (table,X)" with

Code: Select all

        STA .addr_low
.addr_low = *+1
        JMP (.table)

for a bit of self modifying code, and put the table at the start of a page.

and3rson wrote:

EDIT: Turns out `reorder` macro can be easily improved:

Careful: The reads need to happen before the writes:

Code: Select all

!macro  reorder .aa, .bb, .cc {
    !if .aa != zp0 {
        LDA zp0
    }
    !if .cc != zp2 {
        LDX zp2
    }

    !if .aa != zp0 {
        STA .aa
    }
    !if .bb != zp1 {
        ; Y is already loaded with zp1
        STY .bb
    }
    !if .cc != zp2 {
        STX .cc
    }
}

With these changes I get 84 bytes of code, and ~50 cycles on average (including the RTS).

EDIT: Actually, using the 6502 indirect jump leaves X register free which we can use to store zp0, reducing the size of the reorder operations. 75 bytes, ~48 cycles average.

Revised version for reference:

Code: Select all

zp0 = $02
zp1 = $03
zp2 = $04

!macro  reorder .aa, .bb, .cc {
    !if .cc != zp2 {
        LDA zp2
    }
    !if .aa != zp0 {
        STX .aa
    }
    !if .bb != zp1 {
        STY .bb
    }
    !if .cc != zp2 {
        STA .cc
    }
}

* = $0200

; table at start of page
.table
        !word .swap0  ; zp0 <  zp1, zp0 <  zp2, zp1 <  zp2
        !word .swap2  ; zp0 <  zp1, zp0 <  zp2, zp1 >= zp2
        !word 0       ; zp0 <  zp1, zp0 >= zp2, zp1 <  zp2 - not possible
        !word .swap3  ; zp0 <  zp1, zp0 >= zp2, zp1 >= zp2
        !word .swap4  ; zp0 >= zp1, zp0 <  zp2, zp1 <  zp2
        !word 0       ; zp0 >= zp1, zp0 <  zp2, zp1 >= zp2 - not possible
        !word .swap5  ; zp0 >= zp1, zp0 >= zp2, zp1 <  zp2
        !word .swap7  ; zp0 >= zp1, zp0 >= zp2, zp1 >= zp2

.sort
        LDA #0
        LDX zp0
        CPX zp1
        ROL             ; 1 if zp0 >= zp1
        CPX zp2
        ROL             ; 1 if zp0 >= zp2
        LDY zp1
        CPY zp2
        ROL             ; 1 if zp1 >= zp2
        ASL

        STA .addr_low
.addr_low = *+1
        JMP (.table)

.swap2
        +reorder zp0, zp2, zp1
.swap0
        RTS

.swap3
        +reorder zp1, zp2, zp0
        RTS

.swap4
        +reorder zp1, zp0, zp2
        RTS

.swap5
        +reorder zp2, zp0, zp1
        RTS

.swap7
        +reorder zp2, zp1, zp0
        RTS

and3rson · Post by **and3rson** » Tue Oct 10, 2023 8:33 pm

TobyLobster wrote:

This is an interesting version, similar to gfoot's thoughts at one point too. I tried out the code here and noticed the second test is swapped with the third test (second test is really zp0 >= zp2 and third test is zp1 >= zp2) which means the table is mis-ordered, but this works otherwise. I only have humble 6502 here, not 65C02, so I replaced the "TAX: JMP (table,X)" with

Code: Select all

        STA .addr_low
.addr_low = *+1
        JMP (.table)

for a bit of self modifying code, and put the table at the start of a page.

Ah shoot, my bad! Thanks for pointing out.
The JMP trick is very clever - I haven't thought of it!

TobyLobster wrote:

Careful: The reads need to happen before the writes:

Code: Select all

!macro  reorder .aa, .bb, .cc {
    !if .aa != zp0 {
        LDA zp0
    }
    !if .cc != zp2 {
        LDX zp2
    }

    !if .aa != zp0 {
        STA .aa
    }
    !if .bb != zp1 {
        ; Y is already loaded with zp1
        STY .bb
    }
    !if .cc != zp2 {
        STX .cc
    }
}

With these changes I get 84 bytes of code, and ~50 cycles on average (including the RTS).

EDIT: Actually, using the 6502 indirect jump leaves X register free which we can use to store zp0, reducing the size of the reorder operations. 76 bytes, ~48 cycles average.

Another oversight from my end! I wish I did more testing before posting it.

Thanks!
Using self-modifying JMP is a really cool trick, I didn't think about it. I can see how it can nicely save X from getting clobbered - of course, as long as the code is in RAM.

TobyLobster · Post by **TobyLobster** » Wed Oct 11, 2023 3:51 pm

gfoot wrote:

Edit: And here's a first pass speed-optimised version - just writing out all the comparisons and swaps that are required, without attempting to share any code between branches.

A slight improvement to gfoot's speedy version: I managed to shave 3 bytes off with the same cycle count as before just by merging .z2_lt_z1_lt_z0 into .z1_lt_z0_lt_z2. The branches seem to be the right way around. So this is the fastest version known:

Code: Select all

; 60 bytes, 34.02 cycles
SortThree
    LDA z0
    CMP z1
    BCC .z0_lt_z1

    ; z1 < z0
    CMP z2
    BCC .z1_lt_z0_lt_z2

    ; z1 < z0 and z2 < z0
    LDX z2
    STA z2
    CPX z1
    BCC .z2_lt_z1_lt_z0

    ; z1 < z2 < z0
    LDA z1
    STA z0
    STX z1
    RTS

.z1_lt_z0_lt_z2
    LDX z1
    STA z1
.z2_lt_z1_lt_z0
    STX z0
    RTS

.z0_lt_z1
    CMP z2
    BCS .z2_lt_z0_lt_z1

    ; z0 < z1 and z0 < z2
    LDA z1
    CMP z2
    BCC .return

    LDX z2
    STX z1
    STA z2
.return
    RTS

.z2_lt_z0_lt_z1
    LDX z1
    LDY z2
    STY z0
    STA z1
    STX z2
    RTS

gfoot · Post by **gfoot** » Wed Oct 11, 2023 4:19 pm

Cool. I did wonder whether starting with loading z1 might be more efficient somehow. In case it needs moving, we learn that quickly, and otherwise we are well placed to immediately compare it with the other candidate. I'm not sure though it would help or hinder overall but the flow at least would be subtly different.

TobyLobster · Post by **TobyLobster** » Wed Oct 11, 2023 5:37 pm

gfoot wrote:

Cool. I did wonder whether starting with loading z1 might be more efficient somehow. In case it needs moving, we learn that quickly, and otherwise we are well placed to immediately compare it with the other candidate. I'm not sure though it would help or hinder overall but the flow at least would be subtly different.

I just had a go at this. It saves three bytes, but 0.5 cycles slower.

dmsc · Post by **dmsc** » Thu Oct 12, 2023 2:20 am

Hi!

TobyLobster wrote:

A slight improvement to gfoot's speedy version: I managed to shave 3 bytes off with the same cycle count as before just by merging .z2_lt_z1_lt_z0 into .z1_lt_z0_lt_z2. The branches seem to be the right way around. So this is the fastest version known:

And here, 6 bytes less:

Code: Select all

; 54 bytes, 34.02 cycles
SortThree
    LDA z0
    CMP z1
    BCC .z0_lt_z1

    ; z1 < z0
    CMP z2
    BCC .z1_lt_z0_lt_z2

    ; z1 < z0 and z2 < z0
    LDX z2
    STA z2
    CPX z1
    BCC .z2_lt_z1_lt_z0

    ; z1 < z2 < z0
    LDA z1
    STA z0
    STX z1
    RTS

.z1_lt_z0_lt_z2
    LDX z1
    STA z1
.z2_lt_z1_lt_z0
    STX z0
    RTS

.z0_lt_z1
    CMP z2
    BCS .z2_lt_z0_lt_z1

    ; z0 < z1 and z0 < z2
    LDA z2
    CMP z1
    BCC .swap_z1_z2

    RTS

.z2_lt_z0_lt_z1
    LDY z2
    STY z0
.swap_z1_z2
    LDX z1
    STA z1
    STX z2
    RTS

Testing all 16M cases, I get 34.002 cycles/call, this is the profile of the full run:

Code: Select all

  CYCLES    CYCLES     INSTRUCTION                            BYTES
  TOTAL     /CALL
--------------------------------------------------------------------------------------------
 50331648   3.0000     SORTTHREE:       LDA Z0              ; [A5][80]
 50331648   3.0000                      CMP Z1              ; [C5][81]
 41910272   2.4980                      BCC .Z0_LT_Z1       ; [90][1A] (8355840 times taken)
 25264128   1.5059                      CMP Z2              ; [C5][82]
 19638912   1.1706                      BCC .Z1_LT_Z0_LT_Z2 ; [90][0F] (2796160 times taken)
 16875648   1.0059                      LDX Z2              ; [A6][82]
 16875648   1.0059                      STA Z2              ; [85][82]
 16875648   1.0059                      CPX Z1              ; [E4][81]
 14046592   0.8372                      BCC .Z2_LT_Z1_LT_Z0 ; [90][0B] (2796160 times taken)
  8487168   0.5059                      LDA Z1              ; [A5][81]
  8487168   0.5059                      STA Z0              ; [85][80]
  8487168   0.5059                      STX Z1              ; [86][81]
 16974336   1.0117                      RTS                 ; [60]
  8388480   0.5000     .Z1_LT_Z0_LT_Z2: LDX Z1              ; [A6][81]
  8388480   0.5000                      STA Z1              ; [85][81]
 16776960   1.0000     .Z2_LT_Z1_LT_Z0: STX Z0              ; [86][80]
 33553920   2.0000                      RTS                 ; [60]
 25067520   1.4941     .Z0_LT_Z1:       CMP Z2              ; [C5][82]
 19507840   1.1628                      BCS .Z2_LT_Z0_LT_Z1 ; [B0][07] (2796160 times taken)
 16679040   0.9941                      LDA Z2              ; [A5][82]
 16679040   0.9941                      CMP Z1              ; [C5][81]
 13882880   0.8275                      BCC .SWAP_Z1_Z2     ; [90][05] (2763520 times taken)
 16776960   1.0000                      RTS                 ; [60]
  8388480   0.5000     .Z2_LT_Z0_LT_Z1: LDY Z2              ; [A4][82]
  8388480   0.5000                      STY Z0              ; [84][80]
 16679040   0.9941     .SWAP_Z1_Z2:     LDX Z1              ; [A6][81]
 16679040   0.9941                      STA Z1              ; [85][81]
 16679040   0.9941                      STX Z2              ; [86][82]
 33358080   1.9883                      RTS                 ; [60]
--------------------------------------------------------------------------------------------
570459264  34.0020

Have Fun!

barrym95838 · Post by **barrym95838** » Thu Oct 12, 2023 2:46 am

gfoot wrote:

teamtempest, nice, I just did roughly the same thing - it was 20 bytes, 90 cycles.

Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:

Code: Select all

.swap
    ldy z0,X
    sty z1,X
    sta z0,X
.SortThree
    ldx #1
.cmp
    lda z1,X
    cmp z0,X
    bcc swap
.noswap
    dex
    bpl cmp
    rts

That's almost certainly how Woz would've done it in 1976. I approve!

TobyLobster · Post by **TobyLobster** » Thu Oct 12, 2023 7:19 am

dmsc wrote:

Hi!
And here, 6 bytes less:

Excellent! Every time I think "surely that's the limit. Can't get any better than that.", but I always seem to be wrong!

Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes