6502.org

Posted: **Mon Oct 09, 2023 8:25 am**

Anyone up for a fun little challenge?

Suppose there are three bytes in zero page: z0, z1, z2.
The result should be the same three bytes sorted in place from smallest to largest: z0 <= z1 <= z2.

Two categories: smallest code, and fastest code!

Posted: **Mon Oct 09, 2023 10:55 am**

Let's start with a straightforward approach, then:

Code: Select all

SortThree:
  LDA z0
  LDX z1
  LDY z2

  CMP z1
  BCS :+
  STX z0
  TAX
  LDA z0

: CMP z2
  BCS :+
  STA z2
  STY z0
  TYA

: CPX z2
  BCS :+
  STX z2
  STY z1
  RTS

: STX z1
  RTS

The fastest approach on average is probably a decision tree, loading all the values into registers (as above), then taking the three comparisons in sequence to produce eight leaves, and just perform up to three stores in each leaf. That would be a lot bigger than the above, however. The tricky part would be making the above smaller.

Posted: **Mon Oct 09, 2023 12:06 pm**

My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!

Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:

Code: Select all

SortThree:
  LDA z0
  LDX z1

  CMP z1
  BCC :+
  STX z0
  TAX

: CPX z2
  BCC :+

  LDA z2
  STX z2
  TAX

  CPX z0
  BCS :+

  LDX z0
  STA z0

: STX z1
  RTS

Edit: And here's a first pass speed-optimised version - just writing out all the comparisons and swaps that are required, without attempting to share any code between branches. It may be possible to reduce the time taken by changing the order of comparisons in some cases. By my count the worst case here is 33 cycles until the RTS, not including it.

Code: Select all

SortThree:
    LDA z0                 ; 33
    CMP z1                 ; 30
    BCC z0_lt_z1           ; 27 / 25

    ; z1 < z0
    CMP z2                 ; 25
    BCC z1_lt_z0_lt_z2     ; 22 / 12

    ; z1 < z0 and z2 < z0
    LDX z2                 ; 20
    STA z2                 ; 17
    CPX z1                 ; 14
    BCC z2_lt_z1_lt_z0     ; 11 / 6

    ; z1 < z2 < z0
    LDA z1                 ; 9
    STA z0                 ; 6
    STX z1                 ; 3
    RTS

z2_lt_z1_lt_z0:
    STX z0                 ; 3
    RTS

z1_lt_z0_lt_z2:
    LDX z1                 ; 9
    STX z0                 ; 6
    STA z1                 ; 3
    RTS

z0_lt_z1:
    CMP z2                 ; 22
    BCS z2_lt_z0_lt_z1     ; 19 / 18

    ; z0 < z1 and z0 < z2
    LDA z1                 ; 17
    CMP z2                 ; 14
    BCC return             ; 11 / 3

    LDX z2                 ; 9
    STX z1                 ; 6
    STA z2                 ; 3
return:
    RTS

z2_lt_z0_lt_z1:
    LDX z1                 ; 15
    LDY z2                 ; 12
    STY z0                 ; 9
    STA z1                 ; 6
    STX z2                 ; 3
    RTS

Posted: **Mon Oct 09, 2023 1:11 pm**

This is Chromatix version but with fixes added:

Code: Select all

; 38 bytes, 44.61 cycles average
SortThree
  LDA z0
  LDX z1
  LDY z2

  CMP z1
  BCS +         ; branch if z0 >= z1
  STX z0
  STA z1    ; TOBY: added
  TAX
  LDA z0

+ CMP z2
  BCS +         ; branch if z0 >= z2
  STA z2
  STY z0
  TYA
  LDY z2    ; TOBY: added

+ CPX z2
  BCS +         ; branch if z1 >= z2
  STX z2
  STY z1
+           ; TOBY: added
  RTS

 ;STX z1   ; TOBY: removed
 ; RTS

This is 38 bytes, 44.61 cycles average.

gfoot: your versions may not be correct without these fixes.

Posted: **Mon Oct 09, 2023 1:28 pm**

TobyLobster wrote:

gfoot: your versions may not be correct without these fixes.

I tested my first one in Owlet: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... C%2C%22%7D

But didn't test the second one.

Edit: enhanced test framework that measures and times each situation - applied to my second (speed) one: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... PROC%22%7D

Posted: **Mon Oct 09, 2023 2:15 pm**

gfoot wrote:

My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!

Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:

Code: Select all

SortThree:
  LDA z0
  LDX z1

  CMP z1
  BCC :+
  STX z0
  TAX

: CPX z2
  BCC :+

  LDA z2
  STX z2
  TAX

  CPX z0
  BCS :+

  LDX z0
  STA z0

: STX z1
  RTS

The BCC :+ could go straight to the RTS to save a couple cycles. The only difference between this one and the optimized speed version that I can see is the two TAX's for a total of 4 cycles of savings thru the routine.

This routine has the added advantage that it preserves the Y-reg. Which means it can be used to seed the 3 bytes from a table or input register of some kind.

Very nice.

Posted: **Mon Oct 09, 2023 2:27 pm**

gfoot wrote:

TobyLobster wrote:

gfoot: your versions may not be correct without these fixes.

I tested my first one in Owlet

My mistake: I've run it though my own tests here and your code is correct. I have C code that runs through all possible input values doing the cycle counting.
I make your first routine 31 bytes long and ~39 cycles average, the second is 63 bytes long ~34 cycles average (including the final RTS, or ~28 without).
Good stuff.

Posted: **Mon Oct 09, 2023 2:50 pm**

IamRob wrote:

The BCC :+ could go straight to the RTS to save a couple cycles.

I think the final ": STX z1" seems to be needed, e.g. when [z0,z1,z2] = [1,2,0].

Posted: **Mon Oct 09, 2023 3:19 pm**

I can do a smaller 29 byte version (which is much slower though) using a different approach, a simple sorting network:

Code: Select all

; 29 bytes, 99.35 cycles average
.sort3
    ldx #0
    ldy #1
    jsr .swap
    iny
    jsr .swap
    inx
.swap
    lda z0,y
    cmp z0,x
    bcs .done
    pha
    lda z0,x
    sta z0,y
    pla
    sta z0,x
.done
    rts

It assumes z0,z1,z2 are adjacent in memory.

Posted: **Mon Oct 09, 2023 3:35 pm**

Oh that's very neat.

I got sidetracked making my VIA-based timing code as accurate as possible, timing down to the individual cycle even though the VIAs only run at half the CPU clock speed

Posted: **Mon Oct 09, 2023 4:01 pm**

Riffing on yours tobylobster - 22 bytes, 111 cycles on average including the RTS:

Code: Select all

.SortThree
    jsr s3
.s3
    ldx #z0
    jsr swap
    inx
.swap
    lda 0,X
    cmp 1,X
    bcc done
    ldy 1,X
    sty 0,X
    sta 1,X
.done
    rts

Posted: **Mon Oct 09, 2023 4:16 pm**

Not the world's fastest code, but only 20 bytes. Untested and no time estimate.

Code: Select all

swap:
	ldx #1

next:
	lda z1,x
	cmp z0,x
	bcs noswap
	ldy z0,x
	sta z0,x
	sty z1,x
	bcc swap

noswap:
	dex
	beq next
	rts

; worst case: z0 > z1 > z2
; 1: z1 > z2; swapped
; 2: z1 < z2; no swap; z0 > z1; swapped
: 3: z1 < z2; no swap; z0 < z1; no swap

Incidentally, I compared the way I did in case zx == zy, to prevent unnecessary swaps of equal values.

Posted: **Mon Oct 09, 2023 4:28 pm**

teamtempest, nice, I just did roughly the same thing - it was 20 bytes, 90 cycles.

Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:

Code: Select all

.swap
    ldy z0,X
    sty z1,X
    sta z0,X
.SortThree
    ldx #1
.cmp
    lda z1,X
    cmp z0,X
    bcc swap
.noswap
    dex
    bpl cmp
    rts

Posted: **Mon Oct 09, 2023 7:12 pm**

Ooh I didn't think it could get that small. Very nice.

Posted: **Tue Oct 10, 2023 6:22 pm**

This thread totally nerd-sniped me. I decided to go for time efficiency and implement a solution similar to Chromatix's idea of a decision-based algorithm that picks one of 8 cases (2 of which are invalid).
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits!

) - but I didn't have guts to make any use of that...

Here's what I ended up with.
NOTE: This is not tested. I did some very basic sanity checks with sim65 in order to count clock cycles. But this code should give you a general idea.

Code: Select all

; O(1) time sort of three values from zero page
; 48 cycles in most cases (excluding RTS); 33 cycles if input is already sorted
; - Requires 65C02 for JMP (abs,X)
; - Not very memory-efficient
; - Requires extra cycle if `table` crosses page boundary
; - swap2..swap7 can be optimized even further by not using reorder macro (which sometimes does redundant writes)
; - zp0..zp2 can be located anywhere

; ZP addresses of inputs
zp0 = 253
zp1 = 254
zp2 = 255

.macro  reorder aa, bb, cc
        LDA zp0
        ; Y is already loaded with zp1
        LDX zp2
        STA aa
        STY bb
        STX cc
.endmacro

sort:
        LDA #0
        LDY zp0
        CPY zp1
        ROL             ; 1 if zp0 >= zp1
        CPY zp2
        ROL             ; 1 if zp0 >= zp2
        LDY zp1
        CPY zp2
        ROL             ; 1 if zp1 >= zp2
        ASL

        TAX
        JMP (table, x)  ; (jsr, rts)

table:
        .word swap0  ; zp0 <  zp1, zp1 <  zp2, zp0 <  zp2
        .word 0      ; zp0 <  zp1, zp1 <  zp2, zp0 >= zp2 - not possible
        .word swap2  ; zp0 <  zp1, zp1 >= zp2, zp0 <  zp2
        .word swap3  ; zp0 <  zp1, zp1 >= zp2, zp0 >= zp2
        .word swap4  ; zp0 >= zp1, zp1 <  zp2, zp0 <  zp2
        .word swap5  ; zp0 >= zp1, zp1 <  zp2, zp0 >= zp2
        .word 0      ; zp0 >= zp1, zp1 >= zp2, zp0 <  zp2 - not possible
        .word swap7  ; zp0 >= zp1, zp1 >= zp2, zp0 >= zp2

swap0:
        ; Already sorted (-15 cycles)
        RTS

swap2:
        reorder zp0, zp2, zp1
        RTS

swap3:
        reorder zp1, zp2, zp0
        RTS

swap4:
        reorder zp1, zp0, zp2
        RTS

swap5:
        reorder zp2, zp0, zp1
        RTS

swap7:
        reorder zp2, zp1, zp0
        RTS

EDIT: Turns out `reorder` macro can be easily improved:
EDIT 2: This updated macro has a bug. See TobyLobster's next post with a fix.

Code: Select all

.macro  reorder aa, bb, cc
    .if aa != zp0  ; save 6 cycles if already sorted
        LDA zp0
        STA aa
    .endif
    .if bb != zp1  ; save 3 cycles if already sorted
        ; Y is already loaded with zp1
        STY bb
    .endif
    .if cc != zp2  ; save 6 cycles if already sorted
        LDA zp2
        STA cc
    .endif
.endmacro

This saves 6 cycles if `zp0` or `zp2` are already sorted, and 3 cycles if `zp1` is already sorted - up to 15 saved cycles total.

EDIT 3: Here's an updated table & functions with correct equations (thanks TobyLobster!)

Code: Select all

table:
        .word swap0  ; zp0 <  zp1, zp0 <  zp2, zp1 <  zp2
        .word swap1  ; zp0 <  zp1, zp0 <  zp2, zp1 >= zp2
        .word 0      ; zp0 <  zp1, zp0 >= zp2, zp1 <  zp2 - not possible
        .word swap3  ; zp0 <  zp1, zp0 >= zp2, zp1 >= zp2
        .word swap4  ; zp0 >= zp1, zp0 <  zp2, zp1 <  zp2
        .word 0      ; zp0 >= zp1, zp0 <  zp2, zp1 >= zp2 - not possible
        .word swap6  ; zp0 >= zp1, zp0 >= zp2, zp1 <  zp2
        .word swap7  ; zp0 >= zp1, zp0 >= zp2, zp1 >= zp2

swap0:
        reorder zp0, zp1, zp2; no-op
        RTS

swap1:
        reorder zp0, zp2, zp1
        RTS

swap3:
        reorder zp1, zp2, zp0
        RTS

swap4:
        reorder zp1, zp0, zp2
        RTS

swap6:
        reorder zp2, zp0, zp1
        RTS

swap7:
        reorder zp2, zp1, zp0
        RTS

6502.org

Mini-challenge: Sort three bytes

Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes

Re: Mini-challenge: Sort three bytes