6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed May 15, 2024 5:30 pm

All times are UTC




Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Oct 09, 2023 8:25 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
Anyone up for a fun little challenge?

Suppose there are three bytes in zero page: z0, z1, z2.
The result should be the same three bytes sorted in place from smallest to largest: z0 <= z1 <= z2.

Two categories: smallest code, and fastest code!


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 10:55 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Let's start with a straightforward approach, then:
Code:
SortThree:
  LDA z0
  LDX z1
  LDY z2

  CMP z1
  BCS :+
  STX z0
  TAX
  LDA z0

: CMP z2
  BCS :+
  STA z2
  STY z0
  TYA

: CPX z2
  BCS :+
  STX z2
  STY z1
  RTS

: STX z1
  RTS
The fastest approach on average is probably a decision tree, loading all the values into registers (as above), then taking the three comparisons in sequence to produce eight leaves, and just perform up to three stores in each leaf. That would be a lot bigger than the above, however. The tricky part would be making the above smaller.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 12:06 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!

Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:

Code:
SortThree:
  LDA z0
  LDX z1

  CMP z1
  BCC :+
  STX z0
  TAX

: CPX z2
  BCC :+

  LDA z2
  STX z2
  TAX

  CPX z0
  BCS :+

  LDX z0
  STA z0

: STX z1
  RTS


Edit: And here's a first pass speed-optimised version - just writing out all the comparisons and swaps that are required, without attempting to share any code between branches. It may be possible to reduce the time taken by changing the order of comparisons in some cases. By my count the worst case here is 33 cycles until the RTS, not including it.

Code:
SortThree:
    LDA z0                 ; 33
    CMP z1                 ; 30
    BCC z0_lt_z1           ; 27 / 25

    ; z1 < z0
    CMP z2                 ; 25
    BCC z1_lt_z0_lt_z2     ; 22 / 12

    ; z1 < z0 and z2 < z0
    LDX z2                 ; 20
    STA z2                 ; 17
    CPX z1                 ; 14
    BCC z2_lt_z1_lt_z0     ; 11 / 6

    ; z1 < z2 < z0
    LDA z1                 ; 9
    STA z0                 ; 6
    STX z1                 ; 3
    RTS

z2_lt_z1_lt_z0:
    STX z0                 ; 3
    RTS

z1_lt_z0_lt_z2:
    LDX z1                 ; 9
    STX z0                 ; 6
    STA z1                 ; 3
    RTS

z0_lt_z1:
    CMP z2                 ; 22
    BCS z2_lt_z0_lt_z1     ; 19 / 18

    ; z0 < z1 and z0 < z2
    LDA z1                 ; 17
    CMP z2                 ; 14
    BCC return             ; 11 / 3

    LDX z2                 ; 9
    STX z1                 ; 6
    STA z2                 ; 3
return:
    RTS

z2_lt_z0_lt_z1:
    LDX z1                 ; 15
    LDY z2                 ; 12
    STY z0                 ; 9
    STA z1                 ; 6
    STX z2                 ; 3
    RTS


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 1:11 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
This is Chromatix version but with fixes added:

Code:
; 38 bytes, 44.61 cycles average
SortThree
  LDA z0
  LDX z1
  LDY z2

  CMP z1
  BCS +         ; branch if z0 >= z1
  STX z0
  STA z1    ; TOBY: added
  TAX
  LDA z0

+ CMP z2
  BCS +         ; branch if z0 >= z2
  STA z2
  STY z0
  TYA
  LDY z2    ; TOBY: added

+ CPX z2
  BCS +         ; branch if z1 >= z2
  STX z2
  STY z1
+           ; TOBY: added
  RTS

 ;STX z1   ; TOBY: removed
 ; RTS

This is 38 bytes, 44.61 cycles average.

gfoot: your versions may not be correct without these fixes.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 1:28 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
TobyLobster wrote:
gfoot: your versions may not be correct without these fixes.

I tested my first one in Owlet: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... C%2C%22%7D

But didn't test the second one.

Edit: enhanced test framework that measures and times each situation - applied to my second (speed) one: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... PROC%22%7D


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 2:15 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
gfoot wrote:
My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!

Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:

Code:
SortThree:
  LDA z0
  LDX z1

  CMP z1
  BCC :+
  STX z0
  TAX

: CPX z2
  BCC :+

  LDA z2
  STX z2
  TAX

  CPX z0
  BCS :+

  LDX z0
  STA z0

: STX z1
  RTS



The BCC :+ could go straight to the RTS to save a couple cycles. The only difference between this one and the optimized speed version that I can see is the two TAX's for a total of 4 cycles of savings thru the routine.

This routine has the added advantage that it preserves the Y-reg. Which means it can be used to seed the 3 bytes from a table or input register of some kind.

Very nice.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 2:27 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
gfoot wrote:
TobyLobster wrote:
gfoot: your versions may not be correct without these fixes.

I tested my first one in Owlet


My mistake: I've run it though my own tests here and your code is correct. I have C code that runs through all possible input values doing the cycle counting.
I make your first routine 31 bytes long and ~39 cycles average, the second is 63 bytes long ~34 cycles average (including the final RTS, or ~28 without).
Good stuff.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 2:50 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
IamRob wrote:
The BCC :+ could go straight to the RTS to save a couple cycles.


I think the final ": STX z1" seems to be needed, e.g. when [z0,z1,z2] = [1,2,0].


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 3:19 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
I can do a smaller 29 byte version (which is much slower though) using a different approach, a simple sorting network:

Code:
; 29 bytes, 99.35 cycles average
.sort3
    ldx #0
    ldy #1
    jsr .swap
    iny
    jsr .swap
    inx
.swap
    lda z0,y
    cmp z0,x
    bcs .done
    pha
    lda z0,x
    sta z0,y
    pla
    sta z0,x
.done
    rts
It assumes z0,z1,z2 are adjacent in memory.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 3:35 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Oh that's very neat.

I got sidetracked making my VIA-based timing code as accurate as possible, timing down to the individual cycle even though the VIAs only run at half the CPU clock speed :)


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 4:01 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Riffing on yours tobylobster - 22 bytes, 111 cycles on average including the RTS:

Code:
.SortThree
    jsr s3
.s3
    ldx #z0
    jsr swap
    inx
.swap
    lda 0,X
    cmp 1,X
    bcc done
    ldy 1,X
    sty 0,X
    sta 1,X
.done
    rts


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 4:16 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 388
Location: Minnesota
Not the world's fastest code, but only 20 bytes. Untested and no time estimate.

Code:
swap:
   ldx #1

next:
   lda z1,x
   cmp z0,x
   bcs noswap
   ldy z0,x
   sta z0,x
   sty z1,x
   bcc swap

noswap:
   dex
   beq next
   rts

; worst case: z0 > z1 > z2
; 1: z1 > z2; swapped
; 2: z1 < z2; no swap; z0 > z1; swapped
: 3: z1 < z2; no swap; z0 < z1; no swap    


Incidentally, I compared the way I did in case zx == zy, to prevent unnecessary swaps of equal values.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 4:28 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
teamtempest, nice, I just did roughly the same thing - it was 20 bytes, 90 cycles.

Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:

Code:
.swap
    ldy z0,X
    sty z1,X
    sta z0,X
.SortThree
    ldx #1
.cmp
    lda z1,X
    cmp z0,X
    bcc swap
.noswap
    dex
    bpl cmp
    rts


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 09, 2023 7:12 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 34
Ooh I didn't think it could get that small. Very nice.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 10, 2023 6:22 pm 
Offline
User avatar

Joined: Fri Feb 17, 2023 11:59 pm
Posts: 163
Location: Lviv, Ukraine
This thread totally nerd-sniped me. I decided to go for time efficiency and implement a solution similar to Chromatix's idea of a decision-based algorithm that picks one of 8 cases (2 of which are invalid).
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits! :lol:) - but I didn't have guts to make any use of that...

Here's what I ended up with.
NOTE: This is not tested. I did some very basic sanity checks with sim65 in order to count clock cycles. But this code should give you a general idea.

Code:
; O(1) time sort of three values from zero page
; 48 cycles in most cases (excluding RTS); 33 cycles if input is already sorted
; - Requires 65C02 for JMP (abs,X)
; - Not very memory-efficient
; - Requires extra cycle if `table` crosses page boundary
; - swap2..swap7 can be optimized even further by not using reorder macro (which sometimes does redundant writes)
; - zp0..zp2 can be located anywhere

; ZP addresses of inputs
zp0 = 253
zp1 = 254
zp2 = 255

.macro  reorder aa, bb, cc
        LDA zp0
        ; Y is already loaded with zp1
        LDX zp2
        STA aa
        STY bb
        STX cc
.endmacro

sort:
        LDA #0
        LDY zp0
        CPY zp1
        ROL             ; 1 if zp0 >= zp1
        CPY zp2
        ROL             ; 1 if zp0 >= zp2
        LDY zp1
        CPY zp2
        ROL             ; 1 if zp1 >= zp2
        ASL

        TAX
        JMP (table, x)  ; (jsr, rts)

table:
        .word swap0  ; zp0 <  zp1, zp1 <  zp2, zp0 <  zp2
        .word 0      ; zp0 <  zp1, zp1 <  zp2, zp0 >= zp2 - not possible
        .word swap2  ; zp0 <  zp1, zp1 >= zp2, zp0 <  zp2
        .word swap3  ; zp0 <  zp1, zp1 >= zp2, zp0 >= zp2
        .word swap4  ; zp0 >= zp1, zp1 <  zp2, zp0 <  zp2
        .word swap5  ; zp0 >= zp1, zp1 <  zp2, zp0 >= zp2
        .word 0      ; zp0 >= zp1, zp1 >= zp2, zp0 <  zp2 - not possible
        .word swap7  ; zp0 >= zp1, zp1 >= zp2, zp0 >= zp2

swap0:
        ; Already sorted (-15 cycles)
        RTS

swap2:
        reorder zp0, zp2, zp1
        RTS

swap3:
        reorder zp1, zp2, zp0
        RTS

swap4:
        reorder zp1, zp0, zp2
        RTS

swap5:
        reorder zp2, zp0, zp1
        RTS

swap7:
        reorder zp2, zp1, zp0
        RTS


EDIT: Turns out `reorder` macro can be easily improved:
EDIT 2: This updated macro has a bug. See TobyLobster's next post with a fix.
Code:
.macro  reorder aa, bb, cc
    .if aa != zp0  ; save 6 cycles if already sorted
        LDA zp0
        STA aa
    .endif
    .if bb != zp1  ; save 3 cycles if already sorted
        ; Y is already loaded with zp1
        STY bb
    .endif
    .if cc != zp2  ; save 6 cycles if already sorted
        LDA zp2
        STA cc
    .endif
.endmacro


This saves 6 cycles if `zp0` or `zp2` are already sorted, and 3 cycles if `zp1` is already sorted - up to 15 saved cycles total.

EDIT 3: Here's an updated table & functions with correct equations (thanks TobyLobster!)

Code:
table:
        .word swap0  ; zp0 <  zp1, zp0 <  zp2, zp1 <  zp2
        .word swap1  ; zp0 <  zp1, zp0 <  zp2, zp1 >= zp2
        .word 0      ; zp0 <  zp1, zp0 >= zp2, zp1 <  zp2 - not possible
        .word swap3  ; zp0 <  zp1, zp0 >= zp2, zp1 >= zp2
        .word swap4  ; zp0 >= zp1, zp0 <  zp2, zp1 <  zp2
        .word 0      ; zp0 >= zp1, zp0 <  zp2, zp1 >= zp2 - not possible
        .word swap6  ; zp0 >= zp1, zp0 >= zp2, zp1 <  zp2
        .word swap7  ; zp0 >= zp1, zp0 >= zp2, zp1 >= zp2

swap0:
        reorder zp0, zp1, zp2; no-op
        RTS

swap1:
        reorder zp0, zp2, zp1
        RTS

swap3:
        reorder zp1, zp2, zp0
        RTS

swap4:
        reorder zp1, zp0, zp2
        RTS

swap6:
        reorder zp2, zp0, zp1
        RTS

swap7:
        reorder zp2, zp1, zp0
        RTS

_________________
/Andrew

deck65 - 6502 slab with screen and keyboard | ПК-88 - SBC based on KM1810VM88 (Ukrainian i8088 clone) | leo80 - simple Z80 SBC
nice65 - 6502 assembly linter | My parts, footprints & 3D models for KiCad/FreeCAD


Last edited by and3rson on Tue Oct 10, 2023 9:04 pm, edited 6 times in total.

Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: