Mini-challenge: Sort three bytes
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Mini-challenge: Sort three bytes
Anyone up for a fun little challenge?
Suppose there are three bytes in zero page: z0, z1, z2.
The result should be the same three bytes sorted in place from smallest to largest: z0 <= z1 <= z2.
Two categories: smallest code, and fastest code!
Suppose there are three bytes in zero page: z0, z1, z2.
The result should be the same three bytes sorted in place from smallest to largest: z0 <= z1 <= z2.
Two categories: smallest code, and fastest code!
Re: Mini-challenge: Sort three bytes
Let's start with a straightforward approach, then: The fastest approach on average is probably a decision tree, loading all the values into registers (as above), then taking the three comparisons in sequence to produce eight leaves, and just perform up to three stores in each leaf. That would be a lot bigger than the above, however. The tricky part would be making the above smaller.
Code: Select all
SortThree:
LDA z0
LDX z1
LDY z2
CMP z1
BCS :+
STX z0
TAX
LDA z0
: CMP z2
BCS :+
STA z2
STY z0
TYA
: CPX z2
BCS :+
STX z2
STY z1
RTS
: STX z1
RTS
Re: Mini-challenge: Sort three bytes
My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!
Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:
Edit: And here's a first pass speed-optimised version - just writing out all the comparisons and swaps that are required, without attempting to share any code between branches. It may be possible to reduce the time taken by changing the order of comparisons in some cases. By my count the worst case here is 33 cycles until the RTS, not including it.
Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:
Code: Select all
SortThree:
LDA z0
LDX z1
CMP z1
BCC :+
STX z0
TAX
: CPX z2
BCC :+
LDA z2
STX z2
TAX
CPX z0
BCS :+
LDX z0
STA z0
: STX z1
RTS
Code: Select all
SortThree:
LDA z0 ; 33
CMP z1 ; 30
BCC z0_lt_z1 ; 27 / 25
; z1 < z0
CMP z2 ; 25
BCC z1_lt_z0_lt_z2 ; 22 / 12
; z1 < z0 and z2 < z0
LDX z2 ; 20
STA z2 ; 17
CPX z1 ; 14
BCC z2_lt_z1_lt_z0 ; 11 / 6
; z1 < z2 < z0
LDA z1 ; 9
STA z0 ; 6
STX z1 ; 3
RTS
z2_lt_z1_lt_z0:
STX z0 ; 3
RTS
z1_lt_z0_lt_z2:
LDX z1 ; 9
STX z0 ; 6
STA z1 ; 3
RTS
z0_lt_z1:
CMP z2 ; 22
BCS z2_lt_z0_lt_z1 ; 19 / 18
; z0 < z1 and z0 < z2
LDA z1 ; 17
CMP z2 ; 14
BCC return ; 11 / 3
LDX z2 ; 9
STX z1 ; 6
STA z2 ; 3
return:
RTS
z2_lt_z0_lt_z1:
LDX z1 ; 15
LDY z2 ; 12
STY z0 ; 9
STA z1 ; 6
STX z2 ; 3
RTS
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Mini-challenge: Sort three bytes
This is Chromatix version but with fixes added:
This is 38 bytes, 44.61 cycles average.
gfoot: your versions may not be correct without these fixes.
Code: Select all
; 38 bytes, 44.61 cycles average
SortThree
LDA z0
LDX z1
LDY z2
CMP z1
BCS + ; branch if z0 >= z1
STX z0
STA z1 ; TOBY: added
TAX
LDA z0
+ CMP z2
BCS + ; branch if z0 >= z2
STA z2
STY z0
TYA
LDY z2 ; TOBY: added
+ CPX z2
BCS + ; branch if z1 >= z2
STX z2
STY z1
+ ; TOBY: added
RTS
;STX z1 ; TOBY: removed
; RTS
gfoot: your versions may not be correct without these fixes.
Re: Mini-challenge: Sort three bytes
TobyLobster wrote:
gfoot: your versions may not be correct without these fixes.
But didn't test the second one.
Edit: enhanced test framework that measures and times each situation - applied to my second (speed) one: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... PROC%22%7D
Re: Mini-challenge: Sort three bytes
gfoot wrote:
My first thought was to CMP pairs and ROL the carry bits into the bottom of the accumulator, then figure out what to do. But that's probably slow!
Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:
Based on Chromatix's code, I made some changes that make it a bit shorter, I haven't thought through the speed implications though:
Code: Select all
SortThree:
LDA z0
LDX z1
CMP z1
BCC :+
STX z0
TAX
: CPX z2
BCC :+
LDA z2
STX z2
TAX
CPX z0
BCS :+
LDX z0
STA z0
: STX z1
RTS
The BCC :+ could go straight to the RTS to save a couple cycles. The only difference between this one and the optimized speed version that I can see is the two TAX's for a total of 4 cycles of savings thru the routine.
This routine has the added advantage that it preserves the Y-reg. Which means it can be used to seed the 3 bytes from a table or input register of some kind.
Very nice.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Mini-challenge: Sort three bytes
gfoot wrote:
TobyLobster wrote:
gfoot: your versions may not be correct without these fixes.
I make your first routine 31 bytes long and ~39 cycles average, the second is 63 bytes long ~34 cycles average (including the final RTS, or ~28 without).
Good stuff.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Mini-challenge: Sort three bytes
IamRob wrote:
The BCC :+ could go straight to the RTS to save a couple cycles.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Mini-challenge: Sort three bytes
I can do a smaller 29 byte version (which is much slower though) using a different approach, a simple sorting network:
It assumes z0,z1,z2 are adjacent in memory.
Code: Select all
; 29 bytes, 99.35 cycles average
.sort3
ldx #0
ldy #1
jsr .swap
iny
jsr .swap
inx
.swap
lda z0,y
cmp z0,x
bcs .done
pha
lda z0,x
sta z0,y
pla
sta z0,x
.done
rtsRe: Mini-challenge: Sort three bytes
Oh that's very neat.
I got sidetracked making my VIA-based timing code as accurate as possible, timing down to the individual cycle even though the VIAs only run at half the CPU clock speed
I got sidetracked making my VIA-based timing code as accurate as possible, timing down to the individual cycle even though the VIAs only run at half the CPU clock speed
Re: Mini-challenge: Sort three bytes
Riffing on yours tobylobster - 22 bytes, 111 cycles on average including the RTS:
Code: Select all
.SortThree
jsr s3
.s3
ldx #z0
jsr swap
inx
.swap
lda 0,X
cmp 1,X
bcc done
ldy 1,X
sty 0,X
sta 1,X
.done
rts
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Mini-challenge: Sort three bytes
Not the world's fastest code, but only 20 bytes. Untested and no time estimate.
Incidentally, I compared the way I did in case zx == zy, to prevent unnecessary swaps of equal values.
Code: Select all
swap:
ldx #1
next:
lda z1,x
cmp z0,x
bcs noswap
ldy z0,x
sta z0,x
sty z1,x
bcc swap
noswap:
dex
beq next
rts
; worst case: z0 > z1 > z2
; 1: z1 > z2; swapped
; 2: z1 < z2; no swap; z0 > z1; swapped
: 3: z1 < z2; no swap; z0 < z1; no swap
Re: Mini-challenge: Sort three bytes
teamtempest, nice, I just did roughly the same thing - it was 20 bytes, 90 cycles.
Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:
Edit: but this is even better, 18 bytes and 84 cycles - note the entry point is not at the start:
Code: Select all
.swap
ldy z0,X
sty z1,X
sta z0,X
.SortThree
ldx #1
.cmp
lda z1,X
cmp z0,X
bcc swap
.noswap
dex
bpl cmp
rts
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Mini-challenge: Sort three bytes
Ooh I didn't think it could get that small. Very nice.
Re: Mini-challenge: Sort three bytes
This thread totally nerd-sniped me. I decided to go for time efficiency and implement a solution similar to Chromatix's idea of a decision-based algorithm that picks one of 8 cases (2 of which are invalid).
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits!
) - but I didn't have guts to make any use of that...
Here's what I ended up with.
NOTE: This is not tested. I did some very basic sanity checks with sim65 in order to count clock cycles. But this code should give you a general idea.
EDIT: Turns out `reorder` macro can be easily improved:
EDIT 2: This updated macro has a bug. See TobyLobster's next post with a fix.
This saves 6 cycles if `zp0` or `zp2` are already sorted, and 3 cycles if `zp1` is already sorted - up to 15 saved cycles total.
EDIT 3: Here's an updated table & functions with correct equations (thanks TobyLobster!)
I was tempted to do some self-modifying code magic (opcodes of "STY/STA/STX zp" and "LDY/LDA/LDX zp" are dangerously close to each other, and they all differ only with 2 lowest bits!
Here's what I ended up with.
NOTE: This is not tested. I did some very basic sanity checks with sim65 in order to count clock cycles. But this code should give you a general idea.
Code: Select all
; O(1) time sort of three values from zero page
; 48 cycles in most cases (excluding RTS); 33 cycles if input is already sorted
; - Requires 65C02 for JMP (abs,X)
; - Not very memory-efficient
; - Requires extra cycle if `table` crosses page boundary
; - swap2..swap7 can be optimized even further by not using reorder macro (which sometimes does redundant writes)
; - zp0..zp2 can be located anywhere
; ZP addresses of inputs
zp0 = 253
zp1 = 254
zp2 = 255
.macro reorder aa, bb, cc
LDA zp0
; Y is already loaded with zp1
LDX zp2
STA aa
STY bb
STX cc
.endmacro
sort:
LDA #0
LDY zp0
CPY zp1
ROL ; 1 if zp0 >= zp1
CPY zp2
ROL ; 1 if zp0 >= zp2
LDY zp1
CPY zp2
ROL ; 1 if zp1 >= zp2
ASL
TAX
JMP (table, x) ; (jsr, rts)
table:
.word swap0 ; zp0 < zp1, zp1 < zp2, zp0 < zp2
.word 0 ; zp0 < zp1, zp1 < zp2, zp0 >= zp2 - not possible
.word swap2 ; zp0 < zp1, zp1 >= zp2, zp0 < zp2
.word swap3 ; zp0 < zp1, zp1 >= zp2, zp0 >= zp2
.word swap4 ; zp0 >= zp1, zp1 < zp2, zp0 < zp2
.word swap5 ; zp0 >= zp1, zp1 < zp2, zp0 >= zp2
.word 0 ; zp0 >= zp1, zp1 >= zp2, zp0 < zp2 - not possible
.word swap7 ; zp0 >= zp1, zp1 >= zp2, zp0 >= zp2
swap0:
; Already sorted (-15 cycles)
RTS
swap2:
reorder zp0, zp2, zp1
RTS
swap3:
reorder zp1, zp2, zp0
RTS
swap4:
reorder zp1, zp0, zp2
RTS
swap5:
reorder zp2, zp0, zp1
RTS
swap7:
reorder zp2, zp1, zp0
RTS
EDIT 2: This updated macro has a bug. See TobyLobster's next post with a fix.
Code: Select all
.macro reorder aa, bb, cc
.if aa != zp0 ; save 6 cycles if already sorted
LDA zp0
STA aa
.endif
.if bb != zp1 ; save 3 cycles if already sorted
; Y is already loaded with zp1
STY bb
.endif
.if cc != zp2 ; save 6 cycles if already sorted
LDA zp2
STA cc
.endif
.endmacro
EDIT 3: Here's an updated table & functions with correct equations (thanks TobyLobster!)
Code: Select all
table:
.word swap0 ; zp0 < zp1, zp0 < zp2, zp1 < zp2
.word swap1 ; zp0 < zp1, zp0 < zp2, zp1 >= zp2
.word 0 ; zp0 < zp1, zp0 >= zp2, zp1 < zp2 - not possible
.word swap3 ; zp0 < zp1, zp0 >= zp2, zp1 >= zp2
.word swap4 ; zp0 >= zp1, zp0 < zp2, zp1 < zp2
.word 0 ; zp0 >= zp1, zp0 < zp2, zp1 >= zp2 - not possible
.word swap6 ; zp0 >= zp1, zp0 >= zp2, zp1 < zp2
.word swap7 ; zp0 >= zp1, zp0 >= zp2, zp1 >= zp2
swap0:
reorder zp0, zp1, zp2; no-op
RTS
swap1:
reorder zp0, zp2, zp1
RTS
swap3:
reorder zp1, zp2, zp0
RTS
swap4:
reorder zp1, zp0, zp2
RTS
swap6:
reorder zp2, zp0, zp1
RTS
swap7:
reorder zp2, zp1, zp0
RTS
Last edited by and3rson on Tue Oct 10, 2023 9:04 pm, edited 6 times in total.
/Andrew
deck65 - 6502 slab with screen and keyboard | ПК-88 - SBC based on KM1810VM88 (Ukrainian i8088 clone) | leo80 - simple Z80 SBC
nice65 - 6502 assembly linter | My parts, footprints & 3D models for KiCad/FreeCAD
deck65 - 6502 slab with screen and keyboard | ПК-88 - SBC based on KM1810VM88 (Ukrainian i8088 clone) | leo80 - simple Z80 SBC
nice65 - 6502 assembly linter | My parts, footprints & 3D models for KiCad/FreeCAD