I experimented with the first (small array) routine a bit and came up with a few optimizations. Here is the new routine:
Code:
; Sort
;
; 33 * N + 30 * R + 6 = 33 * N + 7686 cycles (add 12 cycles for JSR and RTS)
;
; N = number of elements to sort
; R = size of range = number of possible values an element can have = 256
INDEX = $FD
SORT LDA #0 ; clear counters
TAX
SORT1 STA TEMP,X
INX
BNE S1
DEY ; (A) see below
BEQ S3 ; branch if this is the last element
NOP ; (B) see below
SORT2 LDA (BUFFER),Y ; count how many of each value there are
TAX
DEC TEMP,X
DEY
BNE SORT2
SORT3 LDA (BUFFER),Y ; handle last element specially
TAX
DEC TEMP,X
LDX #0 ; write the values in numerical order
SEC
SORT4 TYA ; INDEX = INDEX - TEMP
SBC TEMP,X
STA INDEX
CPY INDEX
BEQ SORT6
TXA
SORT5 STA (BUFFER),Y
INY
CPY INDEX
BNE SORT5
SORT6 INX ; we enter here with Y = INDEX and the carry set
BNE SORT4
The cycle count (which I've measured) assumes that branches and LDA (BUFFER),Y do not cross a page boundaries, and that TEMP starts on a page boundary. The NOP at line (B) was added just to make the formula neat and can be deleted to save 2 cycles (except when N=1). I like to use a TYA at line (A) instead of DEY. You would then call the routine with Y=N-1, so that Y=$00 when N=1 and Y=$FF when N=256. I prefer that to using Y=$01 when N=1, Y=$FF when N=255, and Y=$00 when N=256.
As for the optimizations, in the second section, we can get rid of the CPY #$00 by handling the last element specially, which saves 2*N cycles. In the third section, we reduce the TEMP buffer to 256 bytes by comparing Y to a zero page location rather than to TEMP,X which means we don't have to put Y in A for the comparison. In a sense, we're taking the 257th byte of the buffer and moving it to the zero page. At SORT6, Y = INDEX and the carry is set, since we always get there by a CPY INDEX that is equal, so instead of adding the count (TEMP) to the index, we store the negative of count in TEMP and subtract TEMP from INDEX, i.e. INDEX=INDEX+count is the same as INDEX=INDEX-(-count). Note that we enter SORT4 the first time with Y=0, so INDEX = TEMP the first time through. Finally, rather than jumping back to the first CPY, we simply add another CPY which allows us to tighten the SORT5 loop.
One easy way to save a bunch of cycles is to change the first section to:
Code:
LDA #0
TAX
SORT1 STA TEMP,X
STA TEMP+$80,X
INX
BPL :1
which saves 640 cycles, and adds only a single instruction. You can unroll the loop a bit more and save an additional 320 cycles with:
Code:
LDA #0
LDX #$3F
SORT1 STA TEMP,X
STA TEMP+$40,X
STA TEMP+$80,X
STA TEMP+$C0,X
DEX
BPL SORT1
Which adds an additional 7 bytes to the routine, so you'll have to decide where the point of diminishing returns is.
The second (large array) routine has three bugs.
1. LDY #0 should be moved before the label COUNT, since Y is not initialize before then.
2. BUFFERLENGTH is not decremented properly.
3. BUFFERSTART (in the FILL loop) must be incremented every time you store, so it must come before the BCC F2
After fixing and tweaking it, this is what I ended up with (I haven't counted or measured cycles on this one, but the timing isn't going to be perfectly linear with all the high byte updating anyway):
Code:
; Sort
BUFH = $FB
SORT LDY #0 ; initialize counters
SORT1 LDA #0
STA TEMP,Y
LDA #1
STA TEMP+256,Y
INX
BNE S1
LDA BUFFERSTART+1 ; save hi byte of buffer pointer
STA BUFH
LDA BUFFERLENGTH ; count how many of each values there are
BEQ SORT2
INC BUFFERLENGTH+1
SORT2 LDA (BUFFERSTART),Y
TAX
INC TEMP,X
BNE SORT3
INC TEMP+256,X
SORT3 INY
BNE SORT4
INC BUFFERSTART+1
SORT4 DEC BUFFERLENGTH
BNE SORT2
DEC BUFFERLENGTH+1
BNE SORT2
LDA BUFH ; restore hi byte of buffer pointer
STA BUFFERSTART+1
LDX #0 ; store the values in numerical order
LDY #0
SORT5 LDA TEMP,X
BNE SORT6
DEC TEMP+256,X
BEQ SORT9
SORT6 TXA
SORT7 STA (BUFFERSTART,Y)
INY
BNE SORT8
INC BUFFERSTART+1
SORT8 DEC TEMP,X
BNE SORT7
DEC TEMP+256,X
BNE SORT7
SORT9 INX
BNE SORT5
Note that this uses one fewer zero page location. You can free up the zero page location used by BUFH by using PHA and PLA which only adds 1 cycle.
This routine is fairly tricky. It does decrement properly even though it may not look like it. It makes use of the fact that:
Code:
INC NUML
INC NUMH
LOOP
DEC NUML
BNE LOOP
DEC NUMH
BNE LOOP
loops NUM+1 times. If you print the value of NUM inside the loop, you'll see that it counts funny, but it does loop NUM+1 times. This works because
Code:
LOOP
DEC NUM
BNE LOOP
counts NUM times if NUM is non-zero, or 256 times if NUM is zero. By adding an INC NUM before the loop, it will count NUM+1 times if NUM was 0 to 254, or 256 times if NUM was 255, so it counts NUM+1 times in all cases.
In the SORT2 loop, we want to loop BUFFERLENGTH times, not BUFFERLENGTH+1 times, so we prepare BUFFERLENGTH, with the INCs, then decrement once.
Code:
INC BUFFERLENGTH
INC BUFFERLENGTH+1
DEC BUFFERLENGTH
BNE SKIP
DEC BUFFERLENGTH+1
SKIP
Moving the INC BUFFERLENGTH+1 to the end doesn't change the final result.
Code:
INC BUFFERLENGTH
DEC BUFFERLENGTH
BNE SKIP
DEC BUFFERLENGTH+1
SKIP INC BUFFERLENGTH+1
This is the same as:
Code:
LDA BUFFERLENGTH
BNE SKIP
DEC BUFFERLENGTH+1
SKIP INC BUFFERLENGTH+1
but BUFFERLENGTH+1 is incremented when BUFFERLENGTH is non-zero, and unchanged (incremented then decremented) when BUFFERLENGTH is zero, so we can change it to:
Code:
LDA BUFFERLENGTH
BEQ SKIP
INC BUFFERLENGTH+1
SKIP
The SORT5 loop is also very sneaky. We want to branch to SORT9 if the count is zero, and loop count times, not count+1 times. Using the previous optimization, we start with:
Code:
SORT5 LDA TEMP,X
BNE SKIP1
LDA TEMP+256,X
BEQ SORT9
SKIP1 LDA TEMP,X
BEQ SKIP2
INC TEMP+256,X
SKIP2
The BNE LOOP6 can be changed to branch to the INC since with already know that TEMP,X is non-zero.
Code:
SORT5 LDA TEMP,X
BNE SKIP1
LDA TEMP+256,X
BEQ LOOP9
LDA TEMP,X
BEQ SKIP2
SKIP1 INC TEMP+256,X
SKIP2
Now when we fall through to the second LDA TEMP,X, we already know that LDA TEMP,X is non-zero, so we can replace it and the BEQ SKIP2 with a BNE SKIP2
Code:
SORT5 LDA TEMP,X
BNE SKIP1
LDA TEMP+256,X
BEQ LOOP9
BNE SKIP2
SKIP1 INC TEMP+256,X
SKIP2
To get rid of the BNE SKIP2 we'll move the INC TEMP+256,X before the LDA TEMP,X and then undo the INC after the branch:
Code:
SORT5 INC TEMP+256,X
LDA TEMP,X
BNE SKIP
DEC TEMP+256,X
LDA TEMP+256,X
BEQ LOOP9
SKIP
The LDA TEMP,X is now unneccesary, since DEC TEMP+256,X affects the Z flag. The INC TEMP+256,X will be performed one time for each value of X, so we can eliminate the INC TEMP+256,X by initializing the TEMP+256 buffer to 1 rather than 0, leaving us with:
Code:
SORT5 LDA TEMP,X
BNE SORT6
DEC TEMP+256,X
BEQ LOOP9
SORT6
Finally, note that we're not using a CPY INDEX like we did in the small array routine. It might be possible to further optimize the large array routine by using a CPY INDEX to replace DEC TEMP,X at SORT8, since the CPY will be faster than the DEC.
If anyone has questions about any of this (hopefully my explanations aren't too terse), please feel free to post or drop me an e-mail.