6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu May 16, 2024 6:59 pm

All times are UTC




Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: JeffSort on the 6502
PostPosted: Thu Mar 25, 2004 3:23 pm 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
After looking at the sorting algorithms in the Source Code Repository I was inspired to make a contribution myself. I choose to make an implementation of the JeffSort algorithm. (Not this one but this one! :wink: )

It was done from memory before I found that page, so it might not be quite right.

On that page Jeff claims it's an O(1) algorithm, but as far as I can see it's really O(n)... but I'm not sure i remember how that O(x) stuff works so maybe I'm wrong... or maybe I've made a different sorting algorithm :)

Anyway, here is what I came up with:
Code:
;
;   JeffSort, an O(n) sort for small numbers, in 6502 assembly.
;
;
;   Input:    Pointer to buffer to be sorted in $FE.
;             Number of elements in buffer (n) in register Y.
;
;   Output:   Sorted buffer starting at adress pointed to by $FE.
;             Number of elements in buffer (n) in register Y and A.
;             X=0.
;      
;   This version support a maximum of 256 8bit values.
;   (Use Y=0 for 256 elements.)
;
;   Estimated execution time, not including jsr/rts:
;
;      9987 + (n * 42) cycles
;
;   That estimate may WAY off... It can be sped up a bit with loop
;   unrolling and stuff like that, but that is left as an exercise
;   for the reader :P
;
;
;   Code is Copyright 2004 Erik Lien.
;   Algorithm by Jeffrey Franklin.
;
;   License: Use as you wish, but at your own risk.
;

   !to "jeffsort.o"

buffer=$FE             ;pointer to n byte buffer with values to be sorted.
temp=$C100             ;257 byte working buffer.

*=$C000


Sort    lda #0      
        tax      
clear   sta temp,x     ;Clear first 256 bytes of working buffer.
        dex      
        bne clear   

count   dey            ;Count how many elements there is of each type.
        lda (buffer),y
        tax
        inc temp,x
        cpy #0
        bne count
   
        ldx #0
fill    tya
        cmp temp,x
        beq next
        txa
        sta (buffer),y
        iny
        bne fill
next    lda temp,x
        clc
        adc temp+1,x   ;Here is why temp has to be 257 bytes, not 256.
        sta temp+1,x   ;Easily fixed, but my fix added n*2 cycles, no good...
        inx
        bne fill

done    rts
It seems to do what it's suposed to from my testing, but it's not unlikely I've missed something. I'm sure you guys could think of lots of optimizations, so if you have any sugestions, I'm all ears.

Oh, and I'm probably way off on that cycle count, so if anybody could check out that I would be gratefull. I think each additional element should add the same amount of extra time (except when going from 255 to 256 elements, a a few cycles is saved there because of the 6502s lack of BRA opcode :D . Also, if there is 256 identical elements, then it will run a lot faster than when there is 256 varying elements because nothing gets written. That however only happens with 256 elements, not any fewer. I think, I'm probably wrong :P )



Edit: Decided to include the code from Jeffs page to show what the algorithm is suposed to do:
Code:
Algorithm: procedure JeffSort (var a: array of integer; lo, hi,
                    rangelo, rangehi: integer);
var
  JeffCount: array [rangelo..rangehi] of integer;
  i, j: integer;
begin
  for i:=rangelo to rangehi do
    JeffCount[i] := 0;

  for i:=lo to hi do
    inc (JeffCount[a[i]]);

  for i:=rangelo to rangehi do
    if (JeffCount[i] > 0) then
      for j := 1 to JeffCount[i] do
      begin
        a[lo] := i;
        inc (lo);
      end;
end;

 

Looking at that I realised that maybe I should have included rangehi and rangelo as parameters in my implementation to speed it up. And maybe I should do a version for signed numbers too...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 26, 2004 4:52 am 
Offline

Joined: Wed Sep 04, 2002 4:08 pm
Posts: 57
Location: Iowa
I haven't tried your sort algorithm, I'm only commenting on the O stuff. O(1) I believe would mean that the algorithm takes a fixed amount of time no matter how many elements there are (which, as you might imagine, nobody really talks about because it's silly). O(n) is linear, and O(n^2) is square, in other words, the amount of time an algorithm takes is porportional to the square of the number of elements. Insertion/Bubble/Selection sorts are O(n^2). Quick sort and Binary Tree sort are between O(n*log(n)) and O(n^2). AVL Tree is O(n*log(n)), which, in my understanding, is as good as sorts get. Therefore I am skeptical that your sort is O(n), which would beat O(n*log(n)), but I could be wrong.

Scott


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 26, 2004 5:43 am 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
schidester wrote:
I haven't tried your sort algorithm, I'm only commenting on the O stuff. O(1) I believe would mean that the algorithm takes a fixed amount of time no matter how many elements there are (which, as you might imagine, nobody really talks about because it's silly). O(n) is linear, and O(n^2) is square, in other words, the amount of time an algorithm takes is porportional to the square of the number of elements. Insertion/Bubble/Selection sorts are O(n^2). Quick sort and Binary Tree sort are between O(n*log(n)) and O(n^2). AVL Tree is O(n*log(n)), which, in my understanding, is as good as sorts get. Therefore I am skeptical that your sort is O(n), which would beat O(n*log(n)), but I could be wrong.

Scott
Thank you for your reply. I've tested the execution time now, and after the large constant time part, it does increase linearly with the number of elements. (and the order of the elements doesn't matter; no worst case, always identical execution time for any given size)

But the way it achieves this could be called cheating. The algorithm doesn't actually do any sorting, it just counts how many elements there are of each value, and then it makes a new table with that many elements of each value.

So, as far as I can see, it's usefullness is very limited. You can't use it to sort a bunch of objects depending on a value in that object. (Example: trying to sort x,y,z vectors acording to z wouldn't work). In fact I can't really think of any situation where it could be used.

But if you encountered a situation where it could be used it could be a huge timesaver. The Combination Sort example in the Source Code Repository mentioned 2 second sort time for bubble sort, and 0.2 second for Combination Sort on an 255 element array. My code only uses ~0.02 seconds on a 255 element array on a 1Mhz C64. The Optimal Sort code used 0.0642 seconds, but that was on a 8Mhz device.

So, despite it's limited usefullness, it's still an interesting example because of it's speed.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 26, 2004 8:48 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
An interesting algorithm, but the complexity is misleading because depends more on the range of the values than the number of values to be sorted.

For example sorting 20 values in the range (0-255) takes 256 iterations to clear the counts, 20 iterations to count the value frequences and 276 iterations to regenerate the sorted data.

If the range was increased to word size (0-65535) then aside from the fact that we might not have enough memory to store the counts, the number of iterations increases to 65536, 20 and 65556.

In can see that under some very specific conditions (e.g. small range and small number of values) that this may be the optimal solution but it lacks the generality of the other sort algorithms.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
 Post subject: Re: JeffSort on the 6502
PostPosted: Fri Mar 26, 2004 12:54 pm 
Offline

Joined: Thu Oct 23, 2003 9:07 am
Posts: 12
> After looking at the sorting algorithms in the Source Code Repository I
> was inspired to make a contribution myself. I choose to make an
> implementation of the JeffSort algorithm.

I'll throw in a few free optimisations for you:

ldx #0
fill ;tya
;cmp temp,x
cpy temp,x
beq next
;txa
;sta (buffer),y
stx (buffer),y
iny
bne fill

Saves a couple of cycles,
Sprow.


Top
 Profile  
Reply with quote  
 Post subject: Re: JeffSort on the 6502
PostPosted: Fri Mar 26, 2004 2:29 pm 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
Sprow wrote:
> After looking at the sorting algorithms in the Source Code Repository I
> was inspired to make a contribution myself. I choose to make an
> implementation of the JeffSort algorithm.

I'll throw in a few free optimisations for you:

ldx #0
fill ;tya
;cmp temp,x
cpy temp,x
beq next
;txa
;sta (buffer),y
stx (buffer),y
iny
bne fill

Saves a couple of cycles,
Sprow.

Thank you for the sugestion, but according to http://www.6502.org/tutorials/6502opcodes.htm the 6502 has neither a cpy $xxxx,Y nor a stx ($xx),y instruction as far as I can see...

Bitwise, thank you for you reply. I have some comments to what you said, but I'm a bit busy now, so I'll post it in an hour or two from now...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 26, 2004 5:13 pm 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
OK, now I don't really remember what I was going to say... :oops:

BitWise wrote:
An interesting algorithm, but the complexity is misleading because depends more on the range of the values than the number of values to be sorted.
That's not always true. If the number of values to be sorted are larger the their range, then it depends more on the number of values to be sorted than the range.

Quote:
For example sorting 20 values in the range (0-255) takes 256 iterations to clear the counts, 20 iterations to count the value frequences and 276 iterations to regenerate the sorted data.

If the range was increased to word size (0-65535) then aside from the fact that we might not have enough memory to store the counts, the number of iterations increases to 65536, 20 and 65556.
Yes, increasing the range to a full 16 bits would make the counting buffer too large. But what about sorting a larger number of 8 bit values? Sorting 64000 8 bit values would require 256 iterations to clear the counting buffer 64000 iterations to count and 64256 iterations to regenerate. I'll think I'll go an make new version for sorting more than 256 values now...

Quote:
In can see that under some very specific conditions (e.g. small range and small number of values that this may be the optimal solution but it lacks the generality of the other sort algorithms.
Yes, the range has to be small for this to be an optimal solution, but I strongly disagree with the 'small number of values' part. On the contrary, the algorithm is most optimal for large number of values. However I'm still unable to think of any situation where it could be usefull... But instead of trying to find a problem to the soultion, I thought I would share the solution, you never know, it might prove useful for somebody... :)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 26, 2004 6:05 pm 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
Here is a new version that can sort much larger arrays of 8bit values:
Code:
;
;   Jeffsort2, an O(n) sort for small numbers, in 6502 assembly.
;
;
;
;   Input:   Pointer to buffer to be sorted in $FE.
;      Length of buffer (n) in $FC
;
;   Output:   Sorted buffer, but $FE now points to the end of it.
;
;      
;   This version support up to several thousand 8bit values.
;
;   Estimated execution time, not including jsr/rts:
;
;      (not timed yet)
;
;
;   Code is Copyright 2004 Erik Lien.
;   Algorithm by Jeffrey Franklin.
;
;   License: Use as you wish, but at your own risk.
;

   !to "jeffsort2.o"

bufferstart =$FE       ;pointer to n byte buffer with values to be sorted.
bufferlength=$FC
currentpos  =$FA
temp=$C100         ;512 byte counting buffer.

*=$0400

Sort    lda #0
        tax
clear   sta temp,x   ;Clear first 512 bytes of counting buffer.      
        sta temp+256,x
        dex      
        bne clear   

        lda bufferstart
        sta currentpos
        lda bufferstart+1
        sta currentpos+1

count   lda (currentpos),y   ;Count how many elements there is of each type.
        tax
        inc temp,x
        bne c1      ;Did it overflow?
        inc temp+256,x   ;increase high byte if it did
c1      inc currentpos
        bne c2
        inc currentpos+1
c2      dec bufferlength
        bne count
        dec bufferlength+1
        bne count

        ldx #0
        ldy #0
fill    lda temp,x
        bne f1
        lda temp+256,x
        beq f2
f1      txa
        sta (bufferstart),y
        lda temp,x
        sec
        sbc #1
        sta temp,x
        lda temp+256,x
        sbc #0
        sta temp+256,x
        bcc f2      ;we are done
        inc bufferstart
        bne f1
        inc bufferstart+1
        bcs f1      ;use a bra if you got it... that came out wrong... :)
f2      inx
        bne fill   

done    rts
It seems to work, I'll do some more testing, comments are greatly appreciated.


Edit: Warning! This code is buggy, don't use it!


Last edited by Thowllly on Mon Apr 12, 2004 5:43 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: JeffSort on the 6502
PostPosted: Fri Mar 26, 2004 7:22 pm 
Offline

Joined: Thu Oct 23, 2003 9:07 am
Posts: 12
> > Saves a couple of cycles
> Thank you for the sugestion, but according to 6502opcodes.htm the 6502 has neither a [b]cpy $xxxx,Y[/b] nor a [b]stx ($xx),y[/b] instruction as far as I can see...

My brain sort of said that was the case, and the assembler didn't complain when I typed them in in interactive mode, but I guess I should have got up off my chair and got the opcode chart out - you're right of course!
Sprow.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 27, 2004 1:18 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
I experimented with the first (small array) routine a bit and came up with a few optimizations. Here is the new routine:

Code:
; Sort
;
; 33 * N + 30 * R + 6 = 33 * N + 7686 cycles (add 12 cycles for JSR and RTS)
;
; N = number of elements to sort
; R = size of range = number of possible values an element can have = 256

INDEX =   $FD

SORT  LDA #0         ; clear counters
      TAX
SORT1 STA TEMP,X
      INX
      BNE S1

      DEY            ; (A) see below
      BEQ S3         ; branch if this is the last element
      NOP            ; (B) see below
SORT2 LDA (BUFFER),Y ; count how many of each value there are
      TAX
      DEC TEMP,X
      DEY
      BNE SORT2
SORT3 LDA (BUFFER),Y ; handle last element specially
      TAX
      DEC TEMP,X

      LDX #0         ; write the values in numerical order
      SEC
SORT4 TYA            ; INDEX = INDEX - TEMP
      SBC TEMP,X
      STA INDEX     
      CPY INDEX
      BEQ SORT6
      TXA
SORT5 STA (BUFFER),Y
      INY
      CPY INDEX
      BNE SORT5
SORT6 INX            ; we enter here with Y = INDEX and the carry set
      BNE SORT4


The cycle count (which I've measured) assumes that branches and LDA (BUFFER),Y do not cross a page boundaries, and that TEMP starts on a page boundary. The NOP at line (B) was added just to make the formula neat and can be deleted to save 2 cycles (except when N=1). I like to use a TYA at line (A) instead of DEY. You would then call the routine with Y=N-1, so that Y=$00 when N=1 and Y=$FF when N=256. I prefer that to using Y=$01 when N=1, Y=$FF when N=255, and Y=$00 when N=256.

As for the optimizations, in the second section, we can get rid of the CPY #$00 by handling the last element specially, which saves 2*N cycles. In the third section, we reduce the TEMP buffer to 256 bytes by comparing Y to a zero page location rather than to TEMP,X which means we don't have to put Y in A for the comparison. In a sense, we're taking the 257th byte of the buffer and moving it to the zero page. At SORT6, Y = INDEX and the carry is set, since we always get there by a CPY INDEX that is equal, so instead of adding the count (TEMP) to the index, we store the negative of count in TEMP and subtract TEMP from INDEX, i.e. INDEX=INDEX+count is the same as INDEX=INDEX-(-count). Note that we enter SORT4 the first time with Y=0, so INDEX = TEMP the first time through. Finally, rather than jumping back to the first CPY, we simply add another CPY which allows us to tighten the SORT5 loop.

One easy way to save a bunch of cycles is to change the first section to:

Code:
      LDA #0
      TAX
SORT1 STA TEMP,X
      STA TEMP+$80,X
      INX
      BPL :1


which saves 640 cycles, and adds only a single instruction. You can unroll the loop a bit more and save an additional 320 cycles with:

Code:
      LDA #0
      LDX #$3F
SORT1 STA TEMP,X
      STA TEMP+$40,X
      STA TEMP+$80,X
      STA TEMP+$C0,X
      DEX
      BPL SORT1


Which adds an additional 7 bytes to the routine, so you'll have to decide where the point of diminishing returns is.

The second (large array) routine has three bugs.

1. LDY #0 should be moved before the label COUNT, since Y is not initialize before then.
2. BUFFERLENGTH is not decremented properly.
3. BUFFERSTART (in the FILL loop) must be incremented every time you store, so it must come before the BCC F2

After fixing and tweaking it, this is what I ended up with (I haven't counted or measured cycles on this one, but the timing isn't going to be perfectly linear with all the high byte updating anyway):

Code:
; Sort

BUFH  =   $FB

SORT  LDY #0              ; initialize counters
SORT1 LDA #0
      STA TEMP,Y
      LDA #1
      STA TEMP+256,Y
      INX
      BNE S1

      LDA BUFFERSTART+1   ; save hi byte of buffer pointer
      STA BUFH

      LDA BUFFERLENGTH    ; count how many of each values there are
      BEQ SORT2
      INC BUFFERLENGTH+1
SORT2 LDA (BUFFERSTART),Y
      TAX
      INC TEMP,X
      BNE SORT3
      INC TEMP+256,X
SORT3 INY
      BNE SORT4
      INC BUFFERSTART+1
SORT4 DEC BUFFERLENGTH
      BNE SORT2
      DEC BUFFERLENGTH+1
      BNE SORT2

      LDA BUFH            ; restore hi byte of buffer pointer
      STA BUFFERSTART+1

      LDX #0              ; store the values in numerical order
      LDY #0
SORT5 LDA TEMP,X
      BNE SORT6
      DEC TEMP+256,X
      BEQ SORT9
SORT6 TXA
SORT7 STA (BUFFERSTART,Y)
      INY
      BNE SORT8
      INC BUFFERSTART+1
SORT8 DEC TEMP,X
      BNE SORT7
      DEC TEMP+256,X
      BNE SORT7
SORT9 INX
      BNE SORT5


Note that this uses one fewer zero page location. You can free up the zero page location used by BUFH by using PHA and PLA which only adds 1 cycle.

This routine is fairly tricky. It does decrement properly even though it may not look like it. It makes use of the fact that:

Code:
      INC NUML
      INC NUMH
LOOP
      DEC NUML
      BNE LOOP
      DEC NUMH
      BNE LOOP


loops NUM+1 times. If you print the value of NUM inside the loop, you'll see that it counts funny, but it does loop NUM+1 times. This works because

Code:
LOOP
      DEC NUM
      BNE LOOP


counts NUM times if NUM is non-zero, or 256 times if NUM is zero. By adding an INC NUM before the loop, it will count NUM+1 times if NUM was 0 to 254, or 256 times if NUM was 255, so it counts NUM+1 times in all cases.

In the SORT2 loop, we want to loop BUFFERLENGTH times, not BUFFERLENGTH+1 times, so we prepare BUFFERLENGTH, with the INCs, then decrement once.

Code:
      INC BUFFERLENGTH
      INC BUFFERLENGTH+1
      DEC BUFFERLENGTH
      BNE SKIP
      DEC BUFFERLENGTH+1
SKIP


Moving the INC BUFFERLENGTH+1 to the end doesn't change the final result.

Code:
      INC BUFFERLENGTH
      DEC BUFFERLENGTH
      BNE SKIP
      DEC BUFFERLENGTH+1
SKIP  INC BUFFERLENGTH+1


This is the same as:

Code:
      LDA BUFFERLENGTH
      BNE SKIP
      DEC BUFFERLENGTH+1
SKIP  INC BUFFERLENGTH+1


but BUFFERLENGTH+1 is incremented when BUFFERLENGTH is non-zero, and unchanged (incremented then decremented) when BUFFERLENGTH is zero, so we can change it to:

Code:
      LDA BUFFERLENGTH
      BEQ SKIP
      INC BUFFERLENGTH+1
SKIP


The SORT5 loop is also very sneaky. We want to branch to SORT9 if the count is zero, and loop count times, not count+1 times. Using the previous optimization, we start with:

Code:
SORT5 LDA TEMP,X
      BNE SKIP1
      LDA TEMP+256,X
      BEQ SORT9
SKIP1 LDA TEMP,X
      BEQ SKIP2
      INC TEMP+256,X
SKIP2


The BNE LOOP6 can be changed to branch to the INC since with already know that TEMP,X is non-zero.

Code:
SORT5 LDA TEMP,X
      BNE SKIP1
      LDA TEMP+256,X
      BEQ LOOP9
      LDA TEMP,X
      BEQ SKIP2
SKIP1 INC TEMP+256,X
SKIP2


Now when we fall through to the second LDA TEMP,X, we already know that LDA TEMP,X is non-zero, so we can replace it and the BEQ SKIP2 with a BNE SKIP2

Code:
SORT5 LDA TEMP,X
      BNE SKIP1
      LDA TEMP+256,X
      BEQ LOOP9
      BNE SKIP2
SKIP1 INC TEMP+256,X
SKIP2


To get rid of the BNE SKIP2 we'll move the INC TEMP+256,X before the LDA TEMP,X and then undo the INC after the branch:

Code:
SORT5 INC TEMP+256,X
      LDA TEMP,X
      BNE SKIP
      DEC TEMP+256,X
      LDA TEMP+256,X
      BEQ LOOP9
SKIP


The LDA TEMP,X is now unneccesary, since DEC TEMP+256,X affects the Z flag. The INC TEMP+256,X will be performed one time for each value of X, so we can eliminate the INC TEMP+256,X by initializing the TEMP+256 buffer to 1 rather than 0, leaving us with:

Code:
SORT5 LDA TEMP,X
      BNE SORT6
      DEC TEMP+256,X
      BEQ LOOP9
SORT6


Finally, note that we're not using a CPY INDEX like we did in the small array routine. It might be possible to further optimize the large array routine by using a CPY INDEX to replace DEC TEMP,X at SORT8, since the CPY will be faster than the DEC.

If anyone has questions about any of this (hopefully my explanations aren't too terse), please feel free to post or drop me an e-mail.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 27, 2004 5:20 am 
Offline

Joined: Wed Mar 24, 2004 6:32 pm
Posts: 59
Location: Bay Area, CA
I remember my algorythms class... ;)

Traditional sorting algorythms can sort any arbitrary set of operations using only swapping and a single comparison.

This is different, so it can be much more efficent. There's a whole load of algorythms that are designed to be more efficent over data where you have values.

Still, In the rather large stock tracking application that I work on in the Day Job, there's lots of sorting but none of the sorts could be replaced with that sort. It's also not particularly memory efficent. ;)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Mar 27, 2004 4:56 pm 
Offline

Joined: Wed Oct 22, 2003 4:07 am
Posts: 51
Location: Norway
wow, great work with those optimizations dclxvi! There really isn't much to add after a great post like that...

Quote:
The second (large array) routine has three bugs.

1. LDY #0 should be moved before the label COUNT, since Y is not initialize before then.
2. BUFFERLENGTH is not decremented properly.
3. BUFFERSTART (in the FILL loop) must be incremented every time you store, so it must come before the BCC F2
Yeah, you're right. My test cases worked despite those bugs; in other words they were poorly choosen. I fixed those bugs, so now I hope it's actually working... It ended up looking a bit like your code, except not quite as optimal (of course), and I solved bug #2 by inverting BUFFERLENGTH and counting up instead of down. It seems to work, but I'm sure I've managed to introduce a few new bugs in the process :P I think your version is better anyway...

Quote:
Finally, note that we're not using a CPY INDEX like we did in the small array routine. It might be possible to further optimize the large array routine by using a CPY INDEX to replace DEC TEMP,X at SORT8, since the CPY will be faster than the DEC.
I think you're on to something, but right now my brain hurts just trying to think about it...

Anyway, thanks for the enlightening reply.

Edit: I realised you can speed up the counting loop in the large array version by using a end of array pointer instead of a length of array variable. I gained 2*n cycles, I think...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Mar 31, 2004 1:07 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
> 3. BUFFERSTART (in the FILL loop) must be incremented every time you store, so it must come before the BCC F2

I got this one wrong. The problem is that the byte after the buffer will be overwritten. For example, if the count is 1, we store the byte, subtract 1 (which sets the carry), increment the buffer pointer, store the byte, subtract 1 (which clears the carry), then branch to F2. The correct solution is to move the two instructions at F1 after the BCC so that only one byte (for this case) gets stored.

Anyway, Here is a different method that is faster in most cases. In the second section of the small array routine, replace both DEC TEMP,X instructions with INC TEMP,X instructions, then replace the third section with this:

Code:
      CLC            ; write the values in numerical order
      LDX #0
SORT4 LDY TEMP,X
      BEQ SORT6
      TXA
SORT5 DEY
      STA (BUFFER),Y ; STA doesn't affect the Z flag!
      BNE SORT5
      LDA TEMP,X     ; BUFFER = BUFFER + TEMP,X
      ADC BUFFER     ; only ADC and CLC affect the carry in this section
      STA BUFFER
      BCC SORT6
      INC BUFFER+1
      CLC
SORT6 INX
      BNE SORT4


The worst case is a 256 element array that contains each value just once. In this case, the new routine will be 2 * 256 cycles slower than the old one. However, by specifying that the buffer must be located on a page boundary (since the maximum buffer size is 256 bytes, this isn't an unreasonable restriction), we can get rid of the three instructions before SORT6 and this method is faster in all cases as far as I can tell. (Actually, as long as the low byte of the buffer address + the buffer length <= 256 we can still eliminate those three instructions.). Note that INDEX is no longer used, which frees up that zero page location.

We can apply a similar technique to the large array routine. Another optimization is to work a page at a time when possible. Actually, this is a good optimization priniciple for practically any 6502 routine. We win big time in the middle section, since there will usually be at least 256 elements. In the last section, we can only work a page at a time when the count is at least 256, which is possible, but not as likely.

Since the three major sections have all changed, here is the new large array routine in its entirety. There isn't anything especially tricky this time around.

Code:
BUFH  =   $FB
LASTL =   $FC              ; LAST = buffer length - 1
LASTH =   $FD              ;

SORT   LDA #0              ; clear the counters
       TAY
SORT1  STA TEMP,Y
       STA TEMP+256,Y
       INX
       BNE SORT1

       LDA BUFFERSTART+1   ; save the high byte of the buffer pointer
       STA BUFH

       LDA LASTH           ; count how many of each value there are
       BEQ SORT4
SORT2  LDA (BUFFERSTART),Y ; count a page (256 elements) at a time
       TAX
       INC TEMP,X
       BNE SORT3
       INC TEMP+256,X
SORT3  INY
       BNE SORT2
       INC BUFFERSTART+1
       DEC LASTH
       BNE SORT2
SORT4  LDY LASTL
       BEQ SORT7
SORT5  LDA (BUFFERSTART),Y ; count what's left
       TAX
       INC TEMP,X
       BNE SORT6
       INC TEMP+256,X
SORT6  DEY
       BNE SORT5
SORT7  LDA (BUFFERSTART),Y ; handle Y=0 separately
       TAX
       INC TEMP,X
       BNE SORT8
       INC TEMP+256,X

SORT8  LDA BUFH            ; restore the high byte of the buffer pointer
       STA BUFFERSTART+1

       CLC
       LDX #0              ; store the values in numerical order
SORT9  LDA TEMP+256,X
       BEQ SORT11
       TXA
SORT10 STA (BUFFERSTART),Y ; store a page at a time
       INY
       BNE SORT10
       INC BUFFERSTART+1   ; update the high byte of BUFFERSTART
       DEC TEMP+256,X      ; decrement the high byte of the count
       BNE SORT10
SORT11 LDY TEMP,X
       BEQ SORT13
       TXA                 ; necessary when the BEQ SORT11 above was taken
SORT12 DEY                 ; store what's left
       STA (BUFFERSTART),Y ; STA doesn't affect the Z flag!
       BNE SORT12
       LDA TEMP,X          ; BUFFERSTART = BUFFERSTART + TEMP,X
       ADC BUFFERSTART     ; only ADC & CLC affect the carry in this section
       STA BUFFERSTART
       BCC SORT13
       INC BUFFERSTART+1
       CLC
SORT13 INX
       BNE SORT9


Even though from SORT11 on it's basically the same as the new third section of the small array routine, note that the SORT10 loop makes use of the fact that Y=0 upon entry, so the INDEX method can't be used at SORT11 as is. Note that the first section is faster than it was previously, which compensates for the worst case of the small array routine (a 256 element array containing each value just once). Also, note that the three instructions before SORT13 are now necessary, since the buffer can be greater than 256 bytes.

Finally, here's one last optimization suggestion (which I haven't applied above) that can be applied to a lot of routines. For speed, the usual:

Code:
         INC NUML
         BNE SKIP
         INC NUMH
SKIP


isn't really optimal since most of the time the branch will be taken. If the code is rewritten as follows:

Code:
         INC NUML
         BEQ CARRY
CONTINUE

CARRY    INC NUMH
         JMP CONTINUE


that branch is not taken most of the time. It looks ugly and can make a program more difficult to follow, but it will save 251 cycles (assuming the branch doesn't cross a page boundary) for every 256 times it gets executed.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Apr 08, 2004 8:49 pm 
Offline

Joined: Sun Aug 24, 2003 7:17 pm
Posts: 111
I checked the execution time for the new "Shell Sort" algorithm of the software repository provided by Fredrik Ramsberg. My first test was to sort a random permutation of the values 0 - 255.

1398 *256 oscillator periods were needed corresponding to 0.0447 seconds with 8 MHz. Corresponding figures with the simple algorithm I provided the software repository with earlier are 2006 *256 oscillator periods corresponding to 0.0642 seconds.

Then I sorted 8192 values (16384 bytes) with "Shell Sort". For this 87503 *256 oscillator periods corresponding to 2.800 seconds was needed. The ratio 1398/(256 * ln 256) = 0.180 and the ratio 87503/(8192 * ln 8192) = 0.110 .

But trying to sort even more values there is a problem! There seems to be some bug in the program that only shows when many values are sorted!

Sorting for example 10000 values the subroutine returned prematurely with only a part of the sequence sorted. The input data was a permutation of the values 0 - 9999 properly in RAM $1183 - $5FA2.

Everybody (specially Fredrik Ramsberg) is invited to verify my results! And provided I am right, the code should be corrected.

If somebody ( Fredrik Ramsberg!) wants to have my test data send me a private message with email address!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu Jun 24, 2004 12:59 am 
Offline

Joined: Thu Jun 24, 2004 12:05 am
Posts: 7
Mats,

I hadn't discovered the forum until now (I haven't been a frequent visitor until now, but now that I know there's a forum, that might change).

Glad to see your test results. I'd quite like to have your test data, so I can reproduce the bug. My address is fredrikXyzzYramsberg.net (replace XyzzY with "@" to get the real address).

I did spot that my Shell Sort was gone from the repository, and mailed Mike about it, but he didn't reply or his reply didn't reach me.

I think the bug has to do with using the highest value of h. If that is true, then sorting less than 19682/2 values should always work correctly, while a greater number of values may fail. I don't know what the bug is yet though. The code is still available here:

http://www.6502.org/source/sorting/shell.htm

I intend to return after further investigations.

Best regards,

/Fredrik


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron