6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Oct 11, 2024 9:52 pm

All times are UTC




Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
 Post subject: Re: 6502 vs 6800
PostPosted: Sat Sep 26, 2020 1:32 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 702
Location: North Tejas
A linked list is a very useful data structure. It allows information to be inserted, removed and reordered without having to move much of the data around as would be the case of using an array. In addition, the individual entries need not be the same size.

The variable List contains the address of the first node in the list. It is zero when the list is empty.

The first two bytes of each entry contains the address of the next node in the list. That address is zero in the last node of the list.

This exercise is to find and remove a node from a list. The variable Node contains its address.

Code:
                          00008 *
                          00009 * Remove a node from a linked list.
                          00010 *
                          00011 * Input:
                          00012 *        List = the address of the first node in the list.
                          00013 *        Node = the address of the node to remove
                          00014 *
                          00015 * Scoreboard:
                          00016 *
                          00017 *   21 cycles if the list is empty
                          00018 *
                          00019 * otherwise
                          00020 *
                          00021 *   24 cycles to start
                          00022 *
                          00023 *   23 cycles for each node before the specified one
                          00024 *
                          00025 *   20 cycles to exit if the node is not found
                          00026 *   38 cycles to exit if the node is found
                          00027 *
 0100                     00028 Remove
 0100 CE 0002         [3] 00029          ldx    #List     ; This is the "previous" of the first node
 0103 DF 04           [5] 00030          stx    Prev
                          00031
 0105 DE 02           [4] 00032          ldx    List      ; Point to the first node
 0107 27 0A (0113)    [4] 00033          beq    Done      ; End of the list?
                          00034
 0109                     00035 Loop
 0109 9C 00           [4] 00036          cpx    Node      ; Is this the node?
 010B 27 07 (0114)    [4] 00037          beq    Found     ; Yes
                          00038
 010D DF 04           [5] 00039          stx    Prev      ; This will be the previous node
                          00040
 010F EE 00           [6] 00041          ldx    ,X        ; Point to the next node
                          00042
 0111 26 F6 (0109)    [4] 00043          bne    Loop      ; Repeat
                          00044
 0113                     00045 Done
 0113 39              [5] 00046          rts
                          00047
 0114                     00048 Found
 0114 EE 00           [6] 00049          ldx    ,X        ; Get the address of the next node
 0116 DF 06           [5] 00050          stx    Next
                          00051
 0118 DE 04           [4] 00052          ldx    Prev      ; Point back to the previous node
                          00053
 011A 96 06           [3] 00054          ldaa   Next      ; Link it to the next node
 011C A7 00           [6] 00055          staa   ,X
 011E 96 07           [3] 00056          ldaa   Next+1
 0120 A7 01           [6] 00057          staa   1,X
                          00058
 0122 39              [5] 00059          rts


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Sat Sep 26, 2020 5:55 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
barrym95838 wrote:
I don't know for sure if I made any rookie coding mistakes or cycle counting mistakes, but it looks like the 6800 is 56 ticks per char.
[I must admit that the 6800 source looks tidier than the 6502's, even though its performance isn't exactly brilliant.]

It is actually 57 ticks, so the 6502/6800 performance ratio is approx. 2.4 for this task.
I can't understand why the 6800 code looks tidier for you it has 13 lines in its main loop when the 6502 code has only 10

BillG wrote:
A SWTPC 6800 computer running FLEX only uses interrupts for background print spooling and that is only if a special timer option is installed. The worst case is the printer pausing while interrupts are disabled.

Many 8-bit computers do not make much use of interrupts. The Apple 2 Woz Disk 2 controller uses software timing to make its magic work. Most systems with a Western Digital 17xx disk controller use a tight loop in the driver software for data transfers; most disable interrupts while doing this.

Until mice became common, interrupts were mostly used for three things: keyboard input, receiving data from a serial port and video synchronization. Only the latter is time critical and programs doing that tend to finish processing and wait in a spin loop for the event.


Timer interrupts are quite common among 8-bit computers. Disabling interrupts during serial io can be critical... Raster interrupts are quite common too and disabling them may disturb a screen picture.

BillG wrote:
Now that is contrived, to use your word. We all agree that the 6800 is hampered by having only one index register, yet you insist on continuing to propose examples to prove that one point.

I have put forth examples of things a computer may be expected to do and you denigrate those, saying that is not used in 8-bit programming or only a compiler does that. Do you use an editor? 16-bit arithmetic is needed when dealing with the addresses of data in memory. Do you use an assembler? Plenty of 16-bit math in there.


It can't be contrived it is quite typical to do string translation. However we have many other typical cases like the mentioned pointer operations.

BillG wrote:
barrym95838 wrote:
[Edit: How about a little 6809 action?]

it is 23 cycles, 1 tick (about 4%) less than the 6502. It is not a great achievement for the processor which is so much more complex than the 6502.

BillG wrote:
Because a zero translates to a zero, it gets even better...

It is not true. We can use reduced translation table, for instance for visible ASCII only.

BigEd wrote:
Two important points here: first, that's a historical claim from a supplier, which is clearly a sales position and may bear no relation to reality.

IMHO it is rather disputable. Maybe somebody can ask Bill Mensch about this claim? He told about this ratio too.

BillG wrote:
A linked list is a very useful data structure. It allows information to be inserted, removed and reordered without having to move much of the data around as would be the case of using an array. In addition, the individual entries need not be the same size.

Thank you, it is a good example but too simple. The 6800 has 16-bit ops like CPX, LDX, ... Indeed they alone give it an advantage.
I have made the 6502 code for the main loop.
Code:
    ldy #0
    sty prev+1
    lda list
    ldx list+1
    beq done
    bne cont1

loop
    lda (prev),y
cont1
    cpx node+1
    bne cont2

    cmp node
    beq found
cont2
    sta prev
    stx prev+1
    iny
    lda (prev),y
    dey
    tax
    bne loop

It is 31 ticks (the 6800 code is 23) but your code is just a sequence of 16-bit operations. If we add some more calculations this 6800 advantage disappears. For instance, it is not typical to seek a node by its address, usually we seek a node by a value of its field. We can compare two lists, process several lists, etc. And more complex processing gives more advantages to the 6502.
IMHO sort algorithms are good for benchmarking. Sorting is a typical task and it is not very simple.

_________________
my blog about processors


Last edited by litwr on Sat Sep 26, 2020 7:55 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Sat Sep 26, 2020 7:31 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
For problem sizes likely to fit into 64KB RAM, Shellsort is an appropriate algorithm. It's very simple (just a sequence of strided insertion sorts), doesn't have nasty implementation details as Quicksort does, and doesn't consume temporary memory as Mergesort does. Performance characteristics of Shellsort are mainly determined by the gap sequence, which is usually precomputed up to the largest practical problem size for the computer involved, and thus doesn't affect the code directly; for these small computers, we could take the Ciura sequence (1, 4, 10, 23, 57, 132, 301, 701) directly without extension. For all these reasons, Shellsort is used to implement qsort() in µClibc.

Sometimes the sorting task is just to put some numbers in order. Sometimes, however, there are sorting keys with attached data. And sometimes the keys to be sorted are actually strings of text. So we should first define what we are sorting. I think a good test would be 24-bit keys with a 40-bit attached record, for a total of 8 bytes in each record. A theoretical 64KB of such records would be 8192 of them, which is not too large for the Ciura sequence to apply. For the sake of sanity, let's stipulate that the implementation has to handle up to 4096 records, 32KB, stored contiguously.

I actually would like to see how that ends up comparing between the two.


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Sun Sep 27, 2020 11:29 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 702
Location: North Tejas
litwr wrote:
I have made the 6502 code for the main loop.


Don't you believe in comments?

I fixed a problem in your code...

Code:
 0200 A0 02           [2] 00008          ldy    #list     ; This is needed to handle deleting first node
 0202 84 04           [3] 00009          sty    prev
                          00010
 0204 A0 00           [2] 00011          ldy    #0
 0206 84 05           [3] 00012          sty    prev+1
 0208 A5 02           [3] 00013          lda    list
 020A A6 03           [3] 00014          ldx    list+1
 020C F0 17 (0225)  [2/3] 00015          beq    done
 020E D0 02 (0212)  [2/3] 00016          bne    cont1
                          00017
 0210                     00018 loop
 0210 B1 04         [5/6] 00019          lda    (prev),y
 0212                     00020 cont1
 0212 E4 01           [3] 00021          cpx    node+1
 0214 D0 04 (021A)  [2/3] 00022          bne    cont2
                          00023
 0216 C5 00           [3] 00024          cmp    node
 0218 F0 0C (0226)  [2/3] 00025          beq    found
 021A                     00026 cont2
 021A 85 04           [3] 00027          sta    prev
 021C 86 05           [3] 00028          stx    prev+1
 021E C8              [2] 00029          iny
 021F B1 04         [5/6] 00030          lda    (prev),y
 0221 88              [2] 00031          dey
 0222 AA              [2] 00032          tax
 0223 D0 EB (0210)  [2/3] 00033          bne    loop


You did not include anything for found. It cannot be substantially better than the 6800 code.

Edit: corrected my misinterpretation of code.


Last edited by BillG on Sun Sep 27, 2020 12:52 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Sun Sep 27, 2020 11:34 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 702
Location: North Tejas
litwr wrote:
It is 31 ticks (the 6800 code is 23) but your code is just a sequence of 16-bit operations. If we add some more calculations this 6800 advantage disappears. For instance, it is not typical to seek a node by its address, usually we seek a node by a value of its field.


To quote the late, great Ronald Reagan, "There you go again..."

Fine. Modify the task to match the value in the Tag variable with the third byte of each node.

This change adds one cycle to the inner loop of the 6800, but IMHO hurts the 6502 by at least 10 cycles. Even with three byte registers, you are now one short. Too bad the 6800 cannot loan you the unused B register...

Code:
                          00010 *
                          00011 * Remove a node from a linked list.
                          00012 *
                          00013 * Input:
                          00014 *        List = the address of the first node in the list.
                          00015 *        Tag = the tag value (third byte) of the node to remove
                          00016 *
                          00017 * Scoreboard:
                          00018 *
                          00019 *   21 cycles if the list is empty
                          00020 *
                          00021 * otherwise
                          00022 *
                          00023 *   28 cycles to start
                          00024 *
                          00025 *   24 cycles for each node before the specified one
                          00026 *
                          00027 *   20 cycles to exit if the node is not found
                          00028 *   38 cycles to exit if the node is found
                          00029 *
 0100                     00030 Remove
 0100 CE 0001         [3] 00031          ldx    #List     ; This is the "previous" of the first node
 0103 DF 03           [5] 00032          stx    Prev
                          00033
 0105 DE 01           [4] 00034          ldx    List      ; Point to the first node
 0107 27 0C (0115)    [4] 00035          beq    Done      ; End of the list?
                          00036
 0109 96 00           [3] 00037          ldaa   Tag       ; Get tag to match
                          00038
 010B                     00039 Loop
 010B A1 02           [5] 00040          cmpa   Tag_,X    ; Is this the node?
 010D 27 07 (0116)    [4] 00041          beq    Found     ; Yes
                          00042
 010F DF 03           [5] 00043          stx    Prev      ; This will be the previous node
                          00044
 0111 EE 00           [6] 00045          ldx    ,X        ; Point to the next node
                          00046
 0113 26 F6 (010B)    [4] 00047          bne    Loop      ; Repeat
                          00048
 0115                     00049 Done
 0115 39              [5] 00050          rts
                          00051
 0116                     00052 Found
 0116 EE 00           [6] 00053          ldx    ,X        ; Get the address of the next node
 0118 DF 05           [5] 00054          stx    Next
                          00055
 011A DE 03           [4] 00056          ldx    Prev      ; Point back to the previous node
                          00057
 011C 96 05           [3] 00058          ldaa   Next      ; Link it to the next node
 011E A7 00           [6] 00059          staa   ,X
 0120 96 06           [3] 00060          ldaa   Next+1
 0122 A7 01           [6] 00061          staa   1,X
                          00062
 0124 39              [5] 00063          rts


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Wed Sep 30, 2020 8:54 am 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
BillG wrote:
Fine. Modify the task to match the value in the Tag variable with the third byte of each node.

The Tag can be a string. ;)

_________________
my blog about processors


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 05, 2020 6:16 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
Chromatix wrote:
For problem sizes likely to fit into 64KB RAM, Shellsort is an appropriate algorithm. It's very simple (just a sequence of strided insertion sorts), doesn't have nasty implementation details as Quicksort does, and doesn't consume temporary memory as Mergesort does. Performance characteristics of Shellsort are mainly determined by the gap sequence, which is usually precomputed up to the largest practical problem size for the computer involved, and thus doesn't affect the code directly; for these small computers, we could take the Ciura sequence (1, 4, 10, 23, 57, 132, 301, 701) directly without extension. For all these reasons, Shellsort is used to implement qsort() in µClibc.

Sometimes the sorting task is just to put some numbers in order. Sometimes, however, there are sorting keys with attached data. And sometimes the keys to be sorted are actually strings of text. So we should first define what we are sorting. I think a good test would be 24-bit keys with a 40-bit attached record, for a total of 8 bytes in each record. A theoretical 64KB of such records would be 8192 of them, which is not too large for the Ciura sequence to apply. For the sake of sanity, let's stipulate that the implementation has to handle up to 4096 records, 32KB, stored contiguously.

I actually would like to see how that ends up comparing between the two.

It is done. I have got the next numbers of cycles.
Quote:
random (gcc) 16,735,556
random (gen) 16,512,696
ascending 4,371,746
descending 9,959,798
constant 4,371,746

So, generally, it rakes about 16 sec. for a 6502 @1MHz. For the random sequence, I used two variants:
gcc) C standard call rand()%256 for every byte;
gen) the next code
Code:
for (int i = 1; i < 8*4096; i++) mem[i] = (i >> 2 | (mem[i - 1]  & 3) << 6) + i*i%43;
mem is an array of bytes.
I add one more value to the gap sequence, it is 1750.
I have also found out that my quicksort is about 2.6 faster than my shellsort of 30000 random two byte (uint16_t) numbers. I am almost sure that the 6800 is at least 3 times slower for the quicksort. The 6800 might use any sort algorithm, it just needs the fastest to compete. ;)
Code:
shelltabidx2 = 18 ;it points to the 9th position in the gap sequence
addrhi = 10   ;zp locations for parameters
addrlo = 11

data = $400     ;start of the array
sz = 4096       ;size of the array in elements
ESZ = 8         ;size of element
sz2 = sz*ESZ    ;size of the array in bytes

        * = $200
        lda #<data
        sta addrlo
        lda #>data
        sta addrhi
        ldx #<sz2
        ldy #>sz2
        lda #shelltabidx2
        jsr shellsort
        nop               ;stop here

        .include "shell-contest.s"

The sources of my shellsort routine are attached, the size of my shellsort routine is 205 bytes, it also uses 10 zp bytes. What is about the 6800 shellsort speed? ;)


Attachments:
shell-contest.zip [707 Bytes]
Downloaded 67 times

_________________
my blog about processors
Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 05, 2020 8:03 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Running an instrumented version of Shellsort on a PC, I estimate your code is using about 130 cycles for a compare, and about 200 or so cycles for a compare and swap, in both cases including all the loop overhead for moving on to the next pair of records. This is in a plausible range for 6502 code on this size of record. I haven't yet dug into your code to drill any deeper than that.


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Tue Oct 06, 2020 7:43 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
I have done some code improvements which made code 10-40% faster, 6 bytes shorter, and 4 zp required bytes less. For the task I have gotten the next results:
Quote:
random (gcc) 14,841,052
random (gen) 14,633,170
descending 8,509,600
ascending 3,219,746
constant 3,219,746

In the attached archive I have added C-source of the shellsort algorithm used - maybe it helps to write the 6800 equivalent.


Attachments:
shell-contest-2.zip [1.74 KiB]
Downloaded 76 times

_________________
my blog about processors
Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Sun Oct 11, 2020 3:44 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
I have done experiments with quick and selection sorts. Their sources are attached, their results follow below.
The next table shows the size of code for sorting routines and amount of zp memory required.
Code:
             size    zp
quick        271     12(min=6)
shell        196      6
selection    118      6


The next tables show the speed of the routines in clock cycles for 4096 and 8063 elements. I use Chromatix's 8 byte data structure for all cases.
Code:
4096             quick         shell        selection
random (gcc)      6,945,023    14,841,052     546,689,912
random (gen)      7,078,469    14,633,170     546,689,967
qs-slow (16B)    11,780,993    11,767,274     546,474,774
descending        3,405,405     8,509,600     592,522,281
ascending         2,833,756     3,219,746     546,384,937
constant          9,610,629     3,219,746     546,384,937

8063             quick         shell        selection
random (gcc)     14,658,383    33,545,066   2,117,270,138
random (gen)     14,979,654    32,611,660   2,117,259,303
qs-slow (16B)    23,994,141    25,221,514   2,116,756,702
descending        7,084,521    19,542,402   2,295,383,997
ascending         5,960,625     6,614,042   2,116,601,085
constant         18,379,125     6,614,042   2,116,601,085


The quicksort routine uses 16 as a stack limit. Filling type 'qs-slow' usually kills quicksort with large enough data. This filling uses the next algorithm.
Code:
mem[0] = 1;
mem[1] = 3;
mem[2] = 5;
mem[3] = 0;
mem[4] = 2;
mem[5] = 4;
for (int i = 3; i < SIZE*2; i++) {
    mem[2*i] = mem[i];
    mem[2*i + 1] = mem[2*i - 1] + 2;
    mem[i] = 2*i + 1;
}


So it is also interesting to get the 6800 variants for quicksort and selectionsort. ;)
Attachment:
sort-quick-selection.zip [2.14 KiB]
Downloaded 75 times

_________________
my blog about processors


Last edited by litwr on Mon Oct 12, 2020 8:24 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 12, 2020 12:41 am 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 137
Hi!

litwr wrote:
I have done experiments with quick and selection sorts. Their sources are attached, their results follow below.
The next table shows the size of code for sorting routines and amount of zp memory required.

The quicksort routine uses 16 as a stack limit. Filling type 'qs-slow' usually kills quicksort with large enough data. This filling uses the next algorithm.

You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.

See this FastBasic code, sorting 16 bit numbers:
Code:
num = 4096
dim p(num)

' FILL
for i=0 to num
  p(i) = rand(10000)
next

dim stack0(12), stack1(12)
lvl = 1    ' Stack level
i0 = 0     ' Sorts from...
i1 = num   ' ... to

repeat
  if i1 > i0

    ' Set pivot to 'i1'
    temp = p(i1)
    j0 = i0 - 1
    j1 = i1
    do
      repeat : inc j0 :                        until (p(j0)) >= (temp)
      repeat : dec j1 : if j1 = i0 then exit : until (temp) >= (p(j1))
      if j0 >= j1 then exit
      x = p(j0) : p(j0) = p(j1) : p(j1) = x
    loop
    p(i1) = p(j0) : p(j0) = temp

   ' Recurse over smaller sub-array:
    if i1 - j0  >  j1 - i0
      stack0(lvl) = j0 + 1
      stack1(lvl) = i1
      i1 = j0 - 1
    else
      stack0(lvl) = i0
      stack1(lvl) = j0 - 1
      i0 = j0 + 1
    endif
    inc lvl
  else
    ' Empty partition, go down in the stack
    dec lvl
    i0 = stack0(lvl)
    i1 = stack1(lvl)
  endif
until not lvl


The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.

Adding an insertion sort pass to sort partitions smaller than 8 elements reduces the runtime to 66 mega-cycles, 11% faster.


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 12, 2020 1:18 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Indeed, that is one of the nasty little implementation details that plagues Quicksort, as I mentioned earlier. It's very easy to write a Quicksort that works well on shuffled unique (or nearly unique) data. It's significantly harder to write one that is not trivially susceptible to pathological behaviour on inputs that are easy to construct (and may even be encountered by accident). Shellsort handles such cases far more naturally in a small code size, which is why I consider it more appropriate for small machines.

The four basic features of a competent Quicksort are:
1: Pivot is chosen well - ideally the median of a modest random sample - to avoid quadratic behaviour on ascending, descending, "organ pipes", or median-3-killer lists. On a small machine it can be hard to generate good random numbers, but Rand48 is better than nothing.
2: Ternary comparisons against the pivot, to efficiently handle runs of equal values. This results in three partitions, but you only need to recurse into two of them. A single-valued array goes from a pathological worst case to an optimal best case when you do this.
3: Smallest open partition is handled first, to minimise stack/queue depth. It's sensible to use a priority queue for this, as you effectively do in the above, and this avoids having to depend on tail recursion optimisations since you can simply iterate on the queue.
4: Small partitions are handled with insertion sort or similar, since Quicksort does not handle them efficiently. On modern hardware, I would use Splaysort from 256 elements down; this can be implemented with a fixed 1KB auxiliary buffer, with a little imagination, and is exceptionally efficient. But even this is too much when you only have 64KB total RAM. Insertion sort is efficient from about 8 elements down.

After taking care of all the above points, the code complexity starts making Shellsort look very attractive for a small machine, even though it's a bit slower. Most of the descriptions of Quicksort that you can easily find do not make the above points as clear as they should be.

In a sorting visualisation framework that I've been playing with, I have a working septenary quicksort, using ternary comparisons with three pivots at a time, and delegating to Splaysort for small partitions. The three pivots are chosen as quartiles of a 7-element random sample, which ensures only a 25% chance that the largest open partition is more than half the original. It uses effectively O(1) memory and makes Timsort start to look slow, which is really saying something. And I have a version that does stable sorting as well, which was the original motivation for the septenary structure. But I wouldn't try to squeeze all that into a 6502.


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 12, 2020 2:27 am 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 137
Hi!

Chromatix wrote:
Indeed, that is one of the nasty little implementation details that plagues Quicksort, as I mentioned earlier. It's very easy to write a Quicksort that works well on shuffled unique (or nearly unique) data. It's significantly harder to write one that is not trivially susceptible to pathological behaviour on inputs that are easy to construct (and may even be encountered by accident). Shellsort handles such cases far more naturally in a small code size, which is why I consider it more appropriate for small machines.
I agree with you, but have some comments on Quicksort for small machines:

Quote:
The four basic features of a competent Quicksort are:
1: Pivot is chosen well - ideally the median of a modest random sample - to avoid quadratic behaviour on ascending, descending, "organ pipes", or median-3-killer lists. On a small machine it can be hard to generate good random numbers, but Rand48 is better than nothing.
2: Ternary comparisons against the pivot, to efficiently handle runs of equal values. This results in three partitions, but you only need to recurse into two of them. A single-valued array goes from a pathological worst case to an optimal best case when you do this.

I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.

And about pivot selection, yes, selecting a good pivot is essential. Instead of one random element you can do best of three, see the complete code:
Code:
' QuickSort with median of three random values as pivot
dim stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
repeat
  ' Only recurse on partitions with more than 8 elements
  if i1 > i0 + 8
    ' Set pivot to median of 3 random values
    p0 = rand(i1 - i0) + i0
    p1 = rand(i1 - i0) + i0
    p2 = rand(i1 - i0) + i0
    t0 = p(p0)
    t1 = p(p1)
    t2 = p(p2)
    if (t0 > t1) = (t0 < t2)
      p(p0) = p(i1)
      temp = t0
    elif (t1 > t0) = (t1 < t2)
      p(p1) = p(i1)
      temp = t1
    else
      p(p2) = p(i1)
      temp = t2
    endif
    p(i1) = temp

    j0 = i0 - 1
    j1 = i1
    do
      repeat : inc j0 :                        until (p(j0)) >= (temp)
      repeat : dec j1 : if j1 = i0 then exit : until (temp) >= (p(j1))
      if j0 >= j1 then exit
      x = p(j0) : p(j0) = p(j1) : p(j1) = x
    loop
    p(i1) = p(j0) : p(j0) = temp

   ' Recurse over smaller sub-array:
    if i1 - j0  >  j1 - i0
      stack0(lvl) = j0 + 1
      stack1(lvl) = i1
      i1 = j0 - 1
    else
      stack0(lvl) = i0
      stack1(lvl) = j0 - 1
      i0 = j0 + 1
    endif
    inc lvl
  else
    for i=i0 to i1-1
      temp = p(i+1)
      for j=i to i0 step -1
        if (p(j)) <= (temp) then exit
        p(j+1) = p(j)
      next
      p(j+1) = temp
    next
    dec lvl
    i0 = stack0(lvl)
    i1 = stack1(lvl)
  endif
until not lvl


Quote:
3: Smallest open partition is handled first, to minimise stack/queue depth. It's sensible to use a priority queue for this, as you effectively do in the above, and this avoids having to depend on tail recursion optimisations since you can simply iterate on the queue.


I don't know why this method is not shown in more places, as for me it is the best way to implement quicksort, the "recursive quicksort" tend to be worse than the iterative one, and it is more difficult to get right.

Quote:
After taking care of all the above points, the code complexity starts making Shellsort look very attractive for a small machine, even though it's a bit slower.


I agree, and I really like the Shellsort algorithm, it is really simple and quite fast, normally faster than Heapsort in most implementations.

Quote:
In a sorting visualisation framework that I've been playing with, I have a working septenary quicksort, using ternary comparisons with three pivots at a time, and delegating to Splaysort for small partitions. The three pivots are chosen as quartiles of a 7-element random sample, which ensures only a 25% chance that the largest open partition is more than half the original. It uses effectively O(1) memory and makes Timsort start to look slow, which is really saying something. And I have a version that does stable sorting as well, which was the original motivation for the septenary structure. But I wouldn't try to squeeze all that into a 6502.


Yea, that above sounds like very tricky code to get right and fast - and I suspect that the speedup is only noticeable for very big inputs.

Have Fun!


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 12, 2020 3:12 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Quote:
I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.

If it is always slower, that implies also for the single-valued array which it is supposed to help massively on, which rather suggests you got it wrong somehow. But it is expected to be a little slower on normal data, since more elements have to be moved and a more complicated comparison has to be performed.

Closely related to the ternary quicksort is the dual-pivot quicksort, which can use the same partitioning algorithm as ternary since it also produces three partitions, but must recurse into all three of them. Here the extra cost of the three-way partitioning is offset by the smaller partition size at each step, and thus the reduced depth of recursion. I have a version which compares the two pivots to each other after selecting them, and performs a ternary partition ( <p | =p | >p ) if they are equal - which implies that there's a big run of equal values for which the ternary scheme works best - and a dual-pivot partition ( <=p | >p && <q | >=q ) otherwise. The latter construction guarantees that the outer partitions have at least one element each, ie. the pivots themselves.


Top
 Profile  
Reply with quote  
 Post subject: Re: 6502 vs 6800
PostPosted: Mon Oct 12, 2020 4:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10953
Location: England
(Interesting article on the practical and modern Timsort here.)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: