Page 2 of 4
Re: 6502 vs 6800
Posted: Sat Sep 26, 2020 1:32 pm
by BillG
A linked list is a very useful data structure. It allows information to be inserted, removed and reordered without having to move much of the data around as would be the case of using an array. In addition, the individual entries need not be the same size.
The variable List contains the address of the first node in the list. It is zero when the list is empty.
The first two bytes of each entry contains the address of the next node in the list. That address is zero in the last node of the list.
This exercise is to find and remove a node from a list. The variable Node contains its address.
Code: Select all
00008 *
00009 * Remove a node from a linked list.
00010 *
00011 * Input:
00012 * List = the address of the first node in the list.
00013 * Node = the address of the node to remove
00014 *
00015 * Scoreboard:
00016 *
00017 * 21 cycles if the list is empty
00018 *
00019 * otherwise
00020 *
00021 * 24 cycles to start
00022 *
00023 * 23 cycles for each node before the specified one
00024 *
00025 * 20 cycles to exit if the node is not found
00026 * 38 cycles to exit if the node is found
00027 *
0100 00028 Remove
0100 CE 0002 [3] 00029 ldx #List ; This is the "previous" of the first node
0103 DF 04 [5] 00030 stx Prev
00031
0105 DE 02 [4] 00032 ldx List ; Point to the first node
0107 27 0A (0113) [4] 00033 beq Done ; End of the list?
00034
0109 00035 Loop
0109 9C 00 [4] 00036 cpx Node ; Is this the node?
010B 27 07 (0114) [4] 00037 beq Found ; Yes
00038
010D DF 04 [5] 00039 stx Prev ; This will be the previous node
00040
010F EE 00 [6] 00041 ldx ,X ; Point to the next node
00042
0111 26 F6 (0109) [4] 00043 bne Loop ; Repeat
00044
0113 00045 Done
0113 39 [5] 00046 rts
00047
0114 00048 Found
0114 EE 00 [6] 00049 ldx ,X ; Get the address of the next node
0116 DF 06 [5] 00050 stx Next
00051
0118 DE 04 [4] 00052 ldx Prev ; Point back to the previous node
00053
011A 96 06 [3] 00054 ldaa Next ; Link it to the next node
011C A7 00 [6] 00055 staa ,X
011E 96 07 [3] 00056 ldaa Next+1
0120 A7 01 [6] 00057 staa 1,X
00058
0122 39 [5] 00059 rts
Re: 6502 vs 6800
Posted: Sat Sep 26, 2020 5:55 pm
by litwr
I don't know for sure if I made any rookie coding mistakes or cycle counting mistakes, but it looks like the 6800 is 56 ticks per char.
[I must admit that the 6800 source looks tidier than the 6502's, even though its performance isn't exactly brilliant.]
It is actually 57 ticks, so the 6502/6800 performance ratio is approx. 2.4 for this task.
I can't understand why the 6800 code looks tidier for you it has 13 lines in its main loop when the 6502 code has only 10
A SWTPC 6800 computer running FLEX only uses interrupts for background print spooling and that is only if a special timer option is installed. The worst case is the printer pausing while interrupts are disabled.
Many 8-bit computers do not make much use of interrupts. The Apple 2 Woz Disk 2 controller uses software timing to make its magic work. Most systems with a Western Digital 17xx disk controller use a tight loop in the driver software for data transfers; most disable interrupts while doing this.
Until mice became common, interrupts were mostly used for three things: keyboard input, receiving data from a serial port and video synchronization. Only the latter is time critical and programs doing that tend to finish processing and wait in a spin loop for the event.
Timer interrupts are quite common among 8-bit computers. Disabling interrupts during serial io can be critical... Raster interrupts are quite common too and disabling them may disturb a screen picture.
Now that is contrived, to use your word. We all agree that the 6800 is hampered by having only one index register, yet you insist on continuing to propose examples to prove that one point.
I have put forth examples of things a computer may be expected to do and you denigrate those, saying that is not used in 8-bit programming or only a compiler does that. Do you use an editor? 16-bit arithmetic is needed when dealing with the addresses of data in memory. Do you use an assembler? Plenty of 16-bit math in there.
It can't be contrived it is quite typical to do string translation. However we have many other typical cases like the mentioned pointer operations.
[Edit: How about a little 6809 action?]
it is 23 cycles, 1 tick (about 4%) less than the 6502. It is not a great achievement for the processor which is so much more complex than the 6502.
Because a zero translates to a zero, it gets even better...
It is not true. We can use reduced translation table, for instance for visible ASCII only.
Two important points here: first, that's a historical claim from a supplier, which is clearly a sales position and may bear no relation to reality.
IMHO it is rather disputable. Maybe somebody can ask Bill Mensch about this claim? He told about this ratio too.
A linked list is a very useful data structure. It allows information to be inserted, removed and reordered without having to move much of the data around as would be the case of using an array. In addition, the individual entries need not be the same size.
Thank you, it is a good example but too simple. The 6800 has 16-bit ops like CPX, LDX, ... Indeed they alone give it an advantage.
I have made the 6502 code for the main loop.
Code: Select all
ldy #0
sty prev+1
lda list
ldx list+1
beq done
bne cont1
loop
lda (prev),y
cont1
cpx node+1
bne cont2
cmp node
beq found
cont2
sta prev
stx prev+1
iny
lda (prev),y
dey
tax
bne loop
It is 31 ticks (the 6800 code is 23) but your code is just a sequence of 16-bit operations. If we add some more calculations this 6800 advantage disappears. For instance, it is not typical to seek a node by its address, usually we seek a node by a value of its field. We can compare two lists, process several lists, etc. And more complex processing gives more advantages to the 6502.
IMHO sort algorithms are good for benchmarking. Sorting is a typical task and it is not very simple.
Re: 6502 vs 6800
Posted: Sat Sep 26, 2020 7:31 pm
by Chromatix
For problem sizes likely to fit into 64KB RAM, Shellsort is an appropriate algorithm. It's very simple (just a sequence of strided insertion sorts), doesn't have nasty implementation details as Quicksort does, and doesn't consume temporary memory as Mergesort does. Performance characteristics of Shellsort are mainly determined by the gap sequence, which is usually precomputed up to the largest practical problem size for the computer involved, and thus doesn't affect the code directly; for these small computers, we could take the Ciura sequence (1, 4, 10, 23, 57, 132, 301, 701) directly without extension. For all these reasons, Shellsort is used to implement qsort() in µClibc.
Sometimes the sorting task is just to put some numbers in order. Sometimes, however, there are sorting keys with attached data. And sometimes the keys to be sorted are actually strings of text. So we should first define what we are sorting. I think a good test would be 24-bit keys with a 40-bit attached record, for a total of 8 bytes in each record. A theoretical 64KB of such records would be 8192 of them, which is not too large for the Ciura sequence to apply. For the sake of sanity, let's stipulate that the implementation has to handle up to 4096 records, 32KB, stored contiguously.
I actually would like to see how that ends up comparing between the two.
Re: 6502 vs 6800
Posted: Sun Sep 27, 2020 11:29 am
by BillG
I have made the 6502 code for the main loop.
Don't you believe in comments?
I fixed a problem in your code...
Code: Select all
0200 A0 02 [2] 00008 ldy #list ; This is needed to handle deleting first node
0202 84 04 [3] 00009 sty prev
00010
0204 A0 00 [2] 00011 ldy #0
0206 84 05 [3] 00012 sty prev+1
0208 A5 02 [3] 00013 lda list
020A A6 03 [3] 00014 ldx list+1
020C F0 17 (0225) [2/3] 00015 beq done
020E D0 02 (0212) [2/3] 00016 bne cont1
00017
0210 00018 loop
0210 B1 04 [5/6] 00019 lda (prev),y
0212 00020 cont1
0212 E4 01 [3] 00021 cpx node+1
0214 D0 04 (021A) [2/3] 00022 bne cont2
00023
0216 C5 00 [3] 00024 cmp node
0218 F0 0C (0226) [2/3] 00025 beq found
021A 00026 cont2
021A 85 04 [3] 00027 sta prev
021C 86 05 [3] 00028 stx prev+1
021E C8 [2] 00029 iny
021F B1 04 [5/6] 00030 lda (prev),y
0221 88 [2] 00031 dey
0222 AA [2] 00032 tax
0223 D0 EB (0210) [2/3] 00033 bne loop
You did not include anything for found. It cannot be substantially better than the 6800 code.
Edit: corrected my misinterpretation of code.
Re: 6502 vs 6800
Posted: Sun Sep 27, 2020 11:34 am
by BillG
It is 31 ticks (the 6800 code is 23) but your code is just a sequence of 16-bit operations. If we add some more calculations this 6800 advantage disappears. For instance, it is not typical to seek a node by its address, usually we seek a node by a value of its field.
To quote the late, great Ronald Reagan, "There you go again..."
Fine. Modify the task to match the value in the Tag variable with the third byte of each node.
This change adds one cycle to the inner loop of the 6800, but IMHO hurts the 6502 by at least 10 cycles. Even with three byte registers, you are now one short. Too bad the 6800 cannot loan you the unused B register...
Code: Select all
00010 *
00011 * Remove a node from a linked list.
00012 *
00013 * Input:
00014 * List = the address of the first node in the list.
00015 * Tag = the tag value (third byte) of the node to remove
00016 *
00017 * Scoreboard:
00018 *
00019 * 21 cycles if the list is empty
00020 *
00021 * otherwise
00022 *
00023 * 28 cycles to start
00024 *
00025 * 24 cycles for each node before the specified one
00026 *
00027 * 20 cycles to exit if the node is not found
00028 * 38 cycles to exit if the node is found
00029 *
0100 00030 Remove
0100 CE 0001 [3] 00031 ldx #List ; This is the "previous" of the first node
0103 DF 03 [5] 00032 stx Prev
00033
0105 DE 01 [4] 00034 ldx List ; Point to the first node
0107 27 0C (0115) [4] 00035 beq Done ; End of the list?
00036
0109 96 00 [3] 00037 ldaa Tag ; Get tag to match
00038
010B 00039 Loop
010B A1 02 [5] 00040 cmpa Tag_,X ; Is this the node?
010D 27 07 (0116) [4] 00041 beq Found ; Yes
00042
010F DF 03 [5] 00043 stx Prev ; This will be the previous node
00044
0111 EE 00 [6] 00045 ldx ,X ; Point to the next node
00046
0113 26 F6 (010B) [4] 00047 bne Loop ; Repeat
00048
0115 00049 Done
0115 39 [5] 00050 rts
00051
0116 00052 Found
0116 EE 00 [6] 00053 ldx ,X ; Get the address of the next node
0118 DF 05 [5] 00054 stx Next
00055
011A DE 03 [4] 00056 ldx Prev ; Point back to the previous node
00057
011C 96 05 [3] 00058 ldaa Next ; Link it to the next node
011E A7 00 [6] 00059 staa ,X
0120 96 06 [3] 00060 ldaa Next+1
0122 A7 01 [6] 00061 staa 1,X
00062
0124 39 [5] 00063 rts
Re: 6502 vs 6800
Posted: Wed Sep 30, 2020 8:54 am
by litwr
Fine. Modify the task to match the value in the Tag variable with the third byte of each node.
The Tag can be a string.

Re: 6502 vs 6800
Posted: Mon Oct 05, 2020 6:16 pm
by litwr
For problem sizes likely to fit into 64KB RAM, Shellsort is an appropriate algorithm. It's very simple (just a sequence of strided insertion sorts), doesn't have nasty implementation details as Quicksort does, and doesn't consume temporary memory as Mergesort does. Performance characteristics of Shellsort are mainly determined by the gap sequence, which is usually precomputed up to the largest practical problem size for the computer involved, and thus doesn't affect the code directly; for these small computers, we could take the Ciura sequence (1, 4, 10, 23, 57, 132, 301, 701) directly without extension. For all these reasons, Shellsort is used to implement qsort() in µClibc.
Sometimes the sorting task is just to put some numbers in order. Sometimes, however, there are sorting keys with attached data. And sometimes the keys to be sorted are actually strings of text. So we should first define what we are sorting. I think a good test would be 24-bit keys with a 40-bit attached record, for a total of 8 bytes in each record. A theoretical 64KB of such records would be 8192 of them, which is not too large for the Ciura sequence to apply. For the sake of sanity, let's stipulate that the implementation has to handle up to 4096 records, 32KB, stored contiguously.
I actually would like to see how that ends up comparing between the two.
It is done. I have got the next numbers of cycles.
random (gcc) 16,735,556
random (gen) 16,512,696
ascending 4,371,746
descending 9,959,798
constant 4,371,746
So, generally, it rakes about 16 sec. for a 6502 @1MHz. For the random sequence, I used two variants:
gcc) C standard call rand()%256 for every byte;
gen) the next code
Code: Select all
for (int i = 1; i < 8*4096; i++) mem[i] = (i >> 2 | (mem[i - 1] & 3) << 6) + i*i%43;
mem is an array of bytes.
I add one more value to the gap sequence, it is 1750.
I have also found out that my quicksort is about 2.6 faster than my shellsort of 30000 random two byte (uint16_t) numbers. I am almost sure that the 6800 is at least 3 times slower for the quicksort. The 6800 might use any sort algorithm, it just needs the fastest to compete.
Code: Select all
shelltabidx2 = 18 ;it points to the 9th position in the gap sequence
addrhi = 10 ;zp locations for parameters
addrlo = 11
data = $400 ;start of the array
sz = 4096 ;size of the array in elements
ESZ = 8 ;size of element
sz2 = sz*ESZ ;size of the array in bytes
* = $200
lda #<data
sta addrlo
lda #>data
sta addrhi
ldx #<sz2
ldy #>sz2
lda #shelltabidx2
jsr shellsort
nop ;stop here
.include "shell-contest.s"
The sources of my shellsort routine are attached, the size of my shellsort routine is 205 bytes, it also uses 10 zp bytes. What is about the 6800 shellsort speed?

Re: 6502 vs 6800
Posted: Mon Oct 05, 2020 8:03 pm
by Chromatix
Running an instrumented version of Shellsort on a PC, I estimate your code is using about 130 cycles for a compare, and about 200 or so cycles for a compare and swap, in both cases including all the loop overhead for moving on to the next pair of records. This is in a plausible range for 6502 code on this size of record. I haven't yet dug into your code to drill any deeper than that.
Re: 6502 vs 6800
Posted: Tue Oct 06, 2020 7:43 pm
by litwr
I have done some code improvements which made code 10-40% faster, 6 bytes shorter, and 4 zp required bytes less. For the task I have gotten the next results:
random (gcc) 14,841,052
random (gen) 14,633,170
descending 8,509,600
ascending 3,219,746
constant 3,219,746
In the attached archive I have added C-source of the shellsort algorithm used - maybe it helps to write the 6800 equivalent.
Re: 6502 vs 6800
Posted: Sun Oct 11, 2020 3:44 pm
by litwr
I have done experiments with quick and selection sorts. Their sources are attached, their results follow below.
The next table shows the size of code for sorting routines and amount of zp memory required.
Code: Select all
size zp
quick 271 12(min=6)
shell 196 6
selection 118 6
The next tables show the speed of the routines in clock cycles for 4096 and 8063 elements. I use Chromatix's 8 byte data structure for all cases.
Code: Select all
4096 quick shell selection
random (gcc) 6,945,023 14,841,052 546,689,912
random (gen) 7,078,469 14,633,170 546,689,967
qs-slow (16B) 11,780,993 11,767,274 546,474,774
descending 3,405,405 8,509,600 592,522,281
ascending 2,833,756 3,219,746 546,384,937
constant 9,610,629 3,219,746 546,384,937
8063 quick shell selection
random (gcc) 14,658,383 33,545,066 2,117,270,138
random (gen) 14,979,654 32,611,660 2,117,259,303
qs-slow (16B) 23,994,141 25,221,514 2,116,756,702
descending 7,084,521 19,542,402 2,295,383,997
ascending 5,960,625 6,614,042 2,116,601,085
constant 18,379,125 6,614,042 2,116,601,085
The quicksort routine uses 16 as a stack limit. Filling type 'qs-slow' usually kills quicksort with large enough data. This filling uses the next algorithm.
Code: Select all
mem[0] = 1;
mem[1] = 3;
mem[2] = 5;
mem[3] = 0;
mem[4] = 2;
mem[5] = 4;
for (int i = 3; i < SIZE*2; i++) {
mem[2*i] = mem[i];
mem[2*i + 1] = mem[2*i - 1] + 2;
mem[i] = 2*i + 1;
}
So it is also interesting to get the 6800 variants for quicksort and selectionsort.
Re: 6502 vs 6800
Posted: Mon Oct 12, 2020 12:41 am
by dmsc
Hi!
I have done experiments with quick and selection sorts. Their sources are attached, their results follow below.
The next table shows the size of code for sorting routines and amount of zp memory required.
The quicksort routine uses 16 as a stack limit. Filling type 'qs-slow' usually kills quicksort with large enough data. This filling uses the next algorithm.
You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.
See this FastBasic code, sorting 16 bit numbers:
Code: Select all
num = 4096
dim p(num)
' FILL
for i=0 to num
p(i) = rand(10000)
next
dim stack0(12), stack1(12)
lvl = 1 ' Stack level
i0 = 0 ' Sorts from...
i1 = num ' ... to
repeat
if i1 > i0
' Set pivot to 'i1'
temp = p(i1)
j0 = i0 - 1
j1 = i1
do
repeat : inc j0 : until (p(j0)) >= (temp)
repeat : dec j1 : if j1 = i0 then exit : until (temp) >= (p(j1))
if j0 >= j1 then exit
x = p(j0) : p(j0) = p(j1) : p(j1) = x
loop
p(i1) = p(j0) : p(j0) = temp
' Recurse over smaller sub-array:
if i1 - j0 > j1 - i0
stack0(lvl) = j0 + 1
stack1(lvl) = i1
i1 = j0 - 1
else
stack0(lvl) = i0
stack1(lvl) = j0 - 1
i0 = j0 + 1
endif
inc lvl
else
' Empty partition, go down in the stack
dec lvl
i0 = stack0(lvl)
i1 = stack1(lvl)
endif
until not lvl
The FastBasic 6502
interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.
Adding an insertion sort pass to sort partitions smaller than 8 elements reduces the runtime to 66 mega-cycles, 11% faster.
Re: 6502 vs 6800
Posted: Mon Oct 12, 2020 1:18 am
by Chromatix
Indeed, that is one of the nasty little implementation details that plagues Quicksort, as I mentioned earlier. It's very easy to write a Quicksort that works well on shuffled unique (or nearly unique) data. It's significantly harder to write one that is not trivially susceptible to pathological behaviour on inputs that are easy to construct (and may even be encountered by accident). Shellsort handles such cases far more naturally in a small code size, which is why I consider it more appropriate for small machines.
The four basic features of a competent Quicksort are:
1: Pivot is chosen well - ideally the median of a modest random sample - to avoid quadratic behaviour on ascending, descending, "organ pipes", or median-3-killer lists. On a small machine it can be hard to generate good random numbers, but Rand48 is better than nothing.
2: Ternary comparisons against the pivot, to efficiently handle runs of equal values. This results in three partitions, but you only need to recurse into two of them. A single-valued array goes from a pathological worst case to an optimal best case when you do this.
3: Smallest open partition is handled first, to minimise stack/queue depth. It's sensible to use a priority queue for this, as you effectively do in the above, and this avoids having to depend on tail recursion optimisations since you can simply iterate on the queue.
4: Small partitions are handled with insertion sort or similar, since Quicksort does not handle them efficiently. On modern hardware, I would use Splaysort from 256 elements down; this can be implemented with a fixed 1KB auxiliary buffer, with a little imagination, and is exceptionally efficient. But even this is too much when you only have 64KB total RAM. Insertion sort is efficient from about 8 elements down.
After taking care of all the above points, the code complexity starts making Shellsort look very attractive for a small machine, even though it's a bit slower. Most of the descriptions of Quicksort that you can easily find do not make the above points as clear as they should be.
In a sorting visualisation framework that I've been playing with, I have a working septenary quicksort, using ternary comparisons with three pivots at a time, and delegating to Splaysort for small partitions. The three pivots are chosen as quartiles of a 7-element random sample, which ensures only a 25% chance that the largest open partition is more than half the original. It uses effectively O(1) memory and makes Timsort start to look slow, which is really saying something. And I have a version that does stable sorting as well, which was the original motivation for the septenary structure. But I wouldn't try to squeeze all that into a 6502.
Re: 6502 vs 6800
Posted: Mon Oct 12, 2020 2:27 am
by dmsc
Hi!
Indeed, that is one of the nasty little implementation details that plagues Quicksort, as I mentioned earlier. It's very easy to write a Quicksort that works well on shuffled unique (or nearly unique) data. It's significantly harder to write one that is not trivially susceptible to pathological behaviour on inputs that are easy to construct (and may even be encountered by accident). Shellsort handles such cases far more naturally in a small code size, which is why I consider it more appropriate for small machines.
I agree with you, but have some comments on Quicksort for small machines:
The four basic features of a competent Quicksort are:
1: Pivot is chosen well - ideally the median of a modest random sample - to avoid quadratic behaviour on ascending, descending, "organ pipes", or median-3-killer lists. On a small machine it can be hard to generate good random numbers, but Rand48 is better than nothing.
2: Ternary comparisons against the pivot, to efficiently handle runs of equal values. This results in three partitions, but you only need to recurse into two of them. A single-valued array goes from a pathological worst case to an optimal best case when you do this.
I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.
And about pivot selection, yes, selecting a good pivot is essential. Instead of one random element you can do best of three, see the complete code:
Code: Select all
' QuickSort with median of three random values as pivot
dim stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
repeat
' Only recurse on partitions with more than 8 elements
if i1 > i0 + 8
' Set pivot to median of 3 random values
p0 = rand(i1 - i0) + i0
p1 = rand(i1 - i0) + i0
p2 = rand(i1 - i0) + i0
t0 = p(p0)
t1 = p(p1)
t2 = p(p2)
if (t0 > t1) = (t0 < t2)
p(p0) = p(i1)
temp = t0
elif (t1 > t0) = (t1 < t2)
p(p1) = p(i1)
temp = t1
else
p(p2) = p(i1)
temp = t2
endif
p(i1) = temp
j0 = i0 - 1
j1 = i1
do
repeat : inc j0 : until (p(j0)) >= (temp)
repeat : dec j1 : if j1 = i0 then exit : until (temp) >= (p(j1))
if j0 >= j1 then exit
x = p(j0) : p(j0) = p(j1) : p(j1) = x
loop
p(i1) = p(j0) : p(j0) = temp
' Recurse over smaller sub-array:
if i1 - j0 > j1 - i0
stack0(lvl) = j0 + 1
stack1(lvl) = i1
i1 = j0 - 1
else
stack0(lvl) = i0
stack1(lvl) = j0 - 1
i0 = j0 + 1
endif
inc lvl
else
for i=i0 to i1-1
temp = p(i+1)
for j=i to i0 step -1
if (p(j)) <= (temp) then exit
p(j+1) = p(j)
next
p(j+1) = temp
next
dec lvl
i0 = stack0(lvl)
i1 = stack1(lvl)
endif
until not lvl
3: Smallest open partition is handled first, to minimise stack/queue depth. It's sensible to use a priority queue for this, as you effectively do in the above, and this avoids having to depend on tail recursion optimisations since you can simply iterate on the queue.
I don't know why this method is not shown in more places, as for me it is the best way to implement quicksort, the "recursive quicksort" tend to be worse than the iterative one, and it is more difficult to get right.
After taking care of all the above points, the code complexity starts making Shellsort look very attractive for a small machine, even though it's a bit slower.
I agree, and I really like the Shellsort algorithm, it is really simple and quite fast, normally faster than Heapsort in most implementations.
In a sorting visualisation framework that I've been playing with, I have a working septenary quicksort, using ternary comparisons with three pivots at a time, and delegating to Splaysort for small partitions. The three pivots are chosen as quartiles of a 7-element random sample, which ensures only a 25% chance that the largest open partition is more than half the original. It uses effectively O(1) memory and makes Timsort start to look slow, which is really saying something. And I have a version that does stable sorting as well, which was the original motivation for the septenary structure. But I wouldn't try to squeeze all that into a 6502.
Yea, that above sounds like very tricky code to get right and fast - and I suspect that the speedup is only noticeable for very big inputs.
Have Fun!
Re: 6502 vs 6800
Posted: Mon Oct 12, 2020 3:12 pm
by Chromatix
I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.
If it is
always slower, that implies also for the single-valued array which it is supposed to help massively on, which rather suggests you got it wrong somehow. But it is expected to be a
little slower on normal data, since more elements have to be moved and a more complicated comparison has to be performed.
Closely related to the ternary quicksort is the dual-pivot quicksort, which can use the same partitioning algorithm as ternary since it also produces three partitions, but must recurse into all three of them. Here the extra cost of the three-way partitioning is offset by the smaller partition size at each step, and thus the reduced depth of recursion. I have a version which compares the two pivots to each other after selecting them, and performs a ternary partition ( <p | =p | >p ) if they are equal - which implies that there's a big run of equal values for which the ternary scheme works best - and a dual-pivot partition ( <=p | >p && <q | >=q ) otherwise. The latter construction guarantees that the outer partitions have at least one element each, ie. the pivots themselves.
Re: 6502 vs 6800
Posted: Mon Oct 12, 2020 4:07 pm
by BigEd
(Interesting article on the practical and modern Timsort
here.)