6502 vs 6800
Re: 6502 vs 6800
I heard that Timsort was considered for implementing sort in Go. It was rejected because of the O(n) auxiliary space requirement. So even on modern computers, in-place algorithms remain valuable.
Re: 6502 vs 6800
Ah, maybe the space usage rules it out.
Re: 6502 vs 6800
Hi!
If it is always slower, that implies also for the single-valued array which it is supposed to help massively on, which rather suggests you got it wrong somehow. But it is expected to be a little slower on normal data, since more elements have to be moved and a more complicated comparison has to be performed.
Those are the measurements of my algorithm, sorting 4097 integers:
Random Data:
- 3 way qsort: 105.5 Mega cycles
- 2 way qsort: 63.6 Mega cycles
Only 2 values:
- 3 way qsort: 9.1 Mega cycles
- 2 way qsort: 65.0 Mega cycles
Only 1 value:
- 3 way qsort: 5.4 Mega cycles
- 2 way qsort: 65.9 Mega cycles
Only 150 values:
- 3 way qsort: 59.3 Mega cycles
- 2 way qsort: 59.0 Mega cycles
So, no, the 3-way is certainly faster if there are few values in the array, but what it is not true is that the standard 2-way quicksort is pathological in that case, it is only slightly slower.
This is the code for my 3-way partitioning:
Have Fun!
Chromatix wrote:
Quote:
I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.
Random Data:
- 3 way qsort: 105.5 Mega cycles
- 2 way qsort: 63.6 Mega cycles
Only 2 values:
- 3 way qsort: 9.1 Mega cycles
- 2 way qsort: 65.0 Mega cycles
Only 1 value:
- 3 way qsort: 5.4 Mega cycles
- 2 way qsort: 65.9 Mega cycles
Only 150 values:
- 3 way qsort: 59.3 Mega cycles
- 2 way qsort: 59.0 Mega cycles
So, no, the 3-way is certainly faster if there are few values in the array, but what it is not true is that the standard 2-way quicksort is pathological in that case, it is only slightly slower.
This is the code for my 3-way partitioning:
Code: Select all
' QuickSort - using 3-way partitioning
num = 4096
dim p(num)
' FILL VALUES
for i=0 to num
p(i) = 123 + rand(150)
x = x + p(i)
next
' QuickSort
dim stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
repeat
if i1 > i0 + 12
' Set pivot to median of 3 random values
p0 = rand(i1 - i0) + i0
p1 = rand(i1 - i0) + i0
p2 = rand(i1 - i0) + i0
t0 = p(p0)
t1 = p(p1)
t2 = p(p2)
if (t0 > t1) = (t0 < t2)
p(p0) = p(i0)
temp = t0
elif (t1 > t0) = (t1 < t2)
p(p1) = p(i0)
temp = t1
else
p(p2) = p(i0)
temp = t2
endif
p(i0) = temp
j0 = i0 - 1
jm = i0 + 1
j1 = i1 + 1
while jm < j1
v = p(jm)
if (v) < (temp)
inc j0
p(jm) = temp
p(j0) = v
inc jm
elif (v) = (temp)
inc jm
else
dec j1
p(jm) = p(j1)
p(j1) = v
endif
wend
if i1 - j1 > j0 - i0
stack0(lvl) = j1
stack1(lvl) = i1
i1 = j0
else
stack0(lvl) = i0
stack1(lvl) = j0
i0 = j1
endif
inc lvl
else
' Small partition, do insertion sort:
for i=i0 to i1-1
temp = p(i+1)
for j=i to i0 step -1
if (p(j)) <= (temp) then exit
p(j+1) = p(j)
next
p(j+1) = temp
next
dec lvl
i0 = stack0(lvl)
i1 = stack1(lvl)
endif
until not lvl
Re: 6502 vs 6800
dmsc wrote:
You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.
dmsc wrote:
See this FastBasic code, sorting 16 bit numbers:
The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.
Code: Select all
num = 4096
num = 4096
dim p(num)
...
endif
until not lvl
The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.
Code: Select all
DIM p(num)
' FILL
FOR i = 0 TO num: p(i) = INT(10000 * RND(1)): NEXT
DIM stack0(12), stack1(12)
lvl = 1 ' Stack level
i0 = 0 ' Sorts from...
i1 = num ' ... to
DO
IF i1 > i0 THEN
' Set pivot to 'i1'
temp = p(i1)
j0 = i0 - 1
j1 = i1
DO
DO: j0 = j0 + 1: LOOP UNTIL p(j0) >= temp
DO: j1 = j1 - 1: IF j1 = i0 THEN EXIT DO
LOOP UNTIL temp >= p(j1)
IF j0 >= j1 THEN EXIT DO
x = p(j0): p(j0) = p(j1): p(j1) = x
LOOP
p(i1) = p(j0): p(j0) = temp
' Recurse over smaller sub-array:
IF i1 - j0 > j1 - i0 THEN
stack0(lvl) = j0 + 1
stack1(lvl) = i1
i1 = j0 - 1
ELSE
stack0(lvl) = i0
stack1(lvl) = j0 - 1
i0 = j0 + 1
END IF
lvl = lvl + 1
ELSE
' Empty partition, go down in the stack
lvl = lvl - 1
i0 = stack0(lvl)
i1 = stack1(lvl)
END IF
LOOP UNTIL NOT lvl
FOR i = 0 TO 10: PRINT p(i); : NEXT: PRINT
dmsc wrote:
Adding an insertion sort pass to sort partitions smaller than 8 elements reduces the runtime to 66 mega-cycles, 11% faster.
Re: 6502 vs 6800
dmsc wrote:
I agree, and I really like the Shellsort algorithm, it is really simple and quite fast, normally faster than Heapsort in most implementations.
BigEd wrote:
(Interesting article on the practical and modern Timsort here.)
IMHO Timsort is overestimated, it is just a kind of current fashion. It requires a lot of additional memory and it is much slower than other methods. It is slower than Shellsort for many important cases.
A year ago I have done a small research about sorting methods, results are in a dynamic table - https://litwr2.github.io/ros/table1.html
dmsc wrote:
And about pivot selection, yes, selecting a good pivot is essential. Instead of one random element you can do best of three, see the complete code:
Re: 6502 vs 6800
A randomly sampled pivot avoids worst-case behaviour pretty reliably, so long as the data is constructed without knowledge of the random sampling sequence (which is assured if you use true entropy to seed a decent PRNG after generating the data). Formally, the worst-case O(n^2) behaviour is still possible but becomes vanishingly improbable for any problem size where it actually matters, so the average-case O(n log n) behaviour is what you will get. That's true for even a single sample per pivot, and the average quality of the pivot improves with the size of the sample.
With only 64KB RAM and decidedly primitive CPUs, we're working with small enough problem sizes that even going quadratic is not completely catastrophic, though it will definitely hurt. More importantly, we probably don't have an adversary constructing our data with the intention of burning CPU time. So we can relax slightly and use a sloppily seeded PRNG for sampling the pivot.
With only 64KB RAM and decidedly primitive CPUs, we're working with small enough problem sizes that even going quadratic is not completely catastrophic, though it will definitely hurt. More importantly, we probably don't have an adversary constructing our data with the intention of burning CPU time. So we can relax slightly and use a sloppily seeded PRNG for sampling the pivot.
Re: 6502 vs 6800
Chromatix wrote:
A randomly sampled pivot avoids worst-case behaviour pretty reliably, so long as the data is constructed without knowledge of the random sampling sequence (which is assured if you use true entropy to seed a decent PRNG after generating the data). Formally, the worst-case O(n^2) behaviour is still possible but becomes vanishingly improbable for any problem size where it actually matters, so the average-case O(n log n) behaviour is what you will get. That's true for even a single sample per pivot, and the average quality of the pivot improves with the size of the sample.
Re: 6502 vs 6800
It seems to me moderately likely that in the real world the input data might be partially sorted, or even reverse-sorted. So the question of what the probable behaviour of an algorithm is should be weighted by the probable distribution of the inputs. For sure, random or nearly so is an important case: but I think it's not the only case, in practice.
Whether the worst case is important or not is another question. Is it incredibly expensive, or just a bit slower? Is it vanishingly improbable, or just uncommon? Will the system fail in its mission if the worst case is encountered, fall below a service level agreement, or merely be a bit slow?
Whether the worst case is important or not is another question. Is it incredibly expensive, or just a bit slower? Is it vanishingly improbable, or just uncommon? Will the system fail in its mission if the worst case is encountered, fall below a service level agreement, or merely be a bit slow?
Re: 6502 vs 6800
I assume you mean the data generator in fill_for_quadratic_qsort_hoare()? I just ported that over to my own system, and it's easily handled by my own quicksorts (and indeed by the standard ones, some of which choke much harder on a mere descending list). Though std::sort() did resort to heapsort as you say, it looks like it's just doing median-of-three which is not a sophisticated pivot selection by my standards. Median-of-three is well known to be susceptible to adversarially-constructed data, and has the only redeeming qualities of being cheap and deterministic - which are the same qualities that inherently make it vulnerable.
I've heard that system library authors don't really like to use RNGs in library functions that are not explicitly related to random number generation. I'm reasonably sure the design of std::sort() is an example of that.
Applying my own (relatively old) Introsort implementation to this data yields quite decent results. I use median-of-three by default in that, but dynamically switch to a linear-time median-of-medians approximation if the partition size does not converge as quickly as expected for the recursion depth. There is no recourse to heapsort there, just quicksort with an actually sophisticated (but still deterministic) pivot selection. Within a recursion depth of about 4, the killer pattern is broken up enough that median-of-three works well again.
My newer quicksorts - the ones that use a priority queue, multiple pivots and random pivot sampling - have absolutely no problem with it. With random sampling for pivot selection, any data looks like it is randomly shuffled, which is quicksort's favoured case.
I've heard that system library authors don't really like to use RNGs in library functions that are not explicitly related to random number generation. I'm reasonably sure the design of std::sort() is an example of that.
Applying my own (relatively old) Introsort implementation to this data yields quite decent results. I use median-of-three by default in that, but dynamically switch to a linear-time median-of-medians approximation if the partition size does not converge as quickly as expected for the recursion depth. There is no recourse to heapsort there, just quicksort with an actually sophisticated (but still deterministic) pivot selection. Within a recursion depth of about 4, the killer pattern is broken up enough that median-of-three works well again.
My newer quicksorts - the ones that use a priority queue, multiple pivots and random pivot sampling - have absolutely no problem with it. With random sampling for pivot selection, any data looks like it is randomly shuffled, which is quicksort's favoured case.
Re: 6502 vs 6800
Just briefly on sorting... One thing to check is the number of comparisons needed - and what it is we're comparing. I found recently that a shellsort was faster than quicksort for sorting strings because for that particular data set I was working with (in my '816 BCPL system), the string compare was relatively slow, so it's always going to be hard to pick a "best" algorithm for everything. I don't think there is a "one size fits all" here.
-Gordon
-Gordon
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/
Re: 6502 vs 6800
Correct, different sorting algorithms have strengths in different situations. Often the main weakness in one algorithm is a strength of another.
Re: 6502 vs 6800
Hi!
This trick helps to save stack but it can also make sorting much slower.
Try to use 'qs-slow' filling...
The order of recursion does not change the speed, as the work done is the same.
It is sad that FastBasic is available only for the Atari. I still missed the Atari world.
It would be good to have FastBasic for the 6800 for the present contest
Porting FastBasic to other 6502 computers should not be hard - it is open source and I certainly would accept contributions!
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is. That line should be "LOOP UNTIL lvl = 0", as QBasic defines NOT as a binary operator (NOT A = 65535 - A), so you must always compare with 0. FastBasic is modeled on Atari BASIC, with the "C like" convention that uses the boolean as a type and AND/NOT/OR as boolean operators.
This is my working QBaic code, with timers to measure the speed:
Have Fun!
litwr wrote:
dmsc wrote:
You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.
Quote:
dmsc wrote:
See this FastBasic code, sorting 16 bit numbers:
The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.
Code: Select all
num = 4096
num = 4096
dim p(num)
...
endif
until not lvl
The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is.
Quote:
So I have converted the code to Microsoft QBasic and run it under Dosbox. I have got
Code: Select all
LOOP UNTIL NOT lvl
This is my working QBaic code, with timers to measure the speed:
Code: Select all
' QuickSort - using 2-way partitioning
' 2-way partition is faster if there is not a lot of repeated elements
DEFINT A-Z
num = 4096
DIM SHARED p(num + 1)
' FILL
x = 0 ' X is a "check-sum" of the values, to verify that they are all the same
S! = TIMER
FOR i = 0 TO num
p(i) = INT(10000 * RND)
x = (x + p(i)) AND 16383
NEXT
S! = TIMER - S!
PRINT "Fill: "; S!; "sec", "check: "; x
' SORT
S! = TIMER
DIM stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
DO
IF i1 > i0 + 8 THEN
' Set pivot to 'i1'
temp = p(i1)
j0 = i0 - 1
j1 = i1
DO
DO: j0 = j0 + 1: LOOP UNTIL (p(j0)) >= (temp)
DO: j1 = j1 - 1: IF j1 = i0 THEN EXIT DO
LOOP UNTIL (temp) >= (p(j1))
IF j0 >= j1 THEN EXIT DO
x = p(j0): p(j0) = p(j1): p(j1) = x
LOOP
p(i1) = p(j0): p(j0) = temp
' Recurse over smaller sub-array:
IF i1 - j0 > j1 - i0 THEN
stack0(lvl) = j0 + 1
stack1(lvl) = i1
i1 = j0 - 1
ELSE
stack0(lvl) = i0
stack1(lvl) = j0 - 1
i0 = j0 + 1
END IF
lvl = lvl + 1
ELSE
FOR i = i0 TO i1 - 1
temp = p(i + 1)
FOR j = i TO i0 STEP -1
IF (p(j)) <= (temp) THEN EXIT FOR
p(j + 1) = p(j)
NEXT
p(j + 1) = temp
NEXT
lvl = lvl - 1
i0 = stack0(lvl)
i1 = stack1(lvl)
END IF
LOOP UNTIL lvl = 0
S! = TIMER - S!
PRINT "Sort: "; S!; "sec",
' CHECK
x = p(0)
FOR i = 0 TO num - 1
x = (x + p(i + 1)) AND 16383
IF (p(i)) > (p(i + 1)) THEN
PRINT "Error at "; i; ": "; (p(i)); " > "; (p(i + 1))
EXIT FOR
END IF
NEXT
PRINT "check: "; x
Re: 6502 vs 6800
Hi!
The "trick" of selecting the pivot from the median of three random values does in fact reduce the probability of quadratic behavior - in fact it is so unlikely on big inputs that it is much more probable that your computer just breaks than the sorting time is more than double of the mean.
The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.
Have Fun!
litwr wrote:
I can only repeat that tricks around pivot selection do not reduce the probability of the worst cases. So the standard industrial C++ sort (Introsort) just switches to Heapsort if there are too many recursive calls. BTW 'qs-slow' kills standard C-qsort (gcc) which uses sophisticated ways to get the pivot and reduce the stack load.
The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.
Have Fun!
Re: 6502 vs 6800
Chromatix wrote:
I assume you mean the data generator in fill_for_quadratic_qsort_hoare()? ...
I agree with you that switching to heapsort seems a bit too complicated. We also know that shellsort for numbers is noticeably faster than heapsort on modern cached hardware. However pivot manipulations do not guarantee from a pattern which can kill sorting. IMHO my 6502 quicksort guarantees N*Log(N) speed for every case.
It would be interesting to check your quicksorts and especially their 6502 implementations.
drogon wrote:
Just briefly on sorting... One thing to check is the number of comparisons needed - and what it is we're comparing. I found recently that a shellsort was faster than quicksort for sorting strings because for that particular data set I was working with (in my '816 BCPL system), the string compare was relatively slow, so it's always going to be hard to pick a "best" algorithm for everything. I don't think there is a "one size fits all" here.
dmsc wrote:
The order of recursion does not change the speed, as the work done is the same.
dmsc wrote:
Porting FastBasic to other 6502 computers should not be hard - it is open source and I certainly would accept contributions!
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is.
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is.
Indeed it is not easy to make a port for a different processor but maybe the 6800's people will help?
dmsc wrote:
That line should be "LOOP UNTIL lvl = 0", as QBasic defines NOT as a binary operator (NOT A = 65535 - A), so you must always compare with 0. FastBasic is modeled on Atari BASIC, with the "C like" convention that uses the boolean as a type and AND/NOT/OR as boolean operators.
dmsc wrote:
This is my working QBaic code, with timers to measure the speed:
Code: Select all
' QuickSort - using 2-way partitioning
' 2-way partition is faster if there is not a lot of repeated elements
DEFINT A-Z
...
What do you think about double pivots quicksort? https://github.com/litwr2/research-of-s ... ick-dp.cpp IMHO it is one of the best, they set it the standard sort for Java. However it seems that it is possible to design a bad sequence for it too.
BTW this discussion has forced me to start a new small github project - https://github.com/litwr2/6502-sorting
dmsc wrote:
The "trick" of selecting the pivot from the median of three random values does in fact reduce the probability of quadratic behavior - in fact it is so unlikely on big inputs that it is much more probable that your computer just breaks than the sorting time is more than double of the mean.
The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.
The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.
Re: 6502 vs 6800
We have to draw a firm distinction between deterministic pivot selection and random-sample pivot selection. The two behave very differently. We can illustrate that difference in terms of probabilities. In probability analysis, it is typical to draw, or at least think about, a Cumulative Distribution Function (CDF), upon which the median and quartiles (or any other desired quantile) can easily be extracted.
For Quicksort, the important metric to plot on the X axis of the CDF is the size of the largest partition, as that determines the required depth of recursion. Assuming a single pivot and unique elements, the ideal outcome is that the largest partition is exactly half the original, but that will rarely be the case in practice. Quicksort will be O(n log n) if the largest partition is consistently a multiplicative factor smaller than the original, but O(n^2) if it is merely a constant number of elements smaller than the original.
With deterministic pivot selection, the outcome will always be the same for any given selection algorithm and input data. So, for any given selection algorithm, it is straightforward to construct input data which, with 100% probability, exercises the worst case behaviour. All other possible outcomes have 0% probability, unless and until the selection algorithm or the input data are changed. Median-of-three, Ninther, and Median-of-Medians are all deterministic selections - but the latter has a much more favourable worst-case behaviour, since it considers a larger sample. I have a linear-time median-of-medians selector which is guaranteed to produce a worst-case O(n^1.25) Quicksort - but the constant scaling factor is somewhat higher than for simple selectors, because sampling all that data is costly. That median-of-medians selector is built into one of my older Introsort implementations, where it acts as the safety valve to avoid quadratic behaviour.
With random-sample pivot selection, we assume for analysis purposes that the outcome will change for each run, and that the random sequence is statistically independent of the data, and these are in fact good assumptions for a quality implementation. So any given outcome has a probability which, for typical data, is both non-zero and less than certainty. For a single random sample, the rank of the selected element is uniformly distributed over the possibilities, so the expected size of the largest partition is ¾ of the original; since this is a multiplicative factor, Quicksort with simple random pivot selection is O(n log n) with reasonably high probability. More sophisticated random sampling can move the expected largest partition size towards ½ of the original, and decrease the probability of largest partitions being more than ¾ of the original, improving the constant factor while retaining O(n log n) asymptotic behaviour.
It is, admittedly, still possible for random pivot selection to result in quadratic behaviour. But we can calculate the probability of that actually occurring, and find that it is vanishingly small, again on the assumption that our random sampling is independent of the input data. It requires that the selected pivot be within the top K or the bottom K ranks, by chance, repeatedly, where K is an arbitrarily chosen constant. The chance of that occurring with simple random pivot selection is 2K/N for a single pass, 2K/(N-K) for the next pass, and so on until the number of passes is itself O(N/K) = O(N). If we assume N =1000 and K=4, that's a 1/125 chance on the first pass - rare but could happen - then 2/249 on the second pass, and 1/124 on the third. All three of those probabilities must come to pass for quadratic behaviour to persist through merely the first three passes; multiplying them together yields a 1/1929750 probability, which continues to shrink exponentially towards zero as more passes are added. And that is for simple random pivot selection; larger random samples are even more robust against this behaviour.
You may recall, earlier in this thread, that I asserted that Quicksort has some unfortunate implementation complexities which make it less suitable for small machines than Shellsort. That is why I don't have any Quicksort implementations for the 6502. I do, however, have this which is a modified clone of Timo Bingmann's modestly famous work, to run on some modern computer supporting a C++ compiler. There you can find both improved versions of the sorting algorithms that were included in the original (in some cases fixing some very silly bugs) and several that I've devised myself. Have fun.
For Quicksort, the important metric to plot on the X axis of the CDF is the size of the largest partition, as that determines the required depth of recursion. Assuming a single pivot and unique elements, the ideal outcome is that the largest partition is exactly half the original, but that will rarely be the case in practice. Quicksort will be O(n log n) if the largest partition is consistently a multiplicative factor smaller than the original, but O(n^2) if it is merely a constant number of elements smaller than the original.
With deterministic pivot selection, the outcome will always be the same for any given selection algorithm and input data. So, for any given selection algorithm, it is straightforward to construct input data which, with 100% probability, exercises the worst case behaviour. All other possible outcomes have 0% probability, unless and until the selection algorithm or the input data are changed. Median-of-three, Ninther, and Median-of-Medians are all deterministic selections - but the latter has a much more favourable worst-case behaviour, since it considers a larger sample. I have a linear-time median-of-medians selector which is guaranteed to produce a worst-case O(n^1.25) Quicksort - but the constant scaling factor is somewhat higher than for simple selectors, because sampling all that data is costly. That median-of-medians selector is built into one of my older Introsort implementations, where it acts as the safety valve to avoid quadratic behaviour.
With random-sample pivot selection, we assume for analysis purposes that the outcome will change for each run, and that the random sequence is statistically independent of the data, and these are in fact good assumptions for a quality implementation. So any given outcome has a probability which, for typical data, is both non-zero and less than certainty. For a single random sample, the rank of the selected element is uniformly distributed over the possibilities, so the expected size of the largest partition is ¾ of the original; since this is a multiplicative factor, Quicksort with simple random pivot selection is O(n log n) with reasonably high probability. More sophisticated random sampling can move the expected largest partition size towards ½ of the original, and decrease the probability of largest partitions being more than ¾ of the original, improving the constant factor while retaining O(n log n) asymptotic behaviour.
It is, admittedly, still possible for random pivot selection to result in quadratic behaviour. But we can calculate the probability of that actually occurring, and find that it is vanishingly small, again on the assumption that our random sampling is independent of the input data. It requires that the selected pivot be within the top K or the bottom K ranks, by chance, repeatedly, where K is an arbitrarily chosen constant. The chance of that occurring with simple random pivot selection is 2K/N for a single pass, 2K/(N-K) for the next pass, and so on until the number of passes is itself O(N/K) = O(N). If we assume N =1000 and K=4, that's a 1/125 chance on the first pass - rare but could happen - then 2/249 on the second pass, and 1/124 on the third. All three of those probabilities must come to pass for quadratic behaviour to persist through merely the first three passes; multiplying them together yields a 1/1929750 probability, which continues to shrink exponentially towards zero as more passes are added. And that is for simple random pivot selection; larger random samples are even more robust against this behaviour.
You may recall, earlier in this thread, that I asserted that Quicksort has some unfortunate implementation complexities which make it less suitable for small machines than Shellsort. That is why I don't have any Quicksort implementations for the 6502. I do, however, have this which is a modified clone of Timo Bingmann's modestly famous work, to run on some modern computer supporting a C++ compiler. There you can find both improved versions of the sorting algorithms that were included in the original (in some cases fixing some very silly bugs) and several that I've devised myself. Have fun.