6502 vs 6800

Chromatix · Post by **Chromatix** » Mon Oct 12, 2020 4:23 pm

I heard that Timsort was considered for implementing sort in Go. It was rejected because of the O(n) auxiliary space requirement. So even on modern computers, in-place algorithms remain valuable.

BigEd · Post by **BigEd** » Mon Oct 12, 2020 6:18 pm

Ah, maybe the space usage rules it out.

dmsc · Post by **dmsc** » Tue Oct 13, 2020 1:53 am

Hi!

Chromatix wrote:

Quote:

I tried implementing ternary partition qsort in Fastbasic, and found it to be always slower than the standard two-way partitioning. I suspect this is because the problem sizes are not really that big (tried up to 8k elements) and the insertion sort pass at the small paritition size is not affected.

If it is always slower, that implies also for the single-valued array which it is supposed to help massively on, which rather suggests you got it wrong somehow. But it is expected to be a little slower on normal data, since more elements have to be moved and a more complicated comparison has to be performed.

Those are the measurements of my algorithm, sorting 4097 integers:

Random Data:
- 3 way qsort: 105.5 Mega cycles
- 2 way qsort: 63.6 Mega cycles

Only 2 values:
- 3 way qsort: 9.1 Mega cycles
- 2 way qsort: 65.0 Mega cycles

Only 1 value:
- 3 way qsort: 5.4 Mega cycles
- 2 way qsort: 65.9 Mega cycles

Only 150 values:
- 3 way qsort: 59.3 Mega cycles
- 2 way qsort: 59.0 Mega cycles

So, no, the 3-way is certainly faster if there are few values in the array, but what it is not true is that the standard 2-way quicksort is pathological in that case, it is only slightly slower.

This is the code for my 3-way partitioning:

Code: Select all

' QuickSort - using 3-way partitioning
num = 4096
dim p(num)

' FILL VALUES
for i=0 to num
  p(i) = 123 + rand(150) 
  x = x + p(i)
next

' QuickSort
dim stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
repeat
  if i1 > i0 + 12
    ' Set pivot to median of 3 random values
    p0 = rand(i1 - i0) + i0
    p1 = rand(i1 - i0) + i0
    p2 = rand(i1 - i0) + i0
    t0 = p(p0)
    t1 = p(p1)
    t2 = p(p2)

    if (t0 > t1) = (t0 < t2)
      p(p0) = p(i0)
      temp = t0
    elif (t1 > t0) = (t1 < t2)
      p(p1) = p(i0)
      temp = t1
    else
      p(p2) = p(i0)
      temp = t2
    endif
    p(i0) = temp

    j0 = i0 - 1
    jm = i0 + 1
    j1 = i1 + 1

    while jm < j1
      v = p(jm)
      if (v) < (temp)
        inc j0
        p(jm) = temp
        p(j0) = v
        inc jm
      elif (v) = (temp)
        inc jm
      else
        dec j1
        p(jm) = p(j1)
        p(j1) = v
      endif
    wend

    if i1 - j1  >  j0 - i0
      stack0(lvl) = j1
      stack1(lvl) = i1
      i1 = j0
    else
      stack0(lvl) = i0
      stack1(lvl) = j0
      i0 = j1
    endif
    inc lvl
  else
    ' Small partition, do insertion sort:
    for i=i0 to i1-1
      temp = p(i+1)
      for j=i to i0 step -1
        if (p(j)) <= (temp) then exit
        p(j+1) = p(j)
      next
      p(j+1) = temp
    next
    dec lvl
    i0 = stack0(lvl)
    i1 = stack1(lvl)
 endif
until not lvl

Have Fun!

litwr · Post by **litwr** » Tue Oct 13, 2020 6:01 pm

dmsc wrote:

You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.

This trick helps to save stack but it can also make sorting much slower.

Try to use 'qs-slow' filling...

dmsc wrote:

See this FastBasic code, sorting 16 bit numbers:

Code: Select all

num = 4096
num = 4096
dim p(num)
  ...
  endif
until not lvl

The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.

It is sad that FastBasic is available only for the Atari. I still missed the Atari world.

It would be good to have FastBasic for the 6800 for the present contest.

So I have converted the code to Microsoft QBasic and run it under Dosbox. I have got

Code: Select all

DIM p(num)

' FILL
FOR i = 0 TO num: p(i) = INT(10000 * RND(1)): NEXT

DIM stack0(12), stack1(12)
lvl = 1    ' Stack level
i0 = 0     ' Sorts from...
i1 = num ' ... to

DO
  IF i1 > i0 THEN

    ' Set pivot to 'i1'
    temp = p(i1)
    j0 = i0 - 1
    j1 = i1
    DO
      DO: j0 = j0 + 1: LOOP UNTIL p(j0) >= temp
      DO: j1 = j1 - 1: IF j1 = i0 THEN EXIT DO
      LOOP UNTIL temp >= p(j1)
      IF j0 >= j1 THEN EXIT DO
      x = p(j0): p(j0) = p(j1): p(j1) = x
    LOOP
    p(i1) = p(j0): p(j0) = temp

   ' Recurse over smaller sub-array:
    IF i1 - j0 > j1 - i0 THEN
      stack0(lvl) = j0 + 1
      stack1(lvl) = i1
      i1 = j0 - 1
    ELSE
      stack0(lvl) = i0
      stack1(lvl) = j0 - 1
      i0 = j0 + 1
    END IF
    lvl = lvl + 1
  ELSE
    ' Empty partition, go down in the stack
    lvl = lvl - 1
    i0 = stack0(lvl)
    i1 = stack1(lvl)
  END IF
LOOP UNTIL NOT lvl
FOR i = 0 TO 10: PRINT p(i); : NEXT: PRINT

It doesn't work for me, it prints an unsorted sequence in the end.

Maybe we need a new thread Quicksort implementations?

dmsc wrote:

Adding an insertion sort pass to sort partitions smaller than 8 elements reduces the runtime to 66 mega-cycles, 11% faster.

IMHO for assembly it gives only less than 3%.

litwr · Post by **litwr** » Wed Oct 14, 2020 6:42 pm

dmsc wrote:

I agree, and I really like the Shellsort algorithm, it is really simple and quite fast, normally faster than Heapsort in most implementations.

BigEd wrote:

(Interesting article on the practical and modern Timsort here.)

I agree that Shellsort is faster than Heapsort for many cases but I think it is because Shellsort is more cache friendly and modern computers rely much on cache performance. However when sorting text strings Heapsort is faster than Shellsort because both methods become equally cache unfriendly.
IMHO Timsort is overestimated, it is just a kind of current fashion. It requires a lot of additional memory and it is much slower than other methods. It is slower than Shellsort for many important cases.
A year ago I have done a small research about sorting methods, results are in a dynamic table - https://litwr2.github.io/ros/table1.html

dmsc wrote:

And about pivot selection, yes, selecting a good pivot is essential. Instead of one random element you can do best of three, see the complete code:

IMHO all these tricks just mask the problem. They do more difficult to find the worst case sequences but they don't reduce probability of such sequences. They also make performance for a general random sequence slightly worse.

Chromatix · Post by **Chromatix** » Thu Oct 15, 2020 1:48 am

A randomly sampled pivot avoids worst-case behaviour pretty reliably, so long as the data is constructed without knowledge of the random sampling sequence (which is assured if you use true entropy to seed a decent PRNG after generating the data). Formally, the worst-case O(n^2) behaviour is still possible but becomes vanishingly improbable for any problem size where it actually matters, so the average-case O(n log n) behaviour is what you will get. That's true for even a single sample per pivot, and the average quality of the pivot improves with the size of the sample.

With only 64KB RAM and decidedly primitive CPUs, we're working with small enough problem sizes that even going quadratic is not completely catastrophic, though it will definitely hurt. More importantly, we probably don't have an adversary constructing our data with the intention of burning CPU time. So we can relax slightly and use a sloppily seeded PRNG for sampling the pivot.

litwr · Post by **litwr** » Thu Oct 15, 2020 6:26 am

Chromatix wrote:

A randomly sampled pivot avoids worst-case behaviour pretty reliably, so long as the data is constructed without knowledge of the random sampling sequence (which is assured if you use true entropy to seed a decent PRNG after generating the data). Formally, the worst-case O(n^2) behaviour is still possible but becomes vanishingly improbable for any problem size where it actually matters, so the average-case O(n log n) behaviour is what you will get. That's true for even a single sample per pivot, and the average quality of the pivot improves with the size of the sample.

I can only repeat that tricks around pivot selection do not reduce the probability of the worst cases. So the standard industrial C++ sort (Introsort) just switches to Heapsort if there are too many recursive calls. BTW 'qs-slow' kills standard C-qsort (gcc) which uses sophisticated ways to get the pivot and reduce the stack load.

BigEd · Post by **BigEd** » Thu Oct 15, 2020 7:47 am

It seems to me moderately likely that in the real world the input data might be partially sorted, or even reverse-sorted. So the question of what the probable behaviour of an algorithm is should be weighted by the probable distribution of the inputs. For sure, random or nearly so is an important case: but I think it's not the only case, in practice.

Whether the worst case is important or not is another question. Is it incredibly expensive, or just a bit slower? Is it vanishingly improbable, or just uncommon? Will the system fail in its mission if the worst case is encountered, fall below a service level agreement, or merely be a bit slow?

Chromatix · Post by **Chromatix** » Thu Oct 15, 2020 8:04 am

I assume you mean the data generator in fill_for_quadratic_qsort_hoare()? I just ported that over to my own system, and it's easily handled by my own quicksorts (and indeed by the standard ones, some of which choke much harder on a mere descending list). Though std::sort() did resort to heapsort as you say, it looks like it's just doing median-of-three which is not a sophisticated pivot selection by my standards. Median-of-three is well known to be susceptible to adversarially-constructed data, and has the only redeeming qualities of being cheap and deterministic - which are the same qualities that inherently make it vulnerable.

I've heard that system library authors don't really like to use RNGs in library functions that are not explicitly related to random number generation. I'm reasonably sure the design of std::sort() is an example of that.

Applying my own (relatively old) Introsort implementation to this data yields quite decent results. I use median-of-three by default in that, but dynamically switch to a linear-time median-of-medians approximation if the partition size does not converge as quickly as expected for the recursion depth. There is no recourse to heapsort there, just quicksort with an actually sophisticated (but still deterministic) pivot selection. Within a recursion depth of about 4, the killer pattern is broken up enough that median-of-three works well again.

My newer quicksorts - the ones that use a priority queue, multiple pivots and random pivot sampling - have absolutely no problem with it. With random sampling for pivot selection, any data looks like it is randomly shuffled, which is quicksort's favoured case.

drogon · Post by **drogon** » Thu Oct 15, 2020 12:16 pm

Just briefly on sorting... One thing to check is the number of comparisons needed - and what it is we're comparing. I found recently that a shellsort was faster than quicksort for sorting strings because for that particular data set I was working with (in my '816 BCPL system), the string compare was relatively slow, so it's always going to be hard to pick a "best" algorithm for everything. I don't think there is a "one size fits all" here.

-Gordon

Chromatix · Post by **Chromatix** » Thu Oct 15, 2020 12:31 pm

Correct, different sorting algorithms have strengths in different situations. Often the main weakness in one algorithm is a strength of another.

dmsc · Post by **dmsc** » Mon Oct 19, 2020 2:09 am

Hi!

litwr wrote:

dmsc wrote:

You can implement Quicksort so the stack usage is bounded at log_2(n), so a 12 entry stack should be enough. The trick is to always recurse on the smallest partition first.

This trick helps to save stack but it can also make sorting much slower.

Try to use 'qs-slow' filling...

The order of recursion does not change the speed, as the work done is the same.

Quote:

dmsc wrote:

See this FastBasic code, sorting 16 bit numbers:

Code: Select all

num = 4096
num = 4096
dim p(num)
  ...
  endif
until not lvl

The FastBasic 6502 interpreter runs the above code sorting the 4097 element array in approx 74 mega-cycles.

It is sad that FastBasic is available only for the Atari. I still missed the Atari world.

It would be good to have FastBasic for the 6800 for the present contest

Porting FastBasic to other 6502 computers should not be hard - it is open source and I certainly would accept contributions!
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is.

Quote:

So I have converted the code to Microsoft QBasic and run it under Dosbox. I have got

Code: Select all

LOOP UNTIL NOT lvl

That line should be "LOOP UNTIL lvl = 0", as QBasic defines NOT as a binary operator (NOT A = 65535 - A), so you must always compare with 0. FastBasic is modeled on Atari BASIC, with the "C like" convention that uses the boolean as a type and AND/NOT/OR as boolean operators.

This is my working QBaic code, with timers to measure the speed:

Code: Select all

' QuickSort - using 2-way partitioning
' 2-way partition is faster if there is not a lot of repeated elements
DEFINT A-Z

num = 4096
DIM SHARED p(num + 1)

' FILL
x = 0  ' X is a "check-sum" of the values, to verify that they are all the same
S! = TIMER
FOR i = 0 TO num
  p(i) = INT(10000 * RND)
  x = (x + p(i)) AND 16383
NEXT
S! = TIMER - S!
PRINT "Fill: "; S!; "sec", "check: "; x

' SORT
S! = TIMER

DIM stack0(12), stack1(12)
lvl = 1
i0 = 0
i1 = num
DO

  IF i1 > i0 + 8 THEN

    ' Set pivot to 'i1'
    temp = p(i1)
    j0 = i0 - 1
    j1 = i1
    DO
      DO: j0 = j0 + 1:                          LOOP UNTIL (p(j0)) >= (temp)
      DO: j1 = j1 - 1: IF j1 = i0 THEN EXIT DO
      LOOP UNTIL (temp) >= (p(j1))
      IF j0 >= j1 THEN EXIT DO
      x = p(j0): p(j0) = p(j1): p(j1) = x
    LOOP
    p(i1) = p(j0): p(j0) = temp

    ' Recurse over smaller sub-array:
    IF i1 - j0 > j1 - i0 THEN
      stack0(lvl) = j0 + 1
      stack1(lvl) = i1
      i1 = j0 - 1
    ELSE
      stack0(lvl) = i0
      stack1(lvl) = j0 - 1
      i0 = j0 + 1
    END IF
    lvl = lvl + 1
  ELSE
    FOR i = i0 TO i1 - 1
      temp = p(i + 1)
      FOR j = i TO i0 STEP -1
        IF (p(j)) <= (temp) THEN EXIT FOR
        p(j + 1) = p(j)
      NEXT
      p(j + 1) = temp
    NEXT
    lvl = lvl - 1
    i0 = stack0(lvl)
    i1 = stack1(lvl)
  END IF
LOOP UNTIL lvl = 0

S! = TIMER - S!
PRINT "Sort: "; S!; "sec",

' CHECK
x = p(0)
FOR i = 0 TO num - 1
  x = (x + p(i + 1)) AND 16383
  IF (p(i)) > (p(i + 1)) THEN
    PRINT "Error at "; i; ": "; (p(i)); " > "; (p(i + 1))
    EXIT FOR
  END IF
NEXT
PRINT "check: "; x

Have Fun!

dmsc · Post by **dmsc** » Mon Oct 19, 2020 2:30 am

Hi!

litwr wrote:

I can only repeat that tricks around pivot selection do not reduce the probability of the worst cases. So the standard industrial C++ sort (Introsort) just switches to Heapsort if there are too many recursive calls. BTW 'qs-slow' kills standard C-qsort (gcc) which uses sophisticated ways to get the pivot and reduce the stack load.

The "trick" of selecting the pivot from the median of three random values does in fact reduce the probability of quadratic behavior - in fact it is so unlikely on big inputs that it is much more probable that your computer just breaks than the sorting time is more than double of the mean.

The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.

Have Fun!

litwr · Post by **litwr** » Tue Oct 20, 2020 6:05 pm

Chromatix wrote:

I assume you mean the data generator in fill_for_quadratic_qsort_hoare()? ...

Indeed using randomness for the pivot selection makes almost impossible to intentionally design a bad sequence. But ordinary random data may be bad as well. So IMHO it is not good to rely on RNG. You know the sequence `qs_slow' killed gcc std::qsort - it means that it is risky to use it, use std::sort instead which guarantees N*Log(N) behaviors for any data.

I agree with you that switching to heapsort seems a bit too complicated. We also know that shellsort for numbers is noticeably faster than heapsort on modern cached hardware. However pivot manipulations do not guarantee from a pattern which can kill sorting. IMHO my 6502 quicksort guarantees N*Log(N) speed for every case.

It would be interesting to check your quicksorts and especially their 6502 implementations.

drogon wrote:

Just briefly on sorting... One thing to check is the number of comparisons needed - and what it is we're comparing. I found recently that a shellsort was faster than quicksort for sorting strings because for that particular data set I was working with (in my '816 BCPL system), the string compare was relatively slow, so it's always going to be hard to pick a "best" algorithm for everything. I don't think there is a "one size fits all" here.

Excuse me but it seems rather impossible. A quicksort is generally always faster than a shellsort. Because it just does less comparisons and exchanges. Maybe your quicksort implementation was not just optimized enough while your shellsort is optimized very well?

dmsc wrote:

The order of recursion does not change the speed, as the work done is the same.

The assembly code without recursion is slightly faster but I have found out that elimination of recursion is a wrong idea. It is because the recursion allows us to check the situations when a bad sequence is processed. So I removed the tail call optimization from my code.

dmsc wrote:

Porting FastBasic to other 6502 computers should not be hard - it is open source and I certainly would accept contributions!
Porting to other processors would be a lot of work, as you would need to rewrite all the interpreter and the parser machine, but at least the parser bytecode can be used as is.

It is an interesting idea to port it for the Commodore +4. However I have still had a bit deserted project of my own Basic compiler - http://litwr2.atspace.eu/plus4/cbcwif.html

Indeed it is not easy to make a port for a different processor but maybe the 6800's people will help?

dmsc wrote:

That line should be "LOOP UNTIL lvl = 0", as QBasic defines NOT as a binary operator (NOT A = 65535 - A), so you must always compare with 0. FastBasic is modeled on Atari BASIC, with the "C like" convention that uses the boolean as a type and AND/NOT/OR as boolean operators.

Thank you very much. It works now. It is a shame for me that I didn't check Basic's NOT real meaning.

dmsc wrote:

This is my working QBaic code, with timers to measure the speed:

Code: Select all

' QuickSort - using 2-way partitioning
' 2-way partition is faster if there is not a lot of repeated elements
DEFINT A-Z
...

Thank you. It works. But it is almost impossible to compare speeds of algos for so different languages as assembly and Basic. IMHO my quicksort can't be realized on Basic. I can implement it on C using longjmp/setjmp but Basic doesn't have such functions.

What do you think about double pivots quicksort? https://github.com/litwr2/research-of-s ... ick-dp.cpp IMHO it is one of the best, they set it the standard sort for Java. However it seems that it is possible to design a bad sequence for it too.

BTW this discussion has forced me to start a new small github project - https://github.com/litwr2/6502-sorting

It is interesting that the 1 MHz 6502 may sort the same 60000 bytes sequence during about 50 hours (!) with insertionsort, 60 seconds with shellsort, 40 seconds with quicksort and only 6 seconds with radixsort. It is also interesting that modern computers sort 8 GB sequence for about the same timings.

dmsc wrote:

The "trick" of selecting the pivot from the median of three random values does in fact reduce the probability of quadratic behavior - in fact it is so unlikely on big inputs that it is much more probable that your computer just breaks than the sorting time is more than double of the mean.

The reason QuickSort is avoided on uncontrolled inputs is that if you don't use a secure RNG, an attacker can give you a crafted input, and using a secure RNG can be tricky.

IMHO just use an absolutely safe quicksort like its hybrid form introsort.

Would you like to give me a link to any information that proves your point about the reduction of the probability of quadratic behavior with pivot selection methods? Thank you

Chromatix · Post by **Chromatix** » Wed Oct 21, 2020 7:06 am

We have to draw a firm distinction between deterministic pivot selection and random-sample pivot selection. The two behave very differently. We can illustrate that difference in terms of probabilities. In probability analysis, it is typical to draw, or at least think about, a Cumulative Distribution Function (CDF), upon which the median and quartiles (or any other desired quantile) can easily be extracted.

For Quicksort, the important metric to plot on the X axis of the CDF is the size of the largest partition, as that determines the required depth of recursion. Assuming a single pivot and unique elements, the ideal outcome is that the largest partition is exactly half the original, but that will rarely be the case in practice. Quicksort will be O(n log n) if the largest partition is consistently a multiplicative factor smaller than the original, but O(n^2) if it is merely a constant number of elements smaller than the original.

With deterministic pivot selection, the outcome will always be the same for any given selection algorithm and input data. So, for any given selection algorithm, it is straightforward to construct input data which, with 100% probability, exercises the worst case behaviour. All other possible outcomes have 0% probability, unless and until the selection algorithm or the input data are changed. Median-of-three, Ninther, and Median-of-Medians are all deterministic selections - but the latter has a much more favourable worst-case behaviour, since it considers a larger sample. I have a linear-time median-of-medians selector which is guaranteed to produce a worst-case O(n^1.25) Quicksort - but the constant scaling factor is somewhat higher than for simple selectors, because sampling all that data is costly. That median-of-medians selector is built into one of my older Introsort implementations, where it acts as the safety valve to avoid quadratic behaviour.

With random-sample pivot selection, we assume for analysis purposes that the outcome will change for each run, and that the random sequence is statistically independent of the data, and these are in fact good assumptions for a quality implementation. So any given outcome has a probability which, for typical data, is both non-zero and less than certainty. For a single random sample, the rank of the selected element is uniformly distributed over the possibilities, so the expected size of the largest partition is ¾ of the original; since this is a multiplicative factor, Quicksort with simple random pivot selection is O(n log n) with reasonably high probability. More sophisticated random sampling can move the expected largest partition size towards ½ of the original, and decrease the probability of largest partitions being more than ¾ of the original, improving the constant factor while retaining O(n log n) asymptotic behaviour.

It is, admittedly, still possible for random pivot selection to result in quadratic behaviour. But we can calculate the probability of that actually occurring, and find that it is vanishingly small, again on the assumption that our random sampling is independent of the input data. It requires that the selected pivot be within the top K or the bottom K ranks, by chance, repeatedly, where K is an arbitrarily chosen constant. The chance of that occurring with simple random pivot selection is 2K/N for a single pass, 2K/(N-K) for the next pass, and so on until the number of passes is itself O(N/K) = O(N). If we assume N =1000 and K=4, that's a 1/125 chance on the first pass - rare but could happen - then 2/249 on the second pass, and 1/124 on the third. All three of those probabilities must come to pass for quadratic behaviour to persist through merely the first three passes; multiplying them together yields a 1/1929750 probability, which continues to shrink exponentially towards zero as more passes are added. And that is for simple random pivot selection; larger random samples are even more robust against this behaviour.

You may recall, earlier in this thread, that I asserted that Quicksort has some unfortunate implementation complexities which make it less suitable for small machines than Shellsort. That is why I don't have any Quicksort implementations for the 6502. I do, however, have this which is a modified clone of Timo Bingmann's modestly famous work, to run on some modern computer supporting a C++ compiler. There you can find both improved versions of the sorting algorithms that were included in the original (in some cases fixing some very silly bugs) and several that I've devised myself. Have fun.

6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800

Re: 6502 vs 6800