slightly OT: a simple Benchmark

BillO · Post by **BillO** » Fri Jul 06, 2018 3:07 am

commodorejohn wrote:

Ooh, now that's a sharp-looking little beast.

Thanks!.

Quote:

I don't see a thread for it on the forum here

There was a trouble shooting thread where I got a lot of help with getting the prototype to behave, but I do plan to do an introduction thread. Just need to find the time.

Quote:

- did you design it?

Yes I did - the idea was rattling around my head for many years. Finally had a chance to work on this last winter.

Quote:

What are the specs

That will be in the intro thread.

Quote:

...and is there anyplace to get a bare board?

I suppose that could be arranged. I had 5 bare boards made up, but will not need all of them.

BigEd · Post by **BigEd** » Fri Jul 06, 2018 3:08 am

It's an interesting and attractive board - I second the idea of a new thread for it!

BillO · Post by **BillO** » Fri Jul 06, 2018 3:27 am

Chromatix wrote:

Well, no, because I have no idea how to do that on your hardware. On the Beeb I just built a small BASIC wrapper around it.

Hmmm, I haven't read through your code yet, but what I was thinking is that you would increment a cycle counter for each time through a known routine - or something like that - so it would be machine independent.

In any case, I will probably need to add some I/O writes to detect when it starts and ends a run using an external interval counter.

BigEd · Post by **BigEd** » Fri Jul 06, 2018 11:09 am

I may be veering slightly off topic here...

GaBuZoMeu wrote:

The 68000 has a different clock scheme than the 6502. A cycle took 4 clocks (IIRC the TAS instruction took 6). I don't look at the clock speed anymore - it isn't helpful. I think it is better to ask for the required memory spped (access time). Using the required memory speed (e.g. 500ns for a 1 MHz 6502 system) you can compare various clock architectures easily.

Agreed, as memory was a dominant part of cost, an interesting question is how much computation you can do once you've bought your memory, which has a particular speed and a particular width, in choosing a CPU wisely.

In the previous discussion on benchmarking using a pi spigot program - results here - I see something a bit surprising. The Amiga with a 68000 at 7MHz is, of course, faster than a 6502 at 2MHz, even though the memory access rate might be a little worse, because it has a wider path to memory, lots of registers which reduce memory traffic, and the registers are wide, and the instruction set includes MUL and DIV. But I'm a little surprised that the ARM1 at 6.67MHz or the ARM2 at 8MHz (with MUL) doesn't make a better showing: the memory path is wider again and it makes efficient use of page mode. Perhaps the 32 bit wide instructions are a hindrance: too much memory bandwidth used for instruction fetch - no cache yet in that generation.

Chromatix · Post by **Chromatix** » Fri Jul 06, 2018 12:51 pm

The answer is simple: ARM2 doesn't have hardware divide, which is used by the algorithm. Implementing divide in software is expensive, even though ARM is much better at it than most with its barrel shifter and conditional execution.

BigEd · Post by **BigEd** » Fri Jul 06, 2018 12:52 pm

That makes sense, thanks!

GaBuZoMeu · Post by **GaBuZoMeu** » Fri Jul 06, 2018 4:06 pm

BigEd, you may also remember that the upcoming first RISC machines in that time had have a heavy stand. The lack of high speed memory, the requirement of wide memory, and the "so lala" performance triggers a huge discussion about whats right or wrong. The reinvention of the cache memory, a new sort of memory access (bulk access), and the tremendous gain in efficiency of the upcoming compilers changes a lot then. Next was once the memory speed doesn't need to keep up with the CPU speed the CPUs clockspeed could freely increase. A well chosen cache architecture could moderate the slow responding memory - memory bandwidth was never a problem.

Cheers
Arne

BigEd · Post by **BigEd** » Fri Jul 06, 2018 4:10 pm

It's true, the appearance of small instruction buffers and then small caches made a lot of difference. I remember back in the 80s, in my first job, someone saying that there was a problem in knowing what to do with all the transistors that we'd have as processes shrunk. I thought this odd - why not let the chips get smaller and cheaper - but as it turns out we did use lots more transistors!

BTW, you, and others interested in such questions, might also like some of the discussions over on anycpu. For example, we added a very small cache to our OPC7 minimal 32 bit CPU, and it made a big difference.
http://anycpu.org/forum/viewtopic.php?f=3&t=480

BigEd · Post by **BigEd** » Fri Jul 06, 2018 4:29 pm

(Sorry, a better thread on the cache performance of the OPC range of CPUs is this one:
http://anycpu.org/forum/viewtopic.php?p=2751#p2751
)

Chromatix · Post by **Chromatix** » Sat Jul 07, 2018 4:00 am

As an illustration of how *algorithmic* optimisations can be just as important - or even more so - compared to micro-optimisations or choice of language, here are results from the emulated 3MHz Second Processor running HiBASIC - but with a program which performs trial divisions over a list of discovered primes, rather than over every odd integer. This also serves as a minor demonstration of "structured BASIC" to contrast with the programming style enforced by standard BASIC.

Code: Select all

A:    3.87
B:    6.02
C:   56.66
D:  133.57
E:  236.45
F: 5263.5

So this version turns out to be slightly slower than my best brute-force BASIC code for the first two cases - which is not so surprising since there are more lines of code to execute per iteration - but gets progressively faster in relative terms as the problem difficulty is increased. Given a big enough problem to work on, this would undoubtedly outperform any assembly implementation of the brute-force algorithm; in case F it is already only 4x slower than my assembly code, and faster than some of the older, larger machines on which the F case has been benched.

Here's the code. Notice I've left out the line numbers - there are no GOTOs or GOSUBs, so they literally don't matter. Load your Beeb emulator, type AUTO, and paste this lot in.

Code: Select all

DIM PR%(4000) : MAXPR%=4000 : NUMPR%=1 : PR%(0)=3
DATA 20,30,35,50,70,100,0
REPEAT : READ G% : IF G%=0 THEN END
T%=TIME
NUMPR%=1 : LIMPR%=9 : C%=3 : D%=3
REPEAT
C%=FNnextprime(C%)
IF (C%-D%) < G% THEN D%=C%
UNTIL (C%-D%) >= G%
U%=TIME
PRINT C%,D%,(C%-D%),(U%-T%)/100
UNTIL FALSE
REM ---
DEF FNnextprime(S%)
LOCAL I%,N%,POP%
FOR POP%=1 TO 0 STEP 1
FOR N%=S%+2 TO 16000000 STEP 2
IF N% > LIMPR% THEN PR%(NUMPR%)=FNnextprime(PR%(NUMPR%-1)) : LIMPR%=PR%(NUMPR%)*PR%(NUMPR%) : NUMPR%=NUMPR%+1 : IF NUMPR% > MAXPR% THEN STOP
FOR I%=0 TO NUMPR%-1
IF (N% MOD PR%(I%))=0 THEN NEXT N%
NEXT
NEXT POP% : =N%

The last line of the above code cleans up the FOR-NEXT stack, which in BBC BASIC is apparently independent of the FN call stack, before returning the newly discovered prime. Without that, you'll quickly end up with "Too many FORs". Note also the use of recursion, calling FNnextprime within itself to update the list of primes to use as divisors.

BigEd · Post by **BigEd** » Sat Jul 07, 2018 5:21 am

I'm going to have to digest that... but meantime I pasted it over serial into my Master with the Matchbox running an approx 64MHz Arlet 65C02 design, and got this:

Code: Select all

>RUN
       907       887        20      0.21
      1361      1327        34      0.33
      9587      9551        36      3.26
     19661     19609        52      7.75
     31469     31397        72     13.75
    370373    370261       112    308.76

And just for laughs, switched to the internally-mounted PiTubeDirect which runs at about 275MHz...

Code: Select all

>CH."primp"
       907       887        20      6E-2
      1361      1327        34      8E-2
      9587      9551        36      0.78
     19661     19609        52      1.86
     31469     31397        72      3.29
    370373    370261       112     73.22

And then I switched to the native ARM, in effect running BBC Basic on the Pi Zero but with a BBC Micro as the front end host:

Code: Select all

>CH."primp"
       907       887        20      0.01
      1361      1327        34         0
      9587      9551        36      0.04
     19661     19609        52      0.09
     31469     31397        72      0.16
    370373    370261       112      3.37
>

Chromatix · Post by **Chromatix** » Sat Jul 07, 2018 5:31 am

Now, looking at your 68K assembly, I can see a number of improvements to make, without changing the underlying algorithm.

In common with most CISC machines, the 68K updates the condition codes as a side-effect of almost every instruction, so some of your explicit test instructions can be removed. It's also worth paying close attention to the operand sizes, since the original 68000 has only a 16-bit ALU and thus takes longer to perform 32-bit operations.

Also, your subroutine is called only once from one place, so should instead be placed inline to avoid the call overhead.

Code: Select all

; Enter with D2.L containing minimum gap to search for
PrimeGap:
  moveq.l #3,D0    ; high (active) prime
UpdateTarget:
  move.l D0,D4
  move.l D2,D5
  add.l D4,D5      ; D5.L can be compared directly with a newly discovered prime
NextPrime:
  addq.l #2,D0
  moveq.w #1,D7    ; trial divisor
NextDivisor:
  addq.w #2,D7
  move.l D0,D1
  divu.w D7,D1     ; divides D1.L by D7.W, leaving quotient in D1.W and remainder in high word of D1
  move.w D1,D3     ; save quotient
  clr.w D1         ; zero quotient field to permit testing the remainder alone
  swap.w D1        ; retrieve remainder and test for zero
  beq NextPrime    ; even division means this number is composite, not prime, don't test any more divisors
  cmp.w D7,D3      ; check whether divisor has reached the square root of the prime candidate
  bcc NextDivisor  ; ie. if the quotient is still greater than the divisor, we should try another divisor

; Here we have discovered a new prime.
  cmp.l D0,D5
  bcc UpdateTarget   ; if gap isn't big enough, try again
  rts
; on exit, D0 and D4 contain the first two primes with sufficient gap.

The above is completely untested. Most likely its performance is still dominated by the DIVU, but now there is less overhead surrounding it and the code is also smaller.

ETA: Thinking about it, I'm reminded that DIVU.W has a quirk which will stop this code from being able to handle primes larger than 3*64K. That is, if the 16-bit result quotient overflows (which will occur if the dividend is at least as large as the divisor shifted left 16 bits), the entire division operation is aborted and the destination/dividend register is left unchanged. Therefore all integers greater than that limit will be detected as composite, even if they're really prime.

Unfortunately there's no simple workaround that is both fast and preserves the character of the algorithm. The nearest I can think of is to use DIVU twice in a multi-precision operation, but DIVU is already by far the most expensive instruction so this would have to be done selectively. The 68020 has a version of DIVU that can produce a quotient the same width as the dividend, but that's not what is already in the bench table.

Chromatix · Post by **Chromatix** » Sat Jul 07, 2018 8:49 pm

It might be worth building a new table based on a more efficient algorithm. With that in mind, I've taken the approach from one of my recent posts and reimplemented it in C, with a slightly different approach to the task being measured. The differences from the original are:

- Instead of using every odd integer as trial divisors, a list of actual primes is kept. The complete set of 16-bit primes can easily be stored in about 13KB, which is still small enough for most 8-bit micros, and this permits testing primality relatively efficiently on any 32-bit integer.

- Instead of searching for gaps of particular sizes, the program lists all maximal prime gaps in the 32-bit integer range, and also notes how long it took to reach each one. The times printed are cumulative, so they can be compared with other approaches to the same problem.

The complete run takes about an hour and a half on my Sandy Bridge based MacBook Pro. It also takes several hours on my Raspberry Pi - so it's feasible to run overnight. Small 8-bit and 16-bit machines are unlikely to be able to perform a complete run, but they can be stopped after producing whatever they are capable of. And the algorithm is simple enough to implement in BASIC or assembly, though most BASICs will use double the memory that C does to store the list of small primes.

In the C source, I've provided programmatic hints for implementing an assembly version to use a 32/16 bit divide instruction efficiently. I suspect most C compilers for small machines won't do a good job of that by themselves. Modern CPUs can produce a 32-bit quotient without triggering overflow semantics, so don't need that special assistance.

Sample output:

Code: Select all

Maxgap    1:          3 -          2,     0.000 sec
Maxgap    2:          5 -          3,     0.000 sec
Maxgap    4:         11 -          7,     0.000 sec
Maxgap    6:         29 -         23,     0.000 sec
Maxgap    8:         97 -         89,     0.000 sec
Maxgap   14:        127 -        113,     0.000 sec
Maxgap   18:        541 -        523,     0.000 sec
Maxgap   20:        907 -        887,     0.000 sec
Maxgap   22:       1151 -       1129,     0.000 sec
Maxgap   34:       1361 -       1327,     0.000 sec
Maxgap   36:       9587 -       9551,     0.000 sec
Maxgap   44:      15727 -      15683,     0.000 sec
Maxgap   52:      19661 -      19609,     0.001 sec
Maxgap   72:      31469 -      31397,     0.001 sec
Found 6542 16-bit primes in 0.002 secs.
Maxgap   86:     156007 -     155921,     0.006 sec
Maxgap   96:     360749 -     360653,     0.017 sec
Maxgap  112:     370373 -     370261,     0.018 sec
Maxgap  114:     492227 -     492113,     0.026 sec
Maxgap  118:    1349651 -    1349533,     0.100 sec
Maxgap  132:    1357333 -    1357201,     0.100 sec
Maxgap  148:    2010881 -    2010733,     0.165 sec
Maxgap  154:    4652507 -    4652353,     0.508 sec
Maxgap  180:   17051887 -   17051707,     2.941 sec
Maxgap  210:   20831533 -   20831323,     3.873 sec
Maxgap  220:   47326913 -   47326693,    11.763 sec
Maxgap  222:  122164969 -  122164747,    42.470 sec
Maxgap  234:  189695893 -  189695659,    77.317 sec
Maxgap  248:  191913031 -  191912783,    78.493 sec
Maxgap  250:  387096383 -  387096133,   203.349 sec
Maxgap  282:  436273291 -  436273009,   239.639 sec
Maxgap  288: 1294268779 - 1294268491,  1087.127 sec
Maxgap  292: 1453168433 - 1453168141,  1274.887 sec
Maxgap  320: 2300942869 - 2300942549,  2399.465 sec
Maxgap  336: 3842611109 - 3842610773,  4813.874 sec
Reached end of 32-bit integer space, in 5623.393 secs.

Noteworthy in the above is that my MBP finds 16-bit primes faster than a BBC Micro's CPU clock ticks!

GaBuZoMeu · Post by **GaBuZoMeu** » Sun Jul 08, 2018 10:38 am

Thank you for your submissions Chromatix!

The idea of creating a list of primes during seeking them I have had once. But with only 4K RAM

Then hacking this into a machine somewhere in a shop would take more time and might be annoying

I will give it a try in Basic but I assume the enhancements will take effect mostly for problems D, E, and F (and higher).

Nevertheless your "pushing" did take effect

I found my old old 68K "Force ProfiKit 1" !

mc-1982-07-pg9-ForceProfiKit.pdf: (314.89 KiB) Downloaded 77 times

I hope it will work for me again, but I have only one suitable power supply at hands - an open frame SPS. That is a bit too dangerous without a cabinet. And as I only need a solution for the +/-12V (RS-232 section) I think it is easier to use a small DC/DC converter with 5V input using the main 5V supply. But such one I haven't at hands.

Cheers,
Arne

Chromatix · Post by **Chromatix** » Sun Jul 08, 2018 2:24 pm

Or you could wire up an adapter to a standard PC-ATX PSU. That would provide the voltages you need.

slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark