slightly OT: a simple Benchmark
Re: slightly OT: a simple Benchmark
commodorejohn wrote:
Ooh, now that's a sharp-looking little beast.
Quote:
I don't see a thread for it on the forum here
Quote:
- did you design it?
Quote:
What are the specs
Quote:
...and is there anyplace to get a bare board?
Last edited by BillO on Fri Jul 06, 2018 3:38 am, edited 1 time in total.
Bill
Re: slightly OT: a simple Benchmark
It's an interesting and attractive board - I second the idea of a new thread for it!
Re: slightly OT: a simple Benchmark
Chromatix wrote:
Well, no, because I have no idea how to do that on your hardware. On the Beeb I just built a small BASIC wrapper around it.
In any case, I will probably need to add some I/O writes to detect when it starts and ends a run using an external interval counter.
Bill
Re: slightly OT: a simple Benchmark
I may be veering slightly off topic here...
Agreed, as memory was a dominant part of cost, an interesting question is how much computation you can do once you've bought your memory, which has a particular speed and a particular width, in choosing a CPU wisely.
In the previous discussion on benchmarking using a pi spigot program - results here - I see something a bit surprising. The Amiga with a 68000 at 7MHz is, of course, faster than a 6502 at 2MHz, even though the memory access rate might be a little worse, because it has a wider path to memory, lots of registers which reduce memory traffic, and the registers are wide, and the instruction set includes MUL and DIV. But I'm a little surprised that the ARM1 at 6.67MHz or the ARM2 at 8MHz (with MUL) doesn't make a better showing: the memory path is wider again and it makes efficient use of page mode. Perhaps the 32 bit wide instructions are a hindrance: too much memory bandwidth used for instruction fetch - no cache yet in that generation.
GaBuZoMeu wrote:
The 68000 has a different clock scheme than the 6502. A cycle took 4 clocks (IIRC the TAS instruction took 6). I don't look at the clock speed anymore - it isn't helpful. I think it is better to ask for the required memory spped (access time). Using the required memory speed (e.g. 500ns for a 1 MHz 6502 system) you can compare various clock architectures easily.
In the previous discussion on benchmarking using a pi spigot program - results here - I see something a bit surprising. The Amiga with a 68000 at 7MHz is, of course, faster than a 6502 at 2MHz, even though the memory access rate might be a little worse, because it has a wider path to memory, lots of registers which reduce memory traffic, and the registers are wide, and the instruction set includes MUL and DIV. But I'm a little surprised that the ARM1 at 6.67MHz or the ARM2 at 8MHz (with MUL) doesn't make a better showing: the memory path is wider again and it makes efficient use of page mode. Perhaps the 32 bit wide instructions are a hindrance: too much memory bandwidth used for instruction fetch - no cache yet in that generation.
Re: slightly OT: a simple Benchmark
The answer is simple: ARM2 doesn't have hardware divide, which is used by the algorithm. Implementing divide in software is expensive, even though ARM is much better at it than most with its barrel shifter and conditional execution.
Re: slightly OT: a simple Benchmark
That makes sense, thanks!
Re: slightly OT: a simple Benchmark
BigEd, you may also remember that the upcoming first RISC machines in that time had have a heavy stand. The lack of high speed memory, the requirement of wide memory, and the "so lala" performance triggers a huge discussion about whats right or wrong. The reinvention of the cache memory, a new sort of memory access (bulk access), and the tremendous gain in efficiency of the upcoming compilers changes a lot then. Next was once the memory speed doesn't need to keep up with the CPU speed the CPUs clockspeed could freely increase. A well chosen cache architecture could moderate the slow responding memory - memory bandwidth was never a problem.
Cheers
Arne
Cheers
Arne
Re: slightly OT: a simple Benchmark
It's true, the appearance of small instruction buffers and then small caches made a lot of difference. I remember back in the 80s, in my first job, someone saying that there was a problem in knowing what to do with all the transistors that we'd have as processes shrunk. I thought this odd - why not let the chips get smaller and cheaper - but as it turns out we did use lots more transistors!
BTW, you, and others interested in such questions, might also like some of the discussions over on anycpu. For example, we added a very small cache to our OPC7 minimal 32 bit CPU, and it made a big difference.
http://anycpu.org/forum/viewtopic.php?f=3&t=480
BTW, you, and others interested in such questions, might also like some of the discussions over on anycpu. For example, we added a very small cache to our OPC7 minimal 32 bit CPU, and it made a big difference.
http://anycpu.org/forum/viewtopic.php?f=3&t=480
Re: slightly OT: a simple Benchmark
(Sorry, a better thread on the cache performance of the OPC range of CPUs is this one:
http://anycpu.org/forum/viewtopic.php?p=2751#p2751
)
http://anycpu.org/forum/viewtopic.php?p=2751#p2751
)
Re: slightly OT: a simple Benchmark
As an illustration of how *algorithmic* optimisations can be just as important - or even more so - compared to micro-optimisations or choice of language, here are results from the emulated 3MHz Second Processor running HiBASIC - but with a program which performs trial divisions over a list of discovered primes, rather than over every odd integer. This also serves as a minor demonstration of "structured BASIC" to contrast with the programming style enforced by standard BASIC.
So this version turns out to be slightly slower than my best brute-force BASIC code for the first two cases - which is not so surprising since there are more lines of code to execute per iteration - but gets progressively faster in relative terms as the problem difficulty is increased. Given a big enough problem to work on, this would undoubtedly outperform any assembly implementation of the brute-force algorithm; in case F it is already only 4x slower than my assembly code, and faster than some of the older, larger machines on which the F case has been benched.
Here's the code. Notice I've left out the line numbers - there are no GOTOs or GOSUBs, so they literally don't matter. Load your Beeb emulator, type AUTO, and paste this lot in.
The last line of the above code cleans up the FOR-NEXT stack, which in BBC BASIC is apparently independent of the FN call stack, before returning the newly discovered prime. Without that, you'll quickly end up with "Too many FORs". Note also the use of recursion, calling FNnextprime within itself to update the list of primes to use as divisors.
Code: Select all
A: 3.87
B: 6.02
C: 56.66
D: 133.57
E: 236.45
F: 5263.5
Here's the code. Notice I've left out the line numbers - there are no GOTOs or GOSUBs, so they literally don't matter. Load your Beeb emulator, type AUTO, and paste this lot in.
Code: Select all
DIM PR%(4000) : MAXPR%=4000 : NUMPR%=1 : PR%(0)=3
DATA 20,30,35,50,70,100,0
REPEAT : READ G% : IF G%=0 THEN END
T%=TIME
NUMPR%=1 : LIMPR%=9 : C%=3 : D%=3
REPEAT
C%=FNnextprime(C%)
IF (C%-D%) < G% THEN D%=C%
UNTIL (C%-D%) >= G%
U%=TIME
PRINT C%,D%,(C%-D%),(U%-T%)/100
UNTIL FALSE
REM ---
DEF FNnextprime(S%)
LOCAL I%,N%,POP%
FOR POP%=1 TO 0 STEP 1
FOR N%=S%+2 TO 16000000 STEP 2
IF N% > LIMPR% THEN PR%(NUMPR%)=FNnextprime(PR%(NUMPR%-1)) : LIMPR%=PR%(NUMPR%)*PR%(NUMPR%) : NUMPR%=NUMPR%+1 : IF NUMPR% > MAXPR% THEN STOP
FOR I%=0 TO NUMPR%-1
IF (N% MOD PR%(I%))=0 THEN NEXT N%
NEXT
NEXT POP% : =N%
Re: slightly OT: a simple Benchmark
I'm going to have to digest that... but meantime I pasted it over serial into my Master with the Matchbox running an approx 64MHz Arlet 65C02 design, and got this:
And just for laughs, switched to the internally-mounted PiTubeDirect which runs at about 275MHz...
And then I switched to the native ARM, in effect running BBC Basic on the Pi Zero but with a BBC Micro as the front end host:
Code: Select all
>RUN
907 887 20 0.21
1361 1327 34 0.33
9587 9551 36 3.26
19661 19609 52 7.75
31469 31397 72 13.75
370373 370261 112 308.76
Code: Select all
>CH."primp"
907 887 20 6E-2
1361 1327 34 8E-2
9587 9551 36 0.78
19661 19609 52 1.86
31469 31397 72 3.29
370373 370261 112 73.22
Code: Select all
>CH."primp"
907 887 20 0.01
1361 1327 34 0
9587 9551 36 0.04
19661 19609 52 0.09
31469 31397 72 0.16
370373 370261 112 3.37
>Re: slightly OT: a simple Benchmark
Now, looking at your 68K assembly, I can see a number of improvements to make, without changing the underlying algorithm.
In common with most CISC machines, the 68K updates the condition codes as a side-effect of almost every instruction, so some of your explicit test instructions can be removed. It's also worth paying close attention to the operand sizes, since the original 68000 has only a 16-bit ALU and thus takes longer to perform 32-bit operations.
Also, your subroutine is called only once from one place, so should instead be placed inline to avoid the call overhead.
The above is completely untested. Most likely its performance is still dominated by the DIVU, but now there is less overhead surrounding it and the code is also smaller.
ETA: Thinking about it, I'm reminded that DIVU.W has a quirk which will stop this code from being able to handle primes larger than 3*64K. That is, if the 16-bit result quotient overflows (which will occur if the dividend is at least as large as the divisor shifted left 16 bits), the entire division operation is aborted and the destination/dividend register is left unchanged. Therefore all integers greater than that limit will be detected as composite, even if they're really prime.
Unfortunately there's no simple workaround that is both fast and preserves the character of the algorithm. The nearest I can think of is to use DIVU twice in a multi-precision operation, but DIVU is already by far the most expensive instruction so this would have to be done selectively. The 68020 has a version of DIVU that can produce a quotient the same width as the dividend, but that's not what is already in the bench table.
In common with most CISC machines, the 68K updates the condition codes as a side-effect of almost every instruction, so some of your explicit test instructions can be removed. It's also worth paying close attention to the operand sizes, since the original 68000 has only a 16-bit ALU and thus takes longer to perform 32-bit operations.
Also, your subroutine is called only once from one place, so should instead be placed inline to avoid the call overhead.
Code: Select all
; Enter with D2.L containing minimum gap to search for
PrimeGap:
moveq.l #3,D0 ; high (active) prime
UpdateTarget:
move.l D0,D4
move.l D2,D5
add.l D4,D5 ; D5.L can be compared directly with a newly discovered prime
NextPrime:
addq.l #2,D0
moveq.w #1,D7 ; trial divisor
NextDivisor:
addq.w #2,D7
move.l D0,D1
divu.w D7,D1 ; divides D1.L by D7.W, leaving quotient in D1.W and remainder in high word of D1
move.w D1,D3 ; save quotient
clr.w D1 ; zero quotient field to permit testing the remainder alone
swap.w D1 ; retrieve remainder and test for zero
beq NextPrime ; even division means this number is composite, not prime, don't test any more divisors
cmp.w D7,D3 ; check whether divisor has reached the square root of the prime candidate
bcc NextDivisor ; ie. if the quotient is still greater than the divisor, we should try another divisor
; Here we have discovered a new prime.
cmp.l D0,D5
bcc UpdateTarget ; if gap isn't big enough, try again
rts
; on exit, D0 and D4 contain the first two primes with sufficient gap.
ETA: Thinking about it, I'm reminded that DIVU.W has a quirk which will stop this code from being able to handle primes larger than 3*64K. That is, if the 16-bit result quotient overflows (which will occur if the dividend is at least as large as the divisor shifted left 16 bits), the entire division operation is aborted and the destination/dividend register is left unchanged. Therefore all integers greater than that limit will be detected as composite, even if they're really prime.
Unfortunately there's no simple workaround that is both fast and preserves the character of the algorithm. The nearest I can think of is to use DIVU twice in a multi-precision operation, but DIVU is already by far the most expensive instruction so this would have to be done selectively. The 68020 has a version of DIVU that can produce a quotient the same width as the dividend, but that's not what is already in the bench table.
Re: slightly OT: a simple Benchmark
It might be worth building a new table based on a more efficient algorithm. With that in mind, I've taken the approach from one of my recent posts and reimplemented it in C, with a slightly different approach to the task being measured. The differences from the original are:
- Instead of using every odd integer as trial divisors, a list of actual primes is kept. The complete set of 16-bit primes can easily be stored in about 13KB, which is still small enough for most 8-bit micros, and this permits testing primality relatively efficiently on any 32-bit integer.
- Instead of searching for gaps of particular sizes, the program lists all maximal prime gaps in the 32-bit integer range, and also notes how long it took to reach each one. The times printed are cumulative, so they can be compared with other approaches to the same problem.
The complete run takes about an hour and a half on my Sandy Bridge based MacBook Pro. It also takes several hours on my Raspberry Pi - so it's feasible to run overnight. Small 8-bit and 16-bit machines are unlikely to be able to perform a complete run, but they can be stopped after producing whatever they are capable of. And the algorithm is simple enough to implement in BASIC or assembly, though most BASICs will use double the memory that C does to store the list of small primes.
In the C source, I've provided programmatic hints for implementing an assembly version to use a 32/16 bit divide instruction efficiently. I suspect most C compilers for small machines won't do a good job of that by themselves. Modern CPUs can produce a 32-bit quotient without triggering overflow semantics, so don't need that special assistance.
Sample output:
Noteworthy in the above is that my MBP finds 16-bit primes faster than a BBC Micro's CPU clock ticks!
- Instead of using every odd integer as trial divisors, a list of actual primes is kept. The complete set of 16-bit primes can easily be stored in about 13KB, which is still small enough for most 8-bit micros, and this permits testing primality relatively efficiently on any 32-bit integer.
- Instead of searching for gaps of particular sizes, the program lists all maximal prime gaps in the 32-bit integer range, and also notes how long it took to reach each one. The times printed are cumulative, so they can be compared with other approaches to the same problem.
The complete run takes about an hour and a half on my Sandy Bridge based MacBook Pro. It also takes several hours on my Raspberry Pi - so it's feasible to run overnight. Small 8-bit and 16-bit machines are unlikely to be able to perform a complete run, but they can be stopped after producing whatever they are capable of. And the algorithm is simple enough to implement in BASIC or assembly, though most BASICs will use double the memory that C does to store the list of small primes.
In the C source, I've provided programmatic hints for implementing an assembly version to use a 32/16 bit divide instruction efficiently. I suspect most C compilers for small machines won't do a good job of that by themselves. Modern CPUs can produce a 32-bit quotient without triggering overflow semantics, so don't need that special assistance.
Sample output:
Code: Select all
Maxgap 1: 3 - 2, 0.000 sec
Maxgap 2: 5 - 3, 0.000 sec
Maxgap 4: 11 - 7, 0.000 sec
Maxgap 6: 29 - 23, 0.000 sec
Maxgap 8: 97 - 89, 0.000 sec
Maxgap 14: 127 - 113, 0.000 sec
Maxgap 18: 541 - 523, 0.000 sec
Maxgap 20: 907 - 887, 0.000 sec
Maxgap 22: 1151 - 1129, 0.000 sec
Maxgap 34: 1361 - 1327, 0.000 sec
Maxgap 36: 9587 - 9551, 0.000 sec
Maxgap 44: 15727 - 15683, 0.000 sec
Maxgap 52: 19661 - 19609, 0.001 sec
Maxgap 72: 31469 - 31397, 0.001 sec
Found 6542 16-bit primes in 0.002 secs.
Maxgap 86: 156007 - 155921, 0.006 sec
Maxgap 96: 360749 - 360653, 0.017 sec
Maxgap 112: 370373 - 370261, 0.018 sec
Maxgap 114: 492227 - 492113, 0.026 sec
Maxgap 118: 1349651 - 1349533, 0.100 sec
Maxgap 132: 1357333 - 1357201, 0.100 sec
Maxgap 148: 2010881 - 2010733, 0.165 sec
Maxgap 154: 4652507 - 4652353, 0.508 sec
Maxgap 180: 17051887 - 17051707, 2.941 sec
Maxgap 210: 20831533 - 20831323, 3.873 sec
Maxgap 220: 47326913 - 47326693, 11.763 sec
Maxgap 222: 122164969 - 122164747, 42.470 sec
Maxgap 234: 189695893 - 189695659, 77.317 sec
Maxgap 248: 191913031 - 191912783, 78.493 sec
Maxgap 250: 387096383 - 387096133, 203.349 sec
Maxgap 282: 436273291 - 436273009, 239.639 sec
Maxgap 288: 1294268779 - 1294268491, 1087.127 sec
Maxgap 292: 1453168433 - 1453168141, 1274.887 sec
Maxgap 320: 2300942869 - 2300942549, 2399.465 sec
Maxgap 336: 3842611109 - 3842610773, 4813.874 sec
Reached end of 32-bit integer space, in 5623.393 secs.
- Attachments
-
- primegap.c
- (4.55 KiB) Downloaded 74 times
Re: slightly OT: a simple Benchmark
Thank you for your submissions Chromatix!
The idea of creating a list of primes during seeking them I have had once. But with only 4K RAM
Then hacking this into a machine somewhere in a shop would take more time and might be annoying
I will give it a try in Basic but I assume the enhancements will take effect mostly for problems D, E, and F (and higher).
Nevertheless your "pushing" did take effect
I found my old old 68K "Force ProfiKit 1" !I hope it will work for me again, but I have only one suitable power supply at hands - an open frame SPS. That is a bit too dangerous without a cabinet. And as I only need a solution for the +/-12V (RS-232 section) I think it is easier to use a small DC/DC converter with 5V input using the main 5V supply. But such one I haven't at hands.
Cheers,
Arne
The idea of creating a list of primes during seeking them I have had once. But with only 4K RAM
Nevertheless your "pushing" did take effect
Cheers,
Arne
Re: slightly OT: a simple Benchmark
Or you could wire up an adapter to a standard PC-ATX PSU. That would provide the voltages you need.