Mike B.
slightly OT: a simple Benchmark
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: slightly OT: a simple Benchmark
BigEd wrote:
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.
Mike B.
Re: slightly OT: a simple Benchmark
Most CISC machines that offer a divide instruction, in fact, will fill two registers with the quotient and remainder simultaneously. That's how x86 behaves as well; the 6309 is another example.
RISC machines tend to have a 3-operand instruction format and permit producing only one result per instruction. To obtain the remainder on those, it is necessary to divide, multiply, and subtract in sequence. But the ARMv6 core in the original Raspberry Pi doesn't have a divide instruction at all; you have to do it "by hand".
RISC machines tend to have a 3-operand instruction format and permit producing only one result per instruction. To obtain the remainder on those, it is necessary to divide, multiply, and subtract in sequence. But the ARMv6 core in the original Raspberry Pi doesn't have a divide instruction at all; you have to do it "by hand".
Last edited by Chromatix on Tue Jul 03, 2018 3:19 pm, edited 1 time in total.
Re: slightly OT: a simple Benchmark
That's good to know, thanks Mike. (And Chromatix, crossed in post)
My habit when writing this sort of trial-division prime-tester is to use the division to see if the divisor has got too large, which saves the square root (or the squaring) and use the remainder to see if the division was exact.
My habit when writing this sort of trial-division prime-tester is to use the division to see if the divisor has got too large, which saves the square root (or the squaring) and use the remainder to see if the division was exact.
Re: slightly OT: a simple Benchmark
That is a good strategy if you've got the registers for it. But I think keeping the quotient would be slower than computing the square root separately on the 6502, given that the latter strategy takes only a few instructions in the *outer* loop.
With an 8-bit divisor I only need 5 instructions per bit to produce the remainder, and I'd have to add at least one more (probably a 5-cycle ROL zp) to produce the quotient from that. When we get into 16-bit divisors (which are needed to complete the testing of 24-bit primes), things get even more complicated. All this is in the inner loop, which is executed several times on average per outer loop.
With an 8-bit divisor I only need 5 instructions per bit to produce the remainder, and I'd have to add at least one more (probably a 5-cycle ROL zp) to produce the quotient from that. When we get into 16-bit divisors (which are needed to complete the testing of 24-bit primes), things get even more complicated. All this is in the inner loop, which is executed several times on average per outer loop.
Re: slightly OT: a simple Benchmark
It's a fair point that the square root can be computed once - not sure why I never realised that!
Re: slightly OT: a simple Benchmark
Wow - what a resonance
When I started to build this list everyone was just judging the raw performance of a (home)computer. And there was only Basic - at least for home computers. So a small program with a moderate amount of required runtime was exactly what could give an idea of how fast ("good") a machine was. In this way the "specs" of the first seven entries are fairly comparable. The eighth entry is already an attempt to tune the program to let the particular machine (and its language implementation) look a little better. This is of course counterproductive in terms of comparability.
When home computers turns to personal computers (having an OS, disks, etc.) another aspect was to compare various languages (either interpreters or compilers) against each other. Of course one can still compare the results of two machines, but now a more careful interpretation of these numbers is required. At least the used algorithm should be the same (as long as this is possible for the chosen language).
That (and because the machines getting faster and faster) is the reason why I never tried to use advanced algorithms. Don't misunderstood me, I don't want to disqualify alternative algorithms! I really like to read the various and very very interesting suggestions here e.g. the Wheel (!) or Chromatix' attempt to avoid squaring and multiplication at all - great! But of course all these kind of advanced algorithms are only comparable if one runs them on the very same machine, and then they could (perhaps) give one an idea of what is feasible on that particular machine.
At last - I am waiting for the scores of VAX and Cray - don't forget to specify the machine, compiler, program, and what else precisely.
P.S.:
I am going to edit the table this evening. John Wests LUA results will be appended. Great! Never before I could so easily follow a LUA program. THX.
Chromatix: your first numbers for the BBC will entered as well - perhaps you can run one of these C/Pascal "V2" versions that avoid using SQR() and use MOD and DIV?
When I started to build this list everyone was just judging the raw performance of a (home)computer. And there was only Basic - at least for home computers. So a small program with a moderate amount of required runtime was exactly what could give an idea of how fast ("good") a machine was. In this way the "specs" of the first seven entries are fairly comparable. The eighth entry is already an attempt to tune the program to let the particular machine (and its language implementation) look a little better. This is of course counterproductive in terms of comparability.
When home computers turns to personal computers (having an OS, disks, etc.) another aspect was to compare various languages (either interpreters or compilers) against each other. Of course one can still compare the results of two machines, but now a more careful interpretation of these numbers is required. At least the used algorithm should be the same (as long as this is possible for the chosen language).
That (and because the machines getting faster and faster) is the reason why I never tried to use advanced algorithms. Don't misunderstood me, I don't want to disqualify alternative algorithms! I really like to read the various and very very interesting suggestions here e.g. the Wheel (!) or Chromatix' attempt to avoid squaring and multiplication at all - great! But of course all these kind of advanced algorithms are only comparable if one runs them on the very same machine, and then they could (perhaps) give one an idea of what is feasible on that particular machine.
At last - I am waiting for the scores of VAX and Cray - don't forget to specify the machine, compiler, program, and what else precisely.
P.S.:
I am going to edit the table this evening. John Wests LUA results will be appended. Great! Never before I could so easily follow a LUA program. THX.
Chromatix: your first numbers for the BBC will entered as well - perhaps you can run one of these C/Pascal "V2" versions that avoid using SQR() and use MOD and DIV?
Re: slightly OT: a simple Benchmark
I've just spotted you have a T800 transputer in there - excellent!
You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.
You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.
Re: slightly OT: a simple Benchmark
I've almost got my assembly version finished, so I'll run that on the same emulator to give a sense of the interpreter's efficiency (or otherwise). I'm using the cc65 toolchain for that, so I might try to get a C version running as well, but no promises; cc65 doesn't have direct support for Acorn MOS, so I'll have to bodge it.
Be sure to note that my numbers are from a Second Processor running at twice the clock speed of the BBC Micro itself.
My assembly implementation now has *three* separate division routines in it - for 16/8, 24/8 and 24/16 bit operands. Separating the latter two makes both considerably simpler and faster, which is important since even with larger prime candidates, most are eliminated with *small* trial divisions.
Incidentally, many C compilers will optimise paired division and remainder operations so that only one division is actually performed. So it doesn't really matter that there is no explicit "divmod" operator.
Be sure to note that my numbers are from a Second Processor running at twice the clock speed of the BBC Micro itself.
My assembly implementation now has *three* separate division routines in it - for 16/8, 24/8 and 24/16 bit operands. Separating the latter two makes both considerably simpler and faster, which is important since even with larger prime candidates, most are eliminated with *small* trial divisions.
Incidentally, many C compilers will optimise paired division and remainder operations so that only one division is actually performed. So it doesn't really matter that there is no explicit "divmod" operator.
Re: slightly OT: a simple Benchmark
BigEd wrote:
I've just spotted you have a T800 transputer in there - excellent!
In their Technical Application Handbook was an article about benchmarks. I found this remarkable technical note on the net: Lies, damned lies and benchmarks
BigEd wrote:
You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.
Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that
(To be fair: the algorithm was slightly changed, and the program just start with some low number and continues to seek for larger and larger gaps):
Code: Select all
PROC INMOS.ENTRY.POINT (CHAN OF ANY from.link, to.link)
-- DO NOT CHANGE PROCEDURE NAME, IT IS IMPORTANT.
#USE "\toolset\flibs" -- File server library
#USE "\toolset\realio" -- IO-conversions
PROC is.it.prim(VAL INT testnum, BOOL result)
VAL step IS 2:
VAL init IS 3:
INT i:
SEQ
i := init
WHILE (( (i TIMES i) < testnum) AND (( testnum REM i) <> 0))
i := i PLUS step
result := (testnum < (i TIMES i))
:
VAL Close.Option IS 0:
VAL f.ok IS 0:
INT result, screen.id, key.id:
SEQ
open.output.stream(from.link, to.link, 0, screen.id, result)
open.input.stream(from.link, to.link, 0, key.id, result)
IF
result = f.ok
[20]BYTE count.str:
INT len, loprim, hiprim, diff, lauf, time1, time2:
TIMER clock:
BOOL error,prim:
VAL step IS 2:
VAL limit IS ((MOSTPOS INT) >> 1):
SEQ
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
write.block(from.link, to.link, screen.id,
"minimale Primzahldifferenz ? :", len, result)
result := f.ok - 1
WHILE result <> f.ok
read.block(from.link, to.link, key.id, 20, count.str, len, result)
STRINGTOINT(error, diff, [count.str FROM 0 FOR len])
clock ? time1
loprim := 3
hiprim := 7
WHILE (diff < 300)
SEQ
WHILE ((hiprim MINUS loprim) < diff)
SEQ
lauf := hiprim MINUS loprim
WHILE (lauf > 0)
SEQ
is.it.prim(loprim PLUS lauf,prim)
IF
prim
SEQ
loprim := lauf PLUS loprim
hiprim := loprim
lauf := 0
TRUE
lauf := lauf MINUS step
IF
hiprim = loprim
SEQ
hiprim := (loprim PLUS diff) MINUS step
TRUE
SEQ
lauf := step
hiprim := hiprim PLUS step
WHILE (lauf > 0)
SEQ
is.it.prim(hiprim,prim)
IF
prim
SEQ
lauf := 0
TRUE
hiprim := hiprim PLUS step
clock ? time2
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, hiprim)
write.block(from.link, to.link, screen.id,
count.str, len, result)
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, loprim)
write.block(from.link, to.link, screen.id,
count.str, len, result)
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, (hiprim-loprim))
write.block(from.link, to.link, screen.id,
[count.str FROM 0 FOR len], len, result)
time2 := time2 - time1
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
REAL32TOSTRING(len, count.str,
(( REAL32 ROUND time2 ) / 15625.0 (REAL32)), 5, 5)
-- INTTOSTRING(len, count.str, time2)
write.block(from.link, to.link, screen.id,
[count.str FROM 0 FOR len], len, result)
write.block(from.link, to.link, screen.id,
" Sek. *N", len, result)
diff := (hiprim-loprim)+2
terminate.filer(from.link, to.link, result)
:
Re: slightly OT: a simple Benchmark
Porting the C/Pascal version to PLASMA 2.0 and running on a 128K Apple //e (so the JIT compiler gets involved):
Yields:
Not bad for a 1 MHz 6502. The JIT compiler will convert the prim() function into machine code after 44 calls using the default JIT values. Compiling it immediately by tuning the JIT compiler gets it a smidge faster, but not appreciably.
Dave...
Code: Select all
include "inc/cmdsys.plh"
var a, b, i,loprim,hiprim
def getnum
var str, num
num = 0
str = gets(':'|$80) + 1
while ^str >= '0' and ^str <= '9'
num = num * 10 + (^str - '0')
str++
loop
return num
end
def prim(x)
var i
i = 3
while i*i < x and x % i; i = i + 2; loop
return x < i*i
end
puts("Geben Sie den oberen Grenzwert ein")
a = getnum
puts("Geben Sie die minimale Differenz ein")
b = getnum
loprim, hiprim, i = 1, 1, 3
while i <= a and hiprim-loprim < b
if prim(i)
loprim = hiprim
hiprim = i
fin
i = i + 2
loop
if hiprim-loprim < b
puts("keine Loesung gefunden !")
else
puti(hiprim); putc(' '); puti(loprim); putc(' '); puti(hiprim-loprim)
fin
putln
done
Code: Select all
A: 5.72s
B: 9.30s
C: 118.36s
Dave...
Last edited by resman on Tue Jul 03, 2018 6:48 pm, edited 1 time in total.
Re: slightly OT: a simple Benchmark
barrym95838 wrote:
BigEd wrote:
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.
Mike B.
E.g,
Code: Select all
(truncate 5 2)
2
1
Code: Select all
(multiple-value-bind (div mod)
(truncate 5 2)
(format t "Div: ~a, Mod: ~a~%" div mod))
Re: slightly OT: a simple Benchmark
@rwiker: is there anything LISP can't do ?
@resman: very impressive numbers, I assume there is no much room left for assembly. Thanks Dave!
@Chromatix: I hope my abbreviations according to BBC/HiBASIC are acceptable.
I found something strange:
I rerun the original version ("BasBenc1") and verified the same speed. Then I tried to "cache" the SQR() with
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
@resman: very impressive numbers, I assume there is no much room left for assembly. Thanks Dave!
@Chromatix: I hope my abbreviations according to BBC/HiBASIC are acceptable.
I found something strange:
I rerun the original version ("BasBenc1") and verified the same speed. Then I tried to "cache" the SQR() with
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
Re: slightly OT: a simple Benchmark
GaBuZoMeu wrote:
Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that
Code: Select all
A <1ms
B <1ms
C 2ms
D 4ms
E 7ms
F 183ms
G (gap 222) - 122164747, 122164969, 222, 627.332 seconds
Code: Select all
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
void findGap(const uint8_t target)
{
struct timeval begin,end;
uint32_t c=3, z=3, l=2, ll=4;
gettimeofday(&begin, NULL);
for(;;) {
if(c > ll) {
ll += l+l+1;
l++;
}
for(uint32_t d=3; d < l; d += 2)
if(!(c%d))
goto notPrime;
if(c-z >= target)
break;
z = c;
notPrime:
c += 2;
}
gettimeofday(&end, NULL);
printf("%u\t%u\t%u\t%.3f\n", z, c, c-z, (end.tv_sec - begin.tv_sec) + 0.000001 * (end.tv_usec - begin.tv_usec));
}
int main(void)
{
findGap(20);
findGap(30);
findGap(35);
findGap(50);
findGap(70);
findGap(100);
findGap(222);
return 0;
}
My Sandy Bridge based MacBook Pro managed case F in 40ms, case G in 148.254 seconds.
- commodorejohn
- Posts: 299
- Joined: 21 Jan 2016
- Location: Placerville, CA
- Contact:
Re: slightly OT: a simple Benchmark
GaBuZoMeu wrote:
commodorejohn wrote:
(Drat...even the POSIX time functions don't track to sub-second accuracy on NetBSD/vax 6.1.5, which means that all but the last one or two example values are effectively unmeasurable...)
You may modify the input. Skip it and work with constants. Then run it thousand times or so. That should sum up to something detectable.
Code: Select all
A B C D E F
int 2.48 4.17 53.61 139.26 262.22 7653.10
long* 13.24 22.61 312.63 826.16 1570.51 46712.30MicroVAX 3100/90, NetBSD 6.1.5 - all times in milliseconds per run.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: slightly OT: a simple Benchmark
GaBuZoMeu wrote:
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
Mike B.