Page 2 of 14
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 3:13 pm
by barrym95838
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.
VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.
Mike B.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 3:18 pm
by Chromatix
Most CISC machines that offer a divide instruction, in fact, will fill two registers with the quotient and remainder simultaneously. That's how x86 behaves as well; the 6309 is another example.
RISC machines tend to have a 3-operand instruction format and permit producing only one result per instruction. To obtain the remainder on those, it is necessary to divide, multiply, and subtract in sequence. But the ARMv6 core in the original Raspberry Pi doesn't have a divide instruction at all; you have to do it "by hand".
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 3:19 pm
by BigEd
That's good to know, thanks Mike. (And Chromatix, crossed in post)
My habit when writing this sort of trial-division prime-tester is to use the division to see if the divisor has got too large, which saves the square root (or the squaring) and use the remainder to see if the division was exact.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 3:33 pm
by Chromatix
That is a good strategy if you've got the registers for it. But I think keeping the quotient would be slower than computing the square root separately on the 6502, given that the latter strategy takes only a few instructions in the *outer* loop.
With an 8-bit divisor I only need 5 instructions per bit to produce the remainder, and I'd have to add at least one more (probably a 5-cycle ROL zp) to produce the quotient from that. When we get into 16-bit divisors (which are needed to complete the testing of 24-bit primes), things get even more complicated. All this is in the inner loop, which is executed several times on average per outer loop.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 3:34 pm
by BigEd
It's a fair point that the square root can be computed once - not sure why I never realised that!
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 4:33 pm
by GaBuZoMeu
Wow - what a resonance
When I started to build this list everyone was just judging the raw performance of a (home)computer. And there was only Basic - at least for home computers. So a small program with a moderate amount of required runtime was exactly what could give an idea of how fast ("good") a machine was. In this way the "specs" of the first seven entries are fairly comparable. The eighth entry is already an attempt to tune the program to let the particular machine (and its language implementation) look a little better. This is of course counterproductive in terms of comparability.
When home computers turns to personal computers (having an OS, disks, etc.) another aspect was to compare various languages (either interpreters or compilers) against each other. Of course one can still compare the results of two machines, but now a more careful interpretation of these numbers is required. At least the used algorithm should be the same (as long as this is possible for the chosen language).
That (and because the machines getting faster and faster) is the reason why I never tried to use advanced algorithms. Don't misunderstood me, I don't want to disqualify alternative algorithms! I really like to read the various and very very interesting suggestions here e.g. the Wheel (!) or Chromatix' attempt to avoid squaring and multiplication at all - great! But of course all these kind of advanced algorithms are only comparable if one runs them on the very same machine, and then they could (perhaps) give one an idea of what is feasible on that particular machine.
At last - I am waiting for the scores of VAX and Cray - don't forget to specify the machine, compiler, program, and what else precisely.
P.S.:
I am going to edit the table this evening. John Wests LUA results will be appended. Great! Never before I could so easily follow a LUA program. THX.
Chromatix: your first numbers for the BBC will entered as well - perhaps you can run one of these C/Pascal "V2" versions that avoid using SQR() and use MOD and DIV?
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 4:45 pm
by BigEd
I've just spotted you have a T800 transputer in there - excellent!
You just might be interested to try
pypy on the Pi - sometimes it gives a very impressive speed improvement.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 4:46 pm
by Chromatix
I've almost got my assembly version finished, so I'll run that on the same emulator to give a sense of the interpreter's efficiency (or otherwise). I'm using the cc65 toolchain for that, so I might try to get a C version running as well, but no promises; cc65 doesn't have direct support for Acorn MOS, so I'll have to bodge it.
Be sure to note that my numbers are from a Second Processor running at twice the clock speed of the BBC Micro itself.
My assembly implementation now has *three* separate division routines in it - for 16/8, 24/8 and 24/16 bit operands. Separating the latter two makes both considerably simpler and faster, which is important since even with larger prime candidates, most are eliminated with *small* trial divisions.
Incidentally, many C compilers will optimise paired division and remainder operations so that only one division is actually performed. So it doesn't really matter that there is no explicit "divmod" operator.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 5:20 pm
by GaBuZoMeu
I've just spotted you have a T800 transputer in there - excellent!
Back in 199X this was a dream of a machine. I once had two of them running but one is gone
In their Technical Application Handbook was an article about benchmarks. I found this remarkable technical note on the net:
Lies, damned lies and benchmarks
You just might be interested to try
pypy on the Pi - sometimes it gives a very impressive speed improvement.
I have used pypy with tremendous results. Whether the given numbers are already coming from using pypy I can't recall, but I think I would have mentioned that instead of referring to Python 3.4.2. The biggest problem this moment: I couldn't find my RasPi
Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that
(To be fair: the algorithm was slightly changed, and the program just start with some low number and continues to seek for larger and larger gaps):
Code: Select all
PROC INMOS.ENTRY.POINT (CHAN OF ANY from.link, to.link)
-- DO NOT CHANGE PROCEDURE NAME, IT IS IMPORTANT.
#USE "\toolset\flibs" -- File server library
#USE "\toolset\realio" -- IO-conversions
PROC is.it.prim(VAL INT testnum, BOOL result)
VAL step IS 2:
VAL init IS 3:
INT i:
SEQ
i := init
WHILE (( (i TIMES i) < testnum) AND (( testnum REM i) <> 0))
i := i PLUS step
result := (testnum < (i TIMES i))
:
VAL Close.Option IS 0:
VAL f.ok IS 0:
INT result, screen.id, key.id:
SEQ
open.output.stream(from.link, to.link, 0, screen.id, result)
open.input.stream(from.link, to.link, 0, key.id, result)
IF
result = f.ok
[20]BYTE count.str:
INT len, loprim, hiprim, diff, lauf, time1, time2:
TIMER clock:
BOOL error,prim:
VAL step IS 2:
VAL limit IS ((MOSTPOS INT) >> 1):
SEQ
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
write.block(from.link, to.link, screen.id,
"minimale Primzahldifferenz ? :", len, result)
result := f.ok - 1
WHILE result <> f.ok
read.block(from.link, to.link, key.id, 20, count.str, len, result)
STRINGTOINT(error, diff, [count.str FROM 0 FOR len])
clock ? time1
loprim := 3
hiprim := 7
WHILE (diff < 300)
SEQ
WHILE ((hiprim MINUS loprim) < diff)
SEQ
lauf := hiprim MINUS loprim
WHILE (lauf > 0)
SEQ
is.it.prim(loprim PLUS lauf,prim)
IF
prim
SEQ
loprim := lauf PLUS loprim
hiprim := loprim
lauf := 0
TRUE
lauf := lauf MINUS step
IF
hiprim = loprim
SEQ
hiprim := (loprim PLUS diff) MINUS step
TRUE
SEQ
lauf := step
hiprim := hiprim PLUS step
WHILE (lauf > 0)
SEQ
is.it.prim(hiprim,prim)
IF
prim
SEQ
lauf := 0
TRUE
hiprim := hiprim PLUS step
clock ? time2
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, hiprim)
write.block(from.link, to.link, screen.id,
count.str, len, result)
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, loprim)
write.block(from.link, to.link, screen.id,
count.str, len, result)
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
INTTOSTRING(len, count.str, (hiprim-loprim))
write.block(from.link, to.link, screen.id,
[count.str FROM 0 FOR len], len, result)
time2 := time2 - time1
SEQ a = 0 FOR 20
count.str[a] := ' ' -- init count.string
REAL32TOSTRING(len, count.str,
(( REAL32 ROUND time2 ) / 15625.0 (REAL32)), 5, 5)
-- INTTOSTRING(len, count.str, time2)
write.block(from.link, to.link, screen.id,
[count.str FROM 0 FOR len], len, result)
write.block(from.link, to.link, screen.id,
" Sek. *N", len, result)
diff := (hiprim-loprim)+2
terminate.filer(from.link, to.link, result)
:
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 6:43 pm
by resman
Porting the C/Pascal version to PLASMA 2.0 and running on a 128K Apple //e (so the JIT compiler gets involved):
Code: Select all
include "inc/cmdsys.plh"
var a, b, i,loprim,hiprim
def getnum
var str, num
num = 0
str = gets(':'|$80) + 1
while ^str >= '0' and ^str <= '9'
num = num * 10 + (^str - '0')
str++
loop
return num
end
def prim(x)
var i
i = 3
while i*i < x and x % i; i = i + 2; loop
return x < i*i
end
puts("Geben Sie den oberen Grenzwert ein")
a = getnum
puts("Geben Sie die minimale Differenz ein")
b = getnum
loprim, hiprim, i = 1, 1, 3
while i <= a and hiprim-loprim < b
if prim(i)
loprim = hiprim
hiprim = i
fin
i = i + 2
loop
if hiprim-loprim < b
puts("keine Loesung gefunden !")
else
puti(hiprim); putc(' '); puti(loprim); putc(' '); puti(hiprim-loprim)
fin
putln
done
Yields:
Not bad for a 1 MHz 6502. The JIT compiler will convert the prim() function into machine code after 44 calls using the default JIT values. Compiling it immediately by tuning the JIT compiler gets it a smidge faster, but not appreciably.
Dave...
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 6:46 pm
by rwiker
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.
VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.
Mike B.
Common Lisp too, in the form of #'floor, #'ceiling, #'round and #'truncate... and possibly others:
E.g,
... which could be used like this:
Code: Select all
(multiple-value-bind (div mod)
(truncate 5 2)
(format t "Div: ~a, Mod: ~a~%" div mod))
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 8:51 pm
by GaBuZoMeu
@rwiker: is there anything LISP can't do ?
@resman: very impressive numbers, I assume there is no much room left for assembly. Thanks Dave!
@Chromatix: I hope my abbreviations according to BBC/HiBASIC are acceptable.
I found something strange:
I rerun the original version ("BasBenc1") and verified the same speed. Then I tried to "cache" the SQR() with
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 9:59 pm
by Chromatix
Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that
Well, I took that as a challenge and plugged in my Raspberry Pi 2B (since I happen to have one). A short C implementation later, compiled specifically for Cortex-A7 (which, unlike many older ARM cores, does have hardware divide)...
Code: Select all
A <1ms
B <1ms
C 2ms
D 4ms
E 7ms
F 183ms
G (gap 222) - 122164747, 122164969, 222, 627.332 seconds
Code: Select all
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
void findGap(const uint8_t target)
{
struct timeval begin,end;
uint32_t c=3, z=3, l=2, ll=4;
gettimeofday(&begin, NULL);
for(;;) {
if(c > ll) {
ll += l+l+1;
l++;
}
for(uint32_t d=3; d < l; d += 2)
if(!(c%d))
goto notPrime;
if(c-z >= target)
break;
z = c;
notPrime:
c += 2;
}
gettimeofday(&end, NULL);
printf("%u\t%u\t%u\t%.3f\n", z, c, c-z, (end.tv_sec - begin.tv_sec) + 0.000001 * (end.tv_usec - begin.tv_usec));
}
int main(void)
{
findGap(20);
findGap(30);
findGap(35);
findGap(50);
findGap(70);
findGap(100);
findGap(222);
return 0;
}
Meanwhile, my old PowerBook G4, with a 1.5GHz PowerPC 7447A, took 133ms over case F, and 497.957 seconds over case G.
My Sandy Bridge based MacBook Pro managed case F in 40ms, case G in 148.254 seconds.
Re: slightly OT: a simple Benchmark
Posted: Tue Jul 03, 2018 11:27 pm
by commodorejohn
(Drat...even the POSIX time functions don't track to sub-second accuracy on NetBSD/vax 6.1.5, which means that all but the last one or two example values are effectively unmeasurable...)
You may modify the input. Skip it and work with constants. Then run it thousand times or so. That should sum up to something detectable.
Good point

Funnily enough, I also had to modify the
long version to use
long long because of the ambiguity with C type sizes (
long and
int are both 32-bit under GCC for VAX, apparently.) Here's the results (problems A-E run 1000x apiece, problem F run 100x):
Code: Select all
A B C D E F
int 2.48 4.17 53.61 139.26 262.22 7653.10
long* 13.24 22.61 312.63 826.16 1570.51 46712.30
* (long = long long)
MicroVAX 3100/90, NetBSD 6.1.5 - all times in milliseconds per run.
Re: slightly OT: a simple Benchmark
Posted: Wed Jul 04, 2018 2:12 am
by barrym95838
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2
To my surprise the program runs slower !??
Erm ... aren't you adding an extra iteration about 50% of the time with that +1? I can't think of any BASICs that re-calculate the limit on every iteration anyhow ...
Mike B.