slightly OT: a simple Benchmark

barrym95838 · Post by **barrym95838** » Tue Jul 03, 2018 3:13 pm

BigEd wrote:

It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.

VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.

Mike B.

Chromatix · Post by **Chromatix** » Tue Jul 03, 2018 3:18 pm

Most CISC machines that offer a divide instruction, in fact, will fill two registers with the quotient and remainder simultaneously. That's how x86 behaves as well; the 6309 is another example.

RISC machines tend to have a 3-operand instruction format and permit producing only one result per instruction. To obtain the remainder on those, it is necessary to divide, multiply, and subtract in sequence. But the ARMv6 core in the original Raspberry Pi doesn't have a divide instruction at all; you have to do it "by hand".

BigEd · Post by **BigEd** » Tue Jul 03, 2018 3:19 pm

That's good to know, thanks Mike. (And Chromatix, crossed in post)

My habit when writing this sort of trial-division prime-tester is to use the division to see if the divisor has got too large, which saves the square root (or the squaring) and use the remainder to see if the division was exact.

Chromatix · Post by **Chromatix** » Tue Jul 03, 2018 3:33 pm

That is a good strategy if you've got the registers for it. But I think keeping the quotient would be slower than computing the square root separately on the 6502, given that the latter strategy takes only a few instructions in the *outer* loop.

With an 8-bit divisor I only need 5 instructions per bit to produce the remainder, and I'd have to add at least one more (probably a 5-cycle ROL zp) to produce the quotient from that. When we get into 16-bit divisors (which are needed to complete the testing of 24-bit primes), things get even more complicated. All this is in the inner loop, which is executed several times on average per outer loop.

BigEd · Post by **BigEd** » Tue Jul 03, 2018 3:34 pm

It's a fair point that the square root can be computed once - not sure why I never realised that!

GaBuZoMeu · Post by **GaBuZoMeu** » Tue Jul 03, 2018 4:33 pm

Wow - what a resonance

When I started to build this list everyone was just judging the raw performance of a (home)computer. And there was only Basic - at least for home computers. So a small program with a moderate amount of required runtime was exactly what could give an idea of how fast ("good") a machine was. In this way the "specs" of the first seven entries are fairly comparable. The eighth entry is already an attempt to tune the program to let the particular machine (and its language implementation) look a little better. This is of course counterproductive in terms of comparability.

When home computers turns to personal computers (having an OS, disks, etc.) another aspect was to compare various languages (either interpreters or compilers) against each other. Of course one can still compare the results of two machines, but now a more careful interpretation of these numbers is required. At least the used algorithm should be the same (as long as this is possible for the chosen language).

That (and because the machines getting faster and faster) is the reason why I never tried to use advanced algorithms. Don't misunderstood me, I don't want to disqualify alternative algorithms! I really like to read the various and very very interesting suggestions here e.g. the Wheel (!) or Chromatix' attempt to avoid squaring and multiplication at all - great! But of course all these kind of advanced algorithms are only comparable if one runs them on the very same machine, and then they could (perhaps) give one an idea of what is feasible on that particular machine.

At last - I am waiting for the scores of VAX and Cray - don't forget to specify the machine, compiler, program, and what else precisely.

P.S.:
I am going to edit the table this evening. John Wests LUA results will be appended. Great! Never before I could so easily follow a LUA program. THX.
Chromatix: your first numbers for the BBC will entered as well - perhaps you can run one of these C/Pascal "V2" versions that avoid using SQR() and use MOD and DIV?

BigEd · Post by **BigEd** » Tue Jul 03, 2018 4:45 pm

I've just spotted you have a T800 transputer in there - excellent!

You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.

Chromatix · Post by **Chromatix** » Tue Jul 03, 2018 4:46 pm

I've almost got my assembly version finished, so I'll run that on the same emulator to give a sense of the interpreter's efficiency (or otherwise). I'm using the cc65 toolchain for that, so I might try to get a C version running as well, but no promises; cc65 doesn't have direct support for Acorn MOS, so I'll have to bodge it.

Be sure to note that my numbers are from a Second Processor running at twice the clock speed of the BBC Micro itself.

My assembly implementation now has *three* separate division routines in it - for 16/8, 24/8 and 24/16 bit operands. Separating the latter two makes both considerably simpler and faster, which is important since even with larger prime candidates, most are eliminated with *small* trial divisions.

Incidentally, many C compilers will optimise paired division and remainder operations so that only one division is actually performed. So it doesn't really matter that there is no explicit "divmod" operator.

GaBuZoMeu · Post by **GaBuZoMeu** » Tue Jul 03, 2018 5:20 pm

BigEd wrote:

I've just spotted you have a T800 transputer in there - excellent!

Back in 199X this was a dream of a machine. I once had two of them running but one is gone

In their Technical Application Handbook was an article about benchmarks. I found this remarkable technical note on the net: Lies, damned lies and benchmarks

BigEd wrote:

You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.

I have used pypy with tremendous results. Whether the given numbers are already coming from using pypy I can't recall, but I think I would have mentioned that instead of referring to Python 3.4.2. The biggest problem this moment: I couldn't find my RasPi

Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that

(To be fair: the algorithm was slightly changed, and the program just start with some low number and continues to seek for larger and larger gaps):

Code: Select all

PROC INMOS.ENTRY.POINT (CHAN OF ANY from.link, to.link)
  -- DO NOT CHANGE PROCEDURE NAME, IT IS IMPORTANT.

  #USE "\toolset\flibs"      -- File server library
  #USE "\toolset\realio"     -- IO-conversions

  PROC is.it.prim(VAL INT testnum, BOOL result)
    VAL step IS 2:
    VAL init IS 3:
    INT i:
    SEQ
      i := init
      WHILE (( (i TIMES i) < testnum) AND (( testnum REM i) <> 0))
        i := i PLUS step
      result := (testnum < (i TIMES i))
  :

  VAL Close.Option IS 0:
  VAL f.ok IS 0:

  INT result, screen.id, key.id:
  SEQ
    open.output.stream(from.link, to.link, 0, screen.id, result)
    open.input.stream(from.link, to.link, 0, key.id, result)
    IF
      result = f.ok
        [20]BYTE count.str:

        INT len, loprim, hiprim, diff, lauf, time1, time2:
        TIMER clock:
        BOOL error,prim:
        VAL step IS 2:
        VAL limit IS ((MOSTPOS INT) >> 1):
        SEQ
          SEQ a = 0 FOR 20
            count.str[a] := ' '   -- init count.string

          write.block(from.link, to.link, screen.id,
                      "minimale Primzahldifferenz ?  :", len, result)

          result := f.ok - 1
          WHILE result <> f.ok
            read.block(from.link, to.link, key.id, 20, count.str, len, result)
          STRINGTOINT(error, diff, [count.str FROM 0 FOR len])

          clock ? time1
          loprim := 3
          hiprim := 7

          WHILE (diff < 300)
            SEQ           
              WHILE ((hiprim MINUS loprim) < diff)
                SEQ
                  lauf := hiprim MINUS loprim
                  WHILE (lauf > 0)
                    SEQ
                      is.it.prim(loprim PLUS lauf,prim)
                      IF
                        prim
                          SEQ
                            loprim := lauf PLUS loprim
                            hiprim := loprim
                            lauf := 0
                        TRUE
                          lauf := lauf MINUS step
                  IF
                    hiprim = loprim
                      SEQ
                        hiprim := (loprim PLUS diff) MINUS step
                    TRUE
                      SEQ
                        lauf := step
                        hiprim := hiprim PLUS step
                        WHILE (lauf > 0)
                          SEQ
                            is.it.prim(hiprim,prim)
                            IF
                              prim
                                SEQ
                                  lauf := 0
                              TRUE
                                hiprim := hiprim PLUS step



              clock ? time2

              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, hiprim)
              write.block(from.link, to.link, screen.id,
                          count.str, len, result)
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, loprim)
              write.block(from.link, to.link, screen.id,
                          count.str, len, result)
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, (hiprim-loprim))
              write.block(from.link, to.link, screen.id,
                          [count.str FROM 0 FOR len], len, result)
              time2 := time2 - time1
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              REAL32TOSTRING(len, count.str,
                            (( REAL32 ROUND time2 ) / 15625.0 (REAL32)), 5, 5)
    --          INTTOSTRING(len, count.str, time2)
              write.block(from.link, to.link, screen.id,
                          [count.str FROM 0 FOR len], len, result)
              write.block(from.link, to.link, screen.id,
                          " Sek. *N", len, result)

              diff := (hiprim-loprim)+2
    
          terminate.filer(from.link, to.link, result)
:

resman · Post by **resman** » Tue Jul 03, 2018 6:43 pm

Porting the C/Pascal version to PLASMA 2.0 and running on a 128K Apple //e (so the JIT compiler gets involved):

Code: Select all

include "inc/cmdsys.plh"

var a, b, i,loprim,hiprim

def getnum
    var str, num
    num = 0
    str = gets(':'|$80) + 1
    while ^str >= '0' and ^str <= '9'
        num = num * 10 + (^str - '0')
        str++
    loop
    return num
end

def prim(x)
    var i
    i = 3
    while i*i < x and x % i; i = i + 2; loop
    return x < i*i
end

puts("Geben Sie den oberen Grenzwert ein")
a = getnum
puts("Geben Sie die minimale Differenz ein")
b = getnum
loprim, hiprim, i = 1, 1, 3
while i <= a and hiprim-loprim < b
    if prim(i)
        loprim = hiprim
        hiprim = i
    fin
    i = i + 2
loop
if hiprim-loprim < b
    puts("keine Loesung gefunden !")
else
    puti(hiprim); putc(' '); puti(loprim); putc(' '); puti(hiprim-loprim)
fin
putln
done

Yields:

Code: Select all

A: 5.72s
B: 9.30s
C: 118.36s

Not bad for a 1 MHz 6502. The JIT compiler will convert the prim() function into machine code after 44 calls using the default JIT values. Compiling it immediately by tuning the JIT compiler gets it a smidge faster, but not appreciably.

Dave...

rwiker · Post by **rwiker** » Tue Jul 03, 2018 6:46 pm

barrym95838 wrote:

BigEd wrote:

It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.

VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.

Mike B.

Common Lisp too, in the form of #'floor, #'ceiling, #'round and #'truncate... and possibly others:

E.g,

Code: Select all

(truncate 5 2) 
2
1

... which could be used like this:

Code: Select all

(multiple-value-bind (div mod)
    (truncate 5 2)
  (format t "Div: ~a, Mod: ~a~%" div mod))

GaBuZoMeu · Post by **GaBuZoMeu** » Tue Jul 03, 2018 8:51 pm

@rwiker: is there anything LISP can't do ?

@resman: very impressive numbers, I assume there is no much room left for assembly. Thanks Dave!

@Chromatix: I hope my abbreviations according to BBC/HiBASIC are acceptable.

I found something strange:
I rerun the original version ("BasBenc1") and verified the same speed. Then I tried to "cache" the SQR() with
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2

To my surprise the program runs slower !??

Chromatix · Post by **Chromatix** » Tue Jul 03, 2018 9:59 pm

GaBuZoMeu wrote:

Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that

Well, I took that as a challenge and plugged in my Raspberry Pi 2B (since I happen to have one). A short C implementation later, compiled specifically for Cortex-A7 (which, unlike many older ARM cores, does have hardware divide)...

Code: Select all

A <1ms
B <1ms
C 2ms
D 4ms
E 7ms
F 183ms
G (gap 222) - 122164747, 122164969, 222, 627.332 seconds

Code: Select all

#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>

void findGap(const uint8_t target)
{
    struct timeval begin,end;
    uint32_t c=3, z=3, l=2, ll=4;

    gettimeofday(&begin, NULL);

    for(;;) {
        if(c > ll) {
            ll += l+l+1;
            l++;
        }
        for(uint32_t d=3; d < l; d += 2)
            if(!(c%d))
                goto notPrime;
        if(c-z >= target)
            break;
        z = c;

    notPrime:
        c += 2;
    }

    gettimeofday(&end, NULL);
    printf("%u\t%u\t%u\t%.3f\n", z, c, c-z, (end.tv_sec - begin.tv_sec) + 0.000001 * (end.tv_usec - begin.tv_usec));
}

int main(void)
{
    findGap(20);
    findGap(30);
    findGap(35);
    findGap(50);
    findGap(70);
    findGap(100);
    findGap(222);

    return 0;
}

Meanwhile, my old PowerBook G4, with a 1.5GHz PowerPC 7447A, took 133ms over case F, and 497.957 seconds over case G.

My Sandy Bridge based MacBook Pro managed case F in 40ms, case G in 148.254 seconds.

commodorejohn · Post by **commodorejohn** » Tue Jul 03, 2018 11:27 pm

GaBuZoMeu wrote:

commodorejohn wrote:

(Drat...even the POSIX time functions don't track to sub-second accuracy on NetBSD/vax 6.1.5, which means that all but the last one or two example values are effectively unmeasurable...)

You may modify the input. Skip it and work with constants. Then run it thousand times or so. That should sum up to something detectable.

Good point

Funnily enough, I also had to modify the long version to use long long because of the ambiguity with C type sizes (long and int are both 32-bit under GCC for VAX, apparently.) Here's the results (problems A-E run 1000x apiece, problem F run 100x):

Code: Select all

        A       B       C       D       E       F
int     2.48    4.17    53.61   139.26  262.22  7653.10
long*   13.24   22.61   312.63  826.16  1570.51 46712.30

* (long = long long)
MicroVAX 3100/90, NetBSD 6.1.5 - all times in milliseconds per run.

barrym95838 · Post by **barrym95838** » Wed Jul 04, 2018 2:12 am

GaBuZoMeu wrote:

30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2

To my surprise the program runs slower !??

Erm ... aren't you adding an extra iteration about 50% of the time with that +1? I can't think of any BASICs that re-calculate the limit on every iteration anyhow ...

Mike B.

slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark