6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 11:34 am

All times are UTC




Post new topic Reply to topic  [ 210 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 14  Next
Author Message
PostPosted: Tue Jul 03, 2018 3:13 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1926
Location: Sacramento, CA, USA
BigEd wrote:
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.

VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 3:18 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Most CISC machines that offer a divide instruction, in fact, will fill two registers with the quotient and remainder simultaneously. That's how x86 behaves as well; the 6309 is another example.

RISC machines tend to have a 3-operand instruction format and permit producing only one result per instruction. To obtain the remainder on those, it is necessary to divide, multiply, and subtract in sequence. But the ARMv6 core in the original Raspberry Pi doesn't have a divide instruction at all; you have to do it "by hand".


Last edited by Chromatix on Tue Jul 03, 2018 3:19 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 3:19 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
That's good to know, thanks Mike. (And Chromatix, crossed in post)

My habit when writing this sort of trial-division prime-tester is to use the division to see if the divisor has got too large, which saves the square root (or the squaring) and use the remainder to see if the division was exact.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 3:33 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
That is a good strategy if you've got the registers for it. But I think keeping the quotient would be slower than computing the square root separately on the 6502, given that the latter strategy takes only a few instructions in the *outer* loop.

With an 8-bit divisor I only need 5 instructions per bit to produce the remainder, and I'd have to add at least one more (probably a 5-cycle ROL zp) to produce the quotient from that. When we get into 16-bit divisors (which are needed to complete the testing of 24-bit primes), things get even more complicated. All this is in the inner loop, which is executed several times on average per outer loop.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 3:34 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
It's a fair point that the square root can be computed once - not sure why I never realised that!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 4:33 pm 
Offline
User avatar

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany
Wow - what a resonance :shock: :) :)

When I started to build this list everyone was just judging the raw performance of a (home)computer. And there was only Basic - at least for home computers. So a small program with a moderate amount of required runtime was exactly what could give an idea of how fast ("good") a machine was. In this way the "specs" of the first seven entries are fairly comparable. The eighth entry is already an attempt to tune the program to let the particular machine (and its language implementation) look a little better. This is of course counterproductive in terms of comparability.

When home computers turns to personal computers (having an OS, disks, etc.) another aspect was to compare various languages (either interpreters or compilers) against each other. Of course one can still compare the results of two machines, but now a more careful interpretation of these numbers is required. At least the used algorithm should be the same (as long as this is possible for the chosen language).

That (and because the machines getting faster and faster) is the reason why I never tried to use advanced algorithms. Don't misunderstood me, I don't want to disqualify alternative algorithms! I really like to read the various and very very interesting suggestions here e.g. the Wheel (!) or Chromatix' attempt to avoid squaring and multiplication at all - great! But of course all these kind of advanced algorithms are only comparable if one runs them on the very same machine, and then they could (perhaps) give one an idea of what is feasible on that particular machine.

At last - I am waiting for the scores of VAX and Cray - don't forget to specify the machine, compiler, program, and what else precisely. :D


P.S.:
I am going to edit the table this evening. John Wests LUA results will be appended. Great! Never before I could so easily follow a LUA program. THX.
Chromatix: your first numbers for the BBC will entered as well - perhaps you can run one of these C/Pascal "V2" versions that avoid using SQR() and use MOD and DIV?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 4:45 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
I've just spotted you have a T800 transputer in there - excellent!

You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 4:46 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I've almost got my assembly version finished, so I'll run that on the same emulator to give a sense of the interpreter's efficiency (or otherwise). I'm using the cc65 toolchain for that, so I might try to get a C version running as well, but no promises; cc65 doesn't have direct support for Acorn MOS, so I'll have to bodge it.

Be sure to note that my numbers are from a Second Processor running at twice the clock speed of the BBC Micro itself.

My assembly implementation now has *three* separate division routines in it - for 16/8, 24/8 and 24/16 bit operands. Separating the latter two makes both considerably simpler and faster, which is important since even with larger prime candidates, most are eliminated with *small* trial divisions.

Incidentally, many C compilers will optimise paired division and remainder operations so that only one division is actually performed. So it doesn't really matter that there is no explicit "divmod" operator.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 5:20 pm 
Offline
User avatar

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany
BigEd wrote:
I've just spotted you have a T800 transputer in there - excellent!

Back in 199X this was a dream of a machine. I once had two of them running but one is gone :cry:

In their Technical Application Handbook was an article about benchmarks. I found this remarkable technical note on the net: Lies, damned lies and benchmarks


BigEd wrote:
You just might be interested to try pypy on the Pi - sometimes it gives a very impressive speed improvement.

I have used pypy with tremendous results. Whether the given numbers are already coming from using pypy I can't recall, but I think I would have mentioned that instead of referring to Python 3.4.2. The biggest problem this moment: I couldn't find my RasPi :!: :lol:


Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that 8)
(To be fair: the algorithm was slightly changed, and the program just start with some low number and continues to seek for larger and larger gaps):
Code:
PROC INMOS.ENTRY.POINT (CHAN OF ANY from.link, to.link)
  -- DO NOT CHANGE PROCEDURE NAME, IT IS IMPORTANT.

  #USE "\toolset\flibs"      -- File server library
  #USE "\toolset\realio"     -- IO-conversions

  PROC is.it.prim(VAL INT testnum, BOOL result)
    VAL step IS 2:
    VAL init IS 3:
    INT i:
    SEQ
      i := init
      WHILE (( (i TIMES i) < testnum) AND (( testnum REM i) <> 0))
        i := i PLUS step
      result := (testnum < (i TIMES i))
  :

  VAL Close.Option IS 0:
  VAL f.ok IS 0:

  INT result, screen.id, key.id:
  SEQ
    open.output.stream(from.link, to.link, 0, screen.id, result)
    open.input.stream(from.link, to.link, 0, key.id, result)
    IF
      result = f.ok
        [20]BYTE count.str:

        INT len, loprim, hiprim, diff, lauf, time1, time2:
        TIMER clock:
        BOOL error,prim:
        VAL step IS 2:
        VAL limit IS ((MOSTPOS INT) >> 1):
        SEQ
          SEQ a = 0 FOR 20
            count.str[a] := ' '   -- init count.string

          write.block(from.link, to.link, screen.id,
                      "minimale Primzahldifferenz ?  :", len, result)

          result := f.ok - 1
          WHILE result <> f.ok
            read.block(from.link, to.link, key.id, 20, count.str, len, result)
          STRINGTOINT(error, diff, [count.str FROM 0 FOR len])

          clock ? time1
          loprim := 3
          hiprim := 7

          WHILE (diff < 300)
            SEQ           
              WHILE ((hiprim MINUS loprim) < diff)
                SEQ
                  lauf := hiprim MINUS loprim
                  WHILE (lauf > 0)
                    SEQ
                      is.it.prim(loprim PLUS lauf,prim)
                      IF
                        prim
                          SEQ
                            loprim := lauf PLUS loprim
                            hiprim := loprim
                            lauf := 0
                        TRUE
                          lauf := lauf MINUS step
                  IF
                    hiprim = loprim
                      SEQ
                        hiprim := (loprim PLUS diff) MINUS step
                    TRUE
                      SEQ
                        lauf := step
                        hiprim := hiprim PLUS step
                        WHILE (lauf > 0)
                          SEQ
                            is.it.prim(hiprim,prim)
                            IF
                              prim
                                SEQ
                                  lauf := 0
                              TRUE
                                hiprim := hiprim PLUS step



              clock ? time2

              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, hiprim)
              write.block(from.link, to.link, screen.id,
                          count.str, len, result)
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, loprim)
              write.block(from.link, to.link, screen.id,
                          count.str, len, result)
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              INTTOSTRING(len, count.str, (hiprim-loprim))
              write.block(from.link, to.link, screen.id,
                          [count.str FROM 0 FOR len], len, result)
              time2 := time2 - time1
              SEQ a = 0 FOR 20
                count.str[a] := ' '   -- init count.string
              REAL32TOSTRING(len, count.str,
                            (( REAL32 ROUND time2 ) / 15625.0 (REAL32)), 5, 5)
    --          INTTOSTRING(len, count.str, time2)
              write.block(from.link, to.link, screen.id,
                          [count.str FROM 0 FOR len], len, result)
              write.block(from.link, to.link, screen.id,
                          " Sek. *N", len, result)

              diff := (hiprim-loprim)+2
   
          terminate.filer(from.link, to.link, result)
:


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 6:43 pm 
Offline

Joined: Sat Dec 12, 2015 7:48 pm
Posts: 122
Location: Lake Tahoe
Porting the C/Pascal version to PLASMA 2.0 and running on a 128K Apple //e (so the JIT compiler gets involved):
Code:
include "inc/cmdsys.plh"

var a, b, i,loprim,hiprim

def getnum
    var str, num
    num = 0
    str = gets(':'|$80) + 1
    while ^str >= '0' and ^str <= '9'
        num = num * 10 + (^str - '0')
        str++
    loop
    return num
end

def prim(x)
    var i
    i = 3
    while i*i < x and x % i; i = i + 2; loop
    return x < i*i
end

puts("Geben Sie den oberen Grenzwert ein")
a = getnum
puts("Geben Sie die minimale Differenz ein")
b = getnum
loprim, hiprim, i = 1, 1, 3
while i <= a and hiprim-loprim < b
    if prim(i)
        loprim = hiprim
        hiprim = i
    fin
    i = i + 2
loop
if hiprim-loprim < b
    puts("keine Loesung gefunden !")
else
    puti(hiprim); putc(' '); puti(loprim); putc(' '); puti(hiprim-loprim)
fin
putln
done


Yields:
Code:
A: 5.72s
B: 9.30s
C: 118.36s


Not bad for a 1 MHz 6502. The JIT compiler will convert the prim() function into machine code after 44 calls using the default JIT values. Compiling it immediately by tuning the JIT compiler gets it a smidge faster, but not appreciably.

Dave...


Last edited by resman on Tue Jul 03, 2018 6:48 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 6:46 pm 
Offline

Joined: Thu Mar 03, 2011 5:56 pm
Posts: 277
barrym95838 wrote:
BigEd wrote:
It's a pity that no languages I can think of offer a DIVMOD which can perform the single computation but return both the division and the remainder.

VTL-2, Forth and 68k machine language instantly jump to my mind, FWIW.

Mike B.


Common Lisp too, in the form of #'floor, #'ceiling, #'round and #'truncate... and possibly others:

E.g,
Code:
(truncate 5 2)
2
1

... which could be used like this:

Code:
(multiple-value-bind (div mod)
    (truncate 5 2)
  (format t "Div: ~a, Mod: ~a~%" div mod))


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 8:51 pm 
Offline
User avatar

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany
@rwiker: is there anything LISP can't do ? :lol:

@resman: very impressive numbers, I assume there is no much room left for assembly. Thanks Dave!

@Chromatix: I hope my abbreviations according to BBC/HiBASIC are acceptable.

I found something strange:
I rerun the original version ("BasBenc1") and verified the same speed. Then I tried to "cache" the SQR() with
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2

To my surprise the program runs slower !??


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 9:59 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
GaBuZoMeu wrote:
Well, for all of you who are going to squeeze out every cycle and every bit of their machines: try to find the first gap that spans across 222. The numbers end with xxx969 and xxx747. The T800 took 11264,5781 seconds for that 8)

Well, I took that as a challenge and plugged in my Raspberry Pi 2B (since I happen to have one). A short C implementation later, compiled specifically for Cortex-A7 (which, unlike many older ARM cores, does have hardware divide)...
Code:
A <1ms
B <1ms
C 2ms
D 4ms
E 7ms
F 183ms
G (gap 222) - 122164747, 122164969, 222, 627.332 seconds
Code:
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>

void findGap(const uint8_t target)
{
    struct timeval begin,end;
    uint32_t c=3, z=3, l=2, ll=4;

    gettimeofday(&begin, NULL);

    for(;;) {
        if(c > ll) {
            ll += l+l+1;
            l++;
        }
        for(uint32_t d=3; d < l; d += 2)
            if(!(c%d))
                goto notPrime;
        if(c-z >= target)
            break;
        z = c;

    notPrime:
        c += 2;
    }

    gettimeofday(&end, NULL);
    printf("%u\t%u\t%u\t%.3f\n", z, c, c-z, (end.tv_sec - begin.tv_sec) + 0.000001 * (end.tv_usec - begin.tv_usec));
}

int main(void)
{
    findGap(20);
    findGap(30);
    findGap(35);
    findGap(50);
    findGap(70);
    findGap(100);
    findGap(222);

    return 0;
}

Meanwhile, my old PowerBook G4, with a 1.5GHz PowerPC 7447A, took 133ms over case F, and 497.957 seconds over case G.

My Sandy Bridge based MacBook Pro managed case F in 40ms, case G in 148.254 seconds.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jul 03, 2018 11:27 pm 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 269
Location: Placerville, CA
GaBuZoMeu wrote:
commodorejohn wrote:
(Drat...even the POSIX time functions don't track to sub-second accuracy on NetBSD/vax 6.1.5, which means that all but the last one or two example values are effectively unmeasurable...)

:lol: :lol: :lol:
You may modify the input. Skip it and work with constants. Then run it thousand times or so. That should sum up to something detectable.

Good point :D Funnily enough, I also had to modify the long version to use long long because of the ambiguity with C type sizes (long and int are both 32-bit under GCC for VAX, apparently.) Here's the results (problems A-E run 1000x apiece, problem F run 100x):
Code:
        A       B       C       D       E       F
int     2.48    4.17    53.61   139.26  262.22  7653.10
long*   13.24   22.61   312.63  826.16  1570.51 46712.30

* (long = long long)
MicroVAX 3100/90, NetBSD 6.1.5 - all times in milliseconds per run.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jul 04, 2018 2:12 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1926
Location: Sacramento, CA, USA
GaBuZoMeu wrote:
30 SQ=INT(SQR(C))+1
35 FOR D=3 TO SQ STEP 2
instead of
30 FOR D=3 TO SQR(C) STEP 2

To my surprise the program runs slower !??

Erm ... aren't you adding an extra iteration about 50% of the time with that +1? I can't think of any BASICs that re-calculate the limit on every iteration anyhow ...

Mike B.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 210 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 14  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: