6502.org • View topic - slightly OT: a simple Benchmark

View unanswered posts | View active topics

Board index » 6502.org Users Forum » General Discussions

All times are UTC

slightly OT: a simple Benchmark

Page 8 of 14

[ 210 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11 ... 14 Next

Previous topic | Next topic

Author

Message

BigEd

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 4:55 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

If we're still thinking of speeding up some trial-division code, it strikes me that dividing by 33 or 35 and looking up in a table to sift out the numbers divisible by small primes might be faster than doing two divisions. At the cost of a larger table, divide by 1155 or some other product of small primes. Maybe a whole series of smaller tables is a bigger win than one large table. Such a table could be implemented in basic or any other language but might fit especially well in machine code - especially if it's quite small.

I like the idea of doing this task within a reasonable space like 4k. Feels like the kind of constraint which improves solutions.

Top

Chromatix

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 5:41 pm

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462

Well, one option is to perform GCD with a series of constants like 15015 (=3*5*7*11*13) to get through the small divisors quickly - but I was keen to keep the algorithm very simple, so that it could be implemented directly in other languages, while still being much faster than the brute-force algorithm originally used. I'll see what I can do to translate it into 6502 and 68K assembly myself, and into BASIC as well.

I'm also still interested in the practical performance of the 6309 when not hobbled by a language interpreter. Its implementation of division is supposedly a *lot* faster than that of the 68000.

Top

BigEd

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 5:44 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

If you're already dividing by a series of constants from an array, you're not far from instead multiplying by a series of reciprocals, which should be quicker. Just need to figure out the fixed point and the rounding. How hard can it be.

With a wheel incrementer, I think a much smaller table of primes would suffice, as you get the most gains (I think) from the smallest primes, which are most likely to be factors.

Top

Chromatix

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 6:08 pm

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462

The trouble is that multiplying by a reciprocal doesn't give you an accurate remainder without re-multiplying. And multiplication is almost as expensive as division in software.

Top

BigEd

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 6:12 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Not with lookup tables... and isn't it the case that testing against an integer can't need (many) more bits after the point than the original divisor?

Top

Chromatix

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 6:47 pm

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462

On the 68000, a 16x16=32 bit multiply (the only one available) takes 70 cycles - so two of them, as would be needed when candidate primes exceed 16 bits, would be just as expensive as a 32/16 division. So that's not a win, no matter how you spin it. Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory.

In BASIC, good luck obtaining the high half of a 64-bit or even 48-bit result from a multiply operation. That will seriously complicate porting to many old languages. In any case, overhead of interpretation is the usual limiting factor, not the actual arithmetic. Taking the remainder directly is better all round.

Conversely, on the 6502 I can run through 8-bit divisors relatively quickly for the most common elimination cases, and switch to the slower 16-bit divisors only rarely. I can't think of any way to do that faster by multiplication, even if I could somehow make multiplication faster.

And on the crazy homebrew end of things, what if someone built a memory-mapped divider and used it to run this algorithm? Now *that* would be interesting.

Top

BigEd

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 6:51 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

What did you mean by your GCD comment? Did you miss the idea of a table of results, or did I miss something?

Top

GARTHWILSON

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 7:01 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California

Chromatix wrote:

Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory

or you put the large tables in ROM, interfaced through I/O, which you can do with my large tables at http://wilsonminesco.com/16bitMathTables/ even though they were primarily intended to go in a 65816's memory map. It gives ways to do it though; and even if using a serial interface, it's still much faster than actually calculating the answers. The reciprocal table is the largest, giving a 32-bit answer for every 16-bit input. (That's not to say you have to use all 32 bits for every application; but they're available.)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Chromatix

Post subject: Re: slightly OT: a simple Benchmark

Posted: Sun Jul 08, 2018 7:04 pm

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462

GCD can be done with one big remainder operation and a bunch of smaller (thus faster) ones, and you don't need a lookup table to interpret the results. But I'm not completely convinced that it's faster overall than individual, well-optimised remainders over 8-bit divisors.

A property of my code is that it starts with practically no prior information, only the first *two* primes. The lookup table of all other 16-bit primes is constructed during the search. On an 8-bit machine, that will take a usefully measurable amount of time by itself, while the table and code combined can fit within 16KB.

Top

John West

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 10:45 am

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 293

If you're changing the algorithm completely, it's possible to do without any multiplication or division in the main loop.

The trick is to use a sieve. Allocate a block containing one bit for each odd number up to the square root of the largest number you expect to encounter (or more, if you've got the memory). If you want to stay within 32 bits, that's at least 32768 bits, or 4KB. To start with, they're all clear. You know that 3 is prime, so mark every odd number from 3^2, stepping by 3, until the end of the buffer as composite. The next clear bit after 3 is 5, so mark every number from 5^2 to the end, stepping by 5. Keep doing this until you the starting point is off the end of the buffer. You've now got all primes up to 65536.

Now create another buffer, representing numbers in blocks starting from 65536. These blocks can be as large as you have memory for. For each prime in the first block whose square is less than the end of the new buffer, mark multiples of that prime that are in the buffer. Then you can scan through the buffer looking for consecutive primes with a large gap between them. If you don't find any, move on to the next block of numbers. This will eventually find every prime less than 2^32. And it's fast.

If you've got the memory, you can have one massive buffer initial buffer, and don't bother with the sliding one. Odd numbers up to 2^32 will need 256MB, which is a lot for an 8-bitter, but nothing on a modern PC. You need one multiply for each prime up to the square root of your limit. The rest is addition. The sliding buffer needs a division per prime per buffer as well, to find which multiple to start with.

As a benchmark, it's testing different things: memory access and addition, mostly. The original algorithm is more of a test of division.

Top

BigEd

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 11:05 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

BYTE was pretty keen on the sieve as a benchmark back in the 80s! Always just a single buffer though.

As it's evolved, this thread has come to investigate more than one kind of thing... it could arguably be two or even three threads by now. On this forum, that only happens when a poster sees fit to start a new thread, when they realise they have a new idea - the mods don't rearrange threads.

The question for a benchmark really is what approximates an application, and that's something which varies according to the use. Is Basic to be used more like Fortran or more like Cobol? The question for an interesting programming exercise is different!

For the prime-gaps benchmark as tabulated at the top of the thread, at minimum, we're either going to be benchmarking division (or modulus) or array accesses - possibly both. In the earlier revisions, we're also benchmarking square root, or multiplication, or both.

In interpreted languages we're benchmarking the interpreter, and in versions which use GOTO we're also benchmarking that.

I quite like the use of integer variables (but again there's a choice - one might want to be benchmarking the floating point)

I have got one contribution which is a cheap speedup of one of the early versions: it's like a poor man's wheel.

Code:

    1  REM Basic-Bench  la SYM etc.
    2  REM *** Integerversion + Modulofunktion ***
   10 Z% = 3: INPUT A%,B%
   15 TIME = 0
   20 FOR C% = 5 TO A% STEP 2
   25 IF C% MOD 3 = 0 THEN C%=C%+2
   30 FOR D% = 5 TO SQR(C%) STEP 2
   40 IF C% MOD D% = 0 THEN 80
   50 NEXT D%
   60 IF C%-Z% >= B% THEN PRINT C%,Z%,C%-Z%: PRINT TIME: GOTO 10
   70 Z% = C%
   80 NEXT C%
   90 PRINT "keine Lsung gefunden !": GOTO 10

I'm quite interested to roll my sleeves up and try our OPC CPUs with (some version of) this benchmark, but I haven't got to it. To have any kind of useful comparison you surely need to be coding the same algorithm with the same optimisations, except where one CPU has to do something by hand which another can do with an instruction. For me, comparing CPUs is interesting, taking care to normalise to memory performance or similar.

Top

BillO

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 8:19 pm

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1000
Location: Canada

Chromatix wrote:

How fast can it run my code? I have emulation numbers for 16MHz, but it'd be nice to verify them with real hardware.

The code is assembled to $4000 by default (easily changed), and uses 16 zero-page bytes beginning at $70 (again, easily changed). The main entry point is at $4000 with $70 set to the gap size to search for; it returns with the following six bytes containing the pair of primes found with at least that gap. The eighth byte is zeroed so that the second prime can be read as a 4-byte integer by BASIC.

At the end of the source file is a test routine covering cases A-F, verifying that it produces the correct result. If it gets as far as the final NOP-NOP-NOP-NOP-BRA, it's all correct. If it goes into any of the other BNE loops, there's a mismatch. My emulator is set up to detect these tight loops and treat them as an exit condition, due to the test suite.

Okay, I may have time tonight to give this a try.

This is real hardware, so I will not be able to detect your BNE traps.

Can you give me an idea on how this works? Does it run all cases (A-F) on a single call, or do you need to set up each case?

_________________
Bill

Top

BillO

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 9:17 pm

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1000
Location: Canada

A simple sieve

Code:

PR=O
L=1000
PRINT "SIEVE OF ERATOSTHENESE"
PRINT "COUNT THE PRIME NUMBERS UP TO ";L
DIM N(L)
FOR I=1 TO L:N(I)=I:NEXT I
P=2
GOSUB 200
P=P+1
IF N(P)<>0 THEN GOSUB 200
IF P < L THEN GOTO 90
PRINT:PRINT PR;" PRIME NUMBERS FOUND."
END
PR=PR+1
FOR I=P TO L STEP P:N(I)=0:NEXT I
RETURN

_________________
Bill

Top

Chromatix

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 9:25 pm

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462

The BNE traps are effectively part of a unit-test routine. They're not intended for use in an actual benchmark run, since on real hardware (or an emulator for a real micro) they just provoke an infinite loop.

Instead, use the main entry point at $4000 as previously described, after setting $70 to the gap size to find. On return, bytes $71-73 and $74-76 contain little-endian 24-bit representations of the two primes enclosing that gap, and $77 is zeroed so you can read one of them as a 32-bit integer. Outputting these numbers and measuring the time interval is up to the caller, since there's so much variation in the hardware and OS facilities to do so.

In BBC BASIC I used the following:

Code:

DATA 20,30,35,50,70,100
*LOAD O.PRIMEGP 4000
FOR I%=1 TO 6
READ G%
?&70=G%
T%=TIME
CALL(&4000)
U%=TIME
PRINT !&74, (U%-T%)/100
NEXT

Top

BillO

Post subject: Re: slightly OT: a simple Benchmark

Posted: Mon Jul 09, 2018 9:50 pm

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1000
Location: Canada

Got it.

Without a time function in EhBAsic (or the hardware to supp0ort it) I'll have to run them individually.

_________________
Bill

Top

Page 8 of 14

[ 210 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11 ... 14 Next

Board index » 6502.org Users Forum » General Discussions

All times are UTC

Who is online

Users browsing this forum: No registered users and 19 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum