slightly OT: a simple Benchmark
Re: slightly OT: a simple Benchmark
If we're still thinking of speeding up some trial-division code, it strikes me that dividing by 33 or 35 and looking up in a table to sift out the numbers divisible by small primes might be faster than doing two divisions. At the cost of a larger table, divide by 1155 or some other product of small primes. Maybe a whole series of smaller tables is a bigger win than one large table. Such a table could be implemented in basic or any other language but might fit especially well in machine code - especially if it's quite small.
I like the idea of doing this task within a reasonable space like 4k. Feels like the kind of constraint which improves solutions.
I like the idea of doing this task within a reasonable space like 4k. Feels like the kind of constraint which improves solutions.
Re: slightly OT: a simple Benchmark
Well, one option is to perform GCD with a series of constants like 15015 (=3*5*7*11*13) to get through the small divisors quickly - but I was keen to keep the algorithm very simple, so that it could be implemented directly in other languages, while still being much faster than the brute-force algorithm originally used. I'll see what I can do to translate it into 6502 and 68K assembly myself, and into BASIC as well.
I'm also still interested in the practical performance of the 6309 when not hobbled by a language interpreter. Its implementation of division is supposedly a *lot* faster than that of the 68000.
I'm also still interested in the practical performance of the 6309 when not hobbled by a language interpreter. Its implementation of division is supposedly a *lot* faster than that of the 68000.
Re: slightly OT: a simple Benchmark
If you're already dividing by a series of constants from an array, you're not far from instead multiplying by a series of reciprocals, which should be quicker. Just need to figure out the fixed point and the rounding. How hard can it be.
With a wheel incrementer, I think a much smaller table of primes would suffice, as you get the most gains (I think) from the smallest primes, which are most likely to be factors.
With a wheel incrementer, I think a much smaller table of primes would suffice, as you get the most gains (I think) from the smallest primes, which are most likely to be factors.
Re: slightly OT: a simple Benchmark
The trouble is that multiplying by a reciprocal doesn't give you an accurate remainder without re-multiplying. And multiplication is almost as expensive as division in software.
Re: slightly OT: a simple Benchmark
Not with lookup tables... and isn't it the case that testing against an integer can't need (many) more bits after the point than the original divisor?
Re: slightly OT: a simple Benchmark
On the 68000, a 16x16=32 bit multiply (the only one available) takes 70 cycles - so two of them, as would be needed when candidate primes exceed 16 bits, would be just as expensive as a 32/16 division. So that's not a win, no matter how you spin it. Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory.
In BASIC, good luck obtaining the high half of a 64-bit or even 48-bit result from a multiply operation. That will seriously complicate porting to many old languages. In any case, overhead of interpretation is the usual limiting factor, not the actual arithmetic. Taking the remainder directly is better all round.
Conversely, on the 6502 I can run through 8-bit divisors relatively quickly for the most common elimination cases, and switch to the slower 16-bit divisors only rarely. I can't think of any way to do that faster by multiplication, even if I could somehow make multiplication faster.
And on the crazy homebrew end of things, what if someone built a memory-mapped divider and used it to run this algorithm? Now *that* would be interesting.
In BASIC, good luck obtaining the high half of a 64-bit or even 48-bit result from a multiply operation. That will seriously complicate porting to many old languages. In any case, overhead of interpretation is the usual limiting factor, not the actual arithmetic. Taking the remainder directly is better all round.
Conversely, on the 6502 I can run through 8-bit divisors relatively quickly for the most common elimination cases, and switch to the slower 16-bit divisors only rarely. I can't think of any way to do that faster by multiplication, even if I could somehow make multiplication faster.
And on the crazy homebrew end of things, what if someone built a memory-mapped divider and used it to run this algorithm? Now *that* would be interesting.
Re: slightly OT: a simple Benchmark
What did you mean by your GCD comment? Did you miss the idea of a table of results, or did I miss something?
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: slightly OT: a simple Benchmark
Chromatix wrote:
Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: slightly OT: a simple Benchmark
GCD can be done with one big remainder operation and a bunch of smaller (thus faster) ones, and you don't need a lookup table to interpret the results. But I'm not completely convinced that it's faster overall than individual, well-optimised remainders over 8-bit divisors.
A property of my code is that it starts with practically no prior information, only the first *two* primes. The lookup table of all other 16-bit primes is constructed during the search. On an 8-bit machine, that will take a usefully measurable amount of time by itself, while the table and code combined can fit within 16KB.
A property of my code is that it starts with practically no prior information, only the first *two* primes. The lookup table of all other 16-bit primes is constructed during the search. On an 8-bit machine, that will take a usefully measurable amount of time by itself, while the table and code combined can fit within 16KB.
Re: slightly OT: a simple Benchmark
If you're changing the algorithm completely, it's possible to do without any multiplication or division in the main loop.
The trick is to use a sieve. Allocate a block containing one bit for each odd number up to the square root of the largest number you expect to encounter (or more, if you've got the memory). If you want to stay within 32 bits, that's at least 32768 bits, or 4KB. To start with, they're all clear. You know that 3 is prime, so mark every odd number from 3^2, stepping by 3, until the end of the buffer as composite. The next clear bit after 3 is 5, so mark every number from 5^2 to the end, stepping by 5. Keep doing this until you the starting point is off the end of the buffer. You've now got all primes up to 65536.
Now create another buffer, representing numbers in blocks starting from 65536. These blocks can be as large as you have memory for. For each prime in the first block whose square is less than the end of the new buffer, mark multiples of that prime that are in the buffer. Then you can scan through the buffer looking for consecutive primes with a large gap between them. If you don't find any, move on to the next block of numbers. This will eventually find every prime less than 2^32. And it's fast.
If you've got the memory, you can have one massive buffer initial buffer, and don't bother with the sliding one. Odd numbers up to 2^32 will need 256MB, which is a lot for an 8-bitter, but nothing on a modern PC. You need one multiply for each prime up to the square root of your limit. The rest is addition. The sliding buffer needs a division per prime per buffer as well, to find which multiple to start with.
As a benchmark, it's testing different things: memory access and addition, mostly. The original algorithm is more of a test of division.
The trick is to use a sieve. Allocate a block containing one bit for each odd number up to the square root of the largest number you expect to encounter (or more, if you've got the memory). If you want to stay within 32 bits, that's at least 32768 bits, or 4KB. To start with, they're all clear. You know that 3 is prime, so mark every odd number from 3^2, stepping by 3, until the end of the buffer as composite. The next clear bit after 3 is 5, so mark every number from 5^2 to the end, stepping by 5. Keep doing this until you the starting point is off the end of the buffer. You've now got all primes up to 65536.
Now create another buffer, representing numbers in blocks starting from 65536. These blocks can be as large as you have memory for. For each prime in the first block whose square is less than the end of the new buffer, mark multiples of that prime that are in the buffer. Then you can scan through the buffer looking for consecutive primes with a large gap between them. If you don't find any, move on to the next block of numbers. This will eventually find every prime less than 2^32. And it's fast.
If you've got the memory, you can have one massive buffer initial buffer, and don't bother with the sliding one. Odd numbers up to 2^32 will need 256MB, which is a lot for an 8-bitter, but nothing on a modern PC. You need one multiply for each prime up to the square root of your limit. The rest is addition. The sliding buffer needs a division per prime per buffer as well, to find which multiple to start with.
As a benchmark, it's testing different things: memory access and addition, mostly. The original algorithm is more of a test of division.
Re: slightly OT: a simple Benchmark
BYTE was pretty keen on the sieve as a benchmark back in the 80s! Always just a single buffer though.
As it's evolved, this thread has come to investigate more than one kind of thing... it could arguably be two or even three threads by now. On this forum, that only happens when a poster sees fit to start a new thread, when they realise they have a new idea - the mods don't rearrange threads.
The question for a benchmark really is what approximates an application, and that's something which varies according to the use. Is Basic to be used more like Fortran or more like Cobol? The question for an interesting programming exercise is different!
For the prime-gaps benchmark as tabulated at the top of the thread, at minimum, we're either going to be benchmarking division (or modulus) or array accesses - possibly both. In the earlier revisions, we're also benchmarking square root, or multiplication, or both.
In interpreted languages we're benchmarking the interpreter, and in versions which use GOTO we're also benchmarking that.
I quite like the use of integer variables (but again there's a choice - one might want to be benchmarking the floating point)
I have got one contribution which is a cheap speedup of one of the early versions: it's like a poor man's wheel.
I'm quite interested to roll my sleeves up and try our OPC CPUs with (some version of) this benchmark, but I haven't got to it. To have any kind of useful comparison you surely need to be coding the same algorithm with the same optimisations, except where one CPU has to do something by hand which another can do with an instruction. For me, comparing CPUs is interesting, taking care to normalise to memory performance or similar.
As it's evolved, this thread has come to investigate more than one kind of thing... it could arguably be two or even three threads by now. On this forum, that only happens when a poster sees fit to start a new thread, when they realise they have a new idea - the mods don't rearrange threads.
The question for a benchmark really is what approximates an application, and that's something which varies according to the use. Is Basic to be used more like Fortran or more like Cobol? The question for an interesting programming exercise is different!
For the prime-gaps benchmark as tabulated at the top of the thread, at minimum, we're either going to be benchmarking division (or modulus) or array accesses - possibly both. In the earlier revisions, we're also benchmarking square root, or multiplication, or both.
In interpreted languages we're benchmarking the interpreter, and in versions which use GOTO we're also benchmarking that.
I quite like the use of integer variables (but again there's a choice - one might want to be benchmarking the floating point)
I have got one contribution which is a cheap speedup of one of the early versions: it's like a poor man's wheel.
Code: Select all
1 REM Basic-Bench la SYM etc.
2 REM *** Integerversion + Modulofunktion ***
10 Z% = 3: INPUT A%,B%
15 TIME = 0
20 FOR C% = 5 TO A% STEP 2
25 IF C% MOD 3 = 0 THEN C%=C%+2
30 FOR D% = 5 TO SQR(C%) STEP 2
40 IF C% MOD D% = 0 THEN 80
50 NEXT D%
60 IF C%-Z% >= B% THEN PRINT C%,Z%,C%-Z%: PRINT TIME: GOTO 10
70 Z% = C%
80 NEXT C%
90 PRINT "keine Lsung gefunden !": GOTO 10Re: slightly OT: a simple Benchmark
Chromatix wrote:
How fast can it run my code? I have emulation numbers for 16MHz, but it'd be nice to verify them with real hardware.
The code is assembled to $4000 by default (easily changed), and uses 16 zero-page bytes beginning at $70 (again, easily changed). The main entry point is at $4000 with $70 set to the gap size to search for; it returns with the following six bytes containing the pair of primes found with at least that gap. The eighth byte is zeroed so that the second prime can be read as a 4-byte integer by BASIC.
At the end of the source file is a test routine covering cases A-F, verifying that it produces the correct result. If it gets as far as the final NOP-NOP-NOP-NOP-BRA, it's all correct. If it goes into any of the other BNE loops, there's a mismatch. My emulator is set up to detect these tight loops and treat them as an exit condition, due to the test suite.
The code is assembled to $4000 by default (easily changed), and uses 16 zero-page bytes beginning at $70 (again, easily changed). The main entry point is at $4000 with $70 set to the gap size to search for; it returns with the following six bytes containing the pair of primes found with at least that gap. The eighth byte is zeroed so that the second prime can be read as a 4-byte integer by BASIC.
At the end of the source file is a test routine covering cases A-F, verifying that it produces the correct result. If it gets as far as the final NOP-NOP-NOP-NOP-BRA, it's all correct. If it goes into any of the other BNE loops, there's a mismatch. My emulator is set up to detect these tight loops and treat them as an exit condition, due to the test suite.
This is real hardware, so I will not be able to detect your BNE traps.
Can you give me an idea on how this works? Does it run all cases (A-F) on a single call, or do you need to set up each case?
Bill
Re: slightly OT: a simple Benchmark
A simple sieve
Code: Select all
10 PR=O
20 L=1000
30 PRINT "SIEVE OF ERATOSTHENESE"
40 PRINT "COUNT THE PRIME NUMBERS UP TO ";L
50 DIM N(L)
60 FOR I=1 TO L:N(I)=I:NEXT I
70 P=2
80 GOSUB 200
90 P=P+1
100 IF N(P)<>0 THEN GOSUB 200
110 IF P < L THEN GOTO 90
120 PRINT:PRINT PR;" PRIME NUMBERS FOUND."
130 END
200 PR=PR+1
210 FOR I=P TO L STEP P:N(I)=0:NEXT I
220 RETURNBill
Re: slightly OT: a simple Benchmark
The BNE traps are effectively part of a unit-test routine. They're not intended for use in an actual benchmark run, since on real hardware (or an emulator for a real micro) they just provoke an infinite loop.
Instead, use the main entry point at $4000 as previously described, after setting $70 to the gap size to find. On return, bytes $71-73 and $74-76 contain little-endian 24-bit representations of the two primes enclosing that gap, and $77 is zeroed so you can read one of them as a 32-bit integer. Outputting these numbers and measuring the time interval is up to the caller, since there's so much variation in the hardware and OS facilities to do so.
In BBC BASIC I used the following:
Instead, use the main entry point at $4000 as previously described, after setting $70 to the gap size to find. On return, bytes $71-73 and $74-76 contain little-endian 24-bit representations of the two primes enclosing that gap, and $77 is zeroed so you can read one of them as a 32-bit integer. Outputting these numbers and measuring the time interval is up to the caller, since there's so much variation in the hardware and OS facilities to do so.
In BBC BASIC I used the following:
Code: Select all
DATA 20,30,35,50,70,100
*LOAD O.PRIMEGP 4000
FOR I%=1 TO 6
READ G%
?&70=G%
T%=TIME
CALL(&4000)
U%=TIME
PRINT !&74, (U%-T%)/100
NEXT
Re: slightly OT: a simple Benchmark
Got it.
Without a time function in EhBAsic (or the hardware to supp0ort it) I'll have to run them individually.
Without a time function in EhBAsic (or the hardware to supp0ort it) I'll have to run them individually.
Bill