6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:26 am

All times are UTC




Post new topic Reply to topic  [ 210 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11 ... 14  Next
Author Message
PostPosted: Sun Jul 08, 2018 4:55 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
If we're still thinking of speeding up some trial-division code, it strikes me that dividing by 33 or 35 and looking up in a table to sift out the numbers divisible by small primes might be faster than doing two divisions. At the cost of a larger table, divide by 1155 or some other product of small primes. Maybe a whole series of smaller tables is a bigger win than one large table. Such a table could be implemented in basic or any other language but might fit especially well in machine code - especially if it's quite small.

I like the idea of doing this task within a reasonable space like 4k. Feels like the kind of constraint which improves solutions.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 5:41 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Well, one option is to perform GCD with a series of constants like 15015 (=3*5*7*11*13) to get through the small divisors quickly - but I was keen to keep the algorithm very simple, so that it could be implemented directly in other languages, while still being much faster than the brute-force algorithm originally used. I'll see what I can do to translate it into 6502 and 68K assembly myself, and into BASIC as well.

I'm also still interested in the practical performance of the 6309 when not hobbled by a language interpreter. Its implementation of division is supposedly a *lot* faster than that of the 68000.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 5:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
If you're already dividing by a series of constants from an array, you're not far from instead multiplying by a series of reciprocals, which should be quicker. Just need to figure out the fixed point and the rounding. How hard can it be.

With a wheel incrementer, I think a much smaller table of primes would suffice, as you get the most gains (I think) from the smallest primes, which are most likely to be factors.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 6:08 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
The trouble is that multiplying by a reciprocal doesn't give you an accurate remainder without re-multiplying. And multiplication is almost as expensive as division in software.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 6:12 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Not with lookup tables... and isn't it the case that testing against an integer can't need (many) more bits after the point than the original divisor?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 6:47 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
On the 68000, a 16x16=32 bit multiply (the only one available) takes 70 cycles - so two of them, as would be needed when candidate primes exceed 16 bits, would be just as expensive as a 32/16 division. So that's not a win, no matter how you spin it. Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory.

In BASIC, good luck obtaining the high half of a 64-bit or even 48-bit result from a multiply operation. That will seriously complicate porting to many old languages. In any case, overhead of interpretation is the usual limiting factor, not the actual arithmetic. Taking the remainder directly is better all round.

Conversely, on the 6502 I can run through 8-bit divisors relatively quickly for the most common elimination cases, and switch to the slower 16-bit divisors only rarely. I can't think of any way to do that faster by multiplication, even if I could somehow make multiplication faster.

And on the crazy homebrew end of things, what if someone built a memory-mapped divider and used it to run this algorithm? Now *that* would be interesting.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 6:51 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
What did you mean by your GCD comment? Did you miss the idea of a table of results, or did I miss something?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 7:01 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Chromatix wrote:
Similar arguments apply against multiply-reciprocal on the 6502, unless your multiplication lookup tables occupy a very large proportion of the memory

or you put the large tables in ROM, interfaced through I/O, which you can do with my large tables at http://wilsonminesco.com/16bitMathTables/ even though they were primarily intended to go in a 65816's memory map. It gives ways to do it though; and even if using a serial interface, it's still much faster than actually calculating the answers. The reciprocal table is the largest, giving a 32-bit answer for every 16-bit input. (That's not to say you have to use all 32 bits for every application; but they're available.)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 08, 2018 7:04 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
GCD can be done with one big remainder operation and a bunch of smaller (thus faster) ones, and you don't need a lookup table to interpret the results. But I'm not completely convinced that it's faster overall than individual, well-optimised remainders over 8-bit divisors.

A property of my code is that it starts with practically no prior information, only the first *two* primes. The lookup table of all other 16-bit primes is constructed during the search. On an 8-bit machine, that will take a usefully measurable amount of time by itself, while the table and code combined can fit within 16KB.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 10:45 am 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336
If you're changing the algorithm completely, it's possible to do without any multiplication or division in the main loop.

The trick is to use a sieve. Allocate a block containing one bit for each odd number up to the square root of the largest number you expect to encounter (or more, if you've got the memory). If you want to stay within 32 bits, that's at least 32768 bits, or 4KB. To start with, they're all clear. You know that 3 is prime, so mark every odd number from 3^2, stepping by 3, until the end of the buffer as composite. The next clear bit after 3 is 5, so mark every number from 5^2 to the end, stepping by 5. Keep doing this until you the starting point is off the end of the buffer. You've now got all primes up to 65536.

Now create another buffer, representing numbers in blocks starting from 65536. These blocks can be as large as you have memory for. For each prime in the first block whose square is less than the end of the new buffer, mark multiples of that prime that are in the buffer. Then you can scan through the buffer looking for consecutive primes with a large gap between them. If you don't find any, move on to the next block of numbers. This will eventually find every prime less than 2^32. And it's fast.

If you've got the memory, you can have one massive buffer initial buffer, and don't bother with the sliding one. Odd numbers up to 2^32 will need 256MB, which is a lot for an 8-bitter, but nothing on a modern PC. You need one multiply for each prime up to the square root of your limit. The rest is addition. The sliding buffer needs a division per prime per buffer as well, to find which multiple to start with.

As a benchmark, it's testing different things: memory access and addition, mostly. The original algorithm is more of a test of division.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 11:05 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
BYTE was pretty keen on the sieve as a benchmark back in the 80s! Always just a single buffer though.

As it's evolved, this thread has come to investigate more than one kind of thing... it could arguably be two or even three threads by now. On this forum, that only happens when a poster sees fit to start a new thread, when they realise they have a new idea - the mods don't rearrange threads.

The question for a benchmark really is what approximates an application, and that's something which varies according to the use. Is Basic to be used more like Fortran or more like Cobol? The question for an interesting programming exercise is different!

For the prime-gaps benchmark as tabulated at the top of the thread, at minimum, we're either going to be benchmarking division (or modulus) or array accesses - possibly both. In the earlier revisions, we're also benchmarking square root, or multiplication, or both.

In interpreted languages we're benchmarking the interpreter, and in versions which use GOTO we're also benchmarking that.

I quite like the use of integer variables (but again there's a choice - one might want to be benchmarking the floating point)

I have got one contribution which is a cheap speedup of one of the early versions: it's like a poor man's wheel.
Code:
    1  REM Basic-Bench  la SYM etc.
    2  REM *** Integerversion + Modulofunktion ***
   10 Z% = 3: INPUT A%,B%
   15 TIME = 0
   20 FOR C% = 5 TO A% STEP 2
   25 IF C% MOD 3 = 0 THEN C%=C%+2
   30 FOR D% = 5 TO SQR(C%) STEP 2
   40 IF C% MOD D% = 0 THEN 80
   50 NEXT D%
   60 IF C%-Z% >= B% THEN PRINT C%,Z%,C%-Z%: PRINT TIME: GOTO 10
   70 Z% = C%
   80 NEXT C%
   90 PRINT "keine Lsung gefunden !": GOTO 10


I'm quite interested to roll my sleeves up and try our OPC CPUs with (some version of) this benchmark, but I haven't got to it. To have any kind of useful comparison you surely need to be coding the same algorithm with the same optimisations, except where one CPU has to do something by hand which another can do with an instruction. For me, comparing CPUs is interesting, taking care to normalise to memory performance or similar.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 8:19 pm 
Offline
User avatar

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1007
Location: Canada
Chromatix wrote:
How fast can it run my code? I have emulation numbers for 16MHz, but it'd be nice to verify them with real hardware.

The code is assembled to $4000 by default (easily changed), and uses 16 zero-page bytes beginning at $70 (again, easily changed). The main entry point is at $4000 with $70 set to the gap size to search for; it returns with the following six bytes containing the pair of primes found with at least that gap. The eighth byte is zeroed so that the second prime can be read as a 4-byte integer by BASIC.

At the end of the source file is a test routine covering cases A-F, verifying that it produces the correct result. If it gets as far as the final NOP-NOP-NOP-NOP-BRA, it's all correct. If it goes into any of the other BNE loops, there's a mismatch. My emulator is set up to detect these tight loops and treat them as an exit condition, due to the test suite.
Okay, I may have time tonight to give this a try.

This is real hardware, so I will not be able to detect your BNE traps.

Can you give me an idea on how this works? Does it run all cases (A-F) on a single call, or do you need to set up each case?

_________________
Bill


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 9:17 pm 
Offline
User avatar

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1007
Location: Canada
A simple sieve

Code:
10 PR=O
20 L=1000
30 PRINT "SIEVE OF ERATOSTHENESE"
40 PRINT "COUNT THE PRIME NUMBERS UP TO ";L
50 DIM N(L)
60 FOR I=1 TO L:N(I)=I:NEXT I
70 P=2
80 GOSUB 200
90 P=P+1
100 IF N(P)<>0 THEN GOSUB 200
110 IF P < L THEN GOTO 90
120 PRINT:PRINT PR;" PRIME NUMBERS FOUND."
130 END
200 PR=PR+1
210 FOR I=P TO L STEP P:N(I)=0:NEXT I
220 RETURN

_________________
Bill


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 9:25 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
The BNE traps are effectively part of a unit-test routine. They're not intended for use in an actual benchmark run, since on real hardware (or an emulator for a real micro) they just provoke an infinite loop.

Instead, use the main entry point at $4000 as previously described, after setting $70 to the gap size to find. On return, bytes $71-73 and $74-76 contain little-endian 24-bit representations of the two primes enclosing that gap, and $77 is zeroed so you can read one of them as a 32-bit integer. Outputting these numbers and measuring the time interval is up to the caller, since there's so much variation in the hardware and OS facilities to do so.

In BBC BASIC I used the following:
Code:
DATA 20,30,35,50,70,100
*LOAD O.PRIMEGP 4000
FOR I%=1 TO 6
READ G%
?&70=G%
T%=TIME
CALL(&4000)
U%=TIME
PRINT !&74, (U%-T%)/100
NEXT


Top
 Profile  
Reply with quote  
PostPosted: Mon Jul 09, 2018 9:50 pm 
Offline
User avatar

Joined: Fri Dec 12, 2008 10:40 pm
Posts: 1007
Location: Canada
Got it.

Without a time function in EhBAsic (or the hardware to supp0ort it) I'll have to run them individually.

_________________
Bill


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 210 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11 ... 14  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 21 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: