6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:41 am

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Wed Oct 11, 2023 10:24 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I made a 6502 prime sieve to exercise my new computer. I think it's pretty fast, but I'm not an expert on these algorithms. It finds all the primes up to about 950,000 in about 1.3 seconds, at least when it doesn't have to print them all out. It doesn't quite get to a million because this system only has 32K of RAM, and one byte of the sieve can only cover 30 numbers' worth - not quite enough!

I've shared the code here in case anyone is interested in stealing it, or just seeing how it works:

https://github.com/gfoot/primesieve6502

The main technique is working through the number space in steps of 30. For each 30 numbers there are 8 prime candidates that are worth testing - the others are all multiples of 2, 3, or 5. These 8 fit nicely in a byte and allow some fast-seeming techniques to work. I've gone over this in a bit more detail in the readme and in the code itself.

It's not something I plan to keep developing, I just wanted a test program really! I've also been in need of "real world code" like this, to test a hypothesis regarding 6502 bus cycle type frequencies, and this is at least one data point there.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 7:07 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Good fun! I'm just a little bit baffled by this
Code:
   ; If n is really large, we don't need to update the sieve
   lda zp_n+2 : bne continue
   lda zp_n+1 : bmi continue

The best test, I think, is to compare against a thousand, which is bigger than the square root of your max value. But even with a weaker test, I can't quite see why that's BMI rather than BPL. Be interested to understand this.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 8:29 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Yes I think you're right. The logic there was, if n is large enough that adding it to addr would overflow the end of the sieve, then skip updating the sieve. N>=$8000 is definitely that large. It was based on how the sieve update code works, rather than any proper maths. However it doesn't feel correct now that you point out the square root thing, and it should be checking that instead, which is generally a stricter test.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 8:44 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Ah, of course - that at least explains the BMI. I had a bit of a mental block there - it's best thought of, in this case, as a comparison with &80 (or &7F, depending on how you think about it, perhaps).

Edit squared: I think I get it now...


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 10:06 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
It wasn't terribly well thought-out to be honest, and I'm now second-guessing a lot of how setbits works!

The "setbits" code mostly works by scanning the next n addresses after addr, in case one of the eight bits in that byte represents a multiple of n - in fact we expect this to be true once for each bit because these remainders modulo 30 form a field (I believe - it's decades since I studied maths). When each bit matches up for an address, we mark it as non-prime (in case n<30) and then can quickly add n to the address - adding 30*n to the number we just found - and mark that as non-prime as well, repeating until we reach the end of the sieve. Many of these will already be marked non-prime but some may not be.

For n>30, when we find each bit's next mulitple of n, it will already be marked as non-prime because it is a multiple of a number smaller than n. But the value of finding these multiples is that we can then eliminate the higher multiples later in the sieve. My original test here was really checking whether there actually were any higher multiples within the sieve - in a very generous way - and not bothering with the scan at all if there were not going to be any.

The square root check is much better though, speeding things up by a factor of 2 to about 650ms - I tested against 1024 as it's a nice round number:

Code:
    ; If n is really large, we don't need to update the sieve
    ; Compare against 1024 as that's more than the square root of the sieve size
    lda zp_n+2 : bne continue
    lda #>1024 : cmp zp_n+1 : bcc continue   ; inverted test so that carry is clear if we branch

There's also an sbc in the next instruction which should be modified as the carry will now be in the opposite state to what it was before.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 4:56 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Here's a BBC BASIC port of the same algorithm, but using video memory to visualise the sieve getting filled up. If you wanted to print out the primes, that would go around line 120.

I avoided things like procedures and multi-letter variable names so it should be possible to port it to the TinyBASICs that have been discussed recently. The only obscure feature I used was direct memory access for byte arrays, especially for the sieve itself.

I also don't know whether this algorithm is useful in BASIC - it may be that BASIC is faster at running a simpler algorithm.

https://bbcmic.ro/?experimental=true#%7 ... %201%22%7D


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 6:32 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
gfoot wrote:
Here's a BBC BASIC port of the same algorithm, but using video memory to visualise the sieve getting filled up. If you wanted to print out the primes, that would go around line 120.

I avoided things like procedures and multi-letter variable names so it should be possible to port it to the TinyBASICs that have been discussed recently. The only obscure feature I used was direct memory access for byte arrays, especially for the sieve itself.

I also don't know whether this algorithm is useful in BASIC - it may be that BASIC is faster at running a simpler algorithm.


The main thing with BASICs might be bit manipulation - it's possible, but might not be fast. The example I put together for my TinyBasic is just the classic sieve - one byte per number and so the limitation there might be a pool of 32767 bytes due to the 16-bit signed integers with overflow detection. Remove the overflow detection and the limit might be RAM.

Running a 32-bit BCPL version with an array of 100,000 bytes, naively one prime per byte takes 9.5 seconds and finds 9592 primes which I think is right. I could put one together that uses bits with the same algorithm but it would just be slower, so win for assembler there... It would be much slower to do the packing 8 primes per byte...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 8:50 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
drogon wrote:
Running a 32-bit BCPL version with an array of 100,000 bytes, naively one prime per byte takes 9.5 seconds and finds 9592 primes which I think is right. I could put one together that uses bits with the same algorithm but it would just be slower, so win for assembler there... It would be much slower to do the packing 8 primes per byte...


And just to follow-up, my TinyBasic on the same 16Mhz system with an array of just 10,000 byte-wide elements takes 20 seconds and that's without any form of 8:1 compression going on ...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 9:07 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Yes for sure, basic is always going to be bad at both bitwise operations (unless it has specific optimisations for them) and read-modify-write operations. And I expect even in assembly it's going to be much faster to store one prime per byte - but hugely limiting on sieve size.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 9:39 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Bytes for speed, bits for capacity, I think. It's tantalising that 32k isn't quite enough for primes up to a million, but it seem like that's so, for this approach.

I think, though, that it's possible to compute a large sieve in sections. It's extra complexity though, which counts against.

We need primes up to 1000 to calculate the sieve up to a million. Those primes fit in the first 34 bytes, or so, of a bitmask. This small part of the sieve contains the working divisors, the rest contains the result. If we break up that larger part, the result, into big blocks, we can compute them independently of each other, and so we don't need them all in memory at once. To do the filling-in of block N, having previously computed block N-1, we need to sync up the phases of the divisors. One way would be by doing some sums. Another way is to store them, from one block to the next: we have maybe 200 primes to keep track of, so we have a list of starting points, one for each, which is where we left off with the previous block and need to start for the next.

For example, take a block size of 10000. When counting up by 13 and recording the multiples in the block, we'll overflow when we get to 10010, and will record that for when we start the next block. When counting up by 17, we'll get to 10013. And so on: one starting point for each of the primes we need to use, and it turns out there's not so many of those. Next block, the 13 will overflow at 20007 and the 17 at 20009.

Of course, a block size of 8k or 16k might be easier to do the accounting.

Edit: oh, and thanks for the Owlet link! I changed the initialisation of W% to &3400, and it computes 3314 primes in a shade under 70 seconds.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 12, 2023 10:14 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
It's a nice idea, I think it would suffer due to the storage needed to keep track of past primes though, as I've done something similar before. (Edit - ah I see I keep overlooking that we only need to store them up to 1000 - which is not very many as your said!)

It was a BBC BASIC program I wrote for demoing Supershadow's large memory space - it's a regular sieve (using mod 6 to skip multiples of 2 or 3 I think, but probably only storing one candidate per byte in the sieve) but as it goes it keeps track of all the primes it finds, overwriting the start of the sieve with those, as the start is no longer needed. When the sieve runs out it wipes everything from the end of the prime list onwards, making a new smaller sieve, then carries on, after first using the known primes to set up the new sieve.

It was able to count up pretty high - at least I thought so at the time! I didn't have much to compare against. The theoretical maximum of that approach was one prime per byte, as it needs to actually store the primes. Actually I guess this assembly version blows that out of the water, finding more primes than can fit in memory in under a second!

Though it strikes me that as a variation you could do a very fast sieve using one byte per candidate to find the early primes quickly, then pack them into a bitmap at the start of a larger, 8-per-byte sieve, to go on to find larger ones.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 12:19 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
So this is maybe what you meant? https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... Q%25%22%7D

It does a basic sieve for primes up to 1000, then repeats a larger sieve until it gets to 1000000. The sizes should be pretty configurable, I don't know where the efficiency sweet spot would lie. As written it uses a 10K sieve.

It doesn't try to do anything smart with the remainders when initialising the new sieves - you could precalculate how much they change by with each sieve pass, to save a bit of time if lots of passes are needed. Which they are, I guess we should expect about 100 passes (1000000 / 10240).

This algorithm is very simple and should also work well in assembler I think, it's mostly integer addition.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 12:46 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Thanks - that took just over 39 mins to report the correct answer (of course):
Quote:
Found 78499 primes up to 1000003

Or 26 mins if it doesn't print out the primes.

I do like the finding that 8 bits covers 30 numbers - that's a good size wheel. Looks (superficially) as if this latest block-sieve program only skips odd numbers - is that right? That's so much easier of course.

A combination of a block sieve, byte-sized flags, and a wheel of 30... that would be something!


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 1:05 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I don't think the wheel of 30 helps in that case. However I did try making the block size be a multiple of 3*5*7*11=1155, so that blocks can easily be prepopulated with all the multiples of those primes struck out - however it makes the code more complex and I'm not convinced it sped things up really! You can play with it if you like: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... C%22%22%7D

I will try a 6502 version of one of these later on.

Edit: Here's the "intermediate" version that tracks the residues for each prime as it goes rather than using MOD, but doesn't force the sieve size to be a multiple of 1155 and so has to reapply the smallest primes on each pass: https://bbcmic.ro/#%7B%22v%22%3A1%2C%22 ... C%22%22%7D

If we call this interim version "B", and the one at the top of this post "C", and the previous one (that used MOD) "A", then times for 1000000 primes, with a 11550-sized sieve, were 39:06 for A, 41:00 for B, and 37:12 for C - so C is quite a bit faster due to being able to reuse some of the sieve contents between batches, but for some reason B is slower than A, which was unexpected.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 13, 2023 6:15 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Interesting timings... The prefilled blocks of 1155 is an interesting idea.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: