6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 5:58 am

All times are UTC




Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Fri Oct 13, 2023 7:24 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I've done a 6502 implementation of B now, because it's the simplest one to do in assembler: https://github.com/gfoot/primesieve6502 ... mesieve2.s

However, surprisingly it seems to come in slower than my original 30-based one - it takes about twice as long as the optimised version, and I didn't expect that. Perhaps I did something silly in the code here.

Actually it's a good opportunity to try out hoglet's profiler - it showed that about 15% of time is spent in "clear_sieve", 15% in "do_sieve", and about 60% in "init_new_sieve", mostly in init_new_sieve's "elimloop". This is the loop eliminating multiples of low primes from newly-initialised sieves. I suspect it's particularly expensive for the small primes that appear many times in the sieves - 3, 5, 7, 11 - and avoiding this was the idea behind the 1155 trick in the BASIC version. Perhaps that would be more effective here than it was in BASIC.

Here's some of the output from the profiler - 030c is the start of init_new_sieve's elimloop:
Code:
030c :  1613590 cycles (  3.847899%)   806795 ins (2.00 cpi) ********************
030e :  4033975 cycles (  9.619747%)   806795 ins (5.00 cpi) **************************************************
0310 :  1613590 cycles (  3.847899%)   806795 ins (2.00 cpi) ********************
0311 :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
0313 :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
0315 :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
0317 :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
0319 :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
031b :  2420385 cycles (  5.771848%)   806795 ins (3.00 cpi) ******************************
031d :  1613590 cycles (  3.847899%)   806795 ins (2.00 cpi) ********************
031f :  2415542 cycles (  5.760299%)   806795 ins (2.99 cpi) *****************************


I'm continually amazed by how useful hoglet's decoder manages to be!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 7:56 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Thanks for the commented code and README - I think this kind of code is going to be a twisty maze shortly after being written, if not well-commented!

With up to 256 primes to keep track of, so you can use the split-pages single-index method, I think primes up to 2 million (but not 3 million) should be possible. And 2 million is a nice round number.

I'd like to return to the business of there being 8 prime-friendly residues in every set of 30. In your first cut, it was a great idea to pack 8 flags into one byte. But in the bigger more complicated world of huge sieves constructed block by block, using a whole byte for a flag is a better bet - as you've shown. Here's my thought: split each block into 8, where each of the 8 sub-blocks stands for one of the 8 prime-friendly residues. Now, for each prime, you walk each of the 8 sub-blocks in turn, skipping over P bytes at a time as usual. For each prime you now have 8 pieces of state, to say where you've got to, each of which will overflow into the next sub-block with its own phase. So that's more state. But, you get something of a speed up, because the sieve flags are only 8 in 30 instead of 1 in 2 of the full set of integers.

Possibly you can make the overflow check easier too, if each sub-block is just 256 bytes. That's a guess though - it makes for quite a small sub-block. 8 of those would comprise 2k flags, representing 7500 integers, more or less, if my maths is right.

Forgive me if this is nonsense, or something you've already tried and shown wanting...

Edit: a thought, that inner loop, which adds P, checks for overflow, sets a flag byte... perhaps that could be dynamically constructed code, modified each time you have a new P? Perhaps in zero page, so the flag byte setting is an absolute address updated with a zero page write?


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 9:24 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
BigEd wrote:
Thanks for the commented code and README - I think this kind of code is going to be a twisty maze shortly after being written, if not well-commented!

It was a twisty maze while being written, as well!

Quote:
With up to 256 primes to keep track of, so you can use the split-pages single-index method, I think primes up to 2 million (but not 3 million) should be possible. And 2 million is a nice round number.

Indeed and with a little bit of code duplication it would allow twice as many primes, using a second set of pages, without much effort. That would probably allow us to reach 10 million or more.

Quote:
I'd like to return to the business of there being 8 prime-friendly residues in every set of 30. In your first cut, it was a great idea to pack 8 flags into one byte. But in the bigger more complicated world of huge sieves constructed block by block, using a whole byte for a flag is a better bet - as you've shown.

That's certainly true in BASIC because read-modify-write is so slow there - however it's not as bad in machine code, the cost of reading each byte before making changes to it seems fairly low compared to the cost of deciding which byte is next. Adding the current prime to the address pointer, for example, is more expensive than setting the flag!

I haven't actually profiled that version yet though, I should do that and see where the bottlenecks are.

Quote:
Here's my thought: split each block into 8, where each of the 8 sub-blocks stands for one of the 8 prime-friendly residues. Now, for each prime, you walk each of the 8 sub-blocks in turn, skipping over P bytes at a time as usual. For each prime you now have 8 pieces of state, to say where you've got to, each of which will overflow into the next sub-block with its own phase. So that's more state. But, you get something of a speed up, because the sieve flags are only 8 in 30 instead of 1 in 2 of the full set of integers.

That's nice, yes. Actually there's also the mod-6 option, where you don't store bits for multiples of 2 or 3. This way there are two candidates per 6 numbers, so it requires 10 bits per 30 numbers, pretty close to 8 but with less complication as the relationship is easier to express.

Quote:
Possibly you can make the overflow check easier too, if each sub-block is just 256 bytes. That's a guess though - it makes for quite a small sub-block. 8 of those would comprise 2k flags, representing 7500 integers, more or less, if my maths is right.

In theory the overhead of small blocks is mostly in the need to clear the sieve between blocks. The population phase seems costly but aside from overheads it is in theory the same cost whether you do it many times for smaller blocks or fewer times with larger blocks.

In the assembler version I added yesterday, I initially used 4 pages for the main sieve, then increased it to 8, then 16, then 64. I think increasing it to 8 saved maybe 10%-20% of execution time, but beyond that the savings were more modest. So 8x 256-byte blocks is probably OK.

Quote:
Edit: a thought, that inner loop, which adds P, checks for overflow, sets a flag byte... perhaps that could be dynamically constructed code, modified each time you have a new P? Perhaps in zero page, so the flag byte setting is an absolute address updated with a zero page write?

I hadn't thought of that. I did wonder about unrolling or having special loops for each of the first few primes but didn't want to do that by hand - however you could generate the update loops for the lowest primes in zero page once and reuse them many times.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 9:36 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(To clarify, perhaps, what I'm thinking, or hoping, is that there's no need to do any calculation mod 6 or mod 30, because all we ever do is increment by P and check against the bound. It's true that mod 6 means we only need pairs of sub-blocks, which is fewer. But from 10 to 8 is quite a chunky improvement, surely - 20% faster?)


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 11:56 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Yes it might help in that way - also because it completely removes the need for some low primes to be eliminated from the sieve explicitly, and they are the most expensive ones because they occur so frequently.

I have added profiling data to github, side-by-side with the disassembly:

I also made some optimisations based on the profiling data, to both algorithms - mostly short-circuiting some 16-bit arithmetic where possible, and to enable that, splitting the elimination loops into 8-bit and 16-bit versions. As the very small primes are the worst cases, it's the 8-bit versions that matter most.

Looking at the profile, the routines to clear the sieves, eliminate small primes, and find the next prime to print out, are very good candidates for self-modification and maybe moving to page zero.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 12:52 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Ah, I did have a thought that dealing with one-byte primes as a special case might help - sounds like you had the same thought!

Another thought: the sub-block size should be at least as large as the largest prime you plan to sieve with, because that way incrementing the pointer will always land in the same sub-block or the next one. If the pointer skips a sub-block that's a different code path, which can be avoided.

I'm thinking clearing the sieve must be rather like clearing the screen - some unrolling will help.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 14, 2023 1:13 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I did wonder about that constraint on sub-block sizes. I ignored it in the current implementation but I think it would not work for block sizes less than 1024 right now. I think 4 pages was the smallest I tried.

However I don't think it's hard to support - e.g. here: https://github.com/gfoot/primesieve6502 ... ve2.s#L251 We can just do a cmp at that point to determine whether elimptr is beyond the end of the block, and jump to store_residue on line 284 as usual if it is.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: