BigEd wrote:
Thanks for the commented code and README - I think this kind of code is going to be a twisty maze shortly after being written, if not well-commented!
It was a twisty maze while being written, as well!
Quote:
With up to 256 primes to keep track of, so you can use the split-pages single-index method, I think primes up to 2 million (but not 3 million) should be possible. And 2 million is a nice round number.
Indeed and with a little bit of code duplication it would allow twice as many primes, using a second set of pages, without much effort. That would probably allow us to reach 10 million or more.
Quote:
I'd like to return to the business of there being 8 prime-friendly residues in every set of 30. In your first cut, it was a great idea to pack 8 flags into one byte. But in the bigger more complicated world of huge sieves constructed block by block, using a whole byte for a flag is a better bet - as you've shown.
That's certainly true in BASIC because read-modify-write is so slow there - however it's not as bad in machine code, the cost of reading each byte before making changes to it seems fairly low compared to the cost of deciding which byte is next. Adding the current prime to the address pointer, for example, is more expensive than setting the flag!
I haven't actually profiled that version yet though, I should do that and see where the bottlenecks are.
Quote:
Here's my thought: split each block into 8, where each of the 8 sub-blocks stands for one of the 8 prime-friendly residues. Now, for each prime, you walk each of the 8 sub-blocks in turn, skipping over P bytes at a time as usual. For each prime you now have 8 pieces of state, to say where you've got to, each of which will overflow into the next sub-block with its own phase. So that's more state. But, you get something of a speed up, because the sieve flags are only 8 in 30 instead of 1 in 2 of the full set of integers.
That's nice, yes. Actually there's also the mod-6 option, where you don't store bits for multiples of 2 or 3. This way there are two candidates per 6 numbers, so it requires 10 bits per 30 numbers, pretty close to 8 but with less complication as the relationship is easier to express.
Quote:
Possibly you can make the overflow check easier too, if each sub-block is just 256 bytes. That's a guess though - it makes for quite a small sub-block. 8 of those would comprise 2k flags, representing 7500 integers, more or less, if my maths is right.
In theory the overhead of small blocks is mostly in the need to clear the sieve between blocks. The population phase seems costly but aside from overheads it is in theory the same cost whether you do it many times for smaller blocks or fewer times with larger blocks.
In the assembler version I added yesterday, I initially used 4 pages for the main sieve, then increased it to 8, then 16, then 64. I think increasing it to 8 saved maybe 10%-20% of execution time, but beyond that the savings were more modest. So 8x 256-byte blocks is probably OK.
Quote:
Edit: a thought, that inner loop, which adds P, checks for overflow, sets a flag byte... perhaps that could be dynamically constructed code, modified each time you have a new P? Perhaps in zero page, so the flag byte setting is an absolute address updated with a zero page write?
I hadn't thought of that. I did wonder about unrolling or having special loops for each of the first few primes but didn't want to do that by hand - however you could generate the update loops for the lowest primes in zero page once and reuse them many times.