Hi Ed,
BigEd wrote:
A couple of things stand out for me, having looked a bit more at the Life88 diagram
- the extreme benefit of using lookup tables to transform a byte into a byte (I think there are 7 page-sized lookup tables in the code)
- the clever avoidance of shifts, in marshalling the partial bytes into the appropriate collections, to deal with the overlapping windows
I've got to the point now where I can try the three different algorithms with arbitrary RLE patterns:
- Atom Life
- List Life
- List Life with 8-bit chunks using the Kernel of Life88 for computation
What's rather surprising is that the original list_life is actually a bit faster than the chunked version.
Running the "rabbits" pattern to 17400 generation:
- List Life takes 71s (245 generations/sec)
- List Life with 8-cell chunks takes 82s (212 generations/sec)
(On a 225MHz 6502 Co Pro, with rendering turned down to once every 32 frames to focus on just the life algorithms)
This is somewhat of a surprise, as my cycle counting lead me to believe the Life88 kernel would be about 3x faster.
As an estimate of the kernel complexity, I had simply counted 6502 cycles:
- List Life is 112.5 cycles (per cell)
- List Life with 8-cell chunks is 266 cycles (per 8 cells)
When I compared these a few days ago, I had rather simplistically just multiplied the List Life figure by 8 to make them comparable. Doing this suggests moving to 8 cell chunks should be 3.4x faster.
Unfortunately, it isn't.
To better understand this, I've added some instrumentation to the C version of each of these algorithms. This counts the total number of times the kernel of the algorithm is executed.
Running the "rabbits" pattern to 17400 generation:
- List Life executes the kernel 109,115,829 times
- List Life with 8-bit chunks executes the kernel 49,264,473 times
Translating everything to kernel operations per generation:
- List Life: 6,271 ops per generation
- List Life with 8-bit chunks: 2,831 per generation
For me the real surprise here is the ratio of the number of times the kernel is executed between the two algorithms. I had expected 8:1 and its more like 2:1.
The average number of cells per generation is 1,246.
Using these numbers, the total number of 6502 cycles should be:
- List Life: 112.5 * 109,115,829 = 12,273,530,762
- List Life with 8-bit chunks: 266 * 49,264,473 = 13,104,349,818
Dividing by a 225MHz 6502 clock speed gives:
- List Life: 112.5 * 109,115,829 = 54.6s (actual was 71s)
- List Life with 8-bit chunks: 266 * 49,259,965 = 58.2s (actual was 82s)
The difference compared to the actual is because there is obviously more than just the Kernel being executed.
There is still a mystery here, which is why is the original life88 implementation (within a 1Kx1K universe) so much than list life comparable hardware.
It's possible we are still missing an important optimization.
Dave