6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 3:49 pm

All times are UTC




Post new topic Reply to topic  [ 50 posts ]  Go to page 1, 2, 3, 4  Next
Author Message
 Post subject: Conway's Life on 6502
PostPosted: Fri Nov 04, 2016 5:27 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Recently we've had an interesting thread going, over on Stardot, about implementations of Conway's Game of Life on Acorn machines. But of course other 6502 machines also have seen implementations, so Dave (hoglet) and I thought it would be good to discuss here. There are questions of how to display the board, which are platform specific, and then there are questions of how to compute the next generation, which are algorithmic and where 6502 coding skill can really make a difference.

We (André and I) posted about this previously.

In addition to commercial and type-in offerings, three implementations have jumped out in the recent discussion:
- a version for the Atom was ported by Dave to the Beeb's second processor, and then converted to use Tony Finch's List Life algorithm, which had to be ported from C to 6502. This turns out to be nice and fast and offers a large 32k by 32k universe.
- Litwr brought in his Xlife-8 implementations, which use some of the tactics from the Unix program Xlife, again ported from C to 6502. This ends up having a smaller universe, but is the fastest we've seen, especially on small patterns.
- A friend of mine made available his 6502 code from 1988, which uses a different list-based approach, turning out to be faster than Dave's version but with a universe of 1k by 1k. There are ways to improve this, which Dave is presently pursuing. Indeed, Dave has a great diagram of the innards of the present code - see attached.

Here's a screen from Xlife-8:
Image

Here's a screen from Life88:
Image

Here's a video of some patterns running on these programs. Here's Xlife-8 running on a C64.

What we note, then, is
- the fastest evaluation is on a finite grid
- a list structure allows for a much larger universe
- a combination of a list with a bitmap gives the best result
- modern life programs use huge datastructures, huge tables, and 32-bit or 64-bit arithmetic, none of which are much help for a 6502.
- it's interesting to find a good way to partition computation and display, on a two-processor Acorn BBC.

I expect there's a lot more to be said, about algorithms used and ideas for improving them!


Attachments:
File comment: Data flow in Life88, drawn by DaveB
Life88-diagram-DaveB.png
Life88-diagram-DaveB.png [ 176.28 KiB | Viewed 5040 times ]


Last edited by BigEd on Sun Apr 12, 2020 5:03 pm, edited 1 time in total.
Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 04, 2016 6:05 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I've always wondered about writing out the logic for implementing a bunch of full adders for determining the sum of neighbors, and then performing this logic bytewise parallel on 8 cells at a time. I don't think this would beat Dave's flow chart though. Maybe on a 32 bit system.


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 05, 2016 10:01 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
A couple of things stand out for me, having looked a bit more at the Life88 diagram
- the extreme benefit of using lookup tables to transform a byte into a byte (I think there are 7 page-sized lookup tables in the code)
- the clever avoidance of shifts, in marshalling the partial bytes into the appropriate collections, to deal with the overlapping windows

I think perhaps the trick with a fast Life - arguably, with any fast algorithm - is to do as little computation as possible. Because Life uses overlapping 3x3 windows to calculate each cell, a naive approach will be doing the same work several times.

Xlife-8 uses 8x8 tiles - figuring out what happens in the interior of a tile is probably quite efficient, but there are 8 neighbouring tiles, the edges and corners of which have to be brought into play. It still wins on speed, but only manages a relatively small universe. It's possible it could be enhanced.


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 05, 2016 4:59 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
Hi Ed,
BigEd wrote:
A couple of things stand out for me, having looked a bit more at the Life88 diagram
- the extreme benefit of using lookup tables to transform a byte into a byte (I think there are 7 page-sized lookup tables in the code)
- the clever avoidance of shifts, in marshalling the partial bytes into the appropriate collections, to deal with the overlapping windows

I've got to the point now where I can try the three different algorithms with arbitrary RLE patterns:
- Atom Life
- List Life
- List Life with 8-bit chunks using the Kernel of Life88 for computation

What's rather surprising is that the original list_life is actually a bit faster than the chunked version.

Running the "rabbits" pattern to 17400 generation:
- List Life takes 71s (245 generations/sec)
- List Life with 8-cell chunks takes 82s (212 generations/sec)

(On a 225MHz 6502 Co Pro, with rendering turned down to once every 32 frames to focus on just the life algorithms)

This is somewhat of a surprise, as my cycle counting lead me to believe the Life88 kernel would be about 3x faster.

As an estimate of the kernel complexity, I had simply counted 6502 cycles:
- List Life is 112.5 cycles (per cell)
- List Life with 8-cell chunks is 266 cycles (per 8 cells)

When I compared these a few days ago, I had rather simplistically just multiplied the List Life figure by 8 to make them comparable. Doing this suggests moving to 8 cell chunks should be 3.4x faster.

Unfortunately, it isn't.

To better understand this, I've added some instrumentation to the C version of each of these algorithms. This counts the total number of times the kernel of the algorithm is executed.

Running the "rabbits" pattern to 17400 generation:
- List Life executes the kernel 109,115,829 times
- List Life with 8-bit chunks executes the kernel 49,264,473 times

Translating everything to kernel operations per generation:
- List Life: 6,271 ops per generation
- List Life with 8-bit chunks: 2,831 per generation

For me the real surprise here is the ratio of the number of times the kernel is executed between the two algorithms. I had expected 8:1 and its more like 2:1.

The average number of cells per generation is 1,246.

Using these numbers, the total number of 6502 cycles should be:
- List Life: 112.5 * 109,115,829 = 12,273,530,762
- List Life with 8-bit chunks: 266 * 49,264,473 = 13,104,349,818

Dividing by a 225MHz 6502 clock speed gives:
- List Life: 112.5 * 109,115,829 = 54.6s (actual was 71s)
- List Life with 8-bit chunks: 266 * 49,259,965 = 58.2s (actual was 82s)

The difference compared to the actual is because there is obviously more than just the Kernel being executed.

There is still a mystery here, which is why is the original life88 implementation (within a 1Kx1K universe) so much than list life comparable hardware.

It's possible we are still missing an important optimization.

Dave


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 9:46 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
In the stardot thread, we're directed to read the Black Book (chapters 17 and 18), where a couple of chapters deal with Life in x86 land. Now, they have 32 bit operations available. But even so, one idea there for speed, if not for memory efficiency, is to hold data about a cell, the row (or column) count around it, and the previous state - all of which can fit into one nibble. So in our case, we get two cell's worth of rich data in a byte, and can do a very fast lookup to a byte function of that data, or indeed to a number of byte functions.

Having only two cell's worth of data in a byte seems like a poor packing, but if the world is sparse then a collection of 8 cells will be mostly zeros so it might not be so bad.

I'd also like to mention the idea of a two-dimensional payload in a list life. That might be two bit-packed bytes covering 2 rows of 8 cells, or it might be two nibble-packed bytes as above, covering 2 rows of 2 cells. There are a couple of reasons why this might be an advantage:
- we only have to traverse two lists, not three, to get our three rows of data. One of our vertical neighbours is immediately accessible in the same payload
- active areas are likely to span more than one row, so the second row in the payload is moderately likely to contain something interesting, and therefore not be wasted.

Arlet, you mention addition by bit-planes - I think Dave has also thought about that.

(Edit to add: Isaac Kuo's JavaScript Life also uses word operations on a nibble packing, which might possibly translate to byte-sized operations. View the source of this page, particularly the opening block comment.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 10:11 am 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
Hi Guys,
Arlet wrote:
I've always wondered about writing out the logic for implementing a bunch of full adders for determining the sum of neighbors, and then performing this logic bytewise parallel on 8 cells at a time. I don't think this would beat Dave's flow chart though. Maybe on a 32 bit system.

There is a bit more on this technique here:
https://www.cs.uaf.edu/2012/spring/cs64 ... _SIMD.html

I did try working out the instructions for doing this kind of bit-plane addition (i.e. 8 cells at a time) on a 65C02.

This uses three zero page locations to hold bits 0, 1 and 2 of the total.
- Lbits = bit 0 of the neighbours counts for each of eight cells
- Mbits = bit 1 of the neighbours counts for each of eight cells
- Hbits = bit 2 of the neighbours counts for each of eight cells

These are first initialized to zero:
Code:
MACRO init
        STZ LBits       3
        STZ MBits       3
        STZ HBits       3
ENDMACRO

The best code I could come up with for the accumulate N function was:
Code:
MACRO accumulate
;; Update bit 0
        TAY             ; stash N                               2
        AND LBits       ; A = LBits & N = Lcarry                3
        TAX             ; stash A (Lcarry)                      2
        TYA             ; restore N                             2
        XOR LBits       ; A = LBits ^ N = Lsum                  3
        STA LBits       ;                                       3
;; Update bit 2
        TXA             ; restore A (Lcarry)                    2
        AND MBits       ; A = Mbits & Lcarry = Mcarry           3
        ORA HBits       ; A = HBits | Mcarry = Hsum             3
        STA HBits                                               3
;; Update bit 1
        TXA             ; restore A (Lcarry)                    2
        XOR MBits       ; A = Mbits ^ Lcarry = Msum             3
        STA MBits                                               3       34
ENDMACRO

This uses simple saturated arithmetic for the MS bit, as suggested in the above link.


This then needs to be used eight times:
Code:
AAAA BBBB CCCC
DDDD EEEE FFFF
GGGG HHHH IIII

MACRO init              9       9

LDA BBBB                4
MACRO accumulate        34      38

LDA HHHH                4
MACRO  accumulate       34      38

LDA AAAA                4
ROR A                   2
LDA BBBB                4
ROR A                   2
MACRO accumulate        34      46

LDA DDDD                4
ROR  A                  2
LDA EEEE                4
ROR  A                  2
MACRO accumulate        34      46

LDA GGGG                4
ROR A                   2
LDA HHHH                4
ROR A                   2
MACRO accumulate        34      46

LDA CCCC                4
ROL A                   2
LDA BBBB                4
ROL A                   2
MACRO accumulate        34      46

LDA FFFF                4
ROL A                   2
LDA EEEE                4
ROL A                   2
MACRO accumulate        34      46

LDA IIII                4
ROL A                   2
LDA HHHH                4
ROL A                   2
MACRO accumulate        34      46


Finally, to work out the new cells, you need to:
Code:
LDA Lbits               3
ORA old                 3
STA temp                3
LDA Hbits               3
EOR #&FF                2
AND Mbits               3
AND temp                3
STA new                 3       23      384
RTS

It turns out 384 cycles is somewhat more than the 266 cycles for the lookup table based approach posted earlier.

But it may be there is a more efficient way to code this.

Dave


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 10:38 am 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
So far, I think this was the most revealing finding...

Translating everything to kernel operations per generation: [ these numbers are Rabbits to 17400 generations ]
- List Life with 1-cell granularity: 6,271 ops per generation
- List Life with 8-cell granularity: 2,831 ops per generation

By shifting the computation to 8 cells at a time, you are not reducing the workload by a factor of 8. In fact, far from it, because as the granularity increases, you end up picking up far more empty space, which with a 1-bit granularity would not be included in the calculation at all.

Making this more concrete, the approaches I have considered so far are:
(A) - my own 6502 translation of Tony Finch's list life - 1 cell at at time - 110 cycles (approx)
(B) - life88 - from Ed's friend - 8 cells at a time - 266 cycles
(C) - bit plane addition - described above - 8 cells at a time - 384 cycles

Doing the sums, you get:
(A) - 6,271 x 110 = 689,810 cycles / generation
(B) - 2,831 x 266 = 753,046 cycles / generation
(C) - 2,831 x 374 = 1,058,794 cycles / generation

So actually the 1-cell at a time approach ought to be faster, which is what I'm observing in the real system.

[ The mystery of why the original Life88 implementation was faster in some circumstances turned out to be down to the approach taken to the updating the display, which did less work on the Co Processor, and more on the host ]

Dave


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 10:40 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
384 cycles is pretty decent. That's actually better than I would have expected on the 6502. May be a viable method on an ARM with plenty of registers and 32 bit wide operations.


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 11:23 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Thanks for the numbers Dave! I suppose in principle we shouldn't be surprised to see a time/space tradeoff - the denser data structure is not the fastest. But then I think it becomes clear on 32-bit machines that wider is better - if not packed bitwise then perhaps packed nibblewise.

But capacity is also an interesting metric! It was very nice to see the Breeder running, and with Acorn's co-processing architecture we get almost 60k of RAM space for our program and data.


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 08, 2016 12:34 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
On the FPGA-based 6502 boards, you could make a Game of Life accelerator ...


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 22, 2016 5:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Dave and I made a bit more progress, mostly Dave of course, and Life on the 6502 is now 50% faster!

We've called the new approach Life42, as it uses one byte as a 4x2 array of cells. The speed advantage comes from a couple of sources, I think:
- by making the next generation offset by (1,1) we only need to consider a 2x2 array of our 4x2 tiles, instead of a 3x3 array
- by using lookup tables we can compute the next generation in one go

To compute the future of a 4x2 patch you need to consider a 6x4 area, which means a 16MByte lookup table. That's a lot, even with banked memory, although it made our C prototype fairly straightforward to code. If instead we consider the two 2x2 halves of the patch, each of which has a 4x4 ancestor, we need to do two lookups in a 64k nibble (or 32kByte) table - large, but more manageable. We had a look at that, and it looks like it should be very fast, but we actually went one step further and take four lookups in a 4kByte table instead - it's not quite as fast, but is much more compact.

We now take well less than 200 cycles in computing the next generation of a 4x2 patch, but there's overhead in traversing the data structure so the average performance overall is 354 cycles per 4x2 operation. We could probably knock 50 cycles off that by moving from 4k to 32k table.

The latest code is always at
https://github.com/hoglet67/6502Life
and you'll find versions built for the beeb at
http://stardot.org.uk/forums/viewtopic. ... 21#p153721
(Extract the SSD file and point JSBeeb at the result.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 4:40 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Arlet wrote:
On the FPGA-based 6502 boards, you could make a Game of Life accelerator ...

And this, of course, is an interesting idea! I do like the possibility of point accelerations. Sometimes, like a CRC or a multiply instruction, it's easy to see what to do and what the benefit is. I'm not quite sure what a Life accelerator would look like - any idea?


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:02 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
A simple idea would be to replace your lookup tables with a logic expression implemented in the FPGA, so you could do the same thing you're doing now, but without having to deal with huge tables.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:09 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Ah, of course - 24 bits in, 8 bits out, just a big mass of logic in between. But it's all quite simple stuff, so single clock cycle for sure.
Edit: even quicker if it's 32 bits in, 8 bits out!


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:12 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Actually I'm almost tempted to consider a writeable control store approach, using distributed RAM to control a function generator which maps 32->8 bits and has enough generality to compute a Life algorithm. Like having an FPGA in your FPGA.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 50 posts ]  Go to page 1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 8 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: