Conway's Life on 6502

Arlet · Post by **Arlet** » Wed Nov 23, 2016 5:24 pm

On Xilinx, you can configure a LUT as an SRL16 shift register. The cool trick is that you get 4 address lines that allow you to pick any of the 16 bits as output, but you also get a shift register that you can use to dynamically alter the table.

It's almost like a dynamically programmable LUT, but unfortunately for this application, you don't get access to the carry chain.

Edit: and of course, you also don't get to dynamically update the routing between various LUTs.

MichaelM · Post by **MichaelM** » Wed Nov 23, 2016 5:30 pm

Alternatively, you could take a microprogrammed approach like the one that Dr Jefyll linked to on anycpu.org. You can use distributed RAM or Block RAM to implement microprogram control store in a writeable manner.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 5:35 pm

Arlet wrote:

On Xilinx, you can configure a LUT as an SRL16 shift register

That's rather interesting - so we could reprogram a patch of the FPGA by serial shifting. We could presumably architect in some muxes which drive some adder or multiply resources if we wished.

For Life, we're only counting up to 3 or 4 (or 8 or 9, depending on how we do it) so a fast carry shouldn't be needed. But making this a general piece of reprogrammable fabric has an attraction.

(And of course, the serial shifting could be done automatically, taking inputs from byte-wide input registers, for easier programming by the host micro.)

MichaelM wrote:

Alternatively, you could take a microprogrammed approach

That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.

Arlet · Post by **Arlet** » Wed Nov 23, 2016 5:37 pm

You could also implement a 'fast forward' algorithm where you look at bigger squares and do several generations in one step.

Arlet · Post by **Arlet** » Wed Nov 23, 2016 5:39 pm

BigEd wrote:

That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.

If you have a 6502 pushing and pulling the data, you do have several cycles available.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 5:44 pm

That's true. Much depends on what you're trying to compute - the balance of computation and communication, and the amount of intermediate state. I can imagine a crypto module which has to compute several rounds is well-suited to a microsequenced implementation running several iterations of a program, for example.

Arlet · Post by **Arlet** » Wed Nov 23, 2016 6:56 pm

BigEd wrote:

Ah, of course - 24 bits in, 8 bits out, just a big mass of logic in between. But it's all quite simple stuff, so single clock cycle for sure.
Edit: even quicker if it's 32 bits in, 8 bits out!

Even quicker if it's 8 bits in, 8 bits out where you use a bit of FPGA memory to remember the surrounding area.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 7:07 pm

Hmm, I'm not sure how to do that - the surrounding area is also being updated. It's true that in the present algorithm we tend to shift the window to the right, so the current generation's context usually inherits half its values from the previous generation. But sometimes we skip from one patch to a non-adjacent patch... maybe that's the same thing, as we shifted in zeros last time.

So, maybe we need to update 16 bits of context, and we inherit 16 bits from the previous iteration - and indeed, that's quite a saving.

(The above numbers are from Life42 - other approaches will of course differ.)

Arlet · Post by **Arlet** » Wed Nov 23, 2016 7:46 pm

Yes, to calculate N^2 block of new generation you need to send over (N+2)^2 of old generation, but as N increases the ratio will get closer to 1.

Naturally, sparse representations of the field will have more overhead.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 7:56 pm

The problem with larger tiles is that they tend to bring in more empty space - 4x2 is working pretty well, and 4x4 isn't too bad but we can't yet tell if it's better or worse overall.

The nice thing about what we've just noticed is that it breaks down into

16 bits of state inside the engine
16 bit of input
8 bits of output
A big (but shallow) cloud of logic to make the 8 bits from the 32 inputs
A trivial transformation to update the 16 bits of state from the 32 inputs

To figure out what kind of architecture would be useful to implement the big shallow cloud, without losing too much generality, I'd have to think a bit. Possibly Dave's dataflow diagram of Life88 will help.

We know that fundamentally we'll need 9 bits of input to go to each bit of output, for the case of Life. We don't need a fully general function of 9 bits (fortunately) so we implement one or two computation layers in between input and output. If we accumulate a four-bit sum for each output bit, that's 8 four-bit totals as intermediate results.

But is population count of 8 or 9 inputs very much logic? I don't know.

Arlet · Post by **Arlet** » Wed Nov 23, 2016 8:15 pm

Just a quick and dirty 9 input saturating summation takes 11 LUTs on a Spartan-3 or Spartan-6.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 8:29 pm

Thanks - in fact for Life there are two cases, so it's more of an 8 input count with the 9th input helping to reach the 1 bit conclusion. But clearly this gives an idea of the total amount of stuff in this shallow cloud.

Probably, by building a cloud which can implement a 4x2 life, we'll have something which is fairly local and has a bit of a 2-d flavour to it.

Arlet · Post by **Arlet** » Wed Nov 23, 2016 8:49 pm

If I did it right, the total logic for 9 inputs, 1 live/dead output is also 11 LUTs.

Code: Select all

module life(
	 input self,		// center cell
    input [7:0] n,	// 8 neighbour cells
    output alive		// output
    );

// 3 partial sums, all fit in 2 bits
wire [1:0] sum12  = n[0] + n[1];
wire [1:0] sum345 = n[2] + n[3] + n[4];
wire [1:0] sum678 = n[5] + n[6] + n[7];

// saturated add of partial sums using 3 bits
wire [2:0] sum12345 = sum12 + sum345;
wire [2:0] sum = sum12345[1:0] + sum678;

// dead when 4 or more neighbours.
wire dead4 = (sum12345 >=4 ) | (sum >= 4);

// lower 2 bits of sum for 0..3 neighbours
wire [1:0] sum3 = sum[1:0];

assign alive = self ? (~dead4 & (sum3 >= 2) ) 
						  : (~dead4 & (sum3 == 3) );

endmodule

Actually the 11 LUTs was for Spartan-3. It's only 6 LUTs on Spartan6.

BigEd · Post by **BigEd** » Wed Nov 23, 2016 9:03 pm

Nicely done! The next step, I suppose, is to extract the connectivity of the 11 LUTs and make a general 'cloudlet' module which includes that connectivity and preferably extends it in a generally useful way by filling out unused connections and adding some small number of LUTs as appropriate. (It would be nice if it projected onto a two or three layer PLA kind of structure, then we know we're quite general.)

Then, 8 or more of these cloudlets make up the logic cloud.

I'm thinking this looks a bit like an image processing kernel. As a quite different possible application, would something of this shape be useful for ECC decoding or similar?

Arlet · Post by **Arlet** » Wed Nov 23, 2016 9:12 pm

Adding 3 cells at a time is probably optimal because it can be done in a single LUT using the carry input and the carry output. I tried adding 2 groups of 4, and it took more LUTs.

Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502

Re: Conway's Life on 6502