Conway's Life on 6502

Programming the 6502 microprocessor and its relatives in assembly and other languages.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

On Xilinx, you can configure a LUT as an SRL16 shift register. The cool trick is that you get 4 address lines that allow you to pick any of the 16 bits as output, but you also get a shift register that you can use to dynamically alter the table.

It's almost like a dynamically programmable LUT, but unfortunately for this application, you don't get access to the carry chain.

Edit: and of course, you also don't get to dynamically update the routing between various LUTs.
Attachments
SRL16.png
User avatar
MichaelM
Posts: 761
Joined: 23 Apr 2012
Location: Huntsville, AL

Re: Conway's Life on 6502

Post by MichaelM »

Alternatively, you could take a microprogrammed approach like the one that Dr Jefyll linked to on anycpu.org. You can use distributed RAM or Block RAM to implement microprogram control store in a writeable manner.
Michael A.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

Arlet wrote:
On Xilinx, you can configure a LUT as an SRL16 shift register
That's rather interesting - so we could reprogram a patch of the FPGA by serial shifting. We could presumably architect in some muxes which drive some adder or multiply resources if we wished.

For Life, we're only counting up to 3 or 4 (or 8 or 9, depending on how we do it) so a fast carry shouldn't be needed. But making this a general piece of reprogrammable fabric has an attraction.

(And of course, the serial shifting could be done automatically, taking inputs from byte-wide input registers, for easier programming by the host micro.)
MichaelM wrote:
Alternatively, you could take a microprogrammed approach
That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

You could also implement a 'fast forward' algorithm where you look at bigger squares and do several generations in one step.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

BigEd wrote:
That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.
If you have a 6502 pushing and pulling the data, you do have several cycles available.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

That's true. Much depends on what you're trying to compute - the balance of computation and communication, and the amount of intermediate state. I can imagine a crypto module which has to compute several rounds is well-suited to a microsequenced implementation running several iterations of a program, for example.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

BigEd wrote:
Ah, of course - 24 bits in, 8 bits out, just a big mass of logic in between. But it's all quite simple stuff, so single clock cycle for sure.
Edit: even quicker if it's 32 bits in, 8 bits out!
Even quicker if it's 8 bits in, 8 bits out where you use a bit of FPGA memory to remember the surrounding area.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

Hmm, I'm not sure how to do that - the surrounding area is also being updated. It's true that in the present algorithm we tend to shift the window to the right, so the current generation's context usually inherits half its values from the previous generation. But sometimes we skip from one patch to a non-adjacent patch... maybe that's the same thing, as we shifted in zeros last time.

So, maybe we need to update 16 bits of context, and we inherit 16 bits from the previous iteration - and indeed, that's quite a saving.

(The above numbers are from Life42 - other approaches will of course differ.)
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

Yes, to calculate N^2 block of new generation you need to send over (N+2)^2 of old generation, but as N increases the ratio will get closer to 1.

Naturally, sparse representations of the field will have more overhead.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

The problem with larger tiles is that they tend to bring in more empty space - 4x2 is working pretty well, and 4x4 isn't too bad but we can't yet tell if it's better or worse overall.

The nice thing about what we've just noticed is that it breaks down into
  • 16 bits of state inside the engine
  • 16 bit of input
  • 8 bits of output
  • A big (but shallow) cloud of logic to make the 8 bits from the 32 inputs
  • A trivial transformation to update the 16 bits of state from the 32 inputs
To figure out what kind of architecture would be useful to implement the big shallow cloud, without losing too much generality, I'd have to think a bit. Possibly Dave's dataflow diagram of Life88 will help.

We know that fundamentally we'll need 9 bits of input to go to each bit of output, for the case of Life. We don't need a fully general function of 9 bits (fortunately) so we implement one or two computation layers in between input and output. If we accumulate a four-bit sum for each output bit, that's 8 four-bit totals as intermediate results.

But is population count of 8 or 9 inputs very much logic? I don't know.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

Just a quick and dirty 9 input saturating summation takes 11 LUTs on a Spartan-3 or Spartan-6.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

Thanks - in fact for Life there are two cases, so it's more of an 8 input count with the 9th input helping to reach the 1 bit conclusion. But clearly this gives an idea of the total amount of stuff in this shallow cloud.

Probably, by building a cloud which can implement a 4x2 life, we'll have something which is fairly local and has a bit of a 2-d flavour to it.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

If I did it right, the total logic for 9 inputs, 1 live/dead output is also 11 LUTs.

Code: Select all

module life(
	 input self,		// center cell
    input [7:0] n,	// 8 neighbour cells
    output alive		// output
    );

// 3 partial sums, all fit in 2 bits
wire [1:0] sum12  = n[0] + n[1];
wire [1:0] sum345 = n[2] + n[3] + n[4];
wire [1:0] sum678 = n[5] + n[6] + n[7];

// saturated add of partial sums using 3 bits
wire [2:0] sum12345 = sum12 + sum345;
wire [2:0] sum = sum12345[1:0] + sum678;

// dead when 4 or more neighbours.
wire dead4 = (sum12345 >=4 ) | (sum >= 4);

// lower 2 bits of sum for 0..3 neighbours
wire [1:0] sum3 = sum[1:0];

assign alive = self ? (~dead4 & (sum3 >= 2) ) 
						  : (~dead4 & (sum3 == 3) );

endmodule
Actually the 11 LUTs was for Spartan-3. It's only 6 LUTs on Spartan6.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Conway's Life on 6502

Post by BigEd »

Nicely done! The next step, I suppose, is to extract the connectivity of the 11 LUTs and make a general 'cloudlet' module which includes that connectivity and preferably extends it in a generally useful way by filling out unused connections and adding some small number of LUTs as appropriate. (It would be nice if it projected onto a two or three layer PLA kind of structure, then we know we're quite general.)

Then, 8 or more of these cloudlets make up the logic cloud.

I'm thinking this looks a bit like an image processing kernel. As a quite different possible application, would something of this shape be useful for ECC decoding or similar?
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Conway's Life on 6502

Post by Arlet »

Adding 3 cells at a time is probably optimal because it can be done in a single LUT using the carry input and the carry output. I tried adding 2 groups of 4, and it took more LUTs.
Post Reply