6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 6:36 pm

All times are UTC




Post new topic Reply to topic  [ 50 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Wed Nov 23, 2016 5:24 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
On Xilinx, you can configure a LUT as an SRL16 shift register. The cool trick is that you get 4 address lines that allow you to pick any of the 16 bits as output, but you also get a shift register that you can use to dynamically alter the table.

It's almost like a dynamically programmable LUT, but unfortunately for this application, you don't get access to the carry chain.

Edit: and of course, you also don't get to dynamically update the routing between various LUTs.


Attachments:
SRL16.png
SRL16.png [ 30.63 KiB | Viewed 1929 times ]
Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:30 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Alternatively, you could take a microprogrammed approach like the one that Dr Jefyll linked to on anycpu.org. You can use distributed RAM or Block RAM to implement microprogram control store in a writeable manner.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:35 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Arlet wrote:
On Xilinx, you can configure a LUT as an SRL16 shift register


That's rather interesting - so we could reprogram a patch of the FPGA by serial shifting. We could presumably architect in some muxes which drive some adder or multiply resources if we wished.

For Life, we're only counting up to 3 or 4 (or 8 or 9, depending on how we do it) so a fast carry shouldn't be needed. But making this a general piece of reprogrammable fabric has an attraction.

(And of course, the serial shifting could be done automatically, taking inputs from byte-wide input registers, for easier programming by the host micro.)

MichaelM wrote:
Alternatively, you could take a microprogrammed approach

That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:37 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
You could also implement a 'fast forward' algorithm where you look at bigger squares and do several generations in one step.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:39 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
That's also interesting, although it will, I think, generally mean taking several cycles to run a microprogram to get something done. Sometimes this will be a good way to go.

If you have a 6502 pushing and pulling the data, you do have several cycles available.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 5:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
That's true. Much depends on what you're trying to compute - the balance of computation and communication, and the amount of intermediate state. I can imagine a crypto module which has to compute several rounds is well-suited to a microsequenced implementation running several iterations of a program, for example.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 6:56 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
Ah, of course - 24 bits in, 8 bits out, just a big mass of logic in between. But it's all quite simple stuff, so single clock cycle for sure.
Edit: even quicker if it's 32 bits in, 8 bits out!

Even quicker if it's 8 bits in, 8 bits out where you use a bit of FPGA memory to remember the surrounding area.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 7:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Hmm, I'm not sure how to do that - the surrounding area is also being updated. It's true that in the present algorithm we tend to shift the window to the right, so the current generation's context usually inherits half its values from the previous generation. But sometimes we skip from one patch to a non-adjacent patch... maybe that's the same thing, as we shifted in zeros last time.

So, maybe we need to update 16 bits of context, and we inherit 16 bits from the previous iteration - and indeed, that's quite a saving.

(The above numbers are from Life42 - other approaches will of course differ.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 7:46 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yes, to calculate N^2 block of new generation you need to send over (N+2)^2 of old generation, but as N increases the ratio will get closer to 1.

Naturally, sparse representations of the field will have more overhead.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 7:56 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
The problem with larger tiles is that they tend to bring in more empty space - 4x2 is working pretty well, and 4x4 isn't too bad but we can't yet tell if it's better or worse overall.

The nice thing about what we've just noticed is that it breaks down into
  • 16 bits of state inside the engine
  • 16 bit of input
  • 8 bits of output
  • A big (but shallow) cloud of logic to make the 8 bits from the 32 inputs
  • A trivial transformation to update the 16 bits of state from the 32 inputs

To figure out what kind of architecture would be useful to implement the big shallow cloud, without losing too much generality, I'd have to think a bit. Possibly Dave's dataflow diagram of Life88 will help.

We know that fundamentally we'll need 9 bits of input to go to each bit of output, for the case of Life. We don't need a fully general function of 9 bits (fortunately) so we implement one or two computation layers in between input and output. If we accumulate a four-bit sum for each output bit, that's 8 four-bit totals as intermediate results.

But is population count of 8 or 9 inputs very much logic? I don't know.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 8:15 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Just a quick and dirty 9 input saturating summation takes 11 LUTs on a Spartan-3 or Spartan-6.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 8:29 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Thanks - in fact for Life there are two cases, so it's more of an 8 input count with the 9th input helping to reach the 1 bit conclusion. But clearly this gives an idea of the total amount of stuff in this shallow cloud.

Probably, by building a cloud which can implement a 4x2 life, we'll have something which is fairly local and has a bit of a 2-d flavour to it.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 8:49 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
If I did it right, the total logic for 9 inputs, 1 live/dead output is also 11 LUTs.
Code:
module life(
    input self,      // center cell
    input [7:0] n,   // 8 neighbour cells
    output alive      // output
    );

// 3 partial sums, all fit in 2 bits
wire [1:0] sum12  = n[0] + n[1];
wire [1:0] sum345 = n[2] + n[3] + n[4];
wire [1:0] sum678 = n[5] + n[6] + n[7];

// saturated add of partial sums using 3 bits
wire [2:0] sum12345 = sum12 + sum345;
wire [2:0] sum = sum12345[1:0] + sum678;

// dead when 4 or more neighbours.
wire dead4 = (sum12345 >=4 ) | (sum >= 4);

// lower 2 bits of sum for 0..3 neighbours
wire [1:0] sum3 = sum[1:0];

assign alive = self ? (~dead4 & (sum3 >= 2) )
                    : (~dead4 & (sum3 == 3) );

endmodule


Actually the 11 LUTs was for Spartan-3. It's only 6 LUTs on Spartan6.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 9:03 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Nicely done! The next step, I suppose, is to extract the connectivity of the 11 LUTs and make a general 'cloudlet' module which includes that connectivity and preferably extends it in a generally useful way by filling out unused connections and adding some small number of LUTs as appropriate. (It would be nice if it projected onto a two or three layer PLA kind of structure, then we know we're quite general.)

Then, 8 or more of these cloudlets make up the logic cloud.

I'm thinking this looks a bit like an image processing kernel. As a quite different possible application, would something of this shape be useful for ECC decoding or similar?


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 23, 2016 9:12 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Adding 3 cells at a time is probably optimal because it can be done in a single LUT using the carry input and the carry output. I tried adding 2 groups of 4, and it took more LUTs.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 50 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: jds and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: