6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 12:28 am

All times are UTC




Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Tue Oct 02, 2012 6:25 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Michael has twice recommended this white paper by Ken Chapman:
http://www.xilinx.com/support/documenta ... /wp275.pdf
"Get your Priorities Right – Make your Design Up to 50% Smaller"

I thought it deserved a new thread in case it sparks a discussion.

Other Xilinx white papers by Ken: https://www.google.co.uk/search?q=%22ke ... ite_papers

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 03, 2012 9:51 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
I think this quote by Ken is key:
Quote:
Although the FPGA as a whole is programmable, each low-level feature is
actually fixed. The programmability comes in the ability to decide if the feature is used
or not.

The features he addresses are in the FDRSE flip flop. A D-type flip flop with Reset, Set, and Enable features.

Many times, in his White Paper you first mentioned, he points out that the order within the Process block is critical. First the Reset is defined (he emphasizes Reset is not needed since the FPGA boots up in a known state), then Set is defined, followed lastly by the Enable logic.

That's what I got out of it anyway.

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 04, 2012 12:16 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
That's part of what he was trying to convey.

In a previous post I made the point that Verilog and VHDL are hardware modeling languages. As part of that discussion, I tried to convey to you the need to understand that the syntax of HDLs is designed for simulation, and not for synthesis. This means that for the synthesizer to determine what physical circuits are desired, it must limit the syntax of the language that it can accept. Now that it has a smaller syntax to evaluate, it is still faced with a monumental task of trying to determine your actual circuit. On top of that, the synthesizer and the simulator need to be able to provide equivalent results for the same code.

Ken's main point is that the synthesizer analyzes the RTL source in a certain order, and that order is determined by the priorities of the fundamental circuit components of its physical FPGA. The highest priority physical feature of a Xilinx CLB FF is its reset feature. The synthesizer parses your source, and if the first portion of your code that it parses can be decomposed into something that can be connected to its reset feature, then that signal is connected to the reset feature of the FF. If it can not be connected in that manner, then the reset feature is tied off, and the synthesizer goes on to the next part of your source to determine if it appears something that can be used to gate the clock. If it can, then that signal is connected to the clock enable feature of the FF. This process repeats for each feature supported by the CLB FF. When all fundamental features have been determined, the remaining logic is implemented in the attached LUTs.

The key takeaway is that the synthesizer is expecting you to write your code in a specific manner in order for it to be able to parse it to use the built-in features of the CLBs. If you review some of the code that I've posted on GitHUB, you'll see that I've taken Ken's admonitions to heart, and I develop my source in a very regular manner. If I want the FF to be initialized, I always use a reset. If I want the clock enable to be used, then I always terminate the logic in a final else if (CE) to ensure the clock enable feature of the FF is used. For example, the following code fragment reverses the priorities of reset and clock enable, which means that the synthesizer will only be able to implement the reset logic with a multiplexer in the LUTs instead of using the built-in features of the CLBs.
Code:
//  Reversed Priorities Clock Enable and Reset

always @(posedge Clk)
begin
    if(CE)
        D <= #1 D + 1;
    else if(Rst)
        D <= #1 0;
end

The same result applies if you code a horizontal line counter as follows:
Code:
// RST of FF not used either

always @(posedge Clk50)
begin
    if(pclk)
        D <= #1 ((hstart | hend)) ? 0 : D + 1);
end

Neither of these two examples are designed to change a coding style. In most cases, the performance of the FPGA far exceeds your requirements, and the results would be equivalent to one which took into consideration the lessons from Ken Chapman's White Paper:
Code:
// RST and CE of FF is used in this example

assign Rst_D = (hstart | hend) & pclk;

always @(posedge Clk50)
begin
    if(Rst_D)
        D <= #1 0;
    if(pclk)
        D <= #1 (D + 1);
end

Furthermore, the Spartan 6, with its 6 input LUTs, is much more capable of merging a simple 2:1 multiplexer into the LUT without overflowing the input limits and expanding the required logic into another LUT. All that being said, its been my experience that if you choose a coding style early that adheres to the lessons of Ken's white paper, then your code will be more efficient and generally have better performance. In addition, when you run out of resources in some future application, it will not be as a result of poor resource utilization but because you truly have exceeded the capabilities of the part.

I will make the point that the reset input of the FF is used to determine the initial state of the FF after configuration. In the configuration process, the bit stream configures whether the FF is initialized as a 1 or a 0. Further during configuration, an internal reset is asserted, and the FFs are initialized in accordance to their assigned initial values. You have been initializing your FFs by assigning them an initial value when you declare them. In simulation, this value is loaded into the FFs when the simulation starts or restarts. In this manner, all of your FFs have a known value, and you can run a simulation.

However, FF reset can be used to initialize the FF to a known value, and this value is inherent in the configuration of the FPGA and not found in the LUTs. In the last example above, the Rst_D signal reinitialized the counter to its initial condition without using an extra LUT input to merge in a multiplexer. But as I said above, the enhanced capabilities of the 6-input Spartan 6 LUTs make some of these optimizations moot. The logic for the lower bits of the counter will fit within a single LUT, and the effect of the extra input bit will not have a significant affect on the size or speed of the example counter.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 04, 2012 7:21 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Code:
// RST and CE of FF is used in this example

assign Rst_D = (hstart | hend) & pclk;

always @(posedge Clk50)
begin
    if(Rst_D)
        D <= #1 0;
    if(pclk)
        D <= #1 (D + 1);
end

It seems to me that on a Spartan-6 device, a naive implementation of this code would require more resources than the original, because you need an extra LUT to evaluate the Rst_D expression, in addition to 1 LUT per bit in 'D' register.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 04, 2012 10:06 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Your assertion is very likely correct for the newer generation FPGAs (Virtex-5, 6, 7, and Spartan-6) which use 6-input LUTs. The problem lies not in the simple LUT for implementing the (pclk & (hstart | hend)) equation, which requires three inputs into a LUT, but with the incorporation of the two inputs, hstart and hend, into the equations of the counter at each bit position.

Bit 0 is a simple toggle, so its LUT should only require 1 input due to the counting function, ~D, and then the two other inputs, hstart and hend. Bit 1's equation will expand to include bit 0 so that it can toggle when bit 0 is a 1 and bit 1 is a 0. Thus, bit 1 of the counter requires 4 inputs to the LUT: D[1:0], hstart, hend. This expansion continues with each successive bit adding as input all bits of lower significance plus the two bits for resetting the counter.

Thus, by my count, at bit 3 all the inputs of the LUT are used up because the logic for reset has been incorporated into each LUT. Perhaps other synthesizers have the ability to automatically extract the "common subexpression", but its been my experience with ISE's synthesizer that it simply does not perform optimizations that I would expect. (So it is only after bit 3 that an additional LUT is required for each bit instead of after bit 6.)

Therefore, even though the example code required an explicit assignment for the local reset function, I think that, in the whole, the localized optimization that the explicit assignment represents will allow the synthesizer greater opportunity for optimization since it provides either a brute force local optimization directly, or it provides a clue regarding common subexpressions in the logic.

The primary problem that I've always had with RTL synthesis is just the large amount of code templates that the synthesizer must try to match in order to create efficient implementations. The expressiveness of the language encourages experimentation with various syntax or coding styles by the designer. Unlike a programming language compiler, the synthesizer does not have the same benefits of a structured syntax with which to map each designers coding styles into actual FPGA logic. In other words, the expression **P++ to a C/C++ compiler has a definite code outcome for the target processor, but to the synthesizer,
Code:
// RST and CE of FF is used in this example

assign Rst_D = (hstart | hend) & pclk;

always @(posedge Clk50)
begin
    if(Rst_D)
        D <= #1 0;
    if(pclk)
        D <= #1 (D + 1);
end

or
Code:
// RST of FF not used either

always @(posedge Clk50)
begin
    if(pclk)
        D <= #1 ((hstart | hend)) ? 0 : D + 1);
end

can make it difficult to recognize the underlying desired logic. I frequently fall into the trap of generating code that places multiple complex logic functions into a single always block. Invariably, as I am looking to improve performance, I find myself achieving my desired performance by simply simplifying the expresssions and then letting the synthesizer/mapper combine them into packed logic blocks. In general, I have found that if local resets, loads, clock enables are declared and used consistently, then there is more opportunity for the synthesizer/mapper to perform optimization because it has more contextual information regarding your desires. But as you point out, the target architecture may provide other opportunities for optimization that the additional coding effort that I recommend may not deliver the expected benefits. I will soon be undertaking my own Spartan 6 design, and may come to the same conclusions that you have reached.

_________________
Michael A.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: