HDL coding for 6502 cores

MichaelM · Post by **MichaelM** » Wed Sep 05, 2012 12:02 pm

Ed:

The 40/10 cycles referred to my initial problem on the Virtex 5 FPGA of a 40 bit fixed point integer. In the comparative analysis, I chose a 16 bit format, so the 1x algorithm requires 16 cycles to complete the product, and the 4x algorithm requires only 4 cycles to complete the product. I have adjusted the table from your earlier post accordingly. If the constraint line is assummed to be the operating clock period, the relative performance of the 4x algorithm still overcomes the shorter path delay of the 1x algorithm. Interestingly, slowing the clock down and processing more bits during an iteration speeds up the overall operation. In the table, a speed up of 3 is achieve by slowing the clock from 166.67 MHz to 125 MHz and switching from a 1x (original single case statement implementation) and going with a 4x (optimized two adder implementation).

Code: Select all

-------------------------------------------------
Algorithm       |   1x  |  1xA  |   4x  |  4xA  |
-------------------------------------------------
FFs Used        |   87  |   87  |  87   |   89  |
LUTs Used       |   129 |   62  |  647  |   168 |
Slices Used     |   93  |   70  |  362  |   115 |
-------------------------------------------------
Constraint      |  6 ns |  5 ns |  10ns |  8 ns |
Reported Speed  |  5.72 |  4.99 |  9.96 |  7.91 |
-------------------------------------------------
Cycles          |    16 |    16 |     4 |     4 |
Min Time (ns)   |    96 |    80 |    40 |    32 |
-------------------------------------------------
100MHz Time (ns)|   160 |  160  |    40 |    40 |
-------------------------------------------------

EEyE:

It can be just aesthetics, but it is also for supporting multipliers in FPGAs that don't have the built in capability like the older Altera 10k, Xilinx 4k, 5200, Spartan, Spartan XL, and Spartan II, the older Actel anti-fuse FPGAs, etc. There is no real reason not to use the built-in multipliers in the Spartan 3AN or 6 families. They are however only 18x18 multipliers, and multiple devices will have to be put together to form 32x32 or 40x40 multipliers. In addition, they are generally built in the vicinity of the embedded block RAMs, and generally restrict the use of the block RAMs because their inputs share the address lines of the RAMs. In most designs, this is not an issue. The multiply-accumulate (MAC) structure that they implement is generally tied to some real-time data stream that would not naturally be stored in the attached block RAM. Thus, you may trade some block RAM for a single cycle MAC function that allows you to build a really fast digital filter, signal transform, etc.

In my situation, the task is to preclude design obsolescence over a large number of years. Once the components are obsolete, porting the design from one multiplier architecture to another may or may not be an issue, but having all of the design components be in RTL of our design and under our control provides an easier path to the inevitable redesign. In addition to component obsolescence, tool obsolescence is also a big consideration. Building 40x40 multipliers from the built in 18x18 multipliers is generally done with the vendor's IP generator, and that has a stability of 0 years. (I expect that the competition by Altera and Xilinx for the fastest and biggest FPGA will eventually change the built in multipliers into floating-point devices, but that is likely a few years out.)

You are correct that that notation 1x and 4x refer to the number of bits that the algorithm processes on each iteration. I expect to release the code for these four (or six) modules shortly on Github. I am considering using Hanson's license (see LCC license), but have not yet contacted him for permission to plagiarize (copy and modify) the license in the LCC Github repository. If that permission is not granted, then I'll likely release the source under the LGPL license I used for the MAM65C02 and MiniCPU Serial ALU releases.

BigEd · Post by **BigEd** » Wed Sep 05, 2012 5:11 pm

Thanks for the extra detail Michael.

Quote:

[multiplier] inputs share the address lines of the RAMs

Interesting!

I'd counsel against using the LCC license, or a non-commercial license, because it may make it difficult to use your work embedded in another work. In particular, difficult to use in a GPL or LGPL work. Unless you really expect to make revenue from the design, and believe that revenue to be at risk from others, which is a reasonable motivation and a more or less probable event depending on the design!

Cheers
Ed

MichaelM · Post by **MichaelM** » Thu Sep 06, 2012 2:56 am

EEyE:

Check the following Github repository for Verilog source code for the various Booth Multipliers I have implemented.

ElEctric_EyE · Post by **ElEctric_EyE** » Thu Sep 06, 2012 7:22 am

Thanks for sharing the code. As a newbie to Verilog I will point out what I comprehend on Booth_Multiplier.v
#1, it is a sequential circuit by the use of 'always @ (posedge clk)'
#2, it uses blocking arguments as opposed to non-blocking, by the use of '<=' which can change a circuit greatly
#3, you use delays which are not synthesizable, i.e.'#1'.
#4, you use a reset which adds a layer of logic, although it is desirable for simulation.
#5, the case statement infers a multiplexer?
Question: Why is there no test for Reset on the case statement, like it is tested in the other FF's?
#6, you use '2**' exclusivley to shift right?, instead of '<<'?

EDIT: Forgive me, I am only on Pg. 36 of Bhasker.

MichaelM · Post by **MichaelM** » Thu Sep 06, 2012 11:33 am

Always look at the code as a specification for simulation purposes not for HW synthesis. Neither Verilog nor VHDL were designed for synthesis, but were designed for modeling of the behavior of HW. There is a big conceptual gap between these HW modeling languages and what the synthesizers require for a specification. You can close the gap by quickly learning the parts of the language to which synthesizers respond, but that may be somewhat disadvantageous with regard to testing and simulating your designs. If there is one overarching recommendation I can make it is to never implement a design and test it in the FPGA. Always write a testbench even it is nothing more than a clock driver.

EEyE wrote:

#1, it is a sequential circuit by the use of 'always @ (posedge clk)'

The list of signals specified as arguments of @() are simply instructions to the simulator of what input signals are to be monitored during each simultation cycle. A general failure of experience and non-experienced HDL designers is to leave out some of the input signals used by the equations within the always block (process). Synthesis is only concerned about which signal may be used as the clock for any sequential logic in the always block (process). Simulation, OTH, is concerned with speed, and most HDL simulators are event driven. As such they attempt to evaluate equations only when thier input signals change. This is the reason that the arguments of @() are known as the sensitivity list of the always block (process).

In the referenced always block, the @(posedge Clk) identifies to the simulator that the input signal Clk is the only signal that needs tbe monitored for changes, and then only for rising edges. This means that the simulator only evaluates the equation within the block on this event. Since the intent is to specify a sequential circuit, including any of the other input signals in the sensitivity list would not produce the same results because the equations would be changing state whenever one or more transitions occurred.

When viewed in this manner, it becomes clear how the synthesizer interprets the sensitivity list of such an always block. It recognizes that the block is sensitive only to the rising edge of Clk, but it doesn't find Clk included in any of the equations within the block. From this it infers that the equations within the block are sequential logic, so it infers the required FFs. If additional signals had been specified within the sensitivity list and none of them were included in the equations, then the synthesizer would be very confused because that would have implied a sequential circuit with FFs using multiple clocks. If a second signal is specified such as @(posedge Clk or posedge Rst), and the Rst signal is included in the equations of the block, then the synthesizer interprets the Rst signal as an asynchronous SET/CLR signal a la 7474. If a signal such as Rst is used with the always block, but is not included in the sensitivity list, then it is assummed to be a synchronous SET/CLR signal. (The recommended practice in Altera and Xilinx FPGAs is to implement local synchronous resets.)

EEyE wrote:

#2, it uses blocking arguments as opposed to non-blocking, by the use of '<=' which can change a circuit greatly

Blocking and non-blocking assignment concepts are not sythesizable. They are concepts for controlling the simulators. From a synthesis perspective, blocking statements are used within always blocks and their assignments are made to Verilog reg variables. That's why I said in an earlier post that reg variables are not FFs. They are only FFs if the sensitivity list identifies a posedge/negedge signal as a clock, otherwise they are simply the combinatorial signals to which the blocking assignment <= is made within the block. From a synthesis perspective, the non-blocking assignments are used with assign statements, and those only synthesize to combinatorial logic.

Conceptually, the blocking assignments within a synthesizable always block functions like a sample and hold in front of an ADC. On the transition of any of the signals in the sensitivity list, the simulator samples and holds the value of all the signals at that instant and then evaluates all of the equations. Notice that I did not mention the synthesizer in this explanation. The logic equations derived by synthesizer are operating in the HW continously without regard to signal transitions and other whatnots.

EEyE wrote:

#3, you use delays which are not synthesizable, i.e.'#1'.

There is a lot of disagreement on my team about my use of this construct. It is true that it is not synthesizable, but it is simulated. The issue that it resolves is a simulation issue. Some simulators, notably ISE (but I've the same problem with ModelSimXE), fail to correctly resolve simultaneous transitions on inputs and outputs. This issue has caused me untold hours of debugging perfectly valid synthesizable code. The solution is to provide the simulator a little help and specify that the output of the assignment will be delayed from the evaluation by some small amount. I have chosen to use #1 (after 1 ns - in VHDL) to avoid this issue. I religiously apply these delays on all assignments for sequential circuits. It is just two simple keystrokes and a space, and it saves an enormous amount of time when debugging.

EEyE wrote:

#4, you use a reset which adds a layer of logic, although it is desirable for simulation.

Relying on the initialization feature of the FPGA is flirtiing with disaster. The fact that the external logic will almost always beat an FPGA in coming out of reset means that some of the external signals may be nice and stable, but others may not because they are adversely affected by the FPGA not being fully operational. There are lots of things I do to avoid this situation, but the best is to have a reset input that is held off until the DONE pin is high. I also change the initialization order in the configuration panel so that DONE is asserted high only after all internal initialization is completed. Finally, the conceptual disconnect represented by an HDL-based design and a schematic-based design means that simulation is absolutely required. As I said above, you should commit yourself to simulating each and every module. This can save you countless hours, and can be as simple as just setting up the clock and releasing reset after a minimum amount of time. (For ISim it is 100ns.)

EEyE wrote:

#5, the case statement infers a multiplexer?

Yes, but it can also infer a ROM.

EEyE wrote:

Question: Why is there no test for Reset on the case statement, like it is tested in the other FF's?

The always @(*) block in which the partial sums are computed does not declare a clock. Without a clock, the block specifies combinatorial logic, and Rst is not required as a gating function in the logic.

EEyE wrote:

#6, you use '2**' exclusivley to shift right?, instead of '<<'?

No. Exponentiation is used to calculate the width of the vectors. I changed the parameterization in the other modules to use N as the bit width directly. It is more natural.

I have found that >> and << don't synthesize reliably in ISE. So I generally write the left and right shifts explicitly.

BigEd · Post by **BigEd** » Thu Sep 06, 2012 7:21 pm

Quote:

Neither Verilog nor VHDL were designed for synthesis, but were designed for modeling of the behavior of HW

Glad you said this - its a major practical failing but these are still the two languages we have. I've been pondering about writing something on the subject of Verilog particularly, but I don't actually know it well enough to be clear. Although I'm sure Verilog is the right choice for me, I use it rarely enough that the best I can do is copy idiomatically from other people.
Cheers
Ed

HDL coding for 6502 cores

Re: HDL coding for 6502 cores

Re: HDL coding for 6502 cores

Re: HDL coding for 6502 cores

Re: HDL coding for 6502 cores

Re: HDL coding for 6502 cores

Re: HDL coding for 6502 cores