6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue May 07, 2024 6:31 am

All times are UTC




Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Apr 22, 2012 9:23 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
To continue a conversation started in another thread. We happened to be talking about the adding of some registers to a 6502 variant based on Arlet's core which is written in Verilog. But I think we could usefully have a place to discuss more generally and widely the ways to code in HDL and the implementations that we get from synthesis as a result.


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 22, 2012 9:26 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
I won't bring in the specific code we were discussing but the context probably starts here. (Bear in mind, please, that the link to github below is to an experimental branch which doesn't simulate correctly, although that's not important here.)

BigEd wrote:
I confess that my B and D reg code is just copied from the nearby regfile code without much thought - it is possible that it can be simplified. It does look like load_b_reg is in fact only valid at decode and rdy.

Note that the code might need to take the existing form if instead of two or more zero-operand opcodes we had a one-operand opcode as I suggested at one point, picking up the special register address from the data bus in a later cycle. I'm not certain of this. I find it all rather difficult!

Edit: hmm, that might not make sense, in which case I apologise for posting while tired. I've got XBA and TXD which are both zero operand. XSR #reg would be one operand.

Cheers
Ed


MichaelA wrote:
Ed:

I appreciate you linking your code for me in your last post. I had this nice reply in the works and then I decided to click through to your code for a second look, and when I got back, my tome was lost. So I will try and reconstruct it as best I can.

Your code is very clean, and clearly follows the structure and style of Arlet Ottens. Like most members of this forum, I am also working on an FPGA implementation of the WDC 65C02. I have currently completed one one core and am working on my second. My general objectives are a faithfull reproduction of the WDC 65C02 instruction set, but not necessarily the instruction cycle time. Actually, my objective in that regard is to reduce the number of clock cycles required for most instructions without resorting to using a wider data bus. My approach to the implementation of the core also differs from those that others such as yourself, and Arlet Ottens have posted. That is, I use a microprogrammed instruction sequencer.

My first core was intended to be used with a single cycle asynchronous memory such as can be synthesized from Xilinx LUTs. It assumes that the address is output and the data is returned on the same cycle. At the performance that I was targeting, 100 MHz in a Xilinx XC3S200AN-5, this approach is not particularly practical since any reasonable program would require too much of the free LUTs and would not provide a sufficiently large memory for anything but toy applications. My second core is a more complete implementation with BRAM included in the core. The difference in the access method has caused me to rethink some of the asynchronous logic that I used for single cycle address calculation in the first core. Without considering the delays in the address output path or the input data path delays (as is the case in functional simulation), my first attempt is complete. But it will never work in a practical system. Thus, the second core has been derived from the first core, and it deals with these practical considerations. With respect to the second core, I have completed its re-implementation and reworked the microprogram's control fields, but I've not yet begun to debug the microprogram. Given the amount of time that I have available to persue this hobby, its going to be several weeks before that task can be completed.

Back to your core. You and Arlet are using a one-hot state machine methodology for the instruction sequencer. I've not been much of a fan of the one-hot state machine approach for a number of reasons, although I often use one-hot control fields. However, the performance and resource utilization that Arlet achieved (as you detailed in an earlier post comparing various cores), coupled with the cleanliness of the implementation, has convinced me to put some effort to studying the methodology once I've completed my second core. It appears that the base design uses the register file to provide the AI input to the ALU module. I have used the LUT RAMs in the Xilinx FPGA in this manner for many years. The inherent multiplexer of the RAM is the fastest way to improve the operating speed of a Xilinx FPGA.

It also appears that the register file is implemented as a single port RAM and that the core does not use multiple, independent adders to provide parallel computation of addresses and/or ALU results. Therefore, I am going to suggest expanding the register file so that you can also use it for temporary storage instead of wiping out your S storage location. Since for the Xilinx FPGA that you are using, any use of a LUT as a single or dual port RAM will always make available a minimum of 16 "registers", the synthesizer is currently tying off the two (16) or four (64) most significant address lines. (I am sure that you are aware that Xilinx FPGAs older than the Spartan 6 and Virtex 7 families employ 4 input LUTs, while these two FPGA families employ 6 input LUTs.) In this manner, you can place your B register into the register file, and any other registers or temporary values that you be need. I don't think that this suggestion applies to the shift count/direction register, but I've not spent any time exploring the implementation or the instruction set that you are presently implementing with your core.

Once again, thanks for the link to your code.

BTW, you need not modify the present definition of the regfile address/select variable: "regsel". You can simply extend it using a bit vector construction, {extreg, regsel}, where "extreg" is declared as a 2-bit or a 4-bit vector (set by the FPGA family that you are using). If you fail to include the expanded RAM select vector when addressing the register file, then verilog's default behavior will left fill the RAM address with 0s, and the register file will behave like your current implementation: a four location RAM. (A synthesizer or PAR warning should be issued, and it will indicate that the RAM address is zero filled because its not completely specified.)


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 22, 2012 9:54 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
Hi Michael
thanks for your input, and the information about your core. Getting to know the available resources on an FPGA is crucial to getting the best possible implementation, I'm sure, even if that means needing to target a specific manufacturer or a specific generation. But equally, I like to see portable HDL which has at least a reasonable implementation across a range of devices. I haven't considered as deeply as you have the implications of the 6-input LUTs in the spartan6 generation.

You're right that Arlet's core, like the 6502, does address calculations with the ALU, which helps with the density but might be a hindrance for clock cycle counts.

With the single-port register file connected to the ALU as it is I couldn't offer a multiply instruction with two operands from registers, which is why my B reg sits outside. (One way to implement a dual-port register file is to build two register files which are always written together and therefore always have the same content. That might be efficient.) EEye did see the performance drop when he unintentionally strayed from the single-port architecture, so it is buying us something.

I do see that a larger register file is possible, and indeed almost free. I'm not so sure we can use it for temporaries, if we need those temporaries in a dual-ported sense, or indeed for special registers such as a direct page or stack page base, because we may need those in the same cycle that we need conventional registers. (Or we may not. I'm actually insufficiently familiar with the overview of what happens when in Arlet's core.)

As my core is sitting in a github fork of Arlet's project, I'm trying to make minimal changes. This might be misguided, acting as extra friction on my progress which I hardly need, but it's a reason why I don't rejig the existing code to help my navigation and reduce the number of lines.

Funnily enough, even the smallest FPGA is now large enough that the size of the core doesn't matter too much: the clock speed and the maintainability of the code are probably more important.

I'm happy for this thread to meander into discussions of one-hot encoded state machines versus microcoded machines!

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 22, 2012 10:01 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
I'm happy for this thread to meander into discussions of one-hot encoded state machines versus microcoded machines!

Note that the one-hot encoding is just a choice from the synthesis tools. I wrote the code as a generic state machine, that can be implemented in hardware in different ways. I've written other state machines in exactly the same style, and ended up with Gray encoding instead. In this case, it seems that the synthesis tools prefer the one-hot encoding for performance reasons. Due to abundance of flip-flips, one-hot encoding is generally well suited to FPGAs.


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 22, 2012 3:40 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
One way to implement a dual-port register file is to build two register files which are always written together and therefore always have the same content. That might be efficient.

I don't think you don't have to do that. If you write the HDL in the right way, the tools can infer multi ported memories. Of course, as we've seen before, this may require a watchful eye on the synthesis report, and perhaps some rewriting.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 23, 2012 3:27 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Ed/Arlet/EE:

I just learned that you can't hit tab in this tool. I posted my first core as an attachment on the thread where you where responding to Windfall about proper 65C02 cores. I would appreciate if you guys would take control of that archive and place it in a repository where others may be able to access it more easily.

I was going to keep the discussion going on multi-port register files, but its getting late and I've got a long day tomorrow.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 12:30 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Ed:

I will investigate your statement:

Quote:
With the single-port register file connected to the ALU as it is I couldn't offer a multiply instruction with two operands from registers, which is why my B reg sits outside. (One way to implement a dual-port register file is to build two register files which are always written together and therefore always have the same content. That might be efficient.) EEye did see the performance drop when he unintentionally strayed from the single-port architecture, so it is buying us something.


In Xilinx xapp065 - XC4000 Series Edge-Triggered and Dual-Port RAM Capability, there is an excellent figure, Figure 1, of how a "register file" based on a Xilinx LUT is contructed by the synthesizer within a single CLB/Slice. Over the years, the exact implementation may have changed somewhat, but this structure was key to the performance of the Xilinx LUT, so I suspect that the private (internal CLB/Slice) connections that make the single-port and dual-port LUT-based RAM possible are retained. As you can see in that figure (.PDF attachement was rejected), the RAM is constructed as an edge-triggered set of FFs followed by the F multiplexer (i.e. the LUT). If used in the most simple manner, that arrangement forms a single-port, rising edge triggered RAM with an active high WE and independent input and output data busses. If the second multiplexer, i.e. the LUT of the adjacent CLB, is connected as shown, the result is a "dual-port" RAM with one read/write port (F) and one read-only port (G). The two addresses can be independent, but only one port can be used to write data into the RAM.

The figure also shows the reason why the multiplexers are so fast: they are actually implemented with the RAM multiplexers of the CLB/Slice. If you use two separate RAM declarations and attach a common address bus and data bus, the result will not be the structure indicated in the figure. In that case, I suspect that there will be additional delays because the two output ports are not located in the same CLB/Slice.

The ISE lightbulb tool has a very good example on how to construct the HDL to get a LUT-based dual-port RAM as discussed in the referenced app note.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 5:54 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
Thanks for that - I had managed to miss the point that two read ports are available. Will look up the app note and example.
Cheers
Ed
Edit: app note for spartan3 (uses two bits per bit, so FPGA capacity for dual-port is halved, if there was any danger of using it all.) Compare figures 5 and 6.
App note for spartan6 works the same.
Historical xapp065 for reference - as you say, showing output muxes allowing for non-duplication of the storage. It seems the implementation by high-level synthesis has changed..


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 12:12 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Ed:

It's still the same implementation, but the difference is that the architecture is now changed from two LUTs and FFs per CLB, to two CLBs per Slice, four LUTs/FFs per Slice. The Spartan 3 slices are composed of the LUTs which have different characteristics. Namely, the two SLICEM LUTs in each Slice can be configured just like the old XC4000XL LUTs, but the two SLICEL LUTs in the slice do not support the dual-port distributed RAM construct.

You are right that using the dual port RAM construct means that both LUTs are used for one function. However, the two associated FFs may still be used as FFs. In the more modern families, this reduction in the LUTs when DPRAM is constructed may appear to lead to a loss of resouces, but given the higher density this is not the case in practice. Furthermore, although you loose some LUTs, there is a vast savings in the total number of LUT/FF pairs when a DPRAM is used for R/W registers instead of discrete FF-only registers. In addition, the built-in multiplexer operates at logic speeds and does not suffer from the general performance loss that will occur as the multiplexer across several LUT levels when explicitly synthesized.

In the case of the 6502 cores, the read/write port should be tied to the destination register address, and the read-only port should be connected to the source register address. Please note that, as Arlet has pointed out, the RAM itself would be considered a single port device. Only the Block RAMs offer true dual-port RAM cells. In other words, only the block RAMs allow two independent read/write ports.

That may sound attractive, particularly since independent clocks can be applied for each of these read/write ports, but there is only one RAM cell for each location. It is specialized, and its contents are undefined if it is accessed simultaneously from both sides. That is, it is impossible for the physical circuit to resolve the value to be stored. Also, the value returned is indeterminate if a read on one port occurs simultaneously with a write to the same address from the other port. The block RAMs do have additional mode logic that help ameliorate the second situation, but my recommendation is to religiously avoid either of these two corner cases when using block RAM

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 31, 2012 1:05 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Ed:

As you suspect, the amount of logic increases as the number of bits processed increases. My objective for the Booth multipliers was not efficiency as much as FPGA vendor independence. My project attempts to ensure that the IP used in the design is 100% unencumbered by vendor-specific IP so that it can be ported seamlessly from one FPGA vendor/family to another. Since some FPGA families do not include built-in hardware multipliers, I needed a relatively fast way of doing 40-bit fixed poind arithmetic. The 40 clock cycles required for the 1 bit Booth multiplier would be acceptable, but I wanted a bit more performance since I had the resources to spare. Unlike the fully pipelined arithmetic shifter which follows it in the arithmetic processor, the Booth multipliers are not pipelined. Like all of the arithmetic elements in the signal processor, they register the mulitplicand and multiplier when an enable/start signal is asserted and provide a RDY output when the product is ready. Thus, if a future target is resource limited, the 1 bit Booth multiplier can be dropped in place of the 4 bit Booth multiplier. If this change was required, no changes would be required to the signal processor microcode because the module interfaces provide the necessary timing control.

I set the bit width to 16 for comparison purposes and synthesized three different multipliers. The following table provides a comparative look at the increased resource utilization and decreased performance that you referred to in the *UM Forth thread:

Code:
-----------------------------------------
                |   x1  |   x2  |   x4  |
-----------------------------------------
#FFs            |   87  |   87  |   87  |
#LUTs           |   129 |   183 |   647 |
#Slices         |   93  |   136 |   362 |
-----------------------------------------
Constraint      | 6 ns  | 8 ns  | 10 ns |
Achieved        | 5.719 | 7.118 | 9.960 |
-----------------------------------------


As expected the number of logic implementation elements, LUTs for the Xilinx Spartan 3AN used, increases as the number of bits processed are increased. Also the additional adders and multiplexers needed decrease the performance from a base of 5.72 ns to 9.96 ns. This performance changed approximately 75%, a significant decrease in performance. The performance difference between the 1 bit and 4 bit multipliers did not decrease as much as the increase in the resources used: 129 to 647 LUTs. The 5x increase in LUTs is significant.

The supporting data is provided below.

The following is taken from the synthesis report for the 1 bit Booth multiplier. It will be used as the base for comparison purposes.

Code:
=========================================================================
*                           HDL Synthesis                               *
=========================================================================

Performing bidirectional port resolution...

Synthesizing Unit <Booth_Multiplier>.
    Related source file is "Booth_Multiplier.v".
    Found 32-bit register for signal <P>.
    Found 1-bit register for signal <Valid>.
    Found 17-bit register for signal <A>.
    Found 5-bit down counter for signal <Cntr>.
    Found 34-bit register for signal <Prod>.
    Found 17-bit 4-to-1 multiplexer for signal <S>.
    Found 17-bit adder for signal <S$addsub0000> created at line 91.
    Found 17-bit subtractor for signal <S$addsub0001> created at line 92.
    Summary:
   inferred   1 Counter(s).
   inferred  84 D-type flip-flop(s).
   inferred   2 Adder/Subtractor(s).
   inferred  17 Multiplexer(s).
Unit <Booth_Multiplier> synthesized.


=========================================================================
HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 2
 17-bit adder                                          : 1
 17-bit subtractor                                     : 1
# Counters                                             : 1
 5-bit down counter                                    : 1
# Registers                                            : 4
 1-bit register                                        : 1
 17-bit register                                       : 1
 32-bit register                                       : 1
 34-bit register                                       : 1
# Multiplexers                                         : 1
 17-bit 4-to-1 multiplexer                             : 1

=========================================================================


And the following is the MAP report for the 1 bit Booth Multiplier showing the final resource utilization used when meeting a 6 ns PERIOD constraint.

Code:
Design Summary
--------------
Number of errors:      0
Number of warnings:    0
Logic Utilization:
  Number of Slice Flip Flops:            87 out of   3,584    2%
  Number of 4 input LUTs:               129 out of   3,584    3%
Logic Distribution:
  Number of occupied Slices:             93 out of   1,792    5%
    Number of Slices containing only related logic:      93 out of      93 100%
    Number of Slices containing unrelated logic:          0 out of      93   0%
      *See NOTES below for an explanation of the effects of unrelated logic.
  Total Number of 4 input LUTs:         129 out of   3,584    3%
  Number of bonded IOBs:                 68 out of     195   34%
  Number of BUFGMUXs:                     1 out of      24    4%


The synthesis report for the 2 bit Booth multiplier shows the expecte growth in the product register needed to guard against arithmetic overflows when adding/subtracting the ±2*M terms to the product register.

Code:
=========================================================================
HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 4
 18-bit adder                                          : 2
 18-bit subtractor                                     : 2
# Counters                                             : 1
 5-bit down counter                                    : 1
# Registers                                            : 5
 1-bit register                                        : 2
 18-bit register                                       : 1
 32-bit register                                       : 1
 34-bit register                                       : 1
# Multiplexers                                         : 1
 18-bit 8-to-1 multiplexer                             : 1

=========================================================================


and its MAP report

Code:
Design Summary
--------------
Number of errors:      0
Number of warnings:    0
Logic Utilization:
  Number of Slice Flip Flops:            87 out of   3,584    2%
  Number of 4 input LUTs:               183 out of   3,584    5%
Logic Distribution:
  Number of occupied Slices:            136 out of   1,792    7%
    Number of Slices containing only related logic:     136 out of     136 100%
    Number of Slices containing unrelated logic:          0 out of     136   0%
      *See NOTES below for an explanation of the effects of unrelated logic.
  Total Number of 4 input LUTs:         183 out of   3,584    5%
  Number of bonded IOBs:                 68 out of     195   34%
  Number of BUFGMUXs:                     1 out of      24    4%


The reports for the 4-bit Booth multiplier follow:

Code:
=========================================================================
HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 16
 20-bit adder                                          : 8
 20-bit subtractor                                     : 8
# Counters                                             : 1
 5-bit down counter                                    : 1
# Registers                                            : 5
 1-bit register                                        : 2
 20-bit register                                       : 1
 32-bit register                                       : 1
 36-bit register                                       : 1
# Multiplexers                                         : 1
 20-bit 32-to-1 multiplexer                            : 1

=========================================================================

Code:
Design Summary
--------------
Number of errors:      0
Number of warnings:    0
Logic Utilization:
  Number of Slice Flip Flops:            87 out of   3,584    2%
  Number of 4 input LUTs:               647 out of   3,584   18%
Logic Distribution:
  Number of occupied Slices:            362 out of   1,792   20%
    Number of Slices containing only related logic:     362 out of     362 100%
    Number of Slices containing unrelated logic:          0 out of     362   0%
      *See NOTES below for an explanation of the effects of unrelated logic.
  Total Number of 4 input LUTs:         647 out of   3,584   18%
  Number of bonded IOBs:                 68 out of     195   34%
  Number of BUFGMUXs:                     1 out of      24    4%

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 03, 2012 7:24 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Just another note on the Booth Multipliers.

My initial objective was an FPGA vendor independent multiplier that would synthesize and run at least 125 MHz in a Virtex 5 SX50T part. The implementations for the 1x, 2x, and 4x Booth multipliers all met this objective. In reaction to Big Ed's enquiry regarding the resource utilized and loss of performance that may be expected as the number of bits processed by the Booth multiplication algorithm increase, I went back and ran each multiplier for the case that its inputs are 16-bit 2sC integers, and the output is a 32-bit 2sC integer. An earlier post summarized the results of this effort.

The simple results indicate that Big Ed's expectations are correct insofar as the resources required increase and the speed achievable decreases as the number of bits processed by the multiplication algorithm increases. This knowledge can be used to temper the desire to increase the number of bits processed by a software implementation. Therefore, the look-up tables which drive such an implementation, and the multi-bit shifts required may not provide an improvement in performance relative to the fundamental, 1 bit at a time Booth algorithm.

The increased resource utilization that a HW implementation must deal with is a trade off that must be made as well. In the problem that I was concerned about, the operands were 40-bit fixed point 2sC integers. A 1 bit algorithm would impose a 40 clock delay in 93 slices, a 2 bit implementation imposes a 20 clock delay in 136 slices, and a 4 bit algorithm imposes a 10 clock delay in 362 slices. In relative terms, the speed improves by a factor 2 for each doubling of the number of bots processed, but resource utilization increases by a factor of 1.46x when moving from 1 to 2 bits, and by a factor of 3.89x when moving from 1 to 4 bits. Therefore, since the synthesized logic met the overall performance requirement of 125 MHz, and the resources needed for the 4 bit algorithm did not use a significant portion of the FPGA resources, the 4 bit algorithm was used.

Unfortunately for me, Big Ed's enquiry forced me to review the synthesis results more closely. Clearly, the 5x increase in LUTs between the 1 and 4 bit algorithms does not fall into the reasonable results category. It is natural to expect the number of LUTs to increase when moving from 1 to 4 bits. but the increase between the two implementations is a bit out of alignment with expectations. First, three additional sign extension guard bits are required for th 4 bit algorithm compared to 1 sign extension guard bit for the 1 bit algorithm. This is not even an issue with a 16x16 multiplier. Second, the 1 bit algorithm requires just one adder/subtractor to implement the partial sums, and the 4 bit algorithm requires just two such adders/subtractors. The incease from 17 to 20 bits is also not an issue between the two algorithms.

Thus, there must be something wrong with the specification of these two multipliers. Their testbenches check that the poducts that they generate are correct, and the real world application in which the 4 bit multiplier was used also provides confirmation of their ability to correctly multiply two signed integers. So I looked more closely at the synthesis results for all three of the multipliers. I found that the 1 bit algorithm was using two adders instead of the one I expected. One adder added the positive multiplicand to the partial product, and the second subtracted the multiplicand from the partial product. In the 4 bit algorithm, I found that synthesis generated 8 adders and 8 subtractors instead of the expected number: 2 adders.

Clearly, this is the reason why the LUTs used increase so markedly between the two multipliers. In both of these implementations, I used a case statement to unambiguously define the add/subtract operations of the multiplicand into the partial product register. The synthesizer clearly recognizes the implied multiplexer of the case statement, and optimizes the number of terms in the case statement. However, it is unable to make the transformation of the basic structure implied by the case statement which is a set of adders/subtractors followed by a multiplexer, into the complementary structure defined as a set of multiplexers (involving the various multiplicand products: 0, ±1M, ±2M, ±4M, and ±8M ) followed by one or two adders.

I therefore optimized the implementations of these two multipliers by making the complementary structure explicit in the source. The result of this optimization effort yields significant reductions in resources for both multipliers, and also significantly improves reported clock speed. The following table combines the previous results with those obtained using the optimized algorithms.
Code:
-------------------------------------------------
Algorithm       |   1x  |  1xA  |   4x  |  4xA  |
-------------------------------------------------
FFs Used        |   87  |   87  |  87   |   89  |
LUTs Used       |   129 |   62  |  647  |   168 |
Slices Used     |   93  |   70  |  362  |   115 |
-------------------------------------------------
Constraint      |  6 ns |  5 ns |  10ns |  8 ns |
Reported Speed  |  5.72 |  4.99 |  9.96 |  7.91 |
-------------------------------------------------

It is clear from the table that there is a 2:1 LUT reduction between the two 1x algorithms, and a 4:1 LUT reduction between the two 4x algorithms. Each of the optimzed algorithms also shows great improvement in the reported operating clock period. The clock period improvements are not as great as those obtained for LUT utilization, which indicates that the performance of these circuits are approaching to fundamental limits of the FPGA family used in the study: Xilinx Spartan 3AN.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 04, 2012 6:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
Thanks for the update!

On the face of it, your tables don't show any tradeoff (things just get bigger and slower), but on careful reading we are to understand that you will ultimately pipeline the design. At that point, we'll see different numbers of FFs according to the number of pipeline stages. (Is this right?)

Also, we note that the 40-clock design has a shorter critical path, so the 10-clock design wouldn't be 4x faster, if the multiplier is the overall critical path. (Is that right?)

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 04, 2012 11:37 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
BigEd wrote:
On the face of it, your tables don't show any tradeoff (things just get bigger and slower),
On the contrary, the 2.71x increase in resources results in a 2.52x improvement in performance: (40*4.99/(10*7.91)=199.6/79.1=2.52. Although the operating maximum operating frequency decreases from 200 MHz to 125 MHz, the 37.5% decrease in clock speed is compensated for by the fewer number of cycles required to complete the operation.
BigEd wrote:
you will ultimately pipeline the design. At that point, we'll see different numbers of FFs according to the number of pipeline stages. (Is this right?)

It was not my plan to pipeline the design. Instead, I would consider increasing the number of functional units required. If pure performance is an issue, then I would just abandon the vendor independence strategy, and use the built-in multipliers. On the other hand, there are many applications where reducing the clock speed and improving multiplication speed with an iterative solution is more advantageous. In many circumstances, the operands required are not in the registers, and must be fetched from memory. If memory is multi-cycle and the multiplier is independent from the load/store operations, then the multiplication operation can be completed while the next operands are simultaneously fetched from memory. The lower operating speed results in lower power, and most likely provides a better match to the external memory's read/write characteristics.
BigEd wrote:
Also, we note that the 40-clock design has a shorter critical path, so the 10-clock design wouldn't be 4x faster, if the multiplier is the overall critical path. (Is that right?)

There is a fundamental speed limit in the FPGA, and the 1x multiplier is operating near that limit. To achieve the same performance as the 4x multiplier at this 200MHz rate, three 1x iterative multipliers would be required (with a bit better overall performance) but the resources required would be more than the resources required for a single 4x iterative multiplier at a lower operating frequency, and hence lower power.

Furthermore, as a system design grows beyond a simple function such as a multplier, the limits of the FPGA impose operating frequency limits that tend toward the frequency limit of the 4x multiplier. As you, EEyE, Arlet, Windfall, myself and others on this forum have demonstrated, a fully functional 6502 FPGA implementation is limited to around 100MHz (Spartan 3AN) operation. Under those conditions, adding a 4x iterative Booth multiplier will outperform the 1x iterative Booth multiplier every time. A 4x Booth multiplier for 8 bit integers would complete the 16 bit product in two clock cycles, and that would be infinitely more desirable than 8 cycles needed by the 1x Booth multiplier.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Wed Sep 05, 2012 6:12 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10799
Location: England
Ah, I see: it's not pipelined, but it is a multi-cycle operation. So something like this:

Code:
-------------------------------------------------
Algorithm       |   1x  |  1xA  |   4x  |  4xA  |
-------------------------------------------------
FFs Used        |   87  |   87  |  87   |   89  |
LUTs Used       |   129 |   62  |  647  |   168 |
Slices Used     |   93  |   70  |  362  |   115 |
-------------------------------------------------
Constraint      |  6 ns |  5 ns |  10ns |  8 ns |
Reported Speed  |  5.72 |  4.99 |  9.96 |  7.91 |
-------------------------------------------------
Cycles          |    40 |    40 |    10 |    10 |
Min Time (ns)   | 228.8 | 199.6 |  99.6 |  79.1 |
-------------------------------------------------
100MHz Time (ns)| 400.0 | 400.0 | 100.0 | 100.0 |
-------------------------------------------------

Where 'Min Time' assumes the multiplier is critical to the clock speed. As you say, if the best speed for other reasons is a 10ns cycle the 4x looks better, and gives the expected algorithmic speed up.
Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Wed Sep 05, 2012 10:55 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Please let me interject a question so I can try to understand. This multiplier project is for older families of FPGA's or even CPLD's without the built-in multiplier? I ask because the Spartan 3AN FPGA was mentioned which has the built in multiplier. The XC4000XL CPLD was also mentioned and these do not have the multipliers.
Is there an advantage to using the Booth multiplication algorithm over the built-in multipliers?
Also, does '1x' and '4x' mean 1-bit and 4-bit?
I would like to see the code it if it's not too long, and if it's not proprietary.

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: