The 40/10 cycles referred to my initial problem on the Virtex 5 FPGA of a 40 bit fixed point integer. In the comparative analysis, I chose a 16 bit format, so the 1x algorithm requires 16 cycles to complete the product, and the 4x algorithm requires only 4 cycles to complete the product. I have adjusted the table from your earlier post accordingly. If the constraint line is assummed to be the operating clock period, the relative performance of the 4x algorithm still overcomes the shorter path delay of the 1x algorithm. Interestingly, slowing the clock down and processing more bits during an iteration speeds up the overall operation. In the table, a speed up of 3 is achieve by slowing the clock from 166.67 MHz to 125 MHz and switching from a 1x (original single case statement implementation) and going with a 4x (optimized two adder implementation).
Code: Select all
-------------------------------------------------
Algorithm | 1x | 1xA | 4x | 4xA |
-------------------------------------------------
FFs Used | 87 | 87 | 87 | 89 |
LUTs Used | 129 | 62 | 647 | 168 |
Slices Used | 93 | 70 | 362 | 115 |
-------------------------------------------------
Constraint | 6 ns | 5 ns | 10ns | 8 ns |
Reported Speed | 5.72 | 4.99 | 9.96 | 7.91 |
-------------------------------------------------
Cycles | 16 | 16 | 4 | 4 |
Min Time (ns) | 96 | 80 | 40 | 32 |
-------------------------------------------------
100MHz Time (ns)| 160 | 160 | 40 | 40 |
-------------------------------------------------
It can be just aesthetics, but it is also for supporting multipliers in FPGAs that don't have the built in capability like the older Altera 10k, Xilinx 4k, 5200, Spartan, Spartan XL, and Spartan II, the older Actel anti-fuse FPGAs, etc. There is no real reason not to use the built-in multipliers in the Spartan 3AN or 6 families. They are however only 18x18 multipliers, and multiple devices will have to be put together to form 32x32 or 40x40 multipliers. In addition, they are generally built in the vicinity of the embedded block RAMs, and generally restrict the use of the block RAMs because their inputs share the address lines of the RAMs. In most designs, this is not an issue. The multiply-accumulate (MAC) structure that they implement is generally tied to some real-time data stream that would not naturally be stored in the attached block RAM. Thus, you may trade some block RAM for a single cycle MAC function that allows you to build a really fast digital filter, signal transform, etc.
In my situation, the task is to preclude design obsolescence over a large number of years. Once the components are obsolete, porting the design from one multiplier architecture to another may or may not be an issue, but having all of the design components be in RTL of our design and under our control provides an easier path to the inevitable redesign. In addition to component obsolescence, tool obsolescence is also a big consideration. Building 40x40 multipliers from the built in 18x18 multipliers is generally done with the vendor's IP generator, and that has a stability of 0 years. (I expect that the competition by Altera and Xilinx for the fastest and biggest FPGA will eventually change the built in multipliers into floating-point devices, but that is likely a few years out.)
You are correct that that notation 1x and 4x refer to the number of bits that the algorithm processes on each iteration. I expect to release the code for these four (or six) modules shortly on Github. I am considering using Hanson's license (see LCC license), but have not yet contacted him for permission to plagiarize (copy and modify) the license in the LCC Github repository. If that permission is not granted, then I'll likely release the source under the LGPL license I used for the MAM65C02 and MiniCPU Serial ALU releases.