Ed:
I appreciate you linking your code for me in your last post. I had this nice reply in the works and then I decided to click through to your code for a second look, and when I got back, my tome was lost. So I will try and reconstruct it as best I can.
Your code is very clean, and clearly follows the structure and style of Arlet Ottens. Like most members of this forum, I am also working on an FPGA implementation of the WDC 65C02. I have currently completed one one core and am working on my second. My general objectives are a faithfull reproduction of the WDC 65C02 instruction set, but not necessarily the instruction cycle time. Actually, my objective in that regard is to reduce the number of clock cycles required for most instructions without resorting to using a wider data bus. My approach to the implementation of the core also differs from those that others such as yourself, and Arlet Ottens have posted. That is, I use a microprogrammed instruction sequencer.
My first core was intended to be used with a single cycle asynchronous memory such as can be synthesized from Xilinx LUTs. It assumes that the address is output and the data is returned on the same cycle. At the performance that I was targeting, 100 MHz in a Xilinx XC3S200AN-5, this approach is not particularly practical since any reasonable program would require too much of the free LUTs and would not provide a sufficiently large memory for anything but toy applications. My second core is a more complete implementation with BRAM included in the core. The difference in the access method has caused me to rethink some of the asynchronous logic that I used for single cycle address calculation in the first core. Without considering the delays in the address output path or the input data path delays (as is the case in functional simulation), my first attempt is complete. But it will never work in a practical system. Thus, the second core has been derived from the first core, and it deals with these practical considerations. With respect to the second core, I have completed its re-implementation and reworked the microprogram's control fields, but I've not yet begun to debug the microprogram. Given the amount of time that I have available to persue this hobby, its going to be several weeks before that task can be completed.
Back to your core. You and Arlet are using a one-hot state machine methodology for the instruction sequencer. I've not been much of a fan of the one-hot state machine approach for a number of reasons, although I often use one-hot control fields. However, the performance and resource utilization that Arlet achieved (as you detailed in an earlier post comparing various cores), coupled with the cleanliness of the implementation, has convinced me to put some effort to studying the methodology once I've completed my second core. It appears that the base design uses the register file to provide the AI input to the ALU module. I have used the LUT RAMs in the Xilinx FPGA in this manner for many years. The inherent multiplexer of the RAM is the fastest way to improve the operating speed of a Xilinx FPGA.
It also appears that the register file is implemented as a single port RAM and that the core does not use multiple, independent adders to provide parallel computation of addresses and/or ALU results. Therefore, I am going to suggest expanding the register file so that you can also use it for temporary storage instead of wiping out your S storage location. Since for the Xilinx FPGA that you are using, any use of a LUT as a single or dual port RAM will always make available a minimum of 16 "registers", the synthesizer is currently tying off the two (16) or four (64) most significant address lines. (I am sure that you are aware that Xilinx FPGAs older than the Spartan 6 and Virtex 7 families employ 4 input LUTs, while these two FPGA families employ 6 input LUTs.) In this manner, you can place your B register into the register file, and any other registers or temporary values that you be need. I don't think that this suggestion applies to the shift count/direction register, but I've not spent any time exploring the implementation or the instruction set that you are presently implementing with your core.
Once again, thanks for the link to your code.
BTW, you need not modify the present definition of the regfile address/select variable: "regsel". You can simply extend it using a bit vector construction, {extreg, regsel}, where "extreg" is declared as a 2-bit or a 4-bit vector (set by the FPGA family that you are using). If you fail to include the expanded RAM select vector when addressing the register file, then verilog's default behavior will left fill the RAM address with 0s, and the register file will behave like your current implementation: a four location RAM. (A synthesizer or PAR warning should be issued, and it will indicate that the RAM address is zero filled because its not completely specified.)
|