Jeff:
I am not sure I can resolve all of the points you raise in your post, but I will try. (Any list formatting does seem to be a goner when placed in quote block.)
I'm definitely in favor of writeable control store!! But what's the best implementation? You could either add data paths to the core, or do an FPGA reboot -- both are viable ways to reload the control store. At present I lean slightly toward the FPGA reboot.
From the perspective of maintaining performance, a FPGA reboot is the most viable approach. While constructing the M65C02A core as a microcomputer, I attempted to partition the internal memory into 4kB blocks that my rudimentary MMU could relocate as desired by the user. The demonstration project uses 4kB of the available block RAM as microprogram memory. The remaining 28 kB is implemented as internal block RAM in three blocks: (1) 16kB of User RAM, (2) 8kB of User RAM/ROM, and (3) 4kB of Monitor RAM/ROM. As it sits today, the M65C02A soft-core microcomputer project reliably and easily synthesizes, MAPs and PARs into a Spartan3A XC3S200A-4VQG100I FPGA at just under 33ns clock period. (My target speed for the part is 29.4912 MHz, a baud rate frequency.) Splitting these block RAMs into 7 independent blocks dramatically reduces the maximum reported clock speed. It is my suspicion that this result is driven by the significant increase in the number of bus connections that the fine grained BRAMs require from the FPGA. The three block configuration with which I am meeting my goals must be fitting into a sweet spot for this particular FPGA family. Thus, although all BRAMs are dual-ported, and the M65C02A microprogram memories are no exception, setting up the microprogram BRAMs to be accessible as a Writable Control Store (WCS) may be moving the solution back into that domain where the project won't meet my performance goals.
One significant issue with using a reboot of the FPGA to load new microcode is that while the FPGA is being rebooted, special care must be taken in the implementation of all of the external logic. The outputs of the FPGA will float, and for a brief time, between the completion of the configuration image load and the transfer of control to the new user application, may exhibit unreliable logic levels. This issue can be mitigated by ensuring that external circuits are held in reset during configuration by monitoring the FPGAs DONE pin and by ensuring that the configuration control logic releases DONE only after all configuration activities have been fully completed.
Another issue with the FPGA reboot approach is that the external configuration memory must be programmed with an image first. The 32 kb required to program the microprogram memories is a far cry from the nearly 1.2 Mb required to program the entire FPGA. A partial reconfiguration capability of portions of FPGAs is now available with newer generation FPGA families. I have not used this capability in the Virtex 5 family with which I am currently working, so I can't make any statements regarding its applicability for the purpose of WCS updates. Further, Virtex 5 FPGAs are much more expensive for general applications than the Spartan 3A FPGAs, which is the primary reason that I am focused on the XC3S200A-4VQG100I FPGA; it can be purchased low as $7.00 in reasonable volumes from Avnet or similar distributors.
This puts me in a quandry, I would like to be able to implement a classic WCS, but I would like to keep the performance for the XC3S200A-4VQG100I FPGA at or above 30 MHz.
I'll list the cons before the pros:
Cons:
[*] an application can't load new microcode except via a reboot. IMO this limitation is trivial.
[*] there'd have to be write access provided to the configuration ROM -- not a major challenge.
[*] given that the configuration ROM includes lots of data not pertaining to microcode, you'd need to know where the microcode resides and what the format is -- I mean so you could specifically alter only the contents of the control store. But sussing the format is something that'd only have to be done once. [Edit: or is it?]
Pros:
[*] the FPGA-reload approach doesn't consume resources or complicate the HDL code, and it places no potential constraint on clock speed. IOW it allows a no-compromise core.
Xilinx provides some
tools that myself, ElEctricEyE, and enso have used to reload the contents of the BRAMs without requiring rebuilding the FPGA project. I don't think that Xilinx has released the particular details of how the initialization of the BRAMs is mapped into the bit stream so that a user can write their own utility for this purpose.
Re that last point, I admit I don't have good sense of how serious the compromise is. Any comment, Michael?
I don't think the compromise is particularly serious. I am leaning to implementing the user WCS approach on a "let's see what the performance issues are" basis. The WCS concepts used in the PDP 11/60, the IBM 360/370, etc. just appeal to me. Being able to dynamically load a new instruction sequence in the microprogram store and access it from a user application holds a certain appeal for me.
I've had this concept investigated on one of the projects that I lead, and it was successful. I keep this capability out of the discussion regarding options for those projects because there's too much risk of the project becoming classified as "SW". There's been a mo
ve to classify HDL-based FPGA projects as "SW", but I resist that each and every time.
On a historical note, I recently went on a buying spree and bought several PDP 11 backplanes, processor cards, and memory cards with the objective to build up a PDP 11/83 processor. In the process, I purchased a PDP 11/03 processor card and while researching that card, I discovered that you could develop user microcode for that family of LSI 11 processors. The WCS was on a card that was mapped into the I/O space on the LSI 11 Q-bus. The internal microcode bus was slow enough that a 40-pin ribbon cable with a DIP-40 header from the WCS card in an adjacent card slot could be plugged into one of the microcode ROM sockets of the processor chip set. (The LSI 11/03 operated at about 2-3 MHz and consiste
d of 3-4 40-pin chips.)
Edit: Corrected some misspelled words.
Michael A.