ttlworks wrote:
If there isn't enough RAM in the FPGA, external RAM is required, and external RAM
probably would be slower than the CPU, so "CPU cache" might become a topic...
a topic where I'm having little to no knowledge.
If nothing else, someone could do it in a hybrid fashion. You'd need to code in decoders. However, this could become confusing or create complications unless one is careful.
One could have 20 internal address lines if you have 512 Mb internally, and provide 19 to the outside world to connect to the other memory. However, unless you plan on shadowing any memory-mapped devices, you'd likely need A19 (or at least /A19 since you can do a NOT faster in FPGA) to mute the external memory when an external device needs to access the low internal memory.
If the external memory is too slow, then you'd likely need your processor to have a halt or data-ready line for dealing with the slower external memory (and particularly during a cache-miss if you have a cache). That way the CPU can wait on that memory while operating at full speed while using BRAM. One way to do that would be with a counter that starts when either control signal is active and signals when the wired/jumpered count is reached. To be more granular, one could drive the ready clock at a rate higher than the CPU in case you need only a half-cycle longer, though I guess it would still take a whole CPU cycle in that case (just because the count is reached doesn't mean the CPU is able to continue then -- the CPU clock would be the deciding factor then).
Yes, caches can get messy, and coherence can be a problem. Too large of a cache can give diminishing returns if too much time is spent in cache logic. One could have a simple read buffer that is flushed when the /WE drops for safety, but a little more sophisticated would be to monitor for writes that contradict what is in the read buffer and either invalidate or replace those. A write cache would offer more performance gains, but that would require more caution. That doesn't reduce activity as with the read buffer, it only defers it. If you use both, then a subsequent read after a write at the same address would likely need to force a flush for that write and an update to the read buffer at that address.
Caching can complicate life for other things. If you use DMA, you may need a signal to tell other devices that they can safely access the memory so the write cache can have time to flush before the devices use the memory. However, that could be a problem if the DMA access is time-critical. And when the CPU is able to use the memory again, it may need to invalidate the read cache in case the device changed the contents of the cached addresses. Plus, I don't think this strategy could be used if you use cycle-stealing (since every other access could theoretically risk coherence). So it might not be good if you are building an ultra-fast Commodore 64. But an Atari 8-bit clone may be more forgiving, and altered cache behavior could be limited to the range that ANTIC uses (not sure about POKEY).
Another thing of note is that FPGA memory is often 9-bit. Perhaps the extra bit could be used for tag RAM, or it could be used for ECC purposes. Or, one could make a 9-bit (or 18-bit) CPU if you want to only use BRAM.