Rob:
I second Garth's welcome. It's good to have you join the discussions here again. I have synthesized your BC6502 core to compare against mine. Perhaps you'd share with us here some of your experiences with your core since it preceded my efforts and those of several others. For example, did you have the opportunity in your work to apply it on a commercial project?
On the subject of your post.
Rob Finch wrote:
Using an i-cache there's the opportunity to eliminate some of the opcode fetches by reading multiple bytes of the opcode all at once. Using a triple ported cache can read PC,PC+1 and PC+2 bytes all at the same time. This means the i-fetch states of the core has to be modified. But if we're going for performance we might as well use a 24 bit instruction port and modify the state machine. Just thinking.
Caveat: applies to 8-bit 6502-like processor.
I agree in general. However, I disagree with the recommended implementation: 3 port cache. I have looked at implementing internal memory in this style but limiting it to using dual port nature of the built-in Block RAMs. In my opinion, extending the native number ports of the internal block RAMs to support additional ports adds more complexity than is needed. (Note: Xilinx and others have demonstrated using the dual-port nature of the block RAMs and overclocking the interface to generate quad port memories. However, once we start talking about overclocking an interface or peripheral, the discussion quickly degenerates into a discussion about whether the core operates in one, two, or more cycles. So let's just keep the discussion focused on the concepts and not the detailed implementation.)
With a processor which has variable length instructions, instruction fetches will always cross a buffer/cache line boundary. So even with a 16-bit dual ported, 24-bit triple ported, or 32-bit quad ported memory/cache interface there's a very high likelihood that the instruction stream will cross a memory/cache line boundary. In a cache, the line buffer multiplexer can ameliorate this issue for a time, but eventually, the instruction stream will cross a memory/cache line boundary, and the instruction fetch unit will have to deal with the break in the instruction stream especially if the next byte/word is not in the following buffer/cache line. It will have to hold/buffer that part of the instruction fetched from one memory/cache segment (or restart the fetch sequence), fetch from the next buffer/cache line, and align the required data into required instruction packet. Whether it is a 16-biit, 24-bit, or 32-bit interface, this issue with the 6502 instruction stream is going to occur.
Instructions in the 6502 instruction set are 1, 2, or 3 bytes in length. If memory is not an issue, then one solution would be to create an internal memory space in which each complete instruction would be assembled and held: 1 byte instructions would not use two of the bytes; 2 byte instructions would not use one of the bytes; and 3 byte instructions would use all bytes. As the instruction fetch unit requests instructions, the entire instruction is written to (as read from main memory on a buffer/cache miss) or read from (on a buffer/cache hit) the instruction buffer. In addition, either the buffer or the instruction fetch unit provides the increment value for the program counter.
Although I don't have any hard data on the average number of bytes in an average 6502 program, but it appears, without explicit counting, that Klaus' 6502 functional test program (the biggest 6502 program I have access to at the moment) looks to have an average of 2 bytes per instruction, or just less than two bytes per instruction. If this perception of average instruction length holds, then about one third of the memory cells in this type of buffer would be unused. This penalty of underutilized block RAM storage may be worthwhile if it essentially turns most zero page and absolute direct access instructions into single or two cycle instructions instead of the normal 3 or 4 cycle instructions.
One issue with such an instruction buffer is that the addresses of the instructions it holds are not linear. I can think of a number of possibilities for aligning blocks of the instruction buffer with blocks of addresses. Flushing blocks in the event of an instruction buffer miss can probably be implemented in a manner similar to that of a standard instruction cache. Thus, there is some additional complexity related to accessing whole instructions that may not be worth the effort.
Another issue with either your approach or mine is that the basic nature of the processor has not changed: it is still an 8-bit machine internally. Thus, to accommodate the wider instruction fetch data path and update the instruction register and any operand registers using a data stream of variable width, there may be a need for additional combinatorial logic in the input data path to align the instruction data with the correct internal registers. This may or may not affect the overall performance, but it will most likely increase the path delays and lower the overall clock frequency that can be achieved. Sometimes a lower operating clock frequency provides better overall performance, although it will marginally reduce the dynamic power dissipation.
One other thought that I've had along these lines is to use a dual port block RAM with two independent address buses to read indirect address operands from memory in a single cycle. One address is the address of the LSB, and the other address is that address + 1. In this manner, the 8-bit nature of the instruction/data spaces is accommodated naturally, i.e. there's no need to waste storage as in the instruction buffer described above. One port is always set as a read/write port (the LSB port), and the other is a read-only (the MSB) port.
When used in combination with the instruction buffer, zero page indirect, {(zp); (zp,X); (zp),Y} and absolute indirect, {(abs); (abs,X)} instructions will experience a savings of at least three clock cycles per instruction. These savings and the corresponding increase in performance may be sufficient to warrant the additional effort needed to implement the instruction and data caches in the manner described here. (Note: the absolute indirect instructions, {jmp (abs) and jmp (abs,X)}, are likely to be used infrequently, so the performance benefit gained from the instruction buffer and data cache can generally be discounted for these two instructions.)
Toshi recently made a good point regarding the performance of direct mapped caches. The concepts described here are not limited to direct mapped caches, but it certainly makes the concepts significantly easier to implement if a direct mapped approach is taken as a first step. In the special case that he described for his SH4 application, the cache thrashing experienced with the direct mapped cache of the SH4, although not desired, is a common issue with direct mapped caches. That issue is present in any cache, but it is much more apparent in a direct mapped cache. However, regardless of the cache implementation strategy, cache thrashing is statistically possible in any program unless all instructions and data reside in the cache. This last point should raise the question in everyone's mind: is a cache even needed for a 6502-like processor in an FPGA with at least as much block RAM as program space?
In my opinion, driving the memory cycle rate of a 6502-like processor into the region much beyond that than can be reasonably achieved using internal block RAM memory is not very productive. As a controller, the 6502 and its kin are great devices. There are some 6502 limitations that should raise concerns as to its applicability outside its normal application domain. In its domain of 8-bit embedded systems, the 6502 is undoubtably an outstanding processor.
As BigEd, EEyE, and others have demonstrated with the 65Org16 and 65Org32 variants, that the 6502 architecture can be extended into other domains, but in doing so, the low-cost nature of the memory subsystem is lost. Because these variants have larger address spaces, wider instructions, etc., a cache subsystem tied to a large external memory like a 16 MB SDRAM is much more reasonable than a cache and a 64kB address space.
Toshi:
I am sure that after all of the effort that compiler writers and software developers have devoted to extracting the most performance from an instruction set, they could have at least devoted some effort to static analysis of the link image for cache thrashing issues. In fact, I would have thought that the garbage collector for the Java virtual machine would have been designed to spot issues such as cache thrashing and automatically adjust the relative positions of the offending code segments to avoid the issue while it had essentially brought the machine to a standstill. This is such a simple concept that it should be relatively easy to implement since the cache design and strategy of the target is a known quantity. Hennessy and Patterson claimed that these type of problems should be left to the software tools and not be implemented in the computer's hardware, i.e. the cache subsystem or the execution pipeline. After driving processor design to virtually unimaginable levels of complexity, it's time for a bit of innovation on the software development side of the house by the so called computer scientists.