(Long reply. Executive summary: yes, use a cache whenever your memory bandwidth is not enough to keep your CPU running at full speed.)
(BTW, for me, before you necessarily add an MMU to a design with caches, you need something in between the caches and the memory proper - whether you call it a memory controller, or memory interface, or whatever. In other words, an MMU, which as you say may have several possible purposes, is an optional extra introducing even more questions as to what can be done and what is worth doing.)
I was just about to post about this kind of topic. I guess I can paste it here instead without disturbing your intention too much:
Caches, pipelines, microarchitecture - what's all the fuss?
Elsewhere in a discussion about a pipelined 6502 implementation with caches I was a bit grumpy, so as penance here's an attempt to explain why we sometimes find ourselves looking at CPU designs which are not straightforward clock-for-clock compatible with the original 6502. Others here also understand this territory and can surely help when I miss the mark.
The original 6502 is a fairly simple machine, with a slight degree of pipelining, intended to be a low-cost MPU connected to bytewide memory and I/O devices. It is in some ways a simplification and improvement on the 6800.
Why do we sometimes discuss and implement more complicated approaches? Why is the original not perfect for all purposes?
Performance.
Or, more usefully, increased performance at moderate cost. Why bring in cost? Because it's always a real-world consideration, and without cost restriction, we could just build a much faster version of the original - and for some purposes that's just not practical. (By faster, we mean something with a faster clock, and of course also with faster memory.)
The speed of light is finite, and not especially high in this context, and that turns out to limit the speed of large memory systems connected to a CPU. So, you can either build a really fast memory system of low capacity, or build a much larger one and find the access time is somewhat slower.
Or, it turns out you can use a combination of a small fast memory and a large slow memory and get better performance at a given cost. This is what caching is all about. What do we mean by a small memory? That depends on the technology we're using to implement the system. A small cheap FPGA might have much less than 64kB of memory onboard. If we're building a KIM-1 then that's enough memory. If we're building a 6502 system with 16MByte of (paged) memory then it isn't enough.
So, if we're talking about caches, we just need to make a quick check that in the context of the problem we're trying to solve and the technology we intend to use, that the full complement of RAM won't run at the speed we intend to run the CPU at. Conversely, if it will, then caches are not needed.
What's the magic which allows a small fast memory near the CPU to increase the performance of the system, when there's still a large slow memory out there which needs also to be accessed? It's locality. It's the idea that if I grab a byte here, now, then I will probably be accessing nearby bytes pretty soon. And conversely, if I haven't touched this little area of memory for a while, I probably won't touch it soon.
So, most often, a cache is organised as lines, of some number of bytes like 8 or 16, and the entire line will be present or absent. When the necessary data is not present, it is fetched as a line, and when it must be got rid of to make space, it is written as a line.
This leads to a second advantage: modern large memories are not really random access. Sending a whole address and getting back just one byte isn't good use of the bus or of the internal organisation of the RAM. So instead we can send half an address and get back a bunch of bytes in quick succession. This fits well with filling and spilling cache lines.
There's a third advantage: multi-port memory is more expensive than single port, and wide busses are more expensive than narrow ones. A small fast memory near the CPU can afford to have wide access, and to have multiple ports. Now we can do things like fetch multiple bytes in one cycle into our decoder, and have both opcode and operands available in the same cycle. We can fetch both bytes of an two-byte address in direct page, in one cycle, and we can push or pull two bytes at a time from the stack. Or, when we talk about pipelined architectures, we might be able to write from the write stage in the same cycle as we read from the address generation stage.
Let's talk about pipelining. Why does it make sense to break an instruction down into three, four, five, six or more small steps? We have to execute them all anyway...
Clock speed depends on the slowest part of the design. It depends on how much logic sits between two flops - and to some extend on distance too. If you have a really fast ALU, but it takes a while to increment the PC, you can only run at the rate limited by the PC.
Pipelining breaks down the actions that need to be taken into smaller steps which are simpler, and therefore faster. The pipelined design can be clocked faster than the unpipelined design. If it can commence one instruction per clock, every clock, then it goes faster even if some instructions take six clocks to finish.
But when the program is not purely sequential - when it hits a branch or a jump - the next address to fetch from may not be known until the previous instruction has finished, which is still several clocks away. So pipelined processors always have some penalty for executing branches, and that's one reason the pipelines don't get arbitrarily long. It's also the reason why a pipelined design can generally be sped up further by spending more logic and complexity on predicting branches, and correctly abandoning mispredicted ones. But that's an optional refinement - it's possible and sensible in some situations to have a pipeline and no branch prediction.
|