I did some measurements this morning which might help inform the question of how and where to use caches for the 6502.
The context is the Matchbox CoPro hooked up to an Acorn BBC Micro. That is, a 64MHz FPGA programmed as a 65C02, with 64kByte of on-FPGA single-cycle RAM, and 1MByte of off-FPGA 5-cycle RAM.
The benchmark used is "clocksp", a Basic program exercising various types of code. The experiments involved remapping various bits of the memory space to the slower off chip memory. Looking at it the other way around, you get an indication of the possible benefit of putting a single-cycle cache in front of the slower memory.
- With only fast RAM in play, as a baseline, clocksp reports 65.33MHz
- Moving Basic ($8000 to $C000) from fast RAM to slow RAM: 17.48MHz
- Also moving &2000 to &8000 to slow RAM: 17.37MHz
- Also moving PAGE up to &2000: 17.10MHz
- Moving everything from &0000 up to &E000 to slow RAM: 13.09MHz
- Moving only the &0000 to &2000 area: 29.58MHz
- Moving only the &0000 to &2000 area and setting PAGE to &2000: 30.57MHz
So, with slow RAM in play everywhere, we get 13MHz, more or less as expected. If we can keep page zero and the stack in fast RAM, we're up to 17MHz, which is substantially faster. If instead we could run our code from single-cycle on-chip cache, but only the code, we might get 30MHz.
That is to say: there's a big gain from speeding up instruction (and operand) reads. I would have expected zero page speedup to be more important, but on reflection I'm not surprised to see that's not so.