A good old-fashioned 6502 system can't really benefit from cache, because memory accesses are already single-cycle. (Unless you have a slow ROM with wait-states...)
But, when the clock of the CPU exceeds the access time of the memory, yes, I think there are possibilities for benefit. Even the 14MHz parts we buy are not really too fast for memory - which is to say, we can buy adequately fast memories. So, it's the realm of the FPGA implementation where caches could be interesting: the CPUs can run at 50MHz up to 100MHz or so, and it's difficult to built a memory system quite as fast as that. But there's more: there's scope for putting some accesses in parallel, such as the two bytes read and written for JSR and RTS, or the two bytes read for indirect accesses, or indeed the two or three operand bytes of each opcode. And indeed, there's scope to run the instruction fetching in parallel with the write-back of stores, to improve the pipelining.
As far as I know, no-one has tackled these ideas yet in practice.
There's another case: Shane Gough's emulated 6502 on ARM has very slow access to main memory over an SPI connection. There's scope there for some local memory or caching. See
viewtopic.php?f=1&t=3146