I've always thought (I'm summarizing here) that a 6502 core should not spend such an awful amount of time on fetching instructions. Experimentation time. What did I do ? I took Michael Morris's 65C02 core (an old version, lets leave the 'why' of that undiscussed ...), and made changes to it thusly :
a) All reads from <address> also read <address + 1> and <address +2> (importing either opcode argument bytes or up to 16 extra data bits). A simple memory moulding operation in an FPGA environment (more about that later).
b) Eliminated, from the microcode, argument byte fetches
c) Similarly, coalesced vector reads into a single access
Shortening, e.g., "LDA (dp),Y" or "BIT abs" by two cycles. Just for the price of widening the databus ... Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses.
After hitting most of the instruction set (all zero page and absolute addressing related instructions, and lucrative instructions like JSR, JMP and RTS), I already improved core performance by around 50% in one of my own creations (the 'soft' Acorn 6502 Second Processor, see http://www.zeridajh.org/hardware/soft65 ... /index.htm).
And the end is not in sight. Optimization of some instructions is impossible due to contention of the instruction fetch and another memory operation. But what if we gave zero page and stack space separate storage (reflecting all writes to two copies : main memory (for degenerate accesses, like LDA 0010h,X or LDA 0100h,X), and a faster copy in e.g. registers (for natural accesses like LDA 10h,X or PHA). The latter copy would have no contention with instruction fetches, and opens the door to single cycle zero page operations.
In short : there's a lot to be exploited here !