Thanks for lots of ideas and background information. The 32-bit databus (or even 64-bit) is very interesting to feed a cache and I have already been thinking alot about that, but I had never seen the 65m02 effort.
From my understanding a pipelined microarchitecture would require one to run several instructions in parallel, each instruction being divided into micro-opcodes (u-opcode). The idea is that a u-opcode runs faster since it does less, but one will need several of them to complete a full opcode, so more clock cycles are needed for each opcode. I am not shure if that is what you mean with x86-like direction, but that is what I have looked into.
For example "STA $(ZP,X)" is a quite complicated instruction which requires both fetching Zeropage, adding X and storing A. At first I though that Zeropage could be looked at as a 256-register area, but in order to run several instructions in parallel, one either have the u-opcodes to access a certain register at the same cycle (for example 6th cycle from start of opcode), or ending up with a very complicated pre- or post-fetcht handling of such a register.
For example, the above instruction could be divided into 8 u-opcodes (or maybe even more):
1) Fetch instruction byte
2) Fetch data byte
3) Fetch Zeropage to work register
4) Fetch and add X to work register
5) Fetch MSB to storage address
6) Fetch LSB to storage address
7) Fetch data from Accumulator to work register
Store work register to storage address
So the instruction would take 8 cycles to complete, but each instruction being simpler could (in theory) allow the core to run much faster.. Which is how the x86 does it. But as you point out, this might not be true for a FPGA.
Anyway, always manipulating Accumulator at cycle 7) would enable one to run many instructions in paralell so that effectively one have 8 paralell pipelines and each cycle finishing 1 instruction. Due to this, ALL instruction will need to take 8 cycles to finish since for example Zeropage has to be accessed at cycle 3) to prevent conflict with other instructions accessing that.
Now if one stores Accumulator into Zeropage, that is obviously not going to work, so one will need to put in wait-states for this to work in practice (at least for upcoming instructions that manipulate the same data in Zeropage). This is also true for X and Y registers (which in the above example accesses in cycle 4), so that for example TAX and TXA would need waitstates if they happend after each other.
So all-in-all I think this would allow for faster 6502 execution, but it would be very dependent on the code if it was any faster at all IRL.
The "other way" would be to compile several instructions into more complicated ones for internal execution, for example "LDA #$xx + STA $yyyy" into "LSA #$xx $yyyy" (LSA would be load and store A+$yyyy) or longer sets of instructions. That would not enable pipelining, but reduce number of cycles per opcode. But maybe the 65m02 is a better way in such a case since its the loading and storing to RAM that is the bottleneck of most opcodes.
Well, at least that are my thoughts, but then I don't have that much experience into this.