I suppose I could just admit that it's impossible to have two threads, one for each of two different ideas. But as a consequence, it's very difficult to have a satisfactory discussion about either idea.
Let me just say: this here thread was intended for implementation ideas which lead to a higher performance 6502. This
is intended for modified add-value near-6502 ideas.
Z-80s can take 3, 4, or 5 clock cycles per M-cycle, depending on the instruction executed, and even then, can take several M-cycles too.
The reason the average is between 3 and 5 is because the chip is
microcoded underneath. That is, the Z-80 and 68000 CPUs actually have
a simpler, yet functionally independent CPU (of sorts) whose job it is to interpret the programmer-visible instruction set. This is how, for example, the 68020 could get away with such utterly ridiculously complicated addressing modes.
In the Z-80's case, the microcode is used to multiplex data over finite numbers of internal buses, all of which were laid out by hand back then. It's also used to enforce the bus protocol as well:
1. Put address on the bus.
2. (While waiting for data to arrive, increment the PC.)
3w. Sample the RDY or similar signal, and if negated, wait here.
3. When RDY is asserted, accept the data and terminate the bus transaction.
E.g., if RDY is asserted during step 1 above, it means nothing.
The genius of the 6502 is that its bus was truly single phase (phi1 and phi2 are conveniences to simplify the NMOS implementation; it's not at all a requirement for the CMOS process, which is why the 65816 doesn't have them, and
you drive phi2 externally). You drive the address bus and R/W lines during phase-1, and capture the data during phase-2. If the data wasn't ready, that's OK -- just repeat the current cycle. The 65816 is only marginally more sophisticated than this, due to multiplexing the bank-address byte on D7-D0.
As far as proof of 6502's performance relative to other CPUs
with a normalized clock, all you need to do is count the cycles in your particular application. Yes, yes, the 68000 can pull off 2 MIPS performance at 8MHz, but know what? A 65816 at 8MHz will pull off 4 MIPS on average. At this point, the 65816 and 68000 compete head to head, with the former on average attaining close to 80% the performance of the 68000,
despite having only an 8-bit wide bus. Proof: Sit an Apple IIgs on a table next to a classic Macintosh. Run a paint program on both of them. (Remember, Mac is monochrome, while IIgs is 16-color. While the Mac has more pixels, the IIgs actually has
more bits to push around total). You'd totally expect the IIgs at 2.3MHz to be slower at video updates than the 8MHz Mac; however, this is not the case. Grab a brush from some picture and drag it around the screen. The Mac will have very noticeable rip and tear, while the IIgs will appear to be about as fast as a Commodore-Amiga using its blitter to draw.
As a final analysis, let's normalize bus widths and optimize our microarchitectures too (in fact, we have an existence proof: the 68008), and what you'll find is that the 68000 is abysmally sluggish compared to the 65816. The
only reason the 68000 gets the performance that it does is because it has a true 16-bit wide bus. Slap a 16-bit wide bus on the 65816, changing nothing else, and I'm willing to put money that the 65816 will meet or even slightly exceed the 68000.
If we take this opportunity to
really widen the data bus, then a single instruction fetch can grab a whole handful of instructions. This is quite useful thanks to something called macro-op fusion. If you augment the instruction decode logic to perform "macro-op fusion", your instruction timings now will look like this:
Code:
; Assuming native-mode, 16-bit accumulator
; For core that implements macro-op fusion, I further assume a 64-bit bus.
;
; CODE AS-IS Macro-Op Fusion
CLC ; 2 1 [1, 3]
LDA addend1L ; 5 1 [2, 5]
ADC addend2L ; 5 1 [2, 5]
STA resultL ; 5 2 [3, 4, 5]
LDA addend1H ; 5 1 [2, 5]
ADC addend2H ; 5 2 [2, 3, 5]
STA resultH ; 5 1 [4, 5]
; TOTAL 32 cycles 9 cycles (best case)
Notes:
1 Out of context, CLC normally would take the usual 2 cycles;
but, since it's recognized as part of a more complex code pattern,
it's behavior can be rolled into the mechanisations of the surrounding
code.
2 This instruction takes 2 cycles to fetch a 16-bit word from memory.
3 There is an additional cycle overhead for instruction fetch on this byte.
4 This instruction takes 2 cycles to store a 16-bit word to memory.
5 Add 1 cycle if 16-bit operand crosses an 8-byte boundary.
The CPU is now looking not just at individual instructions to determine what to do, but the context
surrounding them.
clc, lda, adc is a single "macro-op" instruction.
sta, lda occurs entirely too frequently to miss this one too.
adc, sta occurs less frequently, but it's strongly desirable for what I hope are obvious reasons.
According to
http://oldwww.nvg.ntnu.no/amiga/MC680x0 ... ndard.HTML , a
single ADD.L instruction takes 12 cycles. The code above fetches, adds, and stores a 32-bit quantity to memory, and assuming alignment with respect to the 64-bit bus, actually runs 3 cycles faster. Again,
this is a hypothetical case, and
don't expect to see this technology become widespread in the 65xx-community soon. All I'm saying is that it's relatively easily doable
if you truly compare apples to apples.