Some text from the "Hardware section" of the forum
related to speeding up a hypothetical 6502 compatible TTL CPU
by messing with the architecture...
http://forum.6502.org/viewtopic.php?f=4&t=3493&start=150&sid=efd8854f1f3dc4e9a3568dbc1aa779c3&sid=efd8854f1f3dc4e9a3568dbc1aa779c3#p46625BigEd suggested that I post it here, too:
;------
In the cycle after an instruction fetch, the ALU is sitting idle.
So it would be possible to keep the ALU outputs and some of the
ALU inputs in latches and to do the flag evaluation in the cycle
after an ALU operation (like in M02, which wasn't timing compatible to the 6502).
But this would complicate branch logic, especially making a "branch not taken"
still within two cycles.
![Smile :)](./images/smilies/icon_smile.gif)
Getting rid of the multiplexers in front of the registers
by using bypass latches and "register renaming".
But this would complicate the interrupt logic.
Then, there would be "out of order execution".
We have that circuitry for calculating A&X and DP+1 for the UFOs.
("Unforeseen Opcodes" by the designers of the NMOS 6502).
Now imagine to have some circuitry that calculates X+1 and X-1
(plus the flag results) if X is loaded... same thing for Y, A, etc.
When reading the instruction stream a little bit ahead,
and making use some "register renaming" tricks (for flags in the
status register too), such things as INX, DEX etc. could be done
"in the background".
What brings up another interesting but complicated topic:
Instruction and data prefetching.
BTW: 65CE02 was microcoded, but the datasheet mentions that the code
would run up to 25% faster than on a mere 6502 running at the same
clock frequency.
Interesting thing is, that on the "secret weapons of Commodore" C65 page
there is a line of text that says that the designer of the 4510
(which had a 65CE02 core) later "went on to design the K7 for AMD".
![Mr. Green :mrgreen:](./images/smilies/icon_mrgreen.gif)
Hmm... branch prediction. Like calculating the effective address
resulting from a Bxx instruction taken in advance...
![Wink ;)](./images/smilies/icon_wink.gif)
What would complicate branch instructions a bit, of course.
Oh, and there also are tricks to speed up the microcode...
the simplest trick would be using fast RAMs instead of ROMs,
and to load them from a serial EEPROM after a hardware reset
(hey, that's the _standard_ when using FPGAs).
But it also would be possible to have "microcode ROM" rolled out "flat",
what means the "ROMs" just have 8 address lines fed by the OpCode,
but the "ROM" outputs are much wider than 32 Bit...
then to tap into them by using fast multiplexers controlled by a state machine...
like in X02.
![Wink ;)](./images/smilies/icon_wink.gif)
So the propagation delay of the "ROMs" only would kick in after fetching
the OpCode...
...What brings us to "Barrel Processors".
Where the idea is to have a fast ALU and instruction decoder,
than to have multiple register sets (including multiple status registers)
to make efficient use of that speed...
from the hardware point of view, it's still a single CPU of course...
but for the "end user", it just might look like a dual or quad core.
![Smile :)](./images/smilies/icon_smile.gif)
;------
But another problem is, that when getting past 20MHz,
the address decoder and the peripheral chips might be getting too slow.
Also, implementing such things as mentioned above tend to be quite
difficile and labor intensive, because when trying to increase speed,
complexity of the CPU design will increase in a nonlinear fashion.
![Mr. Green :mrgreen:](./images/smilies/icon_mrgreen.gif)
Oh... and it's all half_baked stuff, of course.