Re: New Verilog 6502 core
Posted: Sun Jan 08, 2017 10:18 pm
I think technically it might be possible to make a significantly higher-performance 6502, in the sense of fewer clocks per instruction, but it would take a very sophisticated microarchitecture, much more than just pipelining. (One caveat: I think it would have to have some means of knowing which accesses are to I/O devices, to make good use of cache.)
The idea is that reads can be satisfied from cache and writes can be written through deep queues and may not need to happen at all. So fetches, reads, and writes can have wider paths than the external path to RAM. More instructions can be in flight, including speculatively executed instructions. I think complex instructions would have to be broken into micro-operations, so those can be reordered and dispatched and executed speculatively as resources allow. As with the x86, you'd need lots more registers than are architecturally defined - you might even bring in some of page zero and some part of the top of stack as if they are registers. And indeed, you'd need to decode more than one instruction per cycle, but that might even be the easy part.
One difficulty is this: it would take an absurd amount of effort - with the necessary knowledge and that amount of effort, a person could build a rather good implementation of a more regular machine - RISC-V, any other RISC, or something of their own invention, and they might well do that instead.
Another difficulty is that as complexity increases it gets more difficult to get the clock speed high, especially in FPGA. If the clock speed is not going to be so high, there's less advantage in trying to make such an aggressive microarchitecture.
I should say, I'm not an expert at any of this, just a keen follower of developments, which I find very interesting. Here's a diagram describing AMD's Barcelona generation, as an idea of how things can be arranged: (From Computer Organization and Design: The Hardware/software Interface, by David A. Patterson, John L. Hennessy)
There's a heap of good information at http://agner.org/optimize/ too, if you're prepared to consider things from an x86 perspective. It's register-poor and has similar status bits, so is similar enough to 6502 for these purposes.
The idea is that reads can be satisfied from cache and writes can be written through deep queues and may not need to happen at all. So fetches, reads, and writes can have wider paths than the external path to RAM. More instructions can be in flight, including speculatively executed instructions. I think complex instructions would have to be broken into micro-operations, so those can be reordered and dispatched and executed speculatively as resources allow. As with the x86, you'd need lots more registers than are architecturally defined - you might even bring in some of page zero and some part of the top of stack as if they are registers. And indeed, you'd need to decode more than one instruction per cycle, but that might even be the easy part.
One difficulty is this: it would take an absurd amount of effort - with the necessary knowledge and that amount of effort, a person could build a rather good implementation of a more regular machine - RISC-V, any other RISC, or something of their own invention, and they might well do that instead.
Another difficulty is that as complexity increases it gets more difficult to get the clock speed high, especially in FPGA. If the clock speed is not going to be so high, there's less advantage in trying to make such an aggressive microarchitecture.
I should say, I'm not an expert at any of this, just a keen follower of developments, which I find very interesting. Here's a diagram describing AMD's Barcelona generation, as an idea of how things can be arranged: (From Computer Organization and Design: The Hardware/software Interface, by David A. Patterson, John L. Hennessy)
There's a heap of good information at http://agner.org/optimize/ too, if you're prepared to consider things from an x86 perspective. It's register-poor and has similar status bits, so is similar enough to 6502 for these purposes.