Re: 32 is the new 8-bit
Posted: Sat Jun 08, 2013 7:37 pm
OK I've done some mookups on the possibility of such a "RISC" machine. I've end up with sometihng not too complicated in concept, but it will definitely be a major change from usual RISC machine.
I'll explain myself. My goals were the following :
1) All direct or immediate instructions should be in 1 cycle, including RMW instructions like inc $memory
2) All sta and lda indexed should be done in 1 cycle
3) sta (),Y and lda (),Y should be done in 1 cycle
4) I don't care if adc (),Y lda(,X) sta(,X) and adc(,X) etc... takes more than 1 cycle, as those are rarely used
5) Anything not mentioned yet has not been thought of now, but I don't care if it takes more than 1 cycle
In the end I have came up with the conclusion that 4) does not change many things, because if I want to fulfil 1-3, it costs nothing more to add support for 4). I might have missed something here, if it is the case, please tell me so.
Because I want to do RMW in 1 cycle it is necessary to have two different memory read and memory write stages which are independent from eachother. And because I want to do indexed instructions in a single cycle, we need an address adder which is independent from the ALU.
This, here you are the details about the RISC-like pipeline which I think could execute the vast majority of microcoded 6502 instructions in a single cycle :
Stage 1 : Register prefetch. Sends the address of the index registers to the register file. Also reads two zero page locations for indirect addressing modes. The reads are done as two separate byte reads instead of single 16-bit read to handle easily all cases : non aligned reads and the usage of ($ff),Y
Hardware resources for stage 1: 1 register read port, 2 zero page bytes read port
Stage 2 : Address calculation, memory load and accumulator prefetch.
This stage contains a 16-bit adder and is able to add a register with a pointer (all values fetched from stage 1) or a register with a literal (which is fetched from the opcode), etc, etc...
The result of the adder is (optionally) send to the data (cache) bus for a load access. The accumulator is prefetched in this stage for most instructions (this could easily be added to prefetch any other reg, for instance).
Hardware resources for stage 2: 1 register read port, 1 data master (read only), 1 16-bit+8-bit with 16-bit result adder
Stage 3: Execute stage, and status flags.
This stage contains the normal 8-bit ALU and do any possible arithmetic or logic of the instructions. One operand of the ALU is the fetched register from stage 2, the other operand is data line from the memory read. It keeps the status flags and updates them, which makes this stage the best candidate to execute branches as well.
Hardware ressources for stage 3 : 1 8-bit ALU, several flipflops for status register, 1 16-bit+8-bit with 16-bit result adder for branches (could possibly be merged with ALU)
Stage 4 : Register and memory writeback
This stage write backs the results of the instruction in the register file and/or into memory (it is to be determined if both will be required at the same time or not)
Hardware ressources for stage 4 : 1 writing port to memory, 1 writing port to register file
This stage could be merged with stage 3, but chances are that it would shorten the critical path to split it like this.
If someone has any optimization or improvement, I'm all ears.
So now the total extenral resources we need are :
- 2 register read ports
- 2 specific zero page read ports
- 1 general purpose memory read port
- 1 register write port
- 1 general purpose memory write port
I see two possible approaches for implementing this.
1) Zero page is register approaches
The register file would be 260 bytes wide (ZP, A, X, Y, SP). It would have 1 writing port and 4 reading ports. It would be quite wasteful of the FPGA's SRAM, as it will very likely synthesize in 2kb of SRAM (260 rounded up to the biggest power of two * 4). However, this is less SRAM than my current 32-bit ARM core.
Some logic in the memory reading and writing detects whenever the high byte of the address is null. In this case, the memory can't be read/written like normal, and the register file should be accessed instead. I don't know how this would be done, but it does not matter if this process costs additional cycles, as long as the logic is here.
There is two separate masters for reading and writing data to/from non-zeropage memory. Either this is done at a D-Cache level, or simply there is two actually different masters on the external bus. In any case, the write master should be made a higher priority than the read one, and the pipeline would be stalled whenever a read should happen in stage 2 while a write is happening in stage 4.
2) Zero page is cached approach
In this approach, the register file is only 4 bytes (A, X, Y, SP), plus perhaps a couple of extra temp registers used exclusively for microcoding purposes, and they would have 2 reading ports. This would ease up on FPGA resources for register file A LOT.
However, ZP has to be stored somewhere. It would not be acceptable to have it on a different chip, and even if it would we would need 3 data masters which does not makes any sense and have few chances of being efficient (tell me if I am wrong please). So this means D-cache is mandatory (well not technically, but it would not make sense not to have one).
The D-Cache would have 3 reading ports and 1 writing ports. Some strategy would be made to be sure zero page is always, or preferably, cached, although it would end up like this with normal algorithms anyway. It would also be smart to ensure $fff0-$ffff are cached, as the interrupt vectors are here, this would grand fast access to them.
A cache coherency mechanism would allow any memory, including zero-page, to be shared with the rest of the system. If the I-Cache contains a coherency mechanism too, executing selfmod code will be possible. However the main bus have to be appropriate with this.
Conclusion
I think a super efficient 6502 is perfectly possible with a bit of trought puts into it. I also forgot to mention that some aggressive forwarding will have to be done.
I think approach 2) is probably cleaner, and closer to what the 6502 originally wanted to be (while being much more efficient). However I feel like approach 1) is easier to implement, as there is no need for a D-Cache, and it is more independent from the external bus, as long as a multi-master bus with fixed priorities is here. Solution 2) needs a powerful bidirectional D-Cache with coherency mechanism, which is easier said than done.
So if it were me I'd try to do it like 1), and see the results. If it happens to be successful, it would not be hard to turn it into 2) at a later stage. If any one have suggestion, or comments I am all ears.
HOWEVER I do not want people trolling or starting flame wars as it has been done in the 6 first pages of this thread. If any one does this, I will ignore their comment and consider them to be stupid and not worth listening, and will attempt implementing my core my own way.
Finally I am not doing this against the original 6502, I just personally want to see how far the 6502 instruction set can become in terms of efficiency.
I'll explain myself. My goals were the following :
1) All direct or immediate instructions should be in 1 cycle, including RMW instructions like inc $memory
2) All sta and lda indexed should be done in 1 cycle
3) sta (),Y and lda (),Y should be done in 1 cycle
4) I don't care if adc (),Y lda(,X) sta(,X) and adc(,X) etc... takes more than 1 cycle, as those are rarely used
5) Anything not mentioned yet has not been thought of now, but I don't care if it takes more than 1 cycle
In the end I have came up with the conclusion that 4) does not change many things, because if I want to fulfil 1-3, it costs nothing more to add support for 4). I might have missed something here, if it is the case, please tell me so.
Because I want to do RMW in 1 cycle it is necessary to have two different memory read and memory write stages which are independent from eachother. And because I want to do indexed instructions in a single cycle, we need an address adder which is independent from the ALU.
This, here you are the details about the RISC-like pipeline which I think could execute the vast majority of microcoded 6502 instructions in a single cycle :
Stage 1 : Register prefetch. Sends the address of the index registers to the register file. Also reads two zero page locations for indirect addressing modes. The reads are done as two separate byte reads instead of single 16-bit read to handle easily all cases : non aligned reads and the usage of ($ff),Y
Hardware resources for stage 1: 1 register read port, 2 zero page bytes read port
Stage 2 : Address calculation, memory load and accumulator prefetch.
This stage contains a 16-bit adder and is able to add a register with a pointer (all values fetched from stage 1) or a register with a literal (which is fetched from the opcode), etc, etc...
The result of the adder is (optionally) send to the data (cache) bus for a load access. The accumulator is prefetched in this stage for most instructions (this could easily be added to prefetch any other reg, for instance).
Hardware resources for stage 2: 1 register read port, 1 data master (read only), 1 16-bit+8-bit with 16-bit result adder
Stage 3: Execute stage, and status flags.
This stage contains the normal 8-bit ALU and do any possible arithmetic or logic of the instructions. One operand of the ALU is the fetched register from stage 2, the other operand is data line from the memory read. It keeps the status flags and updates them, which makes this stage the best candidate to execute branches as well.
Hardware ressources for stage 3 : 1 8-bit ALU, several flipflops for status register, 1 16-bit+8-bit with 16-bit result adder for branches (could possibly be merged with ALU)
Stage 4 : Register and memory writeback
This stage write backs the results of the instruction in the register file and/or into memory (it is to be determined if both will be required at the same time or not)
Hardware ressources for stage 4 : 1 writing port to memory, 1 writing port to register file
This stage could be merged with stage 3, but chances are that it would shorten the critical path to split it like this.
If someone has any optimization or improvement, I'm all ears.
So now the total extenral resources we need are :
- 2 register read ports
- 2 specific zero page read ports
- 1 general purpose memory read port
- 1 register write port
- 1 general purpose memory write port
I see two possible approaches for implementing this.
1) Zero page is register approaches
The register file would be 260 bytes wide (ZP, A, X, Y, SP). It would have 1 writing port and 4 reading ports. It would be quite wasteful of the FPGA's SRAM, as it will very likely synthesize in 2kb of SRAM (260 rounded up to the biggest power of two * 4). However, this is less SRAM than my current 32-bit ARM core.
Some logic in the memory reading and writing detects whenever the high byte of the address is null. In this case, the memory can't be read/written like normal, and the register file should be accessed instead. I don't know how this would be done, but it does not matter if this process costs additional cycles, as long as the logic is here.
There is two separate masters for reading and writing data to/from non-zeropage memory. Either this is done at a D-Cache level, or simply there is two actually different masters on the external bus. In any case, the write master should be made a higher priority than the read one, and the pipeline would be stalled whenever a read should happen in stage 2 while a write is happening in stage 4.
2) Zero page is cached approach
In this approach, the register file is only 4 bytes (A, X, Y, SP), plus perhaps a couple of extra temp registers used exclusively for microcoding purposes, and they would have 2 reading ports. This would ease up on FPGA resources for register file A LOT.
However, ZP has to be stored somewhere. It would not be acceptable to have it on a different chip, and even if it would we would need 3 data masters which does not makes any sense and have few chances of being efficient (tell me if I am wrong please). So this means D-cache is mandatory (well not technically, but it would not make sense not to have one).
The D-Cache would have 3 reading ports and 1 writing ports. Some strategy would be made to be sure zero page is always, or preferably, cached, although it would end up like this with normal algorithms anyway. It would also be smart to ensure $fff0-$ffff are cached, as the interrupt vectors are here, this would grand fast access to them.
A cache coherency mechanism would allow any memory, including zero-page, to be shared with the rest of the system. If the I-Cache contains a coherency mechanism too, executing selfmod code will be possible. However the main bus have to be appropriate with this.
Conclusion
I think a super efficient 6502 is perfectly possible with a bit of trought puts into it. I also forgot to mention that some aggressive forwarding will have to be done.
I think approach 2) is probably cleaner, and closer to what the 6502 originally wanted to be (while being much more efficient). However I feel like approach 1) is easier to implement, as there is no need for a D-Cache, and it is more independent from the external bus, as long as a multi-master bus with fixed priorities is here. Solution 2) needs a powerful bidirectional D-Cache with coherency mechanism, which is easier said than done.
So if it were me I'd try to do it like 1), and see the results. If it happens to be successful, it would not be hard to turn it into 2) at a later stage. If any one have suggestion, or comments I am all ears.
HOWEVER I do not want people trolling or starting flame wars as it has been done in the 6 first pages of this thread. If any one does this, I will ignore their comment and consider them to be stupid and not worth listening, and will attempt implementing my core my own way.
Finally I am not doing this against the original 6502, I just personally want to see how far the 6502 instruction set can become in terms of efficiency.