Hello everyone.
As a personal project i am trying to create a fast 6502-like CPU.
This "evolution" (prototype of) of 6502 drop 8 bit opcode scheme. Sorry. Even Intel, when it noticed that 8080 was wasting half of opcode space, decided to recode the whole thing, while keeping it assembly compatible. I am trying to do the same. 99% instruction level compatibility is a goal. Some things are simply too much effort for too little gain. On the good news, for one, RMW instructions are not dropped, only scheduled later. Plus, if i am able to, i will expand RMW with one instruction: DEP (DEC if the memory is not zero) for atomic operations.
Project Goals:
- All addressing modes are preserved, some are merged.
- All implemented features are fast. There is no point on investing time to implement instructions and then recommending against their usage.
- Registers added are already present in past official extensions. The two additions are B (second accumulator) and Z (zero page pointer), Z is conventionally Z, but writable.
- Instructions can operate on bytes or words when memory is involved.
- No mode bits.
- RMW implemented.
- No excessive amount of memory channels required. Limit channels to 2, one Read-Only driven by PC, on Read/Write driven by internal state.
- Syntethizable.
- Wait for slow RAM.
- Mostly predictable, no branch prediction or speculative execution.
Dropped:
- BCD
- Unaligned 16 bit loads and stores inside the core. The core do not fault on unaligned loads/stores, it is expected that the memory controller/SoC keep RDY low when bus width requested is 16 bit, but address is odd and do the work instead.
Unspecified:
- Mhz, don't hold your breath for the speed. I stopped short of implementing a full Tommasulo algorithm to the back end.
Not in the first version, followed by the reason:
- ADC/SBC, for the moment ADD and SUB are enough, pending verification of first batch of opcodes.
- INC/DEC/ASL/LSR/ROL/ROR/DEP memory, pending verification of first batch of opcodes.
- JSR/BSR/PHR/PLR, pending verification of first batch of opcodes.
- WAI/STP/BRK/RTI, pending implementation of interrupt handler.
Design:
- Highly regular encoding scheme with 16 bit/32 bit
- 2 State front-end pipeline + 3 stage back end pipeline of fragmented operations (uOP, to say)
- Yes, it is a complex design. In fact i completed the first version of the internal micro core. That i will need to modify further to implement ROR/ROL/DEP/ADC/SBC.
"Completed" here is a placeholder for "Logic Gates for the full implementation are there, but there is no tests yet".
- Logic gates simulation first. Because i am bad at planning projects.
- IF stage fetch and reorder the instruction.
- ID stage translate 16 bit opcode in 1-3 20 bit uOps.
- SCHED stage pick a ready uOP from one of the Reservation Stations and put it in the Scheduled_Instruction_Register. 80 bits of MUX x2, YAY!. At least that's all the work that this stage need to do.
- ALU stage pick A register and a B register, then it select one from B, K or T (temp storage of the RS) for the second input of ALU. B is always forwarded to Mem. ALU can store the register back to the Register File or write the MAR register to start a memory cycle.
- MEM stage start with the address in the MAR register, it wait for RDY signal and then write to TA or TB (Temp RSA or Temp RSB)
Github of the project
https://github.com/aleferri/65HE06The current partial implementation is P1. You'll need LogicCircuit to open the file.
Expected FAQWhy have you decided to make this monster that will run at 10khz at max?First of all: this will not run at 10khz. The Intel Nehalem core run at 500 khz on 4 Virtex FPGA. I think i cannot do worse, so at a minimun this will run at 500 khz.
Second: memory is a bottleneck of any 6502 or 6502-like* CPU, to avoid the inevitable death of the clock frequency i had to make the MEM step as empty as possible, the mem bus see a single 16 bit mux on TA/TB, operated by the Load Flag in front of each Flip Flop. On FPGA this MUX is already here, so MEM has no additional delay at all.
If i stopped there, there would have been a dead cycle after each memory access, so i had to find a way to make good use of this single dead cycle. So i decided to reuse it for address generation of other operations. The reuse rule is:
Called
Main (M) the first allocated operation,
Next the other allocated operation; u0 is the Main last uOp, uNext is the Next candidate uOp, uMain is the current executing uOP
Provided that:
uMain has written MAR and the end of an ALU stage (beginning a memory access)
Main is not a store (possible memory hazard, Main is a store if one of u0 or uMain write mem)
uNext A register is not u0 D (WAR)
uNext B register is not u0 D (WAR again)
uNext is not the final uOp of Next (If it were, execution would be out of order, a much bigger core and a much better Me is needed for that)
Then
uNext can be executed instead of stalling
I never heard of this weird architectureBecause this is a botched OoO core that decayed in an InOrderExecution InterlievedPartials InOrderStore. Unless you are trying to pipeline** an heavy Accumulator / Memory ISA there is zero reason to do this. Is probable that only the PDP-8 can reuse such architecture.
*: Unless you expand to 16 registers, drop indirect indexed mode, RMW, flags and find yourself with a RISC-V or a MIPS or a 16 bit AVR or a PIC24.
**: And gain performance, you can pipeline just fine the 6502, you keep stalling everything until you match the original performances with 300% of resource utilization.
EDIT: spelling