Personal 6502 "evolution" project
Personal 6502 "evolution" project
Hello everyone.
As a personal project i am trying to create a fast 6502-like CPU.
This "evolution" (prototype of) of 6502 drop 8 bit opcode scheme. Sorry. Even Intel, when it noticed that 8080 was wasting half of opcode space, decided to recode the whole thing, while keeping it assembly compatible. I am trying to do the same. 99% instruction level compatibility is a goal. Some things are simply too much effort for too little gain. On the good news, for one, RMW instructions are not dropped, only scheduled later. Plus, if i am able to, i will expand RMW with one instruction: DEP (DEC if the memory is not zero) for atomic operations.
Project Goals:
- All addressing modes are preserved, some are merged.
- All implemented features are fast. There is no point on investing time to implement instructions and then recommending against their usage.
- Registers added are already present in past official extensions. The two additions are B (second accumulator) and Z (zero page pointer), Z is conventionally Z, but writable.
- Instructions can operate on bytes or words when memory is involved.
- No mode bits.
- RMW implemented.
- No excessive amount of memory channels required. Limit channels to 2, one Read-Only driven by PC, on Read/Write driven by internal state.
- Syntethizable.
- Wait for slow RAM.
- Mostly predictable, no branch prediction or speculative execution.
Dropped:
- BCD
- Unaligned 16 bit loads and stores inside the core. The core do not fault on unaligned loads/stores, it is expected that the memory controller/SoC keep RDY low when bus width requested is 16 bit, but address is odd and do the work instead.
Unspecified:
- Mhz, don't hold your breath for the speed. I stopped short of implementing a full Tommasulo algorithm to the back end.
Not in the first version, followed by the reason:
- ADC/SBC, for the moment ADD and SUB are enough, pending verification of first batch of opcodes.
- INC/DEC/ASL/LSR/ROL/ROR/DEP memory, pending verification of first batch of opcodes.
- JSR/BSR/PHR/PLR, pending verification of first batch of opcodes.
- WAI/STP/BRK/RTI, pending implementation of interrupt handler.
Design:
- Highly regular encoding scheme with 16 bit/32 bit
- 2 State front-end pipeline + 3 stage back end pipeline of fragmented operations (uOP, to say)
- Yes, it is a complex design. In fact i completed the first version of the internal micro core. That i will need to modify further to implement ROR/ROL/DEP/ADC/SBC.
"Completed" here is a placeholder for "Logic Gates for the full implementation are there, but there is no tests yet".
- Logic gates simulation first. Because i am bad at planning projects.
- IF stage fetch and reorder the instruction.
- ID stage translate 16 bit opcode in 1-3 20 bit uOps.
- SCHED stage pick a ready uOP from one of the Reservation Stations and put it in the Scheduled_Instruction_Register. 80 bits of MUX x2, YAY!. At least that's all the work that this stage need to do.
- ALU stage pick A register and a B register, then it select one from B, K or T (temp storage of the RS) for the second input of ALU. B is always forwarded to Mem. ALU can store the register back to the Register File or write the MAR register to start a memory cycle.
- MEM stage start with the address in the MAR register, it wait for RDY signal and then write to TA or TB (Temp RSA or Temp RSB)
Github of the project
https://github.com/aleferri/65HE06
The current partial implementation is P1. You'll need LogicCircuit to open the file.
Expected FAQ
Why have you decided to make this monster that will run at 10khz at max?
First of all: this will not run at 10khz. The Intel Nehalem core run at 500 khz on 4 Virtex FPGA. I think i cannot do worse, so at a minimun this will run at 500 khz.
Second: memory is a bottleneck of any 6502 or 6502-like* CPU, to avoid the inevitable death of the clock frequency i had to make the MEM step as empty as possible, the mem bus see a single 16 bit mux on TA/TB, operated by the Load Flag in front of each Flip Flop. On FPGA this MUX is already here, so MEM has no additional delay at all.
If i stopped there, there would have been a dead cycle after each memory access, so i had to find a way to make good use of this single dead cycle. So i decided to reuse it for address generation of other operations. The reuse rule is:
Called Main (M) the first allocated operation, Next the other allocated operation; u0 is the Main last uOp, uNext is the Next candidate uOp, uMain is the current executing uOP
Provided that:
uMain has written MAR and the end of an ALU stage (beginning a memory access)
Main is not a store (possible memory hazard, Main is a store if one of u0 or uMain write mem)
uNext A register is not u0 D (WAR)
uNext B register is not u0 D (WAR again)
uNext is not the final uOp of Next (If it were, execution would be out of order, a much bigger core and a much better Me is needed for that)
Then
uNext can be executed instead of stalling
I never heard of this weird architecture
Because this is a botched OoO core that decayed in an InOrderExecution InterlievedPartials InOrderStore. Unless you are trying to pipeline** an heavy Accumulator / Memory ISA there is zero reason to do this. Is probable that only the PDP-8 can reuse such architecture.
*: Unless you expand to 16 registers, drop indirect indexed mode, RMW, flags and find yourself with a RISC-V or a MIPS or a 16 bit AVR or a PIC24.
**: And gain performance, you can pipeline just fine the 6502, you keep stalling everything until you match the original performances with 300% of resource utilization.
EDIT: spelling
As a personal project i am trying to create a fast 6502-like CPU.
This "evolution" (prototype of) of 6502 drop 8 bit opcode scheme. Sorry. Even Intel, when it noticed that 8080 was wasting half of opcode space, decided to recode the whole thing, while keeping it assembly compatible. I am trying to do the same. 99% instruction level compatibility is a goal. Some things are simply too much effort for too little gain. On the good news, for one, RMW instructions are not dropped, only scheduled later. Plus, if i am able to, i will expand RMW with one instruction: DEP (DEC if the memory is not zero) for atomic operations.
Project Goals:
- All addressing modes are preserved, some are merged.
- All implemented features are fast. There is no point on investing time to implement instructions and then recommending against their usage.
- Registers added are already present in past official extensions. The two additions are B (second accumulator) and Z (zero page pointer), Z is conventionally Z, but writable.
- Instructions can operate on bytes or words when memory is involved.
- No mode bits.
- RMW implemented.
- No excessive amount of memory channels required. Limit channels to 2, one Read-Only driven by PC, on Read/Write driven by internal state.
- Syntethizable.
- Wait for slow RAM.
- Mostly predictable, no branch prediction or speculative execution.
Dropped:
- BCD
- Unaligned 16 bit loads and stores inside the core. The core do not fault on unaligned loads/stores, it is expected that the memory controller/SoC keep RDY low when bus width requested is 16 bit, but address is odd and do the work instead.
Unspecified:
- Mhz, don't hold your breath for the speed. I stopped short of implementing a full Tommasulo algorithm to the back end.
Not in the first version, followed by the reason:
- ADC/SBC, for the moment ADD and SUB are enough, pending verification of first batch of opcodes.
- INC/DEC/ASL/LSR/ROL/ROR/DEP memory, pending verification of first batch of opcodes.
- JSR/BSR/PHR/PLR, pending verification of first batch of opcodes.
- WAI/STP/BRK/RTI, pending implementation of interrupt handler.
Design:
- Highly regular encoding scheme with 16 bit/32 bit
- 2 State front-end pipeline + 3 stage back end pipeline of fragmented operations (uOP, to say)
- Yes, it is a complex design. In fact i completed the first version of the internal micro core. That i will need to modify further to implement ROR/ROL/DEP/ADC/SBC.
"Completed" here is a placeholder for "Logic Gates for the full implementation are there, but there is no tests yet".
- Logic gates simulation first. Because i am bad at planning projects.
- IF stage fetch and reorder the instruction.
- ID stage translate 16 bit opcode in 1-3 20 bit uOps.
- SCHED stage pick a ready uOP from one of the Reservation Stations and put it in the Scheduled_Instruction_Register. 80 bits of MUX x2, YAY!. At least that's all the work that this stage need to do.
- ALU stage pick A register and a B register, then it select one from B, K or T (temp storage of the RS) for the second input of ALU. B is always forwarded to Mem. ALU can store the register back to the Register File or write the MAR register to start a memory cycle.
- MEM stage start with the address in the MAR register, it wait for RDY signal and then write to TA or TB (Temp RSA or Temp RSB)
Github of the project
https://github.com/aleferri/65HE06
The current partial implementation is P1. You'll need LogicCircuit to open the file.
Expected FAQ
Why have you decided to make this monster that will run at 10khz at max?
First of all: this will not run at 10khz. The Intel Nehalem core run at 500 khz on 4 Virtex FPGA. I think i cannot do worse, so at a minimun this will run at 500 khz.
Second: memory is a bottleneck of any 6502 or 6502-like* CPU, to avoid the inevitable death of the clock frequency i had to make the MEM step as empty as possible, the mem bus see a single 16 bit mux on TA/TB, operated by the Load Flag in front of each Flip Flop. On FPGA this MUX is already here, so MEM has no additional delay at all.
If i stopped there, there would have been a dead cycle after each memory access, so i had to find a way to make good use of this single dead cycle. So i decided to reuse it for address generation of other operations. The reuse rule is:
Called Main (M) the first allocated operation, Next the other allocated operation; u0 is the Main last uOp, uNext is the Next candidate uOp, uMain is the current executing uOP
Provided that:
uMain has written MAR and the end of an ALU stage (beginning a memory access)
Main is not a store (possible memory hazard, Main is a store if one of u0 or uMain write mem)
uNext A register is not u0 D (WAR)
uNext B register is not u0 D (WAR again)
uNext is not the final uOp of Next (If it were, execution would be out of order, a much bigger core and a much better Me is needed for that)
Then
uNext can be executed instead of stalling
I never heard of this weird architecture
Because this is a botched OoO core that decayed in an InOrderExecution InterlievedPartials InOrderStore. Unless you are trying to pipeline** an heavy Accumulator / Memory ISA there is zero reason to do this. Is probable that only the PDP-8 can reuse such architecture.
*: Unless you expand to 16 registers, drop indirect indexed mode, RMW, flags and find yourself with a RISC-V or a MIPS or a 16 bit AVR or a PIC24.
**: And gain performance, you can pipeline just fine the 6502, you keep stalling everything until you match the original performances with 300% of resource utilization.
EDIT: spelling
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Personal 6502 "evolution" project
It sounds like waiting for slow RAM won't be necessary. 10ns SRAM is pretty normal today, and I've seen down to 6ns.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Personal 6502 "evolution" project
Thanks for starting a new technical thread!
This sounds very exciting and I look forward to seeing developments.
As ever, the choices you've made are your own choices, and we will see where they take you. Others will of course have other preferences, and can start their own discussion threads about the merits of the projects they build.
This sounds very exciting and I look forward to seeing developments.
As ever, the choices you've made are your own choices, and we will see where they take you. Others will of course have other preferences, and can start their own discussion threads about the merits of the projects they build.
Re: Personal 6502 "evolution" project
Update:
i completed the Opcode -> micro-Op decoder. Now i am debugging it and debuggin the Back-End too.
I will update my github once i am almost sure it works, at least for the testbench (testbench being a memcopy implementation).
I also re-read my intro for this topic. I feel i can read it because i am the designer, but it isn't very clear to anyone else.
Just to be sure, i will dive a little more in the details, beginning with the Back End.
Terminology
First, let's enstablish a classification for microprocessor cores.
1) In Order Issue Core
The core issue the instructions to the EU (Execution Unit) in the specified order set by the programmer.
2) In Order Execution Core
The EU must complete the instructions in the issued order.
3) In Order Commit Core
The Core has to commit instructions to the shared processor state in the issued order.
4) Out Of Order Issue Core
The Core employ a scheduler to issue instructions to the EU when they are ready (executable without hazards).
5) Out Of Order Execution Core
The EU do not need to wait for completion of one instruction before completing another, successive instruction. Swapped instructions need to be independent from each other.
6) Out Of Order Commit Core
The commit order of completed instructions is dependent on completion order.
7) Speculative Core
The core can execute instructions that are not officially issued yet because of controls hazards. Instructions that are executed this way need to be dropped if the speculation turn out to be false. If the core commit out of order one or multiple ROB (Reorder Buffers) may be needed.
CPU cores in the wild are part of at least 3 of those classes.
EDIT: grammar
i completed the Opcode -> micro-Op decoder. Now i am debugging it and debuggin the Back-End too.
I will update my github once i am almost sure it works, at least for the testbench (testbench being a memcopy implementation).
I also re-read my intro for this topic. I feel i can read it because i am the designer, but it isn't very clear to anyone else.
Just to be sure, i will dive a little more in the details, beginning with the Back End.
Terminology
First, let's enstablish a classification for microprocessor cores.
1) In Order Issue Core
The core issue the instructions to the EU (Execution Unit) in the specified order set by the programmer.
2) In Order Execution Core
The EU must complete the instructions in the issued order.
3) In Order Commit Core
The Core has to commit instructions to the shared processor state in the issued order.
4) Out Of Order Issue Core
The Core employ a scheduler to issue instructions to the EU when they are ready (executable without hazards).
5) Out Of Order Execution Core
The EU do not need to wait for completion of one instruction before completing another, successive instruction. Swapped instructions need to be independent from each other.
6) Out Of Order Commit Core
The commit order of completed instructions is dependent on completion order.
7) Speculative Core
The core can execute instructions that are not officially issued yet because of controls hazards. Instructions that are executed this way need to be dropped if the speculation turn out to be false. If the core commit out of order one or multiple ROB (Reorder Buffers) may be needed.
CPU cores in the wild are part of at least 3 of those classes.
EDIT: grammar
Re: Personal 6502 "evolution" project
Ok, after cores are classified. We need to introduce others, importants terms.
Micro-Operations
A large (Pentium Pro had 118 bits micro-ops) and minimally encoded operation that activate or deactivate functionality of one or more unit inside the core. Cores that are Out Of Order in some way usally break the opcode into multiple micro-ops during the Instruction Decoding phase and discard it. If the number of micro operations that an opcode generate is known or is at least bound (e.g. ARM cores generate 1 to X micro-ops where X is a single digit) the decoding can be done in a single clock. If micro ops number is not bound the decoder must employ a microcoded control store to generate them (e.g. rep movsb x86). Unbound micro-ops generation is also a problem for interrupts. Again, in x86 have all sort of quirks to pause and restart micro-ops generation from microcoded instruction.
Micro-Operations Scheduler
If N micro-operations are in-flight, but there are only M executions units, some circuit is required to select the next M micro-ops to execute. This circuit is called scheduler, it checks which ops are ready and enqueues them in the appropriate EU (adds go to ALUs, xors go to ALUs, divs go to DIVs, loads go to LSUs, vectors go to VPUs and so on). The spot where the micro ops goes to wait for their turn to run is the Reservation Station.
Reservation Station
This is the place where micro-ops wait for their turn and optionally catch values returning from other EUs to complete their operands if some are missing (e.g. they required a not-yet-committed instruction's result).
Micro-Operations
A large (Pentium Pro had 118 bits micro-ops) and minimally encoded operation that activate or deactivate functionality of one or more unit inside the core. Cores that are Out Of Order in some way usally break the opcode into multiple micro-ops during the Instruction Decoding phase and discard it. If the number of micro operations that an opcode generate is known or is at least bound (e.g. ARM cores generate 1 to X micro-ops where X is a single digit) the decoding can be done in a single clock. If micro ops number is not bound the decoder must employ a microcoded control store to generate them (e.g. rep movsb x86). Unbound micro-ops generation is also a problem for interrupts. Again, in x86 have all sort of quirks to pause and restart micro-ops generation from microcoded instruction.
Micro-Operations Scheduler
If N micro-operations are in-flight, but there are only M executions units, some circuit is required to select the next M micro-ops to execute. This circuit is called scheduler, it checks which ops are ready and enqueues them in the appropriate EU (adds go to ALUs, xors go to ALUs, divs go to DIVs, loads go to LSUs, vectors go to VPUs and so on). The spot where the micro ops goes to wait for their turn to run is the Reservation Station.
Reservation Station
This is the place where micro-ops wait for their turn and optionally catch values returning from other EUs to complete their operands if some are missing (e.g. they required a not-yet-committed instruction's result).
Re: Personal 6502 "evolution" project
Hello all,
i will be short, because is late.
I leave you with a diagram drawn by myself, the inevitable nightmares you will face after looking at the diagram, and finally a promise of a better explanation of my back end design.
i will be short, because is late.
I leave you with a diagram drawn by myself, the inevitable nightmares you will face after looking at the diagram, and finally a promise of a better explanation of my back end design.
Re: Personal 6502 "evolution" project
Another day, another update.
Just don't expect them everyday.
I completed the design and implementation of the Instruction Fetch stage.
I attach here the diagram in case someone ask himself why. I also updated the github of the project with the added circuit.
Just don't expect them everyday.
I completed the design and implementation of the Instruction Fetch stage.
I attach here the diagram in case someone ask himself why. I also updated the github of the project with the added circuit.
- Attachments
-
- FetchDiagram.odg
- Fetch Stage diagram.
- (13.8 KiB) Downloaded 80 times
Re: Personal 6502 "evolution" project
(Thanks - but please could you attach a PDF too?)
Re: Personal 6502 "evolution" project
BigEd wrote:
(Thanks - but please could you attach a PDF too?)
- Attachments
-
- FetchStatesDiagram.pdf
- Diagram of Fetch States in PDF
- (35.15 KiB) Downloaded 98 times
Re: Personal 6502 "evolution" project
New Update:
I must have been drunk when i wrote "Fetch now works".
I found 4 bugs in 5 minutes this evening:
1) My CPU ops are 16 or 32 bit. I did not even put the required signal (pc_inc_2) in the fetch module. Now resolved.
2) I forgot about the behaviour of the hold signal. Now fixed.
3) Fetch output an invalid instruction after pc update or a branch. Wrong timings. IR & K16 should read data from memory AFTER and not DURING a pc update. Now fixed.
4) While waiting for the updated result from the ALU, PC incremented by 2 every cycle. Now resolved.
While i was at it i did translate the fetch stage in verilog.
I attach here the PDF of the Decoder states and transitions. Tomorrow i hope to do less disasters than today.
I must have been drunk when i wrote "Fetch now works".
I found 4 bugs in 5 minutes this evening:
1) My CPU ops are 16 or 32 bit. I did not even put the required signal (pc_inc_2) in the fetch module. Now resolved.
2) I forgot about the behaviour of the hold signal. Now fixed.
3) Fetch output an invalid instruction after pc update or a branch. Wrong timings. IR & K16 should read data from memory AFTER and not DURING a pc update. Now fixed.
4) While waiting for the updated result from the ALU, PC incremented by 2 every cycle. Now resolved.
While i was at it i did translate the fetch stage in verilog.
I attach here the PDF of the Decoder states and transitions. Tomorrow i hope to do less disasters than today.
- Attachments
-
- Decoder_States_Diagram.pdf
- Decoder internal states and transitions diagram.
- (39.98 KiB) Downloaded 94 times
Re: Personal 6502 "evolution" project
It's me again.
The following program now correctly gets decoded by the Front End:
Without manually setting the write of status flags, this sequence stall the CPU (predicated operation, but flags not ready). I wrongly assembled the offset, so PC jump back at 0x12 instead of 0x20 like it should. Not that big as a problem. Tomorrow i will try to put all together (FrontEnd + BackEnd) to discover more bugs. If I am lucky i will be able to test performances too.
If everything goes fine, i will translate everything in verilog next week.
EDIT: moved sentence up.
The following program now correctly gets decoded by the Front End:
Code: Select all
_reset: LDA:Y #100
LDA:A #$C000
STA:A $A0
LDA:A #$B000
STA:A $A2
_loop: SUB:Y #1 ?
LDA:A byte ($A0),Y
STA:A byte ($A2),Y
BNE _loop
LDA:A #0
STA:A byte $FFFF ; stop the clock
If everything goes fine, i will translate everything in verilog next week.
EDIT: moved sentence up.
Re: Personal 6502 "evolution" project
There are some bug in LogicCircuit regardings bits. With the complexity of the back end, some bits become missing, then they reappear at the next cycle. I found a few bugs in the decoding and a miscalculation of timing, i accidentally overwrite a value in the constant register one cycle before it is needed causing a serious bug. I will double k and bring it forward to the current uOp module. Since i would need to rewrite a moderate part of the back end i will transfer everything in verilog and continue there.
Re: Personal 6502 "evolution" project
Fetching unaligned opcodes is bad. It require 3 port memory and double the amount of logic (2 adders, 1 for PC, 1 for Next).
It could have been worst, i had only 16b/32b ops. Original 6502 with 8b/16b/24b would have been a mess.
On the good news i mostly need to port the decoder to verilog to complete the project. While i was reworking the back end i added all required logic to handle INC, DEP, ROL, ROR, ADC, SBC.
PUSH and POP are doable without major reworking.
PUSH is "MOV R -> (S+0); MOV S-1 -> S;" (2 uOps). POP is "MOV (S+1), T; MOV S+1, S; MOV T, R;" (3 uOps).
Any RMW is either "MOV (PIDX, offset) -> T; MOV (IDX, T) -> T; MOV OP(T) -> implicit;" (3 uOps) or "MOV (IDX, offset) -> T; MOV OP(T) -> implicit" (2 uOps).
BSR can be translated to PUSH(PC) because the branches with immediates are executed directly in the Fetch_Unit.
RTS is POP(PC).
JSR on the other hand is not so easy. PUSH require 2 uops, addressing mode require from 1 to 3 uops. So i would need to translate it in 2 instructions: PUSH(PC) followed by LOAD(PC).
I also decided to drop DEC. DEP will work the same except when the operand is zero. In that case DEP refuse to decrement it. If requested DEP sets the A flag (Acquired) if operand is not zero.
It could have been worst, i had only 16b/32b ops. Original 6502 with 8b/16b/24b would have been a mess.
On the good news i mostly need to port the decoder to verilog to complete the project. While i was reworking the back end i added all required logic to handle INC, DEP, ROL, ROR, ADC, SBC.
PUSH and POP are doable without major reworking.
PUSH is "MOV R -> (S+0); MOV S-1 -> S;" (2 uOps). POP is "MOV (S+1), T; MOV S+1, S; MOV T, R;" (3 uOps).
Any RMW is either "MOV (PIDX, offset) -> T; MOV (IDX, T) -> T; MOV OP(T) -> implicit;" (3 uOps) or "MOV (IDX, offset) -> T; MOV OP(T) -> implicit" (2 uOps).
BSR can be translated to PUSH(PC) because the branches with immediates are executed directly in the Fetch_Unit.
RTS is POP(PC).
JSR on the other hand is not so easy. PUSH require 2 uops, addressing mode require from 1 to 3 uops. So i would need to translate it in 2 instructions: PUSH(PC) followed by LOAD(PC).
I also decided to drop DEC. DEP will work the same except when the operand is zero. In that case DEP refuse to decrement it. If requested DEP sets the A flag (Acquired) if operand is not zero.
Re: Personal 6502 "evolution" project
Up.
Still working in the decode stage. I hope to avoid the generation of many muxes.
Still working in the decode stage. I hope to avoid the generation of many muxes.