6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 1:18 pm

All times are UTC




Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon May 03, 2021 12:01 am 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Hello everyone.

As a personal project i am trying to create a fast 6502-like CPU.
This "evolution" (prototype of) of 6502 drop 8 bit opcode scheme. Sorry. Even Intel, when it noticed that 8080 was wasting half of opcode space, decided to recode the whole thing, while keeping it assembly compatible. I am trying to do the same. 99% instruction level compatibility is a goal. Some things are simply too much effort for too little gain. On the good news, for one, RMW instructions are not dropped, only scheduled later. Plus, if i am able to, i will expand RMW with one instruction: DEP (DEC if the memory is not zero) for atomic operations.

Project Goals:
- All addressing modes are preserved, some are merged.
- All implemented features are fast. There is no point on investing time to implement instructions and then recommending against their usage.
- Registers added are already present in past official extensions. The two additions are B (second accumulator) and Z (zero page pointer), Z is conventionally Z, but writable.
- Instructions can operate on bytes or words when memory is involved.
- No mode bits.
- RMW implemented.
- No excessive amount of memory channels required. Limit channels to 2, one Read-Only driven by PC, on Read/Write driven by internal state.
- Syntethizable.
- Wait for slow RAM.
- Mostly predictable, no branch prediction or speculative execution.

Dropped:
- BCD
- Unaligned 16 bit loads and stores inside the core. The core do not fault on unaligned loads/stores, it is expected that the memory controller/SoC keep RDY low when bus width requested is 16 bit, but address is odd and do the work instead.

Unspecified:
- Mhz, don't hold your breath for the speed. I stopped short of implementing a full Tommasulo algorithm to the back end.

Not in the first version, followed by the reason:
- ADC/SBC, for the moment ADD and SUB are enough, pending verification of first batch of opcodes.
- INC/DEC/ASL/LSR/ROL/ROR/DEP memory, pending verification of first batch of opcodes.
- JSR/BSR/PHR/PLR, pending verification of first batch of opcodes.
- WAI/STP/BRK/RTI, pending implementation of interrupt handler.

Design:
- Highly regular encoding scheme with 16 bit/32 bit
- 2 State front-end pipeline + 3 stage back end pipeline of fragmented operations (uOP, to say)
- Yes, it is a complex design. In fact i completed the first version of the internal micro core. That i will need to modify further to implement ROR/ROL/DEP/ADC/SBC.
"Completed" here is a placeholder for "Logic Gates for the full implementation are there, but there is no tests yet".
- Logic gates simulation first. Because i am bad at planning projects.
- IF stage fetch and reorder the instruction.
- ID stage translate 16 bit opcode in 1-3 20 bit uOps.
- SCHED stage pick a ready uOP from one of the Reservation Stations and put it in the Scheduled_Instruction_Register. 80 bits of MUX x2, YAY!. At least that's all the work that this stage need to do.
- ALU stage pick A register and a B register, then it select one from B, K or T (temp storage of the RS) for the second input of ALU. B is always forwarded to Mem. ALU can store the register back to the Register File or write the MAR register to start a memory cycle.
- MEM stage start with the address in the MAR register, it wait for RDY signal and then write to TA or TB (Temp RSA or Temp RSB)

Github of the project

https://github.com/aleferri/65HE06
The current partial implementation is P1. You'll need LogicCircuit to open the file.

Expected FAQ

Why have you decided to make this monster that will run at 10khz at max?
First of all: this will not run at 10khz. The Intel Nehalem core run at 500 khz on 4 Virtex FPGA. I think i cannot do worse, so at a minimun this will run at 500 khz.
Second: memory is a bottleneck of any 6502 or 6502-like* CPU, to avoid the inevitable death of the clock frequency i had to make the MEM step as empty as possible, the mem bus see a single 16 bit mux on TA/TB, operated by the Load Flag in front of each Flip Flop. On FPGA this MUX is already here, so MEM has no additional delay at all.
If i stopped there, there would have been a dead cycle after each memory access, so i had to find a way to make good use of this single dead cycle. So i decided to reuse it for address generation of other operations. The reuse rule is:
Called Main (M) the first allocated operation, Next the other allocated operation; u0 is the Main last uOp, uNext is the Next candidate uOp, uMain is the current executing uOP
Provided that:
uMain has written MAR and the end of an ALU stage (beginning a memory access)
Main is not a store (possible memory hazard, Main is a store if one of u0 or uMain write mem)
uNext A register is not u0 D (WAR)
uNext B register is not u0 D (WAR again)
uNext is not the final uOp of Next (If it were, execution would be out of order, a much bigger core and a much better Me is needed for that)
Then
uNext can be executed instead of stalling

I never heard of this weird architecture
Because this is a botched OoO core that decayed in an InOrderExecution InterlievedPartials InOrderStore. Unless you are trying to pipeline** an heavy Accumulator / Memory ISA there is zero reason to do this. Is probable that only the PDP-8 can reuse such architecture.

*: Unless you expand to 16 registers, drop indirect indexed mode, RMW, flags and find yourself with a RISC-V or a MIPS or a 16 bit AVR or a PIC24.
**: And gain performance, you can pipeline just fine the 6502, you keep stalling everything until you match the original performances with 300% of resource utilization.

EDIT: spelling


Top
 Profile  
Reply with quote  
PostPosted: Mon May 03, 2021 12:55 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
It sounds like waiting for slow RAM won't be necessary. 10ns SRAM is pretty normal today, and I've seen down to 6ns.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon May 03, 2021 7:23 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Thanks for starting a new technical thread!

This sounds very exciting and I look forward to seeing developments.

As ever, the choices you've made are your own choices, and we will see where they take you. Others will of course have other preferences, and can start their own discussion threads about the merits of the projects they build.


Top
 Profile  
Reply with quote  
PostPosted: Tue May 04, 2021 2:36 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Update:
i completed the Opcode -> micro-Op decoder. Now i am debugging it and debuggin the Back-End too.
I will update my github once i am almost sure it works, at least for the testbench (testbench being a memcopy implementation).

I also re-read my intro for this topic. I feel i can read it because i am the designer, but it isn't very clear to anyone else.

Just to be sure, i will dive a little more in the details, beginning with the Back End.

Terminology
First, let's enstablish a classification for microprocessor cores.

1) In Order Issue Core
The core issue the instructions to the EU (Execution Unit) in the specified order set by the programmer.

2) In Order Execution Core
The EU must complete the instructions in the issued order.

3) In Order Commit Core
The Core has to commit instructions to the shared processor state in the issued order.

4) Out Of Order Issue Core
The Core employ a scheduler to issue instructions to the EU when they are ready (executable without hazards).

5) Out Of Order Execution Core
The EU do not need to wait for completion of one instruction before completing another, successive instruction. Swapped instructions need to be independent from each other.

6) Out Of Order Commit Core
The commit order of completed instructions is dependent on completion order.

7) Speculative Core
The core can execute instructions that are not officially issued yet because of controls hazards. Instructions that are executed this way need to be dropped if the speculation turn out to be false. If the core commit out of order one or multiple ROB (Reorder Buffers) may be needed.

CPU cores in the wild are part of at least 3 of those classes.

EDIT: grammar


Top
 Profile  
Reply with quote  
PostPosted: Tue May 04, 2021 7:23 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Ok, after cores are classified. We need to introduce others, importants terms.

Micro-Operations
A large (Pentium Pro had 118 bits micro-ops) and minimally encoded operation that activate or deactivate functionality of one or more unit inside the core. Cores that are Out Of Order in some way usally break the opcode into multiple micro-ops during the Instruction Decoding phase and discard it. If the number of micro operations that an opcode generate is known or is at least bound (e.g. ARM cores generate 1 to X micro-ops where X is a single digit) the decoding can be done in a single clock. If micro ops number is not bound the decoder must employ a microcoded control store to generate them (e.g. rep movsb x86). Unbound micro-ops generation is also a problem for interrupts. Again, in x86 have all sort of quirks to pause and restart micro-ops generation from microcoded instruction.

Micro-Operations Scheduler
If N micro-operations are in-flight, but there are only M executions units, some circuit is required to select the next M micro-ops to execute. This circuit is called scheduler, it checks which ops are ready and enqueues them in the appropriate EU (adds go to ALUs, xors go to ALUs, divs go to DIVs, loads go to LSUs, vectors go to VPUs and so on). The spot where the micro ops goes to wait for their turn to run is the Reservation Station.

Reservation Station
This is the place where micro-ops wait for their turn and optionally catch values returning from other EUs to complete their operands if some are missing (e.g. they required a not-yet-committed instruction's result).


Top
 Profile  
Reply with quote  
PostPosted: Wed May 05, 2021 12:35 am 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Hello all,
i will be short, because is late.
I leave you with a diagram drawn by myself, the inevitable nightmares you will face after looking at the diagram, and finally a promise of a better explanation of my back end design.


Attachments:
File comment: Back End diagram
Back End.png
Back End.png [ 37.98 KiB | Viewed 2914 times ]
Top
 Profile  
Reply with quote  
PostPosted: Wed May 05, 2021 7:25 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Another day, another update.
Just don't expect them everyday.

I completed the design and implementation of the Instruction Fetch stage.

I attach here the diagram in case someone ask himself why. I also updated the github of the project with the added circuit.


Attachments:
File comment: Fetch Stage diagram.
FetchDiagram.odg [13.8 KiB]
Downloaded 63 times
Top
 Profile  
Reply with quote  
PostPosted: Wed May 05, 2021 9:12 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(Thanks - but please could you attach a PDF too?)


Top
 Profile  
Reply with quote  
PostPosted: Wed May 05, 2021 9:51 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
BigEd wrote:
(Thanks - but please could you attach a PDF too?)


Yes, sorry.


Attachments:
File comment: Diagram of Fetch States in PDF
FetchStatesDiagram.pdf [35.15 KiB]
Downloaded 77 times
Top
 Profile  
Reply with quote  
PostPosted: Wed May 05, 2021 10:03 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Thanks!


Top
 Profile  
Reply with quote  
PostPosted: Thu May 06, 2021 12:17 am 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
New Update:
I must have been drunk when i wrote "Fetch now works".

I found 4 bugs in 5 minutes this evening:
1) My CPU ops are 16 or 32 bit. I did not even put the required signal (pc_inc_2) in the fetch module. Now resolved.
2) I forgot about the behaviour of the hold signal. Now fixed.
3) Fetch output an invalid instruction after pc update or a branch. Wrong timings. IR & K16 should read data from memory AFTER and not DURING a pc update. Now fixed.
4) While waiting for the updated result from the ALU, PC incremented by 2 every cycle. Now resolved.

While i was at it i did translate the fetch stage in verilog.

I attach here the PDF of the Decoder states and transitions. Tomorrow i hope to do less disasters than today.


Attachments:
File comment: Decoder internal states and transitions diagram.
Decoder_States_Diagram.pdf [39.98 KiB]
Downloaded 69 times
Top
 Profile  
Reply with quote  
PostPosted: Thu May 06, 2021 10:13 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
It's me again.

The following program now correctly gets decoded by the Front End:
Code:
_reset:   LDA:Y   #100
            LDA:A   #$C000
            STA:A   $A0
            LDA:A   #$B000
            STA:A   $A2
_loop:    SUB:Y   #1 ?
            LDA:A   byte ($A0),Y
            STA:A   byte ($A2),Y
            BNE     _loop
            LDA:A   #0
            STA:A   byte $FFFF ; stop the clock


Without manually setting the write of status flags, this sequence stall the CPU (predicated operation, but flags not ready). I wrongly assembled the offset, so PC jump back at 0x12 instead of 0x20 like it should. Not that big as a problem. Tomorrow i will try to put all together (FrontEnd + BackEnd) to discover more bugs. If I am lucky i will be able to test performances too.
If everything goes fine, i will translate everything in verilog next week.

EDIT: moved sentence up.


Top
 Profile  
Reply with quote  
PostPosted: Fri May 07, 2021 6:20 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
There are some bug in LogicCircuit regardings bits. With the complexity of the back end, some bits become missing, then they reappear at the next cycle. I found a few bugs in the decoding and a miscalculation of timing, i accidentally overwrite a value in the constant register one cycle before it is needed causing a serious bug. I will double k and bring it forward to the current uOp module. Since i would need to rewrite a moderate part of the back end i will transfer everything in verilog and continue there.


Top
 Profile  
Reply with quote  
PostPosted: Sun May 09, 2021 1:03 pm 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Fetching unaligned opcodes is bad. It require 3 port memory and double the amount of logic (2 adders, 1 for PC, 1 for Next).
It could have been worst, i had only 16b/32b ops. Original 6502 with 8b/16b/24b would have been a mess.

On the good news i mostly need to port the decoder to verilog to complete the project. While i was reworking the back end i added all required logic to handle INC, DEP, ROL, ROR, ADC, SBC.
PUSH and POP are doable without major reworking.
PUSH is "MOV R -> (S+0); MOV S-1 -> S;" (2 uOps). POP is "MOV (S+1), T; MOV S+1, S; MOV T, R;" (3 uOps).
Any RMW is either "MOV (PIDX, offset) -> T; MOV (IDX, T) -> T; MOV OP(T) -> implicit;" (3 uOps) or "MOV (IDX, offset) -> T; MOV OP(T) -> implicit" (2 uOps).
BSR can be translated to PUSH(PC) because the branches with immediates are executed directly in the Fetch_Unit.
RTS is POP(PC).

JSR on the other hand is not so easy. PUSH require 2 uops, addressing mode require from 1 to 3 uops. So i would need to translate it in 2 instructions: PUSH(PC) followed by LOAD(PC).

I also decided to drop DEC. DEP will work the same except when the operand is zero. In that case DEP refuse to decrement it. If requested DEP sets the A flag (Acquired) if operand is not zero.


Top
 Profile  
Reply with quote  
PostPosted: Tue May 18, 2021 10:44 am 
Offline
User avatar

Joined: Sun May 02, 2021 10:24 am
Posts: 21
Up.

Still working in the decode stage. I hope to avoid the generation of many muxes.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron