6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 10:29 pm

All times are UTC




Post new topic Reply to topic  [ 182 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10 ... 13  Next
Author Message
PostPosted: Sun Nov 08, 2020 1:50 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
I wanted to touch on a final aspect of this pipeline design and the specific problems it helps to overcome. I struggled a bit with this explanation, so apologies in advance for any confusion. I’ll be happy to try to clarify so please just ask. Here it goes ...

Unlike the atomic instructions we find in a traditional RISC pipeline, even basic operations in this design are spread over two microinstructions. For example, a typical RISC style add operation, like this:
Code:
add r1, r2, r3
takes two microinstructions to specify in this pipeline, corresponding to the 6502 FetchOperand and FetchOpcode bus cycles, like this:
Code:
ALUin(A, DB, C); PC += 1                # FetchOperand
A := ALUop(ADD); IR := DB; PC += 1      # FetchOpcode
The first microinstruction loads the inputs of the ALU and the second performs the ALU operation itself. To be clear, the RISC form of the instruction would execute the same sequence of steps with respect to the ALU as it works down the pipeline. But it remains an atomic unit that describes only a single operation.

By contrast, a single microinstruction in this design can specify the ALUop for one operation and the ALU inputs for the next. For example, during indexing, we might see the following microinstruction:
Code:
ADL := ALUop(ADD); ALUin(ADH, 0, Cout)
This microinstruction completes the sum for the low-byte of the address (ADL) and also sets up the ALU inputs to adjust the high-byte (ADH). Because of their dual-function, only one of these microinstructions is required to manage the activity across both the DECODE and EXECUTE stages.
Attachment:
D72D8FF6-B463-4F19-8A4F-42F9AD9BCFA1.jpeg
D72D8FF6-B463-4F19-8A4F-42F9AD9BCFA1.jpeg [ 121.54 KiB | Viewed 1829 times ]
As a bonus, that one microinstruction can also specify whether any ALU outputs need to be recirculated (as is the case with Cout in the microinstruction above).

Most importantly, though, the arrangement allows the pipeline to avoid control stalls. To see how, consider the effect of a FetchOpcode operation on the pipeline. A FetchOpcode causes microinstructions for a new opcode to begin to be fetched from a new location. In that sense, we can think of the opcode as an address and of FetchOpcode microinstructions as unconditional branches to opcode "subroutines". From the point of view of the microinstruction stream, a FetchOpcode is in fact an unconditional branch.
Attachment:
0AF13418-5175-4786-9C77-AC7FA5652D4B.jpeg
0AF13418-5175-4786-9C77-AC7FA5652D4B.jpeg [ 56.06 KiB | Viewed 1829 times ]
And just like all pipelined branches, FetchOpcode invalidates any instructions that have already been pre-fetched into the pipeline at the time it executes. This is effectively a "branch delay slot". With a traditional pipeline, there are two such invalid instructions, one in the FETCH stage and another in the DECODE stage. In this case we can use the opcode itself in place of the microinstruction in the FETCH stage. But the microinstruction in the DECODE stage has to be discarded. Left as is, the pipeline would stall.
Attachment:
7F61A5E2-7825-4C8E-A82A-DAAF19009B02.jpeg
7F61A5E2-7825-4C8E-A82A-DAAF19009B02.jpeg [ 144.83 KiB | Viewed 1829 times ]
This is where the "split" microinstructions come in handy. Since there is only one microinstruction active for both DECODE and EXECUTE, we can take the opcode and just keep going. No control stall is triggered and no extra cycles added to the processing as a result.

Alright, with that final gremlin banished, here now is a high-level block diagram for this CPU (which once again Dr. Jefyll helped to edit. Thanks Jeff!).
Attachment:
BDE32586-357C-494B-BDF1-F5C63F430764.png
BDE32586-357C-494B-BDF1-F5C63F430764.png [ 167.74 KiB | Viewed 1829 times ]

Cheers for now,
Drass.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 09, 2020 10:04 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(Ken Shirriff's latest blog, on the 8008's carry lookahead, has many interesting-looking links in the footnotes, about fast adders.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 10, 2020 9:12 am 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi Drass,

Quote:
The first microinstruction loads the inputs of the ALU and the second performs the ALU operation itself. To be clear, the RISC form of the instruction would execute the same sequence of steps with respect to the ALU as it works down the pipeline. But it remains an atomic unit that describes only a single operation.

I initially found the order you mentioned the Fetches a bit confusing, but then I understood that by FetchOperand you mean the cycle were the current instruction is being decoded, and by FetchOpcode you mean the next opcode being fetched while the current instruction is executing.

So without the aim to be accurate or exhaustive, I made a quick drawing as an attempt for me to understand and visualise what's going on in your pipeline:
Attachment:
6502.png
6502.png [ 46.9 KiB | Viewed 1722 times ]

The drawing represents a 2 cycle instruction, followed by a 4 cycle instruction, followed by a 2 cycle instruction. Operations related to every instruction are represented in different colours. Bold texts represent actual operations in the pipeline, while plain texts represent pipeline steps that either do nothing or are discarded for the next cycle. I have also displayed the write back cycle with half the length to represent that results are available before the decoding cycle of the next instruction finished.

I believe this shows why there are no hazards, and why it is safe to recirculate data through the alu. For example the two consecutive executing steps of the green instruction can be a "recirculation" of the ALU.

In this case, keeping the minimum 2 cycles per instruction of the 6502, is key for this to be possible, because execution cycles of any two consecutive instructions are always separated by at least one cycle between them. Is this right?

Thanks


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 11, 2020 12:34 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
BigEd wrote:
(Ken Shirriff's latest blog, on the 8008's carry lookahead, has many interesting-looking links in the footnotes, about fast adders.)
Thanks for the link BigEd. I was interested to see the carry lookahead circuit in the 8008 described as follows:
Attachment:
4B998B1C-30C6-45F3-90E8-E2C5F292D36D.jpeg
4B998B1C-30C6-45F3-90E8-E2C5F292D36D.jpeg [ 21.26 KiB | Viewed 1682 times ]
“You might wonder why this carry lookahead circuit is any faster than a plain ripple-carry adder, since the carry signal has to go through up to seven large gates to generate the last carry bit. The trick is that the entire circuit is electrically a single large gate due to the dynamic design. All the transistors are activated in parallel, and then the 5-volt signal can pass through them all rapidly.5 Although there is still a delay as this signal travels through the circuit, the circuit is faster than the standard ripple carry adder which activates transistors in sequence.”

It seems that Dieter’s FET Adder is doing much the same thing and activating the FET switches in parallel.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 11, 2020 1:06 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
joanlluch wrote:
In this case, keeping the minimum 2 cycles per instruction of the 6502, is key for this to be possible, because execution cycles of any two consecutive instructions are always separated by at least one cycle between them.
Yes, I think this is right. On the 6502, a FetchOperand always follows a FetchOpcode. The ALU operation can be performed concurrently with a FetchOpcode but never with a FetchOperand. Hence, FetchOperand always provides the minimum space needed in the pipeline to avoid data hazards.

In addition, the following two modifications were required to avoid FetchOpcode control hazards:
1) Use the opcode itself in place of the microinstruction in the FETCH stage at the time FetchOpcode is executed, and
2) Use “split” microinstructions to eliminate the need for a separate microinstruction in the DECODE stage.

Hope that helps.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 1:01 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
The cycle-by-cycle execution of 6502 instructions will look familiar as it simply reflects the regualr bus cycles of a 6502. Let's take a look at an ORA # instruction, for example, as follows:
Code:
[ORA #]
ALUin(A, DB, C); AD += 1                          # Load the inputs to the ALU, increment bus address
A := ALUop(OR, NZ); IR := DB; PC := AD += 1       # ALU operation; FetchOpcode
As expected, the ALU operation is split across two microinstructions; one to load the ALU inputs and another to carry out the ALU operation. The operand is fetched in the first cycle and the next opcode in the second. Flags and registers are updated in the third cycle, following the ALU operation. Meanwhile, this ORA # operation will execute exactly the same bus cycles as any other 6502, which is exactly the result we are after.

Here is a few more addressing modes:
Code:
[ORA zpg]
ADL := DB; ADH := 0; PC += 1                      # Use the operand byte as an zero-page address; increment PC
ALUin(A, DB, C); AD := PC                         # Load the inputs to the ALU, restore PC
A := ALUop(OR, NZ); IR := DB; PC := AD += 1       # ALU operation; FetchOpcode

[ORA abs]
ALUin(0, DB, 0); PC := AD += 1                    # Read Address low-byte
ADL := ALU(ADD); ADH := DB; PC += 1               # Read Address high-byte
ALUin(A, DB, 0); AD := PC                         # Load the inputs to the ALU, restore PC
A := ALUop(OR, NZ); IR := DB; PC := AD += 1       # ALU operation; FetchOpcode

[ORA zpg,x]
ALUin(X, DB, 0); PC += 1                          # Add X to the operand, increment PC
ADL := ALUop(ADD); ADH := 0                       # Use the zero-page address; increment PC
ALUin(A, DB, C); AD := PC                         # Load the inputs to the ALU, restore PC
A := ALUop(OR, NZ); IR := DB; PC := AD += 1       # ALU operation; FetchOpcode

Absolute Indexed addressing needs a little gymnastics to add a cycle when a page is crossed:
Code:
[ORA abs,X]
ALUin(X, DB, 0); PC := AD += 1                                # Add X to the Address low-byte
ADL := ALUop(ADD); ADH := DB; ALUin(0, DB, COUT); PC += 1     # Read address high-byte
ALUin(A, DB, C); CC:AD := PC; CS:[ADH := ALUop(ADD); REPEAT]  # Adjust high-byte if necessary
A := ALUop(OR, NZ); IR := DB; PC := AD += 1; END              # ALU operation; FetchOpcode
Note in the third microinstruction the "CC:AD := PC; CS:[ADH := ALU(ADD); REPEAT]" conditional execution construct. Here "AD := PC" will execute if the carry is clear, and "[ADH := ALU(ADD); REPEAT]" if it is set. The REPEAT keyword will force a pipeline stall if the carry is set. This will execute the same microinstruction again, which has the effect of adding an extra cycle if a page was crossed.

Here is a branch instruction:
Code:
ALUin(ADL+1, DB, 0); PC := AD += 1; EXIT.BTF      # EXIT if the Branch Test Fails (generate a FetchOpcode)
ADL := ALUop(ADD); ALUin(PCH, SE, COUT); EXIT.CC  # EXIT if the a page boundary is not crossed (generates a FetchOpcode)
ADH := ALUop(ADD)                                 # Adjust the high-byte on a page crossing
IR := DB; PC := AD += 1; END                      # FetchOpcode
There are a couple of special registers being used here -- "ADL+1" is the low-byte output of the 16-bit incrementer and "SE" is the Sign-Extended value of the ALUB input in the prior cycle. The "EXIT.BTF" keyword will generate a FertchOpcode in the next cycle if the branch test fails. The EXIT.CC keyword will do the same if the ALU output carry (COUT) is clear (i.e., a page was not crossed). A FetchOpcode is specified as an all-zeroes control word, so a FetchOpcode can be generated easily by clearing the control word.

The Logisim model for the pipeline is now complete and I am happy to report that these microcode sequences have all passed their unit tests. So far so good and testing continues. :)

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 1:29 pm 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi Drass, I'm curious if you use your text notation for microinstructions only as a reference document of the actual control signals required, or whether you have created some sort of software utility to convert them to the actual contents of the decode ROM(s). (not sure if the later would be really worth to have, though). Also, when you talk about unit tests, do you mean instructions, or sequences of instructions, being executed in the logisim model with all the possible, or at least the edge case, operand combinations?


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 3:40 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Hi Joan,

I considered a purpose-built assembler but I found it much faster to use a little VisualBasic Script to generate the ROMs. The microcode is housed in MS Excel. I hand parse the text microinstructions into a master lookup table which the script then uses to generate the ROMs. I really like using readable microcode, so this is a good tradeoff for me. It’s relatively simple to change the text microcode, update the corresponding lookup table entry and press the “Write ROMs” button.

Regarding testing, it’s very iterative. It takes a while to get the basic piping working, and then a lot of stuff works all at once. I start with “component” testing of the circuitry — does the ALU work, does the incrementer work, etc. Then I test the control unit decoders and basic logic (i.e. flag evaluation, branch evaluation, etc.). Then I test individual instructions to exercise the various 6502 addressing modes, both reads and writes, including edge cases (which mostly translates into page boundary conditions). I then will test jumps and branches. By the time I get to this stage, I’m writing little code fragments into Logisim and running them. That’s what I call “unit testing”.

Once all the unit testing is done, I feel fairly comfortable the CPU basically works. Then it’s time to spark up the Dormann Test Suite: https://github.com/Klaus2m5/6502_65C02_functional_tests. It tests a large number of edge cases and subtle behaviours of the CPU to ensure compatibility. My tests with C64 games showed that once the Dormann test suite worked, only timing and undocumented behaviours were a source of problems. Straight ahead code execution all checked out. All in all, the Dormann Test Suite is an invaluable tool and we’re very lucky to have it.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 4:13 pm 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi Drass, thanks for that. It makes a lot of sense. I'm essentially doing the same with my CPU74 logisim model. My "text" microinstructions are actually defined as expressions in my software simulator. From that, I manually create a table in a text file with the the control signals for each instruction. Instead of ROMs. I use PLAs for the decoding. The Logisim holycross fork that I'm using has a combinational analysis import feature that allows importing a truth table and converting it into 'sums of products' equations, or into a circuit in logisim. The same table, or the resulting 'sum of products' equations, will eventually be used to program the physical decoding PLAs. Unfortunately, I do not count with any standard test suite for the ultimate stress test. Most that I can do is to compile and run C based test code, but this does not always guarantee that all machine instructions or edge cases for the particular architecture will be fully tested.


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 14, 2020 5:10 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
joanlluch wrote:
Instead of ROMs. I use PLAs for the decoding.
Yes, I like the idea of using GALs for decoding. They’re certainly fast enough, and nice and tidy I think. If I ever get the chance to build a RISC TTL CPU I would go that way.

Quote:
The Logisim holycross fork that I'm using has a combinational analysis import feature that allows importing a truth table and converting it into 'sums of products' equations, or into a circuit in logisim. The same table, or the resulting 'sum of products' equations, will eventually be used to program the physical decoding PLAs.
That’s very handy. I’ll have to have a look.

Quote:
Unfortunately, I do not count with any standard test suite for the ultimate stress test. Most that I can do is to compile and run C based test code, but this does not always guarantee that all machine instructions or edge cases for the particular architecture will be fully tested.
Making a CPU that could run C64 games would have been MUCH harder without a Test Suite! You might want to have a look at the Dormann test suite. It might give you some ideas about how to write one for your ISA. (in your case, you get to define what those subtle undocumented behaviours are so it might not be quite as difficult to test things).

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 16, 2020 12:53 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Good news! The Logisim model for the CPU has now completed the 6502 Dormann test suite except for the final Decimal Mode test. That means the pipeline is solid and working across the instruction set, which I’m very happy about. There will be some fine tuning for sure, but the critical path seems in range. All very exciting.

Looking forward to completing V1 of the model soon.

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 16, 2020 7:52 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Very nice milestone!


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 16, 2020 8:04 am 
Offline
User avatar

Joined: Thu Apr 11, 2019 7:22 am
Posts: 40
Hi Drass, congratulations for that, and for the short time in which you achieved it !


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 16, 2020 11:49 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Thanks! Much appreciate the interest and support. Cheers.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 18, 2020 1:28 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
UPDATE: The Decimal Mode logic is now fixed and the model passes the full Dormann 6502 test suite.

It’s time to take a fine-tooth comb to the design and validate that every path is in fact within the 10ns critical threshold. Drum roll ..... :)

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 182 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10 ... 13  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 28 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: