I'll continue to chronicle the work here. As always, thanks in advance for all comments and suggestions.
Cheers for now,
Drass
--------------- Summary of discussion to date ---------------
- Drass: Original post (replicated below)
- MichaelM discusses the importance of memory bandwidth in this context, and suggests that a dual cache design (an I-Cache and a D-Cache) may be sufficient to support the pipeline well.
- Chromatix draws a comparison to the 68040 pipeline
- ttlworks discusses the Intel Itanium VLIW architecture and his own TREX TTL design
- John West suggests pre-decoding instructions as they enter the I-Cache
Having had the chance to study this thread more closely, I began to experiment with a new “reduced-CPI” 6502 design. After significant noodling, I settled on a seven-stage pipelined architecture, as follows:
[FEtch] —> [DEcode] —> [ADdress] —> [IndeX] —> [DAta] —> [EXecute] —> [WriteBack]
- FE: Fetch an instruction (opcode and operands) from Program Memory
- DE: Decode opcode, marshal operands, calculate pc-relative, stack-relative and pre-index offsets
- AD: Fetch indirect address, including jump vectors, from Address Memory
- IX: Apply post-indexed address offsets
- DA: Read Data Memory
- EX: Perform ALU operation, update register file
- WB: Update P register, verify branch prediction, write to Data Memory
As others have suggested, pipelining a 6502 beyond just a few stages requires concurrent access to Program, Address and Data memory. For example, we could have Program Memory supplying a new instruction, Address Memory reading an indirect address from zero page, and Data Memory reading a byte of data... all at the same time. In concept, the requirement is analogous to a large multiport RAM with three simultaneous accesses possible. In practice, modern pipelines are supported by sophisticated multi-level cached memory for this purpose. But we will want to keep things as simple as possible to start -- we can always optimize later.
For now, let's propose three distinct single port RAMs working together. We will want to be able to read complete instructions and addresses as often as possible, so 32-Bit wide memory is appropriate. Let's assume also 64k bytes of addressable space. One consequence of using three single port RAMs is that writes must be mirrored across all memories to keep them synchronized. To prevent writes from stalling the pipeline excessively, each memory is outfited with a one-deep write buffer. Each memory can independently postpone the write if busy, until a cycle when it would otherwise have been inactive.
Beyond memory, certain accessory mechanisms are needed. 6502 variable length instructions mean the instruction boundaries don't always neatly word-align in 32-bit memory. For this reason Program memory feeds into a prefetch buffer, and any instruction that might have been divided because its bytes span a word boundary will appear intact and complete in the buffer. There is also an alignment problem in Address Memory. Here we can use a second parallel Address Memory to enable fetching of two consecutive words in one cycle. This guarantees that we will get a complete address, even when that address spans a 32-bit boundary. In addition, as instructions are processed, branches are identified and their target addresses loaded into a Branch Target Buffer. The BTB is then consulted on every fetch to resolve branch targets as early as possible, and to aid in branch prediction. Finally, with all this concurrent activity, a complement of hazard checks must constantly police the pipeline to make sure the correct program semantics are maintained.
Ok, so that’s a lot of stuff to keep straight. After some paper and pencil exploration, I decided to write a simulation to flesh out the concept and work out the kinks. The result is the P7 Simulator. It’s a fairly straight forward python script written to simulate the action of pipeline registers working cycle by cycle on 6502 opcodes. It currently supports NMOS opcodes and implements the seven-stage pipeline outlined above. It makes use of the following facilities across the various pipeline stages:
- FE: Program Memory Read, Pre-Fetch Code Buffer and asssociated Byte-Select Mux Matrix, Branch Target Buffer
- DE: Decode RAM, PC Increment Adder, Pre-Index Adder
- AD: Address Memory Read
- IX: Post-Index Adder
- DA: Data Memory Read, Register File Read
- EX: ALU, Register File Write
- WB: Data Memory Write, Branch evaluation/validation
One significant fly in the memory-management ointment is self-modifying code. Unfortunately, an instruction that is already in the pipeline will not be modified by a write to Program Memory. It will therefore not reflect a recent modification when it executes. Rather than instrumenting the pipeline to detect self-modifying writes, the workaround is to impose a minimum distance between dynamically modified instructions and the instructions which modify them. Not the end of the world, but certainly a limitation in terms of compatibility.
Finally, one other architectural note: instructions which require both stack operations and jumps — namely JSR/RTS and BRK/RTI — are done by feeding multiple successive lines of microcode down the pipeline for each of these instructions (whereas the norm is just a single line of microcode). So a JSR <abs>, for example, generates a PUSH PCH / PUSH PCL / JMP <abs> sequence in the pipeline. Similarly, BRK generates three pushes, an SEI, and then a JMP (abs) instruction to jump through the correct interrupt vector. (Speaking of interrupts, I’ve yet to instrument the simulation to handle them. Intuitively, it seems like inserting a BRK into the instruction stream should work, but it will be interesting to see how that bears out, and where interrupt latency ends up).
Ok, with that as a design outline, I got the simulation more or less put together. I then loaded the Dormann NMOS 6502 Test Suite to give it a whirl. Sure enough, there is self-modifying code in the suite! I had to insert a couple of strategically placed NOPs in the code to create the requisite minimum distance between self-modifying instructions. (Two NOPs just prior to the each of the JSRs in the CHKADD routine did the trick. The simulator does not currently support decimal mode, so I didn’t bother to similarly change CHKDADD).
And with that, I was ready for the big test. I let P7 run the test suite for a half million cycles, and then the simulation generated the following summary output:
Code: Select all
(FOR EXPLANATION OF THE FIGURES SEE DISCUSSION BELOW)
------------ Run Stats ------------
Cycles: 500000
NMOS Cycles: 1424511
- - - - - - - - - - - - - - - - - -
Instructions: 462936 (93%) <-- (JSR and RTS are counted as one instruction each)
Data Stalls 1641 (0%)
Control Stalls: 1104 (0%)
Delay Slots: 13831 (3%)
Generated: 12849 (3%)
Cache Busy: 7640 (2%)
Total: 500000 (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd: 131477 (26%)
Operand Fwd: 3755 (1%)
- - - - - - - - - - - - - - - - - -
Branches: 76460 (17%)
Taken: 4993 (7%)
btb Hit: 4571 (6%)
Mispredictions: 276 (0%)
- - - - - - - - - - - - - - - - - -
bCache Access: 2 (0%)
----------- Performance -----------
CPI: 1.08
NMOS CPI: 3.08
Boost Factor: 2.9x
Some other interesting things to highlight in the numbers above:
- On the whole, the pipeline manages to resolve interdependencies between instructions very successfully, and very few Data Stalls were triggered on this run.
- Control Stalls too are very few, which suggests the branch logic manages pretty well despite a 7-cycle branch mis-prediction penalty. (Branches are resolved in the final WB pipeline stage, so a branch mis-prediction requires that all pipeline stages be flushed and reloaded).
- 3% of cycles are Delay Slots — a direct branch (or jump) will incur a single delay slot penalty. Indirect jumps trigger two delay slots. No attempt is made to fill the slots, so in fact they are full-fledged stalls. Conditional branches and direct jumps that have been seen at least once in the instruction stream do not incur a delay slot penalty; they are managed by the Branch Target Buffer. For various reasons, we don’t use the btb for indirect jumps.
- Relatively few Cache Busy stalls are generated (about 2% of cycles), which suggests that single-entry cache write buffers are more than adequate in this case. There are usually plenty of idle cycles available to complete writes in a timely manner. Even so, this aspect of pipeline performance may be further improved. Running the simulation with dual port RAMs enabled produced a CPI of 1.06, as all Cache Busy stalls are eliminated. That's a significant improvement that is well worth considering, particularly if full caching is implemented and the inevitable cache misses begin to tax performance (although the locality of the working-set of most 6502 code is probably good enough so that cache miss rates may be very low. Anyway, the right tradeoffs should become clear with further experimentation).
- About 3% of cycles are additional cycles Generated by JSR and RTS instructions during their translation to multiple pipeline operations. This seems like a very reasonable tradeoff. With it, we avoid undue complication in the pipeline, and pay a relatively modest performance penalty.
- Fully 26% of cycles needed Data Forwarding to avoid stalls. Data is “forwarded” when it is passed directly from one pipeline stage to another as soon as it is generated (from the output of the ALU back into its input, for example). It’s very interesting to see just how critical this mechanism is. Without it, the pipeline would degrade significantly.
- It’s interesting also that Operand Forwarding does not have nearly the same impact. The FE Stage goes to some lengths to have complete instructions ready when the DE Stage needs them. However, the first couple of cycles after a branch can be tricky because a branch invalidates the Pre-Fetch Code Buffer. If the branch target instruction spans a word boundary, one or more operands will be missing from the initial fetch, and the FE Stage falls behind. In some cases, it is possible for FE to forward the missing operands from memory directly to the DE stage, and thereby avoid a stall. I thought this feature would be well worth the added complexity, but the results above suggest otherwise. In this run at least, it would be better to tolerate a few more stalls rather than over-burden the FE stage with this additional logic.
- 17% of instructions executed were Conditional Branches. Of those, only 7% were Taken Branches. This may be because the Dormann test suite is peppered throughout with error-checking branches which are never taken (once the emulation is working correctly that is). I suspect we will see a much higher proportion of taken branches with other code.
- The Branch Target Buffer (BTB) seems to perform very well, and generates hits for nearly all taken branches. This is not altogether surprising given that looping branches are taken many times while not-taken branches repeat infrequently.
- Meanwhile, Branch Prediction is “off-the-charts” accurate. I am using a 2-bit Branch Prediction scheme here, with a modified Backward-Taken-Forward-Not-Taken static prediction for the first encounter of a given branch. I say modified because a backward branch to itself is predicted as Not-Taken. There are a large number of these in the Dormann code, which probably accounts for the unusually high accuracy in this run. I would not expect prediction accuracy to stay at these levels for the common case workload.
- Finally, it’s interesting to note that (contrary to my expectation) the bCache is hardly ever used. The bCache is a parallel Address Memory used by the AD Stage to fetch a complete address in a single cycle, even when that address spans a word-boundary. The above result suggests it might be best to dispense with the bCache altogether, and simply tolerate the occasional pipeline stall when necessary.
Speaking of fun, one highlight for me was being able to witness first-hand how the pipeline stages interact with one another. The P7 simulator reports its progress and displays pipeline activity as it goes. It’s fascinating to see it work through different instruction stream scenarios. To that end, below is a brief snippet of the last 10 cycles of the Dormann test run. (In the report, cycles are delimited by a line showing the cycle number, and one line per pipeline stage is shown below that. Each pipeline stage is annotated with relevant pipeline activity as appropriate. The FE Stage shows the next Program Memory fetch address every cycle, and also dumps the 8-byte Prefetch Code Buffer whenever a new 32-bit word is brought in from Program Memory. In the report, iCache refers to Program Memory, aCache to Address Memory and dCache to Data Memory. Pipeline “bubbles” are indicated by “---“ in the opcode field. Note that an RTS instruction generates two pipeline operations, one to adjust SP, and the other to jump to the return address. This is an indirect jump which triggers two Delay Slots).
I would be delighted to share the python code with anyone wishing to try it out. (Please PM me). Alternatively, feel free to suggest any 6502 code you think might be an instructive test case. All I need to run a test is a binary image of the code (up to 64k), a load address, and a start address to jump to. A listing of the code is useful in case I run into any issues (e.g., self-modifying code). The only caveat is that the test must run without any user input or interrupts.
(One thing I’d like to test is loading BASIC and running one of the sieve type benchmarks. Would anyone be willing to create a binary image containing, say, ehBASIC with a BASIC program already loaded in memory such that it would auto-run without user input? That would be a cool thing to try).
As always, many thanks to Dr Jefyll and ttlworks for their continued help, and thanks in advance for any thoughts and comments on the above.
Cheers for now,
Drass
Code: Select all
Program Start: $c000
------------- 499990 --------------
WB: js* $0213
EX: sbc #$fb
... A = $2e
DA: rts
IX: rt*
AD: ---
DE: ---
... iCache Write Pending
FE: $f0e8: 20 13 2 8 - 20 13 2 8
------------- 499991 --------------
WB: sbc #$fb
EX: rts
DA: rt*
IX: ---
AD: ---
DE: php
... iCache Write Pending
FE: $f0ec: 20 13 2 8 - c5 f d0 fe
------------- 499992 --------------
WB: rts
EX: rt*
DA: ---
IX: ---
AD: php
... iCache Write Pending
DE: cmp $0f
FE: $f0f0: 68 29 c3 c5 - c5 f d0 fe
------------- 499993 --------------
WB: rt*
EX: ---
DA: ---
IX: php
AD: cmp $0f
DE: bne $fe
... iCache Write Pending
FE: $f0f4: 68 29 c3 c5 - 11 d0 fe 28
------------- 499994 --------------
WB: ---
EX: ---
DA: php
IX: cmp $0f
AD: bne $fe
DE: pla
FE: $f0f8:
... Drain: iCache[$1fb] = $ea
------------- 499995 --------------
WB: ---
EX: php
DA: cmp $0f
... Alub = $2e = [$f]
IX: bne $fe
AD: pla
DE: and #$c3
FE: $f0f8:
------------- 499996 --------------
WB: php
... dCache[$1fc] = $30
EX: cmp $0f
DA: bne $fe
IX: pla
AD: and #$c3
... iCache Write Pending
DE: cmp $11
FE: $f0f8: 8 a5 d 75 - 11 d0 fe 28
... Drain: dCache[$1fc] = $30
... Drain: aCache[$1fc] = $30
------------- 499997 --------------
WB: cmp $0f
EX: bne $fe
DA: pla
... Alub = $30 = [$1fc]
IX: and #$c3
AD: cmp $11
DE: bne $fe
FE: $f0fc:
... Drain: iCache[$1fc] = $30
------------- 499998 --------------
WB: bne $fe
EX: pla
... A = $30
DA: and #$c3
... Forward A = $30
IX: cmp $11
AD: bne $fe
DE: plp
FE: $f0fc: 8 a5 d 75 - 0 8 c5 f
------------- 499999 --------------
WB: pla
EX: and #$c3
... A = $0
DA: cmp $11
... Forward A = $0
... Alub = $0 = [$11]
IX: bne $fe
AD: plp
DE: php
FE: $f100:
------------- 500000 --------------
WB: and #$c3
EX: cmp $11
DA: bne $fe
IX: plp
AD: php
DE: lda $0d
FE: $f100: