On
another thread, I posted
an outline of a pipelined 6502. BigEd then suggested I start a new thread on the topic. It's a little bit awkward to split off an existing discussion, so I’ve tried to provide some context below (including replicating the original post).
I'll continue to chronicle the work here. As always, thanks in advance for all comments and suggestions.
Cheers for now,
Drass
--------------- Summary of discussion to date ---------------
- Drass: Original post (replicated below)
- MichaelM discusses the importance of memory bandwidth in this context, and suggests that a dual cache design (an I-Cache and a D-Cache) may be sufficient to support the pipeline well.
- Chromatix draws a comparison to the 68040 pipeline
- ttlworks discusses the Intel Itanium VLIW architecture and his own TREX TTL design
- John West suggests pre-decoding instructions as they enter the I-Cache
--------------- Original Post Replicated ------------------
Having had the chance to study this thread more closely, I began to experiment with a new “reduced-CPI” 6502 design. After significant noodling, I settled on a seven-stage pipelined architecture, as follows:
[FEtch] —> [DEcode] —> [ADdress] —> [IndeX] —> [DAta] —> [EXecute] —> [WriteBack]- FE: Fetch an instruction (opcode and operands) from Program Memory
- DE: Decode opcode, marshal operands, calculate pc-relative, stack-relative and pre-index offsets
- AD: Fetch indirect address, including jump vectors, from Address Memory
- IX: Apply post-indexed address offsets
- DA: Read Data Memory
- EX: Perform ALU operation, update register file
- WB: Update P register, verify branch prediction, write to Data Memory
The basic idea is that a seven-stage pipeline can support the various 6502 addressing modes (including RMW instructions) in one pass. Because the first stage simultaneously fetches both the opcode and its inline operands, each instruction can work its way down the pipeline as an atomic unit. Instructions advance through the pipeline at a rate of one stage per cycle, making use of specific resources as they go — indirection, indexing, memory, the ALU, etc. Once the pipeline fills, the CPU is able to complete one instruction in something approaching one cycle on average (i.e. CPI = 1.x). (Various
pipeline hazards make the ideal
one-instruction-per-cycle pipeline somewhat of a chimera, but we can try to get close).
As others have suggested, pipelining a 6502 beyond just a few stages requires concurrent access to Program, Address and Data memory. For example, we could have Program Memory supplying a new instruction, Address Memory reading an indirect address from zero page, and Data Memory reading a byte of data... all at the same time. In concept, the requirement is analogous to a large multiport RAM with three simultaneous accesses possible. In practice, modern pipelines are supported by sophisticated multi-level cached memory for this purpose. But we will want to keep things as simple as possible to start -- we can always optimize later.
For now, let's propose three distinct single port RAMs working together. We will want to be able to read complete instructions and addresses as often as possible, so 32-Bit wide memory is appropriate. Let's assume also 64k bytes of addressable space. One consequence of using three single port RAMs is that writes must be mirrored across all memories to keep them synchronized. To prevent writes from stalling the pipeline excessively, each memory is outfited with a one-deep write buffer. Each memory can independently postpone the write if busy, until a cycle when it would otherwise have been inactive.
Beyond memory, certain accessory mechanisms are needed. 6502 variable length instructions mean the instruction boundaries don't always neatly word-align in 32-bit memory. For this reason Program memory feeds into a prefetch buffer, and any instruction that might have been divided because its bytes span a word boundary will appear intact and complete in the buffer. There is also an alignment problem in Address Memory. Here we can use a second parallel Address Memory to enable fetching of two consecutive words in one cycle. This guarantees that we will get a complete address, even when that address spans a 32-bit boundary. In addition, as instructions are processed, branches are identified and their target addresses loaded into a
Branch Target Buffer. The BTB is then consulted on every fetch to resolve branch targets as early as possible, and to aid in branch prediction. Finally, with all this concurrent activity, a complement of hazard checks must constantly police the pipeline to make sure the correct program semantics are maintained.
Ok, so that’s a lot of stuff to keep straight. After some paper and pencil exploration, I decided to write a simulation to flesh out the concept and work out the kinks. The result is the
P7 Simulator. It’s a fairly straight forward python script written to simulate the action of pipeline registers working cycle by cycle on 6502 opcodes. It currently supports NMOS opcodes and implements the seven-stage pipeline outlined above. It makes use of the following facilities across the various pipeline stages:
- FE: Program Memory Read, Pre-Fetch Code Buffer and asssociated Byte-Select Mux Matrix, Branch Target Buffer
- DE: Decode RAM, PC Increment Adder, Pre-Index Adder
- AD: Address Memory Read
- IX: Post-Index Adder
- DA: Data Memory Read, Register File Read
- EX: ALU, Register File Write
- WB: Data Memory Write, Branch evaluation/validation
As mentioned above, writes to the three memories are completed during idle cycles. The Branch Target Buffer (btb), on the other hand, is read every cycle and so has no naturally occurring idle cycles. Thankfully, btb reads can simply be skipped when the need to write arises. We might incur a few more branch stalls as a result, but as we’ll see, this all works out pretty well. (One alternative is to run memory at twice pipeline clock-rate and time-multiplex separate read and write ports. There are a variety of low-cost high-speed synch RAMs available that are quite suitable for this purpose. And indeed, this can work equally well for all three memories in addition to the btb. The P7 simulation includes a run-time switch to simulate dual-port RAMs across the board).
One significant fly in the memory-management ointment is self-modifying code. Unfortunately, an instruction that is already in the pipeline will not be modified by a write to Program Memory. It will therefore not reflect a recent modification when it executes. Rather than instrumenting the pipeline to detect self-modifying writes, the workaround is to impose a minimum distance between dynamically modified instructions and the instructions which modify them. Not the end of the world, but certainly a limitation in terms of compatibility.
Finally, one other architectural note: instructions which require both stack operations and jumps — namely JSR/RTS and BRK/RTI — are done by feeding multiple successive lines of microcode down the pipeline for each of these instructions (whereas the norm is just a single line of microcode). So a JSR <abs>, for example, generates a PUSH PCH / PUSH PCL / JMP <abs> sequence in the pipeline. Similarly, BRK generates three pushes, an SEI, and then a JMP (abs) instruction to jump through the correct interrupt vector. (Speaking of interrupts, I’ve yet to instrument the simulation to handle them. Intuitively, it seems like inserting a BRK into the instruction stream should work, but it will be interesting to see how that bears out, and where interrupt latency ends up).
Ok, with that as a design outline, I got the simulation more or less put together. I then loaded the Dormann NMOS 6502 Test Suite to give it a whirl. Sure enough, there is self-modifying code in the suite! I had to insert a couple of strategically placed NOPs in the code to create the requisite minimum distance between self-modifying instructions. (Two NOPs just prior to the each of the JSRs in the CHKADD routine did the trick. The simulator does not currently support decimal mode, so I didn’t bother to similarly change CHKDADD).
And with that, I was ready for the big test. I let P7 run the test suite for a half million cycles, and then the simulation generated the following summary output:
Code:
(FOR EXPLANATION OF THE FIGURES SEE DISCUSSION BELOW)
------------ Run Stats ------------
Cycles: 500000
NMOS Cycles: 1424511
- - - - - - - - - - - - - - - - - -
Instructions: 462936 (93%) <-- (JSR and RTS are counted as one instruction each)
Data Stalls 1641 (0%)
Control Stalls: 1104 (0%)
Delay Slots: 13831 (3%)
Generated: 12849 (3%)
Cache Busy: 7640 (2%)
Total: 500000 (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd: 131477 (26%)
Operand Fwd: 3755 (1%)
- - - - - - - - - - - - - - - - - -
Branches: 76460 (17%)
Taken: 4993 (7%)
btb Hit: 4571 (6%)
Mispredictions: 276 (0%)
- - - - - - - - - - - - - - - - - -
bCache Access: 2 (0%)
----------- Performance -----------
CPI: 1.08
NMOS CPI: 3.08
Boost Factor: 2.9x
There’s a lot of information here, but the key takeaway is the performance summary at the bottom. P7 averaged
1.08 CPI during the run!
Honestly, I was not expecting such a good result. We shall see how performance holds up under varying workloads. For now I’m quite happy to report a
2.9x performance boost over the NMOS equivalent CPI for this run. Fantastic!
Some other interesting things to highlight in the numbers above:
- On the whole, the pipeline manages to resolve interdependencies between instructions very successfully, and very few Data Stalls were triggered on this run.
- Control Stalls too are very few, which suggests the branch logic manages pretty well despite a 7-cycle branch mis-prediction penalty. (Branches are resolved in the final WB pipeline stage, so a branch mis-prediction requires that all pipeline stages be flushed and reloaded).
- 3% of cycles are Delay Slots — a direct branch (or jump) will incur a single delay slot penalty. Indirect jumps trigger two delay slots. No attempt is made to fill the slots, so in fact they are full-fledged stalls. Conditional branches and direct jumps that have been seen at least once in the instruction stream do not incur a delay slot penalty; they are managed by the Branch Target Buffer. For various reasons, we don’t use the btb for indirect jumps.
- Relatively few Cache Busy stalls are generated (about 2% of cycles), which suggests that single-entry cache write buffers are more than adequate in this case. There are usually plenty of idle cycles available to complete writes in a timely manner. Even so, this aspect of pipeline performance may be further improved. Running the simulation with dual port RAMs enabled produced a CPI of 1.06, as all Cache Busy stalls are eliminated. That's a significant improvement that is well worth considering, particularly if full caching is implemented and the inevitable cache misses begin to tax performance (although the locality of the working-set of most 6502 code is probably good enough so that cache miss rates may be very low. Anyway, the right tradeoffs should become clear with further experimentation).
- About 3% of cycles are additional cycles Generated by JSR and RTS instructions during their translation to multiple pipeline operations. This seems like a very reasonable tradeoff. With it, we avoid undue complication in the pipeline, and pay a relatively modest performance penalty.
- Fully 26% of cycles needed Data Forwarding to avoid stalls. Data is “forwarded” when it is passed directly from one pipeline stage to another as soon as it is generated (from the output of the ALU back into its input, for example). It’s very interesting to see just how critical this mechanism is. Without it, the pipeline would degrade significantly.
- It’s interesting also that Operand Forwarding does not have nearly the same impact. The FE Stage goes to some lengths to have complete instructions ready when the DE Stage needs them. However, the first couple of cycles after a branch can be tricky because a branch invalidates the Pre-Fetch Code Buffer. If the branch target instruction spans a word boundary, one or more operands will be missing from the initial fetch, and the FE Stage falls behind. In some cases, it is possible for FE to forward the missing operands from memory directly to the DE stage, and thereby avoid a stall. I thought this feature would be well worth the added complexity, but the results above suggest otherwise. In this run at least, it would be better to tolerate a few more stalls rather than over-burden the FE stage with this additional logic.
- 17% of instructions executed were Conditional Branches. Of those, only 7% were Taken Branches. This may be because the Dormann test suite is peppered throughout with error-checking branches which are never taken (once the emulation is working correctly that is). I suspect we will see a much higher proportion of taken branches with other code.
- The Branch Target Buffer (BTB) seems to perform very well, and generates hits for nearly all taken branches. This is not altogether surprising given that looping branches are taken many times while not-taken branches repeat infrequently.
- Meanwhile, Branch Prediction is “off-the-charts” accurate. I am using a 2-bit Branch Prediction scheme here, with a modified Backward-Taken-Forward-Not-Taken static prediction for the first encounter of a given branch. I say modified because a backward branch to itself is predicted as Not-Taken. There are a large number of these in the Dormann code, which probably accounts for the unusually high accuracy in this run. I would not expect prediction accuracy to stay at these levels for the common case workload.
- Finally, it’s interesting to note that (contrary to my expectation) the bCache is hardly ever used. The bCache is a parallel Address Memory used by the AD Stage to fetch a complete address in a single cycle, even when that address spans a word-boundary. The above result suggests it might be best to dispense with the bCache altogether, and simply tolerate the occasional pipeline stall when necessary.
Alright, a lot remains to be worked out here, but I found these results very encouraging indeed. The next step is to profile a wider variety of workloads and see what can be learned. Once the design is fine-tuned, I’ll want to port the simulation to Logisim. That will provide further validation, and some sense of what it would take to implement something like this in the real world. As always, the objective here is to learn and to have fun, so I’ll still consider the exercise a success even if the design proves too ambitious for discrete logic.
Speaking of fun, one highlight for me was being able to witness first-hand how the pipeline stages interact with one another. The P7 simulator reports its progress and displays pipeline activity as it goes. It’s fascinating to see it work through different instruction stream scenarios. To that end, below is a brief snippet of the last 10 cycles of the Dormann test run. (In the report, cycles are delimited by a line showing the cycle number, and one line per pipeline stage is shown below that. Each pipeline stage is annotated with relevant pipeline activity as appropriate. The FE Stage shows the next Program Memory fetch address every cycle, and also dumps the 8-byte Prefetch Code Buffer whenever a new 32-bit word is brought in from Program Memory. In the report, iCache refers to Program Memory, aCache to Address Memory and dCache to Data Memory. Pipeline “bubbles” are indicated by “---“ in the opcode field. Note that an RTS instruction generates two pipeline operations, one to adjust SP, and the other to jump to the return address. This is an indirect jump which triggers two Delay Slots).
I would be delighted to share the python code with anyone wishing to try it out. (Please PM me). Alternatively, feel free to suggest any 6502 code you think might be an instructive test case. All I need to run a test is a binary image of the code (up to 64k), a load address, and a start address to jump to. A listing of the code is useful in case I run into any issues (e.g., self-modifying code). The only caveat is that the test must run without any user input or interrupts.
(One thing I’d like to test is loading BASIC and running one of the sieve type benchmarks. Would anyone be willing to create a binary image containing, say, ehBASIC with a BASIC program already loaded in memory such that it would auto-run without user input? That would be a cool thing to try).
As always, many thanks to Dr Jefyll and ttlworks for their continued help, and thanks in advance for any thoughts and comments on the above.
Cheers for now,
Drass
Code:
Program Start: $c000
------------- 499990 --------------
WB: js* $0213
EX: sbc #$fb
... A = $2e
DA: rts
IX: rt*
AD: ---
DE: ---
... iCache Write Pending
FE: $f0e8: 20 13 2 8 - 20 13 2 8
------------- 499991 --------------
WB: sbc #$fb
EX: rts
DA: rt*
IX: ---
AD: ---
DE: php
... iCache Write Pending
FE: $f0ec: 20 13 2 8 - c5 f d0 fe
------------- 499992 --------------
WB: rts
EX: rt*
DA: ---
IX: ---
AD: php
... iCache Write Pending
DE: cmp $0f
FE: $f0f0: 68 29 c3 c5 - c5 f d0 fe
------------- 499993 --------------
WB: rt*
EX: ---
DA: ---
IX: php
AD: cmp $0f
DE: bne $fe
... iCache Write Pending
FE: $f0f4: 68 29 c3 c5 - 11 d0 fe 28
------------- 499994 --------------
WB: ---
EX: ---
DA: php
IX: cmp $0f
AD: bne $fe
DE: pla
FE: $f0f8:
... Drain: iCache[$1fb] = $ea
------------- 499995 --------------
WB: ---
EX: php
DA: cmp $0f
... Alub = $2e = [$f]
IX: bne $fe
AD: pla
DE: and #$c3
FE: $f0f8:
------------- 499996 --------------
WB: php
... dCache[$1fc] = $30
EX: cmp $0f
DA: bne $fe
IX: pla
AD: and #$c3
... iCache Write Pending
DE: cmp $11
FE: $f0f8: 8 a5 d 75 - 11 d0 fe 28
... Drain: dCache[$1fc] = $30
... Drain: aCache[$1fc] = $30
------------- 499997 --------------
WB: cmp $0f
EX: bne $fe
DA: pla
... Alub = $30 = [$1fc]
IX: and #$c3
AD: cmp $11
DE: bne $fe
FE: $f0fc:
... Drain: iCache[$1fc] = $30
------------- 499998 --------------
WB: bne $fe
EX: pla
... A = $30
DA: and #$c3
... Forward A = $30
IX: cmp $11
AD: bne $fe
DE: plp
FE: $f0fc: 8 a5 d 75 - 0 8 c5 f
------------- 499999 --------------
WB: pla
EX: and #$c3
... A = $0
DA: cmp $11
... Forward A = $0
... Alub = $0 = [$11]
IX: bne $fe
AD: plp
DE: php
FE: $f100:
------------- 500000 --------------
WB: and #$c3
EX: cmp $11
DA: bne $fe
IX: plp
AD: php
DE: lda $0d
FE: $f100: