Pipelining the 6502

Drass · Post by **Drass** » Sun Nov 10, 2019 3:44 pm

On another thread, I posted an outline of a pipelined 6502. BigEd then suggested I start a new thread on the topic. It's a little bit awkward to split off an existing discussion, so I’ve tried to provide some context below (including replicating the original post).

I'll continue to chronicle the work here. As always, thanks in advance for all comments and suggestions.

Cheers for now,
Drass

--------------- Summary of discussion to date ---------------

Drass: Original post (replicated below)
MichaelM discusses the importance of memory bandwidth in this context, and suggests that a dual cache design (an I-Cache and a D-Cache) may be sufficient to support the pipeline well.
Chromatix draws a comparison to the 68040 pipeline
ttlworks discusses the Intel Itanium VLIW architecture and his own TREX TTL design
John West suggests pre-decoding instructions as they enter the I-Cache

--------------- Original Post Replicated ------------------

Having had the chance to study this thread more closely, I began to experiment with a new “reduced-CPI” 6502 design. After significant noodling, I settled on a seven-stage pipelined architecture, as follows:

[FEtch] —> [DEcode] —> [ADdress] —> [IndeX] —> [DAta] —> [EXecute] —> [WriteBack]

FE: Fetch an instruction (opcode and operands) from Program Memory
DE: Decode opcode, marshal operands, calculate pc-relative, stack-relative and pre-index offsets
AD: Fetch indirect address, including jump vectors, from Address Memory
IX: Apply post-indexed address offsets
DA: Read Data Memory
EX: Perform ALU operation, update register file
WB: Update P register, verify branch prediction, write to Data Memory

The basic idea is that a seven-stage pipeline can support the various 6502 addressing modes (including RMW instructions) in one pass. Because the first stage simultaneously fetches both the opcode and its inline operands, each instruction can work its way down the pipeline as an atomic unit. Instructions advance through the pipeline at a rate of one stage per cycle, making use of specific resources as they go — indirection, indexing, memory, the ALU, etc. Once the pipeline fills, the CPU is able to complete one instruction in something approaching one cycle on average (i.e. CPI = 1.x). (Various pipeline hazards make the ideal one-instruction-per-cycle pipeline somewhat of a chimera, but we can try to get close).

As others have suggested, pipelining a 6502 beyond just a few stages requires concurrent access to Program, Address and Data memory. For example, we could have Program Memory supplying a new instruction, Address Memory reading an indirect address from zero page, and Data Memory reading a byte of data... all at the same time. In concept, the requirement is analogous to a large multiport RAM with three simultaneous accesses possible. In practice, modern pipelines are supported by sophisticated multi-level cached memory for this purpose. But we will want to keep things as simple as possible to start -- we can always optimize later.

For now, let's propose three distinct single port RAMs working together. We will want to be able to read complete instructions and addresses as often as possible, so 32-Bit wide memory is appropriate. Let's assume also 64k bytes of addressable space. One consequence of using three single port RAMs is that writes must be mirrored across all memories to keep them synchronized. To prevent writes from stalling the pipeline excessively, each memory is outfited with a one-deep write buffer. Each memory can independently postpone the write if busy, until a cycle when it would otherwise have been inactive.

Beyond memory, certain accessory mechanisms are needed. 6502 variable length instructions mean the instruction boundaries don't always neatly word-align in 32-bit memory. For this reason Program memory feeds into a prefetch buffer, and any instruction that might have been divided because its bytes span a word boundary will appear intact and complete in the buffer. There is also an alignment problem in Address Memory. Here we can use a second parallel Address Memory to enable fetching of two consecutive words in one cycle. This guarantees that we will get a complete address, even when that address spans a 32-bit boundary. In addition, as instructions are processed, branches are identified and their target addresses loaded into a Branch Target Buffer. The BTB is then consulted on every fetch to resolve branch targets as early as possible, and to aid in branch prediction. Finally, with all this concurrent activity, a complement of hazard checks must constantly police the pipeline to make sure the correct program semantics are maintained.

Ok, so that’s a lot of stuff to keep straight. After some paper and pencil exploration, I decided to write a simulation to flesh out the concept and work out the kinks. The result is the P7 Simulator. It’s a fairly straight forward python script written to simulate the action of pipeline registers working cycle by cycle on 6502 opcodes. It currently supports NMOS opcodes and implements the seven-stage pipeline outlined above. It makes use of the following facilities across the various pipeline stages:

FE: Program Memory Read, Pre-Fetch Code Buffer and asssociated Byte-Select Mux Matrix, Branch Target Buffer
DE: Decode RAM, PC Increment Adder, Pre-Index Adder
AD: Address Memory Read
IX: Post-Index Adder
DA: Data Memory Read, Register File Read
EX: ALU, Register File Write
WB: Data Memory Write, Branch evaluation/validation

As mentioned above, writes to the three memories are completed during idle cycles. The Branch Target Buffer (btb), on the other hand, is read every cycle and so has no naturally occurring idle cycles. Thankfully, btb reads can simply be skipped when the need to write arises. We might incur a few more branch stalls as a result, but as we’ll see, this all works out pretty well. (One alternative is to run memory at twice pipeline clock-rate and time-multiplex separate read and write ports. There are a variety of low-cost high-speed synch RAMs available that are quite suitable for this purpose. And indeed, this can work equally well for all three memories in addition to the btb. The P7 simulation includes a run-time switch to simulate dual-port RAMs across the board).

One significant fly in the memory-management ointment is self-modifying code. Unfortunately, an instruction that is already in the pipeline will not be modified by a write to Program Memory. It will therefore not reflect a recent modification when it executes. Rather than instrumenting the pipeline to detect self-modifying writes, the workaround is to impose a minimum distance between dynamically modified instructions and the instructions which modify them. Not the end of the world, but certainly a limitation in terms of compatibility.

Finally, one other architectural note: instructions which require both stack operations and jumps — namely JSR/RTS and BRK/RTI — are done by feeding multiple successive lines of microcode down the pipeline for each of these instructions (whereas the norm is just a single line of microcode). So a JSR <abs>, for example, generates a PUSH PCH / PUSH PCL / JMP <abs> sequence in the pipeline. Similarly, BRK generates three pushes, an SEI, and then a JMP (abs) instruction to jump through the correct interrupt vector. (Speaking of interrupts, I’ve yet to instrument the simulation to handle them. Intuitively, it seems like inserting a BRK into the instruction stream should work, but it will be interesting to see how that bears out, and where interrupt latency ends up).

Ok, with that as a design outline, I got the simulation more or less put together. I then loaded the Dormann NMOS 6502 Test Suite to give it a whirl. Sure enough, there is self-modifying code in the suite! I had to insert a couple of strategically placed NOPs in the code to create the requisite minimum distance between self-modifying instructions. (Two NOPs just prior to the each of the JSRs in the CHKADD routine did the trick. The simulator does not currently support decimal mode, so I didn’t bother to similarly change CHKDADD).

And with that, I was ready for the big test. I let P7 run the test suite for a half million cycles, and then the simulation generated the following summary output:

Code: Select all

(FOR EXPLANATION OF THE FIGURES SEE DISCUSSION BELOW)

------------ Run Stats ------------
Cycles:          500000
NMOS Cycles:    1424511
- - - - - - - - - - - - - - - - - -
Instructions:    462936	(93%)  <-- (JSR and RTS are counted as one instruction each)
Data Stalls        1641	(0%)
Control Stalls:    1104	(0%)
Delay Slots:      13831	(3%)
Generated:        12849	(3%)
Cache Busy:        7640	(2%)
Total:           500000	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:        131477	(26%)
Operand Fwd:       3755	(1%)
- - - - - - - - - - - - - - - - - -
Branches:         76460	(17%)
Taken:             4993	(7%)
btb Hit:           4571	(6%)
Mispredictions:     276	(0%)
- - - - - - - - - - - - - - - - - -
bCache Access:        2	(0%)

----------- Performance -----------
CPI:               1.08
NMOS CPI:          3.08
Boost Factor:      2.9x

There’s a lot of information here, but the key takeaway is the performance summary at the bottom. P7 averaged 1.08 CPI during the run!

Honestly, I was not expecting such a good result. We shall see how performance holds up under varying workloads. For now I’m quite happy to report a 2.9x performance boost over the NMOS equivalent CPI for this run. Fantastic!

Some other interesting things to highlight in the numbers above:

On the whole, the pipeline manages to resolve interdependencies between instructions very successfully, and very few Data Stalls were triggered on this run.
Control Stalls too are very few, which suggests the branch logic manages pretty well despite a 7-cycle branch mis-prediction penalty. (Branches are resolved in the final WB pipeline stage, so a branch mis-prediction requires that all pipeline stages be flushed and reloaded).
3% of cycles are Delay Slots — a direct branch (or jump) will incur a single delay slot penalty. Indirect jumps trigger two delay slots. No attempt is made to fill the slots, so in fact they are full-fledged stalls. Conditional branches and direct jumps that have been seen at least once in the instruction stream do not incur a delay slot penalty; they are managed by the Branch Target Buffer. For various reasons, we don’t use the btb for indirect jumps.
Relatively few Cache Busy stalls are generated (about 2% of cycles), which suggests that single-entry cache write buffers are more than adequate in this case. There are usually plenty of idle cycles available to complete writes in a timely manner. Even so, this aspect of pipeline performance may be further improved. Running the simulation with dual port RAMs enabled produced a CPI of 1.06, as all Cache Busy stalls are eliminated. That's a significant improvement that is well worth considering, particularly if full caching is implemented and the inevitable cache misses begin to tax performance (although the locality of the working-set of most 6502 code is probably good enough so that cache miss rates may be very low. Anyway, the right tradeoffs should become clear with further experimentation).
About 3% of cycles are additional cycles Generated by JSR and RTS instructions during their translation to multiple pipeline operations. This seems like a very reasonable tradeoff. With it, we avoid undue complication in the pipeline, and pay a relatively modest performance penalty.
Fully 26% of cycles needed Data Forwarding to avoid stalls. Data is “forwarded” when it is passed directly from one pipeline stage to another as soon as it is generated (from the output of the ALU back into its input, for example). It’s very interesting to see just how critical this mechanism is. Without it, the pipeline would degrade significantly.
It’s interesting also that Operand Forwarding does not have nearly the same impact. The FE Stage goes to some lengths to have complete instructions ready when the DE Stage needs them. However, the first couple of cycles after a branch can be tricky because a branch invalidates the Pre-Fetch Code Buffer. If the branch target instruction spans a word boundary, one or more operands will be missing from the initial fetch, and the FE Stage falls behind. In some cases, it is possible for FE to forward the missing operands from memory directly to the DE stage, and thereby avoid a stall. I thought this feature would be well worth the added complexity, but the results above suggest otherwise. In this run at least, it would be better to tolerate a few more stalls rather than over-burden the FE stage with this additional logic.
17% of instructions executed were Conditional Branches. Of those, only 7% were Taken Branches. This may be because the Dormann test suite is peppered throughout with error-checking branches which are never taken (once the emulation is working correctly that is). I suspect we will see a much higher proportion of taken branches with other code.
The Branch Target Buffer (BTB) seems to perform very well, and generates hits for nearly all taken branches. This is not altogether surprising given that looping branches are taken many times while not-taken branches repeat infrequently.
Meanwhile, Branch Prediction is “off-the-charts” accurate. I am using a 2-bit Branch Prediction scheme here, with a modified Backward-Taken-Forward-Not-Taken static prediction for the first encounter of a given branch. I say modified because a backward branch to itself is predicted as Not-Taken. There are a large number of these in the Dormann code, which probably accounts for the unusually high accuracy in this run. I would not expect prediction accuracy to stay at these levels for the common case workload.
Finally, it’s interesting to note that (contrary to my expectation) the bCache is hardly ever used. The bCache is a parallel Address Memory used by the AD Stage to fetch a complete address in a single cycle, even when that address spans a word-boundary. The above result suggests it might be best to dispense with the bCache altogether, and simply tolerate the occasional pipeline stall when necessary.

Alright, a lot remains to be worked out here, but I found these results very encouraging indeed. The next step is to profile a wider variety of workloads and see what can be learned. Once the design is fine-tuned, I’ll want to port the simulation to Logisim. That will provide further validation, and some sense of what it would take to implement something like this in the real world. As always, the objective here is to learn and to have fun, so I’ll still consider the exercise a success even if the design proves too ambitious for discrete logic.

Speaking of fun, one highlight for me was being able to witness first-hand how the pipeline stages interact with one another. The P7 simulator reports its progress and displays pipeline activity as it goes. It’s fascinating to see it work through different instruction stream scenarios. To that end, below is a brief snippet of the last 10 cycles of the Dormann test run. (In the report, cycles are delimited by a line showing the cycle number, and one line per pipeline stage is shown below that. Each pipeline stage is annotated with relevant pipeline activity as appropriate. The FE Stage shows the next Program Memory fetch address every cycle, and also dumps the 8-byte Prefetch Code Buffer whenever a new 32-bit word is brought in from Program Memory. In the report, iCache refers to Program Memory, aCache to Address Memory and dCache to Data Memory. Pipeline “bubbles” are indicated by “---“ in the opcode field. Note that an RTS instruction generates two pipeline operations, one to adjust SP, and the other to jump to the return address. This is an indirect jump which triggers two Delay Slots).

I would be delighted to share the python code with anyone wishing to try it out. (Please PM me). Alternatively, feel free to suggest any 6502 code you think might be an instructive test case. All I need to run a test is a binary image of the code (up to 64k), a load address, and a start address to jump to. A listing of the code is useful in case I run into any issues (e.g., self-modifying code). The only caveat is that the test must run without any user input or interrupts.

(One thing I’d like to test is loading BASIC and running one of the sieve type benchmarks. Would anyone be willing to create a binary image containing, say, ehBASIC with a BASIC program already loaded in memory such that it would auto-run without user input? That would be a cool thing to try).

As always, many thanks to Dr Jefyll and ttlworks for their continued help, and thanks in advance for any thoughts and comments on the above.

Cheers for now,
Drass

Code: Select all

Program Start: $c000
------------- 499990 --------------
WB: js* $0213
EX: sbc #$fb
... A = $2e
DA: rts
IX: rt*
AD: ---
DE: ---
... iCache Write Pending
FE: $f0e8: 20 13 2 8 - 20 13 2 8
------------- 499991 --------------
WB: sbc #$fb
EX: rts
DA: rt*
IX: ---
AD: ---
DE: php
... iCache Write Pending
FE: $f0ec: 20 13 2 8 - c5 f d0 fe
------------- 499992 --------------
WB: rts
EX: rt*
DA: ---
IX: ---
AD: php
... iCache Write Pending
DE: cmp $0f
FE: $f0f0: 68 29 c3 c5 - c5 f d0 fe
------------- 499993 --------------
WB: rt*
EX: ---
DA: ---
IX: php
AD: cmp $0f
DE: bne $fe
... iCache Write Pending
FE: $f0f4: 68 29 c3 c5 - 11 d0 fe 28
------------- 499994 --------------
WB: ---
EX: ---
DA: php
IX: cmp $0f
AD: bne $fe
DE: pla
FE: $f0f8:
... Drain: iCache[$1fb] = $ea
------------- 499995 --------------
WB: ---
EX: php
DA: cmp $0f
... Alub = $2e = [$f]
IX: bne $fe
AD: pla
DE: and #$c3
FE: $f0f8:
------------- 499996 --------------
WB: php
... dCache[$1fc] = $30
EX: cmp $0f
DA: bne $fe
IX: pla
AD: and #$c3
... iCache Write Pending
DE: cmp $11
FE: $f0f8: 8 a5 d 75 - 11 d0 fe 28
... Drain: dCache[$1fc] = $30
... Drain: aCache[$1fc] = $30
------------- 499997 --------------
WB: cmp $0f
EX: bne $fe
DA: pla
... Alub = $30 = [$1fc]
IX: and #$c3
AD: cmp $11
DE: bne $fe
FE: $f0fc:
... Drain: iCache[$1fc] = $30
------------- 499998 --------------
WB: bne $fe
EX: pla
... A = $30
DA: and #$c3
... Forward A = $30
IX: cmp $11
AD: bne $fe
DE: plp
FE: $f0fc: 8 a5 d 75 - 0 8 c5 f
------------- 499999 --------------
WB: pla
EX: and #$c3
... A = $0
DA: cmp $11
... Forward A = $0
... Alub = $0 = [$11]
IX: bne $fe
AD: plp
DE: php
FE: $f100:
------------- 500000 --------------
WB: and #$c3
EX: cmp $11
DA: bne $fe
IX: plp
AD: php
DE: lda $0d
FE: $f100:

Drass · Post by **Drass** » Sun Nov 10, 2019 3:59 pm

(Continuing the discussion ... )

The Dormann Suite is a pretty specialized piece of code -- probably not the kind of programming one might see in a typical 6502 program. We will want to sample a variety of "typical" code to better assess specific tradeoffs in the pipeline design. To start, I tried a simple memcpy function as follows:

Code: Select all

memcpy  lda #$10
        sta $00
        sta $01
        lda #$20
        sta $02
        sta $03
        ldy #$0
		
loop    lda ($00),y
        sta ($02),y
        dey
        bne loop

        inc $00
        bne skip1
        inc $01

Skip1   inc $02
        bne skip2
        inc $03

Skip2   jmp loop

And here are the results of the simulation after a few thousand cycles:

Code: Select all


------------ Run Stats ------------
Cycles:            5000
NMOS Cycles:      19846
- - - - - - - - - - - - - - - - - -
Instructions:      4951	(99%)
Data Stalls           1	(0%)
Control Stalls:      32	(1%)
Delay Slots:         17	(0%)
Generated:            0	(0%)
Total:             5000	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:          2470	(49%)
Operand Fwd:          4	(0%)
- - - - - - - - - - - - - - - - - -
Branches:          1239	(25%)
Taken:             1235	(100%)
Mispredictions:       8	(1%)
- - - - - - - - - - - - - - - - - -
btb Hits:          1249	(100%)
bCache Access:        0	(0%)
aCache Conflicts:     1	(0%)

----------- Performance -----------
CPI:               1.01
NMOS CPI:          4.01
Boost Factor:      4.0x

It turns out that memcpy is pretty close to an ideal case for the pipeline! The NMOS run is burdened by multi-cycle indirect loads, indirect stores and INC operations. By contrast, the pipeline executes all these instructions in one cycle. In addition, the branch prediction logic missteps only 8 times. All in all very good, and the pipeline manages a CPI of 1.01 vs. 4.01 for the NMOS equivalent.

...

Now for something a little more compute intensive: a 32-bit multiply (http://6502.org/source/integers/32muldiv.htm).

Code: Select all

;32 bit multiply with 64 bit product

MULTIPLY:  lda     #$00
           sta     PROD+4   ;Clear upper half of
           sta     PROD+5   ;product
           sta     PROD+6
           sta     PROD+7
           ldx     #$20     ;Set binary count to 32
SHIFT_R:   lsr     MULR+3   ;Shift multiplyer right
           ror     MULR+2
           ror     MULR+1
           ror     MULR
           bcc     ROTATE_R ;Go rotate right if c = 0
           lda     PROD+4   ;Get upper half of product
           clc              ; and add multiplicand to
           adc     MULND    ; it
           sta     PROD+4
           lda     PROD+5
           adc     MULND+1
           sta     PROD+5
           lda     PROD+6
           adc     MULND+2
           sta     PROD+6
           lda     PROD+7
           adc     MULND+3
ROTATE_R:  ror     a        ;Rotate partial product
           sta     PROD+7   ; right
           ror     PROD+6
           ror     PROD+5
           ror     PROD+4
           ror     PROD+3
           ror     PROD+2
           ror     PROD+1
           ror     PROD
           dex              ;Decrement bit count and
           bne     SHIFT_R  ; loop until 32 bits are
           clc              ; done
           lda     MULXP1   ;Add dps and put sum in MULXP2
           adc     MULXP2
           sta     MULXP2
           jmp     $0000

Running through one full multiplication with random input values, we get this:

Code: Select all


------------ Run Stats ------------
Cycles:             851
NMOS Cycles:       2153
- - - - - - - - - - - - - - - - - -
Instructions:       575	(68%)
Data Stalls           0	(0%)
Control Stalls:      44	(5%)
Delay Slots:         22	(3%)
Generated:            0	(0%)
Cache Busy:         211	(25%)
Total:              851	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:           543	(64%)
Operand Fwd:          0	(0%)
- - - - - - - - - - - - - - - - - -
Branches:            47	(8%)
Taken:               32	(68%)
Mispredictions:      11	(23%)
- - - - - - - - - - - - - - - - - -
btb Hits:            45	(85%)
bCache Access:        0	(0%)
aCache Conflicts:     0	(0%)

----------- Performance -----------
CPI:               1.48
NMOS CPI:          3.74
Boost Factor:      2.5x

2.5x performance is not bad. The most striking thing about this run is the number of Cache Busy stalls. Fully 25% of cycles are impeded by cache conflcts. To understand why, it's insrtructive to look at the consecutive ROR instructions in the loop. Each ROR generates a read from the I-Cache, a read from the D-Cache, a write to the D-Cache, and finally a mirrored write to the I-Cache. These memory accesses conflict with one another as the various RORs work they way down the pipeline, and it's not long before the write buffers are overwhelmed.

Turning on the Dual Port RAM alleviates this bottleneck, and we get a significant improvement:

Code: Select all


------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
Branch Prediction: 2-bit
Run: Multiply.65b
- - - - - - - - - - - - - - - - - -
Cycles:             851
NMOS Cycles:       2857
- - - - - - - - - - - - - - - - - -
Instructions:       755	(89%)
Data Stalls           0	(0%)
Control Stalls:      64	(8%)
Delay Slots:         33	(4%)
Generated:            0	(0%)
Total:              851	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:           535	(63%)
Operand Fwd:          0	(0%)
- - - - - - - - - - - - - - - - - -
Branches:            64	(8%)
Taken:               45	(70%)
Mispredictions:      16	(25%)
- - - - - - - - - - - - - - - - - -
btb Hits:            63	(89%)
bCache Access:        0	(0%)
aCache Conflicts:     0	(0%)

----------- Performance -----------
CPI:               1.13
NMOS CPI:          3.78
Boost Factor:      3.4x

That's much better -- 3.4x!

Now Control Stalls and Delay Slots are the main drag on performance on this run. We also see that branch prediction accuracy is down substantially from previous tests. The simulator predicts correctly only 75% of the time on this run, and each misprediction incurs a seven cycle penalty. It doesn't take long hurt CPI at that rate.

The culprit is the BCC ROTATE_R branch instruction about one quarter of the way into the code. The direction of this branch is detemined by the bit sequence that is generated as we rotate through the multiplier (MULR). The branch is just as likely to go one way as the other on every iteration based on the bit value. In fact, the 2-bit prediction scheme will be totally defeated if a branch like this is taken twice, and then alternates direction on every iteration thereafter. We can make that happen if MULR generates a bit sequence of ... 0101 0101 0100 when rotating right, which would be the case with a MULR value of $55555554. Let's try it ...

Code: Select all


------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
Branch Prediction: 2-bit
Run: Multiply (Worst Case).65b
- - - - - - - - - - - - - - - - - -
Cycles:             851
NMOS Cycles:       2583
- - - - - - - - - - - - - - - - - -
Instructions:       671	(79%)
Data Stalls           0	(0%)
Control Stalls:     120	(14%)
Delay Slots:         61	(7%)
Generated:            0	(0%)
Total:              851	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:           480	(56%)
Operand Fwd:          0	(0%)
- - - - - - - - - - - - - - - - - -
Branches:            60	(9%)
Taken:               46	(77%)
Mispredictions:      30	(50%)
- - - - - - - - - - - - - - - - - -
btb Hits:            59	(89%)
bCache Access:        0	(0%)
aCache Conflicts:     0	(0%)

----------- Performance -----------
CPI:               1.27
NMOS CPI:          3.85
Boost Factor:      3.0x

Sure enough, the branch prediction scheme is now wrong half the time (there is also a backward branch in the code which is predicted correctly by default). Generally, code with a high number non-looping irregular branches like this one is going to be challenging for the pipeline. Thankfully the rest of the code in this snippet runs very efficiently, so overall performance is still excellent at 3x that of an NMOS 6502.

We shall see how these various mechanisms fare with other code in more complex programs ... stay tuned!

In the meantime, I’m eager to find ways to further optimize the pipeline so please don’t hesitate to post any thoughts and suggestions.

Cheers for now,
Drass

BigEd · Post by **BigEd** » Sun Nov 10, 2019 4:35 pm

> ... stay tuned!
Certainly will! Thanks for the new thread, and the elaboration of a practical pipelined design.

MichaelM · Post by **MichaelM** » Mon Nov 11, 2019 1:40 am

Drass:

Very cool indeed. Why don't you try your pipeline on the classic benchmark for 8-bit microcontrollers: Sieve of Eratosthenes. You can find a Pascal version with the assembly language listing in the github repo for test programs my M65C02A assembler. Of course, it uses the extended 6502/65C02 syntax that I adopted to supported my 8/16-bit M65C02A processor core. It uses a number of M65C02A-specific instructions, but you should be able to make out their behavior with the exception of the MOV #imm and the ADJ #imm instructions. The immediate operand for the MOV instruction determines whether the source (X) ptr and the destination (Y) pointers are incremented, decremented, or held. As used in the program, the source pointer is not incremented, and the destination pointer is incremented. The transfer count is placed in A. The ADJ #imm instruction adds the signed operand value to the stack pointer. If the instruction is decorated with a .w, then the operation size of the instruction is 16 bits.

In the listing of the benchmark linked to above, I have a number of runtime library functions such subroutines for 16x16 unsigned division, 16x16 signed multiplication, and various string output and 16-bit integer to string conversion functions. Perhaps this benchmark may help you construct a test routine that uses more varied operations than the ones you've constructed so far.

Your cache seems to work real well, whether single ported or not, and further optimization may only yield marginally improving performance.

Drass · Post by **Drass** » Tue Nov 12, 2019 12:38 am

MichaelM wrote:

Very cool indeed.

I know! I’m having a lot of fun with it ...

Quote:

Why don't you try your pipeline on the classic benchmark for 8-bit microcontrollers: Sieve of Eratosthenes.

Great suggestion! And thanks for the link. I found an 8-bit version ready to go on rosettacode: https://rosettacode.org/wiki/Sieve_of_E ... 2_Assembly (there are versions there for every platform imaginable!).

Finding primes up to 128 ...

Code: Select all


------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
Branch Prediction: 2-bit
Run: Eratosthenes.65b
- - - - - - - - - - - - - - - - - -
Cycles:            5125
NMOS Cycles:      12669
- - - - - - - - - - - - - - - - - -
Instructions:      4582	(89%)
Data Stalls           1	(0%)
Control Stalls:     360	(7%)
Delay Slots:        183	(4%)
Generated:            0	(0%)
Total:             5125	(100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:          1647	(32%)
Operand Fwd:         31	(1%)
- - - - - - - - - - - - - - - - - -
Branches:           849	(19%)
Taken:              222	(26%)
Mispredictions:      90	(11%)
- - - - - - - - - - - - - - - - - -
btb Hits:          1280	(106%)
bCache Access:        0	(0%)
aCache Conflicts:     0	(0%)

----------- Performance -----------
CPI:               1.12
NMOS CPI:          2.76
Boost Factor:      2.5x

It’s a pretty clean run, with stalls generated entirely by branch mispredictions. Prediction accuracy is running at 90%, so I can’t really ask much more from a 2-bit scheme. All in all, a pretty good result with CPI at 1.12.

Quote:

Your cache seems to work real well, whether single ported or not, and further optimization may only yield marginally improving performance.

I agree. Although I think I’ve run into a couple potential optimizations which might be beneficial. Not so much with the caches as with other pipeline logic ... more on that later.

Currently working on getting BASIC and FORTH running in the simulator ... it’s gonna be interesting!

ttlworks · Post by **ttlworks** » Tue Nov 12, 2019 11:18 am

For not disrupting our "pipelining the 6502" thread, I was hesitant to post some "VLIW stuff" before Drass has a clean concept of the 7 level pipeline.
But because he said "don't worry", I'm not going to worry while posting.

First to clarify things: I'm not suggesting to build a "poor man's Itanium", but a "speed kludge".
For our old 20MHz TTL CPU, I came up with the idea for the K24 kludge, which had added a subset of the 65816 functionality to the design (including a 24 Bit address bus).
For the new 100MHz(?) TTL CPU architecture, I'm suggesting a kludge for squeezing as much speed out of the CPU hardware as possible,
probably more speed than a 7 level pipeline executing 6502 machine code could achieve.
Nothing in the universe is for free, of course... as in "increased part count" and/or in "increased complexity".

Questions like "could it be done" and "would it make sense" will have to wait until the concept for a 7 level pipeline is stable, sorry.

;---

Let's start simple. What we have so far:
0) bus bandwidth between CPU innards and CPU external main memory is a critical factor for the MIPS rating of our CPU.
1) fast RAMs for the microcode would tend to be big, what will give us a lot more of microcode RAM inside the CPU than actually required.
2) a 6502 instruction read from external memory triggers/runs a fixed set of CPU internal microcode instructions.

Now if we would (in theory) be able to add the following items to the innards of the CPU:
0) instructions for transferring a part of external main memory RAM into the CPU microcode RAM
1) a microcode PC that isn't just 4 Bit like in our old 20MHz CPU, but that spans the whole microcode RAM address lines
2) a literal field in the microcode which could represent data or an address
3) conditional/unconditional changes to program flow in the microcode
4) a 6502 instruction "JSR to microcode address" and a microcode instruction "RTS to 6502 code"
Then in theory, one could "bake" the microcode sequences for multiple 6502 instructions together,
and run the equivalent of a block of 6502 instructions as a block of microcode instructions, in microcode memory,
what would free some bandwidth at the CPU bus interface to external main memory.

External main memory only would contain data, CPU internal microcode RAM only would contain code, that's a clean Harward architecture.
Of course, it only would work well with code which isn't self_modifying, and which doesn't rely too much on the C64 specific address decoding.
So it's just an option for making a 6502 program run faster by "moving" some of the speed critical parts from "6502 code" into "CPU internal microcode".
Candidates would be (for instance) fixed point or floating point math routines, or the lower levels of a TCP\TP stack.

We know, what a microcode sequence for a given 6502 instruction looks like,
so in theory the microcode equivalent for a block of 6502 instructions could be generated (statically compiled) by software.
//One also could homebrew a microcode assembler, like with AM2900.

If the compiler would be good, it would be able to examine program flow of the 6502 code before translating it,
then to map it to the CPU internal function blocks for slightly increasing speed.
For instance, not all of the LDA\LDX\LDY instructions in 6502 code are checked by a conditional branch later, so not all of them would require a flag evaluation.

;---

Now for the (somewhat obscure) VLIW part:

Depending on what the CPU block diagram looks like after the 7 level pipeline is rolled out,
when resorting to pre_emptive address and data calculations it could happen that we end up with more than one address ALU, with more than one data ALU, and with some additional registers.

If the compiler would be _really_ good, it would try to make efficient use of that CPU internal hardware.
What the microcode instructions are supposed to look like, that's a thing to be discussed later,
after we went smarter about which function blocks (and how many of them) are in the CPU design.
//Itanium and TREX just were examples of what a VLIW\EPIC instruction set might look like when trying to go "superscalar".

That's all for now.
When remembering Terbium, and since we are using 74xx chips: atomic number 74 would be Tungsten.

Rob Finch · Post by **Rob Finch** » Thu Nov 28, 2019 9:48 am

Hi, it looks like there's an interesting project here. Why do branches have to wait until the writeback stage? I would think they could be resolved sooner. I've started working (it may never get finished) on logic for a superscalar 6502 which splits up 6502 opcodes into micro-ops similar to a Pentium. The project's posted at anycpu.org. Big Ed suggested cross pollination of ideas. I'm not sure what's assumed for memory performance?
If micro-ops were used it might be possible to run with a shorter pipeline and reduce the number of cache ports.

BigEd · Post by **BigEd** » Thu Nov 28, 2019 10:20 am

(Here's the thread over there:

microop 6502

)

ttlworks · Post by **ttlworks** » Thu Nov 28, 2019 12:37 pm

BigEd, thanks for the cross pollination of ideas.
Rob, welcome to the thread, nice to see you entering the discussion.

Drass still is working on the simulator for the 7 level pipeline, but seems to have a lack of spare time due to job related reasons at the moment.
I'm not familiar with his concept, and I haven't seen the simulator yet, so I can't be much of a help now.

;---

Rob, please correct me if I'm wrong:
From what I have seen at the anycpu forum, your plan is to go with a 54 Bit instruction word,
which breaks down into four atomic 13 Bit RISC microinstructions,
plus two Bits for marking how many of the four microinstructions are valid/used.

Building the equivalent of a 6502 CISC instruction with a sophisticated addressing mode is done by a sequence of atomic RISC microinstructions.

This somehow reminds me a bit to my old TREX architecture.
It had three atomic microinstructions in a 32 Bit word, unused slots marked as NOP (by a "move register to the same register without modifying the flags" instruction).

To me, it looks like you are taking something like the EPIC approach, where the compiler has to translate a CISC instruction into a set of consecutive microinstructions
(consecutive in program memory that is), with a clear "marking" between the microinstructions to determine the start\end of one CISC "packet",
and checking dependencies between the instructions is done by hardware in the CPU and not by the compiler.
Am I right so far ?

Cheers,
Dieter.

Rob Finch · Post by **Rob Finch** » Thu Nov 28, 2019 1:21 pm

Quote:

To me, it looks like you are taking something like the EPIC approach, where the compiler has to translate a CISC instruction into a set of consecutive microinstructions
(consecutive in program memory that is), with a clear "marking" between the microinstructions to determine the start\end of one CISC "packet",
and checking dependencies between the instructions is done by hardware in the CPU and not by the compiler. Am I right so far ?

Yes, that's right. But there's no compiler involved. The core will run plain old 6502 instructions which get translated into RISC instructions by the core. There's room reserved for up to four risc instructions and an instruction count field in the table so there's no skipping over instructions. Only the instructions needed to be executed are executed. When the instructions are queued they are marked as the first instruction, last instruction or middle instruction, so the core can tell where the effects of one macro instruction are.
The core is now using 16 bit instructions, so there's 64 bits reserved for each 6502 opcode. There's a total of 70 bits though, two for length, and four to indicate which status flags should be updated. The table has 256 rows of 70 bits. In theory it could be possible to update this table at runtime. Looks like:

Code: Select all

uopl[`LDA_IMM8]	= {2'd0,`UOF_NZ,2'd0,`UO_LDIB,`UO_R8,`UO_ACC,`UO_ZR,48'h0};
uopl[`LDA_ZP]		= {2'd0,`UOF_NZ,2'd0,`UO_LDB,`UO_R8,`UO_ACC,`UO_ZR,48'h0};
uopl[`LDA_ZPX]	= {2'd0,`UOF_NZ,2'd0,`UO_LDB,`UO_R8,`UO_ACC,`UO_XR,48'h0};
uopl[`LDA_IX]		= {2'd1,`UOF_NZ,2'd1,`UO_LDW,`UO_R8,`UO_TMP,`UO_XR,`UO_LDB,`UO_ZERO,`UO_ACC,`UO_TMP,32'h0000};
uopl[`LDA_IY]		= {2'd2,`UOF_NZ,2'd2,`UO_LDW,`UO_R8,`UO_TMP,`UO_ZR,`UO_ADDW,`UO_ZERO,`UO_TMP,`UO_YR,`UO_LDB,`UO_ZERO,`UO_ACC,`UO_TMP,16'h0000};
uopl[`LDA_ABS]	= {2'd0,`UOF_NZ,2'd0,`UO_LDB,`UO_R16,`UO_ACC,`UO_ZR,48'h0};
uopl[`LDA_ABSX]	= {2'd0,`UOF_NZ,2'd0,`UO_LDB,`UO_R16,`UO_ACC,`UO_XR,48'h0};
uopl[`LDA_ABSY]	= {2'd0,`UOF_NZ,2'd0,`UO_LDB,`UO_R16,`UO_ACC,`UO_YR,48'h0};

I created the table manually as it's fairly small. If I were more ambitious I'd create a tool for translating a list of 6502 instructions to a list of risc instructions.

Translation is just a map. I was thinking that maybe a similar approach could be used for pipelining the 6502. It would require a mapping stage at the head of the pipeline. After that it's just a classic five stage risc pipeline. In terms of real hardware it could be a dual-port ram organized to be 70 bits wide on one side and loadable on the other side. It'd need parallel hardware pipelines though to get the performance.

ttlworks · Post by **ttlworks** » Thu Nov 28, 2019 2:18 pm

Quote:

But there's no compiler involved. The core will run plain old 6502 instructions which get translated into RISC instructions by the core.

Now that's nice: one could run existing 6502 machine code without much fuss (except for that "self modifying code" versus "cache plus pipeline" thing).

In our old 20MHz TTL CPU, one microcode "word" is fetched from microcode memory in one cycle, and a 6502 CISC instruction might require up to 8 cycles.
Your translation from CISC to atomic microinstructions then would be the equivalent to making the microcode memory more wide for reading the
(up to 8) microcode "words" for a given CISC instruction simultaneously from microcode memory within one clock cycle.
//Maybe it would make sense to have a "native mode" in which the CPU can fetch a bundle of atomic microinstructions from program memory, even if it's just for testing the CPU core.
//Being able to update the translation table at runtime means having custom CISC instructions.

It's an interesting question, how many temporary registers (not visible to the "end user" living confortably at 6502 CISC level) would be required,
but I think at least one address register plus one data register. Flags shouldn't be modified when atomic microinstructions do address calculation.

TREX had a SKIP microinstruction for implementing the conditional branches:
R6=[PC+], SKIP C, PC+=R6(nf)
//read branch distance into R6, skip the rest of the instruction word if C flag is set. If it isn't set, add R6 to PC without modifying the flags.
//R6 was the scratchpad and link register. When writing PC, the old contents of PC are pre_emptively saved into R6 as a return address for subroutine calls etc.

Drass · Post by **Drass** » Thu Nov 28, 2019 5:53 pm

Rob Finch wrote:

Hi, it looks like there's an interesting project here.

Thanks Rob, and welcome. I’ve followed some of your projects on anycpu.org with interest.

Quote:

Why do branches have to wait until the writeback stage? I would think they could be resolved sooner.

I left the resolution of branches to the WriteBack stage because that’s where the 6502 flags are evaluated, i.e. based on the ALU result from the Execute stage. Thinking about it, perhaps a branch could be resolved sooner if it tests a condition flag which will not be changed by any upstream instructions in the pipeline. (This is not likely to be the case for the Zero flag, but a BVC, for example, could be resolved much earlier if there are no instructions which will modify the V flag ahead of it in the pipeline). Is that what you’re thinking of?

Anything that can help reduce the branch mis-prediction penalty would be very beneficial. Thanks for mentioning this.

Quote:

I've started working (it may never get finished) on logic for a superscalar 6502 which splits up 6502 opcodes into micro-ops similar to a Pentium. The project's posted at anycpu.org.

ttlworks has mentioned a superscalar design, and it seems that there is lots of potential there. I have to admit it’s a little over my head at the moment. I’ll have take some time to study your project. (This is my first attempt at a significant pipeline design, so it might take me a bit to catch on).

Quote:

I'm not sure what's assumed for memory performance?

There is not much more in this design than has been outlined above already — separate memory for program, address and data, with single-entry write-buffers.

Quote:

If micro-ops were used it might be possible to run with a shorter pipeline and reduce the number of cache ports.

Both are certainly desirable. MichaelM suggested above that the number of ports can be reduced without a much penalty, and my tests so far seem to suggest he’s right. I landed on seven pipeline stages in order to process 6502 instructions as atomic units in the pipeline (opcodes plus operands), with the objective to get as close to one cycle per 6502 instruction as possible. A shorter pipeline is inherently better since the branch mis-prediction penalty is a significant drag on performance. It may be that the reduced penalty might more than compensate for the added overhead of multiple micro-ops per 6502 instruction. Very Interesting. Not quite sure how to best evaluate the tradeoffs here (short of trying it, that is).

Drass · Post by **Drass** » Fri Nov 29, 2019 1:32 am

ttlworks wrote:

Drass ... seems to have a lack of spare time due to job related reasons at the moment.

Hi Dieter. Indeed, time is short these days! Nevertheless, I did manage to devote some time to get ehBASIC running on the P7 pipeline simulator. As might have been expected, there was some self-modifying code to contend with. ehBASIC copies the "get byte" code (shown below) into zero page such that the labels Bpntrl and Bpntrh refer to the operand bytes of the LDA $FFFF instruction at LAB_2CF4. In effect, the LDA $FFFF is an indirect load, and the 16-bit increment at the start of the sequence moves the indirect pointer along the buffer.

Code: Select all

LAB_2CEE
	INC	Bpntrl		; increment BASIC execute pointer low byte
	BNE	LAB_2CF4		; branch if no carry
					; else
	INC	Bpntrh		; increment BASIC execute pointer high byte

; page 0 initialisation table from $C2
; scan memory

LAB_2CF4
	LDA	$FFFF			; get byte to scan (addr set by call routine)
	CMP	#TK_ELSE		; compare with the token for ELSE
	BEQ	LAB_2D05		; exit if ELSE, not numeric, carry set

...

LAB_2D05
	RTS

The problem for the pipeline is that the LDA $FFFF instruction is already loaded by the time the INC instruction updates memory. The pointer address is therefore wrong when the load executes. The simplest solution was to change the LDA $FFFF into a 65C02 unindexed indirect load (and to move Bpntrl and Bpntrh to their own zero page locations), as follows:

Code: Select all

LAB_2CF4
	LDA	(Bpntrl)		; get byte to scan (addr set by call routine)
	CMP	#TK_ELSE		; compare with the token for ELSE

And with that, ehBASIC came alive in the P7 simulator. To load a BASIC program, I run this slightly modified ehBASIC in the Kowalski simulator and paste the BASIC program into the I/O window. I then save a 64k binary image of the whole thing to disk, and load that directly into the P7 simulator. To start, P7 then jumps to ehBASIC's "warmstart" entry point and feeds it a "RUN/n" byte sequence as user input ... and away we go with the BASIC program running!

Looking around for an interesting test case to try, I thought of Arne's BASIC BENCHMARK:

Code: Select all

  1 REM
  2 REM         B A S I C - B E N C H
  3 REM
  4 REM Das Programm testet, ob es im Bereich [3..A] der natürlichen
  5 REM Zahlen zwei aufeinander folgende Primzahlen P1 < P2 gibt, deren
  6 REM Differenz größer oder gleich B ist ( P2 - P1 >= B ).
  7 REM
  8 REM
 10 ZS = 3 : INPUT A,B
 20 FOR C = 3 TO A STEP 2
 30 FOR D = 3 TO SQR(C) STEP 2
 40 IF INT(C/D)*D = C THEN 80
 50 NEXT D
 60 IF C-ZS >= B THEN PRINT C,ZS,C-ZS : GOTO 10
 70 ZS = C
 80 NEXT C
 90 PRINT " KEINE LOESUNG " : GOTO 10

It took the P7 simulator just over 10 Million cycles to display a result for the "A" test case (with the INPUT statement in line 10 replaced by direct assignments for the values of A and B — A=1000, B=20"). The results are as follows:

Code: Select all


Ready
RUN
 907           887           20

------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
Branch Prediction: 2-bit
Run: BasicBenchmark.65b
- - - - - - - - - - - - - - - - - -
Cycles:        10030000
NMOS Cycles:   22625499
- - - - - - - - - - - - - - - - - -
Instructions:   7062867	(70%)
Data Stalls      236358	(2%)
Control Stalls:  962632	(10%)
Delay Slots:    1022706	(10%)
Generated:       745438	(7%)
Total:         10030000 (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:       2264143	(23%)
Operand Fwd:     230178	(2%)
- - - - - - - - - - - - - - - - - -
Branches:       1566812	(22%)
Taken:           793063	(51%)
Mispredictions:  240658	(15%)
- - - - - - - - - - - - - - - - - -
btb Hits:       2131762	(99%)
bCache Access:   105252	(1%)
aCache Conflicts: 74521	(1%)

----------- Performance -----------
CPI:               1.42
NMOS CPI:          3.20
Boost Factor:      2.3x

I was at once thrilled that the simulation worked correctly, and also a bit dissaspointed with 2.3x performance that this run achieves. We've done better than 1.42 CPI to this point, so this deserves some investigation. More on that later.

For now, I wanted to take this opporunity to quickly sanity test the NMOS cycle count. I used BigEd's test to validate the cycle count initially, but this does not exercise the various gyrations the pipeline might go through on more general-purpose code. ehBASIC represents a more complete test in this regard.

The Kowalski simulator reports 22,606,233 cycles exeuted to the GOTO 10 in line 60 of the program above. Meanwhile, the P7 simulator reports 22,625,499 for more or less the same run. That's a variance of just 20,000 cycles, or a 0.1% error. Considering minor differences in the start and stop points of execution, that seems more than adequate for our purposes. Good.

Just for fun, I looked at other tabulated results reported on Arne’s benchmark. BillO's Jaguar homebrew 65C02 SBC, for example, ran this test in 1.46 seconds. It will be fun (eventually) to see how a pipelined 6502 might do against some of the faster entries on this scoreboard.

I'll report back on the performance analysis when I get a chance ...

Cheers for now,
Drass

GaBuZoMeu · Post by **GaBuZoMeu** » Sat Nov 30, 2019 2:44 pm

I'm glad that I could contribute to this project

The benchmark (or its BASIC implementation) causes a lot of branches (22%) of which 51% were taken. That will cause a more or less severe disturbance in the pipeline. If roughly 11% of the executed code will cause a pipeline rebuild that should drown the performance.

(The numbers for Control Stalls and Delay Slots correspond to the 11% (51% of 22% taken branches) - but I'm not sure if this is a correspondence or just by chance.)

Can you add a stat that tells how many times a jump or branch hits a pipeline that isn't fully established? Or can this count be deducted from the given ones? Somehow I think this does happen during the Benchmark (and other BASIC progs) frequently.

Great work so far!

Cheers,
Arne

Drass · Post by **Drass** » Sun Dec 01, 2019 2:41 am

Hi Arne. Great to see you drop by.

Quote:

The benchmark (or its BASIC implementation) causes a lot of branches (22%) of which 51% were taken. That will cause a more or less severe disturbance in the pipeline.

Yes, BASIC has a lot of non-looping branches, which are difficult for the 2-bit branch prediction scheme (as we discovered above when looking at the 32-bit Multiply run).

Quote:

(The numbers for Control Stalls and Delay Slots correspond to the 11% (51% of 22% taken branches) - but I'm not sure if this is a correspondence or just by chance.)

These things are definitely related, but not quite as directly as might appear. It’s the number of mis-predictions that is key (not whether the branches are taken). We have to flush the pipeline every time we mis-predict, whether the branch is eventually taken or not. This is because the pipeline will have already fetched several instructions from the wrong path by the time it figures out it guessed incorrectly. When that happens, the pipeline discards those instructions, resets PC for the correct path, and starts fetching instructions all over again. The discarded instructions are in effect the mis-prediction penalty.

There are seven stages in the pipeline. At the time of the mis-prediction, the branch instruction is in the WB stage, so that means we have 6 instructions to flush. The simulation counts those as 4 Control Stalls and 2 Delay Slots. So, 4 Control Stalls per mis-prediction * 240,658 mis-predictions = 962,632 Control Stalls, as shown in the report. Delay Slots are also generated for other reasons, but wrt branches, it’s 2 Delay Slots per mis-prediction * 240,658 mis-predictions = 481,316 Delay Slot cycles. Hence the total branch mis-prediction penalty is 1,443,948 cycles, or about 14% of the total cycle count.

Quote:

Can you add a stat that tells how many times a jump or branch hits a pipeline that isn't fully established? Or can this count be deducted from the given ones? Somehow I think this does happen during the Benchmark (and other BASIC progs) frequently.

I think you are referring to the number of times we have to reload the pipeline, which is the mis-prediction count above.

The other sources of turbulence on this run are the number of Delay Slots that are triggered for other reasons (541,390 or 5% of cycles) and Generated instructions (745,438 or 7% of cycles). I am currently working to try improve both these figures (as well as the branch prediction accuracy), but that will have to be the subject of another post.

Quote:

Great work so far!

Thanks Arne, and regards.

Pipelining the 6502

Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502