6502.org • View topic - Pipelining the 6502

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Hardware

All times are UTC

Pipelining the 6502

Page 6 of 6

[ 89 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6

Previous topic | Next topic

Author

Message

Drass

Post subject: Re: Pipelining the 6502

Posted: Sun Jan 05, 2020 5:56 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

kakemoms wrote:

It starts at positive cpu clock by sending the address to the memory, which half a cycle later returns the opcode. Then at negative edge, the next address is fetched and operand returned at the second half clock. At the third half cpu cycle, the address at which the operand points to is read (which is done in paralell to reading the next opcode through the dual-port memory), then at the fourth half cpu cycle, the opcode is actually carried out. The result of the opcode is written in the sixth half cpu cycle.

Very nice. Thanks for the explanation. It sounds analogous to a pipeline like this:

[FetchOpcode] —> [FetchOperand (stall)] —> [FetchData] —> [Execute] —> [???] —> [WriteBack]

... where the next instruction is fetched when the current one is at the FetchData stage (i.e. FetchOperand is a “stall” state). But I’m not sure if all instructions go through all stages in your design.

_________________
C74-6502 Website: https://c74project.com

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Sun Jan 05, 2020 6:14 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

While getting Fig FORTH up and running, I've been pondering how best to manage self-modifying code in the pipeline. There are two key issues:

1) Detection: Data is written to memory in the WriteBack stage. A series of comparators can check whether we are writing to an address already in the pipeline. To that end, each pipeline register can have a "source" field which stores the memory address of the instruction held there. In the brute force method, each stage would have as many source-address fields as bytes in the instruction, and every write would be compared against every byte address in every pipeline stage. With seven stages and up to three bytes per insrtruction, you end up with an additional 21 comparators -- plus additional ICs beyond that to make the pipeline registers wide enough to store the addresses. That's probably not practical. One alternative to reduce the required hardware is to use only word-addresses (32-bit) in the comparisons, which has some drawbacks. Perhaps there are others solutions as well.

2) Action: The accepted practice is to flush the modified instructions from the pipeline whenever self-modifying code (SMC) is detected (see here for a good discussion of SMC handling on x86). In essence, the pipeline writes NOPs to the modified stage and all downstream stages, and then starts fetching from the source address of the modified stage to reload the instructions. Ironically this makes self-modifying code slower than otherwise, which is precisely the reverse effect from what the programmer would have inteded. Indeed, it's likely that the offending code will be an often-executed critical snippet, and the additional NOPs will significantly punish CPI. Once again, there may be alternatives worth exploring.

I thought for now I would simply share the general contours of the issue. I'll be working on the final fix for this down the road a bit, and as always I can use all the help I can get.

Cheers for now,
Drass

P.S. Here's a bit of fun trivia. The x86 discussion mentioned above contains a reference to US Patent 6237088 "System and method for tracking in-flight instructions in a pipeline", The patent appears to cover a detection mechanism for self-modifying code similar to the one described above (except that addresses are kept in a separate "Line Address Buffer" rather than in the pipeline itself). The patent was granted to Intel in May 2001 and appears to have expired in June of 2018. National Semiconductor filed a similar patent in 1997 (both patents were granted, so I assume these were deemed sufficiently different by the patent office) -- US Patent 5835949A "Method of identifying and self-modifying code". This patent seems to have expired in 2014. It’s described as follows:

"A system and method of readily identifying and handling self-modifying variable length instructions in a pipelined processor is disclosed employing index tags associated with each stage of the execution pipeline wherein the index tags identify the cache line numbers in the instruction cache from which the instructions originate."

It's comforting to know that both these patents are now expired. Looks like we can safely proceed with this project after all ... Phew!

_________________
C74-6502 Website: https://c74project.com

Top

MichaelM

Post subject: Re: Pipelining the 6502

Posted: Sun Jan 05, 2020 9:39 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

Drass:

It seems like a bunch of work on your part to support self-modifying code. You seem to be considering self-modifying code as a necessity. In the case of figFORTH, it does modify the dispatch address in page zero (if I remember correctly), but I'm pretty sure that the number of cycles, i.e. number of instructions, that dispatch initiates is far more than the depth of the pipeline as you've described. It is perfectly valid to support self-modifying code with a stipulation that the modified code cannot be executed within the depth of the pipeline; put the onus on the programmer to ensure that the modifications that he's making to the instruction stream will be valid before the processor cycles back to execute the modified instruction.

In other words, take a pointer from one of the reference / fundamental RISC architectures of the last century: Microprocessor without Interlocked Pipeline Stages (MIPS).

The cache coherency and program security issues associated with self-modifying code will be problem enough for your architecture to tackle without also taking into account modifications to the instruction stream less than the depth of your processor's pipeline.

_________________
Michael A.

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Tue Jan 07, 2020 5:07 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Thanks for the perspective Michael. Your point is well taken.

It's relatively easy to expertiment with the simulation, so there is little downside to exploring options. Then, even if left out of the hardware implementation, support for self-modifyng code in the simulation may be a helpful diagnostic tool to identify the issue in existing software (it might display some diagnostics to identify the problem areas, for example, and perhaps even "patch" the code as needed -- as an side, there are other areas worth exploring where the simulation may become a companion "optimizer").

The fact that both BASIC (GETCHR) and figFORTH (EXEC) use tight self-modifying code is a motivating factor. It would be nice to run these 6502 staples without modification. Alas, I agree that may prove impractical in the end. If so, then a tool which creates a pipeline-compatible binary, or at least provides some help in doing so, will be desirable.

_________________
C74-6502 Website: https://c74project.com

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 08, 2020 1:59 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Alright, fig-FORTH is up and running on the pipeline!

Dr Jefyll was kind enough to supply a version of fig-FORTH that runs in the Kowalski simulator (many thanks Jeff). I used our good friend the Sieve of Eratosthenes as a test once again. This time it was a version from rosettacode which Dr Jefyll adapted to run on fig-FORTH. Here it is:

Code:

: prime? ( n -- ? ) here + c@ 0= ;
: composite! ( n -- ) here + 1 swap c! ;
: 2dup ( n1 n2 -- n1 n2 n1 n2 )  over over ;

: sieve ( n -- ) 
  here over 0 fill
  2                     
  begin
    2dup dup  
    * >
  while                     
    dup prime?
      if  2dup dup *                     
        do  i composite! dup
        +loop
      then
      1+
  repeat drop ." Primes: "
  2 
  do i prime?
    if i . then
  loop ;

Fig-FORTH uses self-modifying code in the NEXT routine, which as Michael suspected, works just fine in the pipeline. Unfortunately, fig also uses self-modifying code in the EXEC routine, and this did require that three NOPs be inserted into the code stream. Once that was done, the prime sieve ran happily to 100, as follows:

Code:

fig-FORTH  1.0
100 sieve Primes: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 71 73 79 83 89 97 OK

------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
rtb Enabled:       True
ind Enabled:       True
Branch Prediction: TOUR 2-bit
Run: ForthPrime.65b
- - - - - - - - - - - - - - - - - -
Cycles:          290000
NMOS Cycles:     782134
- - - - - - - - - - - - - - - - - -
Instructions:    237479   (82%)
Data Stalls       27436   (9%)
Control Stalls:    5136   (2%)
Delay Slots:      19817   (7%)
Generated:          133   (0%)
Total:           290000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:        125677   (43%)
Operand Fwd:       6030   (2%)
- - - - - - - - - - - - - - - - - -
Branches:         18600   (8%)
Taken:            17361   (93%)
Mispredictions:    1284   (7%)
- - - - - - - - - - - - - - - - - -
bCache Access:     2106   (1%)
aCache Conflicts:   335   (0%)

----------- Performance -----------
CPI:               1.22
NMOS CPI:          3.29
Boost Factor:      2.7x

Performance is quite good at 1.22 CPI and 2.7x NMOS — not bad at all actually. Tournament prediction does quite well at 93% accuracy. Indeed, with 93% of branches taken on this run, 2-bit prediction does equally well. I know we had wondered how the pipeline might deal with data-oriented branches in this code, but evidently branching is in fact quite regular. Thus far only the BASIC interpreter has needed the more sophisticated predictor.

I noted a few other performance-related items on this run:

Data Stalls are unusually high at 9%. This is largely due to the common practice of modifying an index register just prior to using it in an indexed operation. Interestingly, this is often arbitrary and could be avoided by re-ordering instructions slightly. The condition is realtively easy to detect, so I'm pondering whether some optimizations may be appropriate here.
Delay Slots are high also at 7%. This is largely due to the indirect jump in the often-executed NEXT routine. I suspect the other FORTH variants (that Garth listed here) will perform differently in this regard. (I would be curious to test some of these. Can anyone provide some pointers to them? Ideally, I would need source code which will run in the Kowalski simulator so I can more easilty deal woth self-modifying code).
It's interesting to note that there are a lot of JMP <abs> instructions in this code -- 7% of executed instructions are JMPs (not shown in the Run Stats). I mention this because it may be possible to optimize these jumps out with “zero-cycle” jumps. More on that later.

Ok, so that’s a whole lot of fun. All in all I'm very happy with this result. We’ll have to see whether any additional optimizations make sense for FigFORTH. More to come ...

Chees for now,
Drass

EDIT: Best to correct typo in the FORTH code that Dr Jefyll points out below. Thanks Jeff!

_________________
C74-6502 Website: https://c74project.com

Last edited by Drass on Sat Jan 11, 2020 1:48 pm, edited 2 times in total.

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 08, 2020 2:20 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Nice result! And it's very good to see in-the-wild natural use of self-modifying code - instead of worrying about it as an abstract possibility, we see actual examples. And we see how to calm it down to run on a machine that won't run it as-is.

Top

Dr Jefyll

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 08, 2020 6:31 pm

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada

Drass wrote:

fig also uses self-modifying code in the EXEC routine

In case anyone's wondering, EXEC is a label in the assembly source. The corresponding Forth word is EXECUTE .

Code:

: 2dup (n1 n2 -- n1 n2 n1 n2 ) over over ;

Also, this 2dup definition which I hacked together contains a typo which we later corrected. Forth afficianados will realize there's supposed to be a space after the ( .

Quote:

Tournament prediction does quite well at 93% accuracy. Indeed, with 93% of branches taken on this run, 2-bit prediction does equally well.

It's nice to see the high accuracy rates. But most primitives are fairly short, which means FIG Forth spends a lot of its time executing NEXT. And, the only conditional branch in NEXT is embarrassingly predictable!

(After the ADC that adds 2 to IP-low, there's a BCC that detects those rare occasions when a carry results and IP-high needs to be incremented.) So, as always, we need to remain aware of what the stats don't tell us.

ETA: Hm, a lot of time spent in NEXT doesn't necessarily mean most of the conditional branches encountered are in NEXT. Hmmm!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Sat Jan 11, 2020 2:43 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Dr Jefyll wrote:

this 2dup definition which I hacked together contains a typo which we later corrected.

Thanks for catching that Jeff. I edited the post above to correct the typo.

Quote:

FIG Forth spends a lot of its time executing NEXT ... Hm, a lot of time spent in NEXT doesn't necessarily mean most of the conditional branches encountered are in NEXT. Hmmm!

A quick test shows that just about 50% of branches are in fact executions of the BCC in the NEXT routine. That certainly contributes to the very high number of taken branches in this run, and makes over all prediction easier.

But even so, 93%-taken seems high, and shows a significant bias even outside NEXT. My speculation is that it’s all about 16-bit integer arithmetic in this case. As you suggest, adjusting the high-byte of 16-bit quantities is an example of a forward branch that is almost always taken. (In fact, this is probably an example of a common 6502 idiom which MichaelM has suggested might aid prediction). There is likely a lot of that going on here given the program under test.

Forward branches are often presented in various articles as likely not-taken. But those folks may not be thinking about 16-bit integer arithmetic on an 8-bit processor. :wink:

_________________
C74-6502 Website: https://c74project.com

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Sat Jan 11, 2020 4:14 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Maybe short-distance forward branches are usually taken?

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 15, 2020 12:00 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Good insight BigEd! I did some testing and indeed short-range forward branches (of 5 bytes or less) are taken about 96% of the time in the Basic and Forth tests. I think this confirms your intuition quite decisively.

[EDIT: Thinking through this ... ]

_________________
C74-6502 Website: https://c74project.com

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 15, 2020 8:53 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Alright, I did a little more digging with the Forth Prime Sieve test ...

48% of all executed branches in this test are short-range branches, and of those 98% are taken. We might expect, therefore, that telling the predictor to assume that short-range branches are taken should improve accuracy significantly. Unfortunately, that’s not the case. Changing the initial assumption has no measurable effect on accuracy rate. The reason is that the 2-bit predictor is a quick learner, particularly when presented with such a regular stream of branches.

To see it in action, I ran some traces which show the 2-bit predictor working over time. It takes about 300,000 cycles for the Forth Prime Sieve to run to completion. Below are traces showing the number and percentage of mispredictions over the first 30,000 cycles. The first run is performed with the usual BTFNT assumption, and the second with a “Backward and Short-Forward taken, Long-Forward not-taken” assumption.

Here is the standard BTFNT run:

Code:

Accuracy @ 1000: 7 (13%)
Accuracy @ 2000: 14 (14%)
Accuracy @ 3000: 14 (10%)
Accuracy @ 4000: 19 (10%)
Accuracy @ 5000: 29 (12%)
Accuracy @ 6000: 35 (12%)
Accuracy @ 7000: 40 (12%)
Accuracy @ 8000: 41 (10%)
Accuracy @ 9000: 43 (10%)
Accuracy @ 10000: 45 (9%)
Accuracy @ 11000: 47 (9%)
Accuracy @ 12000: 49 (8%)
Accuracy @ 13000: 50 (8%)
Accuracy @ 14000: 52 (7%)
Accuracy @ 15000: 54 (7%)
Accuracy @ 16000: 59 (7%)
Accuracy @ 17000: 65 (8%)
Accuracy @ 18000: 73 (8%)
Accuracy @ 19000: 77 (8%)
Accuracy @ 20000: 88 (8%)
Accuracy @ 21000: 124 (10%)
Accuracy @ 22000: 157 (11%)
Accuracy @ 23000: 197 (13%)
Accuracy @ 24000: 220 (13%)
Accuracy @ 25000: 248 (14%)
Accuracy @ 26000: 279 (14%)
Accuracy @ 27000: 322 (15%)
Accuracy @ 28000: 353 (16%)
Accuracy @ 29000: 393 (17%)
Accuracy @ 30000: 413 (17%)

And now the modified run:

Code:

Accuracy @ 1000: 3 (6%)
Accuracy @ 2000: 5 (5%)
Accuracy @ 3000: 5 (3%)
Accuracy @ 4000: 6 (3%)
Accuracy @ 5000: 10 (4%)
Accuracy @ 6000: 14 (5%)
Accuracy @ 7000: 18 (5%)
Accuracy @ 8000: 20 (5%)
Accuracy @ 9000: 22 (5%)
Accuracy @ 10000: 23 (5%)
Accuracy @ 11000: 25 (5%)
Accuracy @ 12000: 27 (4%)
Accuracy @ 13000: 29 (4%)
Accuracy @ 14000: 31 (4%)
Accuracy @ 15000: 32 (4%)
Accuracy @ 16000: 37 (5%)
Accuracy @ 17000: 43 (5%)
Accuracy @ 18000: 51 (5%)
Accuracy @ 19000: 54 (5%)
Accuracy @ 20000: 67 (6%)
Accuracy @ 21000: 102 (8%)
Accuracy @ 22000: 134 (10%)
Accuracy @ 23000: 174 (11%)
Accuracy @ 24000: 198 (12%)
Accuracy @ 25000: 225 (12%)
Accuracy @ 26000: 257 (13%)
Accuracy @ 27000: 300 (14%)
Accuracy @ 28000: 329 (15%)
Accuracy @ 29000: 370 (15%)
Accuracy @ 30000: 390 (16%)

As we can see, the modified predictor does much better to start, but the advantage is reduced to just 1% after 30,000 cycles. In fact, we can see that in the last 1,000 cycles of the trace, both runs report exactly the same number of mispredictions (20). After 300,000 cycles of largely equivalent performance, the initial advantage is vanishingly small and both assumptions are equivalent for all practical purposes.

This is all to say that although this modified assumption is in fact better, it is only so over a very short run. Still, it’s an improvement and one which I am glad we have identified. We’ll have to see whether other tests show it be valuable enough to justify the additional complexity (i.e. having to test the the size of the branch offset before deciding whether a branch should be taken or not). For the time being, we’ll just add it to our bag of tricks and move on.

Many thanks to BigEd for identifying it!

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Wed Jan 15, 2020 10:27 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Interesting! I wonder, if there's a not-too-helpful distorting effect in running a short benchmark for a long time. If, for example, we ran a Basic interpreter running a reasonably diverse but not especially loopy program, might we get different answers?

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Thu Jan 16, 2020 1:17 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

I would be very interested to try more substantial Basic or Forth programs. Can you suggest some? Realistically, programs with substantial running times are going to include loops, but your point is well taken. We could do better than these very tight looping benchmarks.

_________________
C74-6502 Website: https://c74project.com

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Thu Jan 16, 2020 11:39 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

On another note, I mentioned earlier that it may be possible to optimize JMP <abs> instructions out of the pipeline using zero-cycle jumps. This a relatively straight-forward change whereby the btb entry for a JMP is applied to the instruction preceding it, rather than to the JMP itself. The CPU will then skip the JMP in future encounters and jump from the preceding instruction directly to the target address of the JMP. (Recall that the Branch Target Buffer stores the address of the Next PC for the Fetch stage). The JMP instruction itself is not executed, so the jump complets in zero cycles.

There is one caveat to this mechanism. An unconditional JMP btb entry cannot be applied to the preceding instruction if that instruction requires a btb entry of its own (i.e., the instruction is a branch, RTS or indirect jump). It turns out that about half of the executed JMP instructions in the figFORTH code originate from a single JMP -- the one in the NEXT routine. As (bad) luck would have it, this JMP is preceded by a conditional branch (the BCC that Dr Jefyll refers to above). This JMP will therefore only be optimized out about once every 128 iterations, when the preceding branch is not taken.

Even so, the change will reliably optimize the other half of the JMPs, which is beneficial. Let's look at the impact on the Prime Sieve run:

Code:

fig-FORTH  1.0
100 sieve  Primes: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 71 73 79 83 89 97 OK

------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
rtb Enabled:       True
ind Enabled:       True
jmp zero Enabled:  True
Branch Prediction: TOUR 2-bit
Run: ForthPrimeJMP.65b
- - - - - - - - - - - - - - - - - -
Cycles:          290000
NMOS Cycles:     808472
- - - - - - - - - - - - - - - - - -
Instructions:    236473   (82%)
Data Stalls       27748   (10%)
Control Stalls:    5532   (2%)
Delay Slots:      20112   (7%)
Generated:          136   (0%)
Total:           290000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:        129143   (45%)
Operand Fwd:       6529   (2%)
- - - - - - - - - - - - - - - - - -
JMP Executed:      8885   (4%)
JMP Skipped:       8747   (4%)    <—— Note here “skipped” jumps
- - - - - - - - - - - - - - - - - -
Branches:         21874   (9%)
Taken:            20548   (94%)
Mispredictions:    1383   (6%)
- - - - - - - - - - - - - - - - - -
bCache Access:     3365   (1%)
aCache Conflicts:   324   (0%)

----------- Performance -----------
CPI:               1.18
NMOS CPI:          3.30
Boost Factor:      2.8x

As we can see, just about half of the JMP instructions have been successfuly "Skipped" in this run. These skipped JMPs have legitimately completed though, so this number is added to the total count of completed instructions when calculating CPI. As a result, CPI drops from 1.22 in the pervious run to 1.18 in this run -- a meaningful improvement.

Interestingly, we see also that Data Stalls have increased slightly from 9% to 10%. This is due to the fact that certain eliminated jumps sat between instructions that share a mutual dependency. With jumps gone, these instructions now execute one after the other in the code stream, and trigger a stall as a result. In effect, the JMP is being immediately replaced by a Data Stall in these instances. Thankfully this happens only rarely.

For a bit of fun, it's pretty magical to see the skipped JMPs dissapear from the pipeline. Note the jmp $0244 in the cycle below (between inx and ldy #1 instructions):

Code:

------------- 962 --------------
WB: adc $03,x
EX: sta $03,x
DA: inx
IX: inx
AD: jmp $0244
DE: ldy #$1
FE: $248: 85 b2 88 b1 - a0 1 b1 ae

And now, we run the same test with zero-cycle-jumps enabled, and ... Shazzam!

Code:

------------- 946 --------------
WB: adc $03,x
EX: sta $03,x
DA: inx
IX: inx
AD: ldy #$1
DE: lda ($ae),y
FE: $24c: 85 b2 88 b1 - ae 85 b1 18

... the jmp is gone (note inx is followed immediately by ldy #1). Neat.

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com

Top

Page 6 of 6

[ 89 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6

Board index » 6502.org Users Forum » Hardware

All times are UTC

Who is online

Users browsing this forum: No registered users and 28 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum