6502.org • View topic - Pipelining the 6502

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Hardware

All times are UTC

Pipelining the 6502

Page 3 of 6

[ 89 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 Next

Previous topic | Next topic

Author

Message

Drass

Post subject: Re: Pipelining the 6502

Posted: Fri Dec 06, 2019 4:55 pm

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

GaBuZoMeu wrote:

To get an idea of how frequently two (or more) branches are simultaneously in one pipe running the given examples so far, I was asking for another counter.

Ah! Gotcha. Thanks for clarifying. From having looked at execution traces, I would say we do encounter multiple branches in the pipeline at any one time with some frequency. Comparing a byte to several values in quick succession is one such example, but there are others. I’ll see about adding a counter to see how it plays out.

Great analysis above regarding branch types, thank you — a very useful perspective. One comment is that “data” branches may not always be entirely unpredictable ... sometimes they do break one way or the other reflecting some regularity in the data. But I definitely take your point on this.

What a fascinating subject!

Cheers.

_________________
C74-6502 Website: https://c74project.com

Top

GaBuZoMeu

Post subject: Re: Pipelining the 6502

Posted: Fri Dec 06, 2019 6:03 pm

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany

Drass wrote:

Is there a version of python that runs on a 6502?

My mind is blank

Drass wrote:

One comment is that “data” branches may not always be entirely unpredictable ... sometimes they do break one way or the other reflecting some regularity in the data.

I'm with you - entirely unpredictable is somewhat exaggerated. I just want to underline the little chance for applying strategies.

Drass wrote:

What a fascinating subject!

Indeed!

Cheers,
Arne

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Sun Dec 08, 2019 8:38 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Before finally zeroing-in on branch prediction, I wanted to give some attention to inefficiencies with JSR, RTS and indirect jumps. The BASIC interpreter has lots of these, and the latest tests have put their shortcomings in sharp relief. JSR and RTS erode CPI because they are executed as multiple micro-instructions each. Indirect jumps, on the other hand, trigger 2 Delay Slots on every execution. We can do better on both fronts.

One simple enhancement is to combine micro-instructions together where possible. For JSR, pushing the hi-byte of the return address and jumping to the target address can be done in a single step. A second micro-instruction is still required to push the return address low-byte onto the stack, but that can occur without penalty during the Delay Slot which follows the jump. Condensing further would require multi-byte writes to Data Memory, which is something I'd rather avoid. (The writes are simple enough to buffer, but the hazard logic would have to grow significantly). Lets leave that be for the moment and go on to the next item.

RTS is executed in a four step sequence: increment SP by two, fetch the return address, add one to it, and jump to it. The vast majority of RTS instructions merely pop from the stack an address which was pushed by a JSR sometime earlier. If we can keep that address handy, then RTS can execute in a single step. Here's how:

On the first encounter with a specific RTS instruction, we make an entry in the btb to flag it as such.
On a JSR, we write the return address directly to a sixteen-entry Return Target Buffer (rtb) indexed by SP[4..1] (i.e., SP/2 & $f). The rtb is wide enough to store complete address as a single word.
In the FE stage, we consult the btb and the rtb at the same time. If the btb entry indicates that the current instruction is an RTS, then we use the address from the rtb as a prediction and jump to it.
In the AD stage, we fetch the actual return address from the normal stack
In the IX stage, we increment the return address and, if there was no btb entry for this RTS, we jump to it.
In the DA stage, we validate the predicted address by comparing it with the fully resolved return address from the stack. If they don't match, we flush the pipeline and jump to the correct address.

The net effect of all this is that RTS instructions will complete in one cycle (once they have acquired a btb entry). If the program modifies the stack directly or we exceed the sixteen entry limit, then the affected RTS instructions will incur a four cycle penalty. But this should happen only rarely. (Many thanks to Dr Jefyll for sorting me out on a couple of aspects of this mechanism).

Finally, we have indirect jumps to deal with. The BASIC interpreter uses indirect jumps for vectored I/O. Although the jumps are indirect, the target address in fact seldom changes. A simple optimization is to predict the target address for indirect jumps when we see it is the same for two consecutive iterations. As usual, we validate the prediction by comparing it to the fully resolved address when it’s available. If the prediction fails, we correct the jump and disable predicting it until the target address is once again unchanged for two consecutive iterations.

Alright, with JSR, RTS and JMP (abs) thus improved, CPI for the BASIC Benchmark dropped significantly, from 1.42 to 1.30, and performance was enhanced from 2.3x to 2.5x — certainly a worthwhile improvement (see stats report below).

And this now sets the stage nicely for branch prediction — the main remaining inefficiency is a 15% mis-prediction rate which manifests as Control Stalls on 11% of cycles and Delay Slots on 7% of cycles. We return to that next.

Cheers for now,
Drass

Code:

------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
rtb Enabled:       True
ind Enabled:       True
Branch Prediction: 2-bit
Run: BasicBenchmark.65b
- - - - - - - - - - - - - - - - - -
Cycles:         9000000
NMOS Cycles:    22235056
- - - - - - - - - - - - - - - - - -
Instructions:   6941299   (77%)
Data Stalls      238725   (3%)
Control Stalls:  945948   (11%)
Delay Slots:     632116   (7%)
Generated:       241913   (3%)
Total:          9000000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:       2240865   (25%)
Operand Fwd:     256221   (3%)
- - - - - - - - - - - - - - - - - -
Branches:       1539697   (22%)
Taken:           779351   (51%)
Mispredictions:  236487   (15%)
- - - - - - - - - - - - - - - - - -
bCache Access:    56744   (1%)
aCache Conflicts: 62369   (1%)

----------- Performance -----------
CPI:               1.30
NMOS CPI:          3.20
Boost Factor:      2.5x

_________________
C74-6502 Website: https://c74project.com

Last edited by Drass on Mon Dec 09, 2019 3:22 am, edited 1 time in total.

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Sun Dec 08, 2019 1:45 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Very interesting! Can you see how things vary with rtb size? 8, 16, 24 (or perhaps 8, 16, 32)

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Mon Dec 09, 2019 11:03 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

BigEd wrote:

Very interesting! Can you see how things vary with rtb size? 8, 16, 24 (or perhaps 8, 16, 32)

What a great question. The sweet spot for the size of the rtb might surprise you (it certainly surprised me). It's quite small.

Here is a table of results:

Code:

rtb     Instructions     Delay Slots     CPI       Boost
0           72%              10%         1.38       2.3x
4           76%               8%         1.31       2.5x
8           77%               8%         1.30       2,5x
16          77%               8%         1.30       2.5x
256         77%               7%         1.30       2.5x

Just 4 entries makes a big difference! The fact is that the BASIC interpreter spends most of its time within a fairly shallow call tree, and a small table captures it well. The gains beyond that are marginal.

I seem to recall reading something that Garth wrote in which he mentions that most programs require relatively little stack space. This seems to bare that out. Compiled code which uses stack frames to pass arguments might be an exception. One possible change in that case is to keep a separate register for the rtb index so only return addresses consume rtb space. There is a tradeoff in that the rtb becomes a little more sensitive (manually pushing a return address on the stack will invalidate the whole rtb, not just one entry), but this may be sensible nevertheless. We'll see.

I am using 16 entries for testing but my hope is that we can get away with a much smaller rtb in the implementation.

_________________
C74-6502 Website: https://c74project.com

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Mon Dec 09, 2019 11:11 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

That's a good result! I think it's a generally good idea, when you consider the benefit of a mechanism, to see how the size of the mechanism trades off. In the world of chip design, silicon area is paramount (or it was in my day) and so you'd always want to get a good deal out of the area you spend. And if an rtb is smaller, perhaps that leaves room for a larger btb, or whatever.

Top

GARTHWILSON

Post subject: Re: Pipelining the 6502

Posted: Mon Dec 09, 2019 7:20 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8427
Location: Southern California

Drass wrote:

I seem to recall reading something that Garth wrote in which he mentions that most programs require relatively little stack space. This seems to bare that out. Compiled code which uses stack frames to pass arguments might be an exception.

I've only mildly been following this topic; but yes, the above is in page 16 of the 6502 stacks treatise, titled, "Does the 6502 have enough stack space?" It's one of the shorter pages of the treatise. There are exceptions of course, especially in higher-level languages, and it mentions that. The '816 has a 16-bit stack pointer and instructions and stack-relative addressing modes which make large stack frames much more practical than they are on the '02 (regardless of whether or not the code using these is compiled from an HLL).

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

GaBuZoMeu

Post subject: Re: Pipelining the 6502

Posted: Mon Dec 09, 2019 8:17 pm

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany

Yes, very interesting results!

Perhaps someone who deals with Forth can manage to prepare a memory image of a Forth program for Drass? My fairly limited knowledge around Forth reminds me that there are at least two variants (STC and DTC). So two images perhaps?

Although I believe Forth doesn't use the 6502 stack greatly it would be good to know the required rtb depth.

Again, great work!

Top

GARTHWILSON

Post subject: Re: Pipelining the 6502

Posted: Mon Dec 09, 2019 9:17 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8427
Location: Southern California

GaBuZoMeu wrote:

STC = subroutine-threaded code, basically compiled to machine language
DTC = direct-threaded code
ITC = indirect-threaded code, which is what figForth is, along with other people's improved versions
TTC = token-threaded code, which is the most memory-efficient but adds an extra step in the inner loop
(and I know I'm forgetting one)

Dr. Brad Rodriguez has a page on how various threading methods work, at http://www.bradrodriguez.com/papers/moving1.htm .

My tests show that Forth uses the 6502 hardware stack very roughly the same amount as the ZP data stack. Separating most data from the return addresses solves a lot of problems. It has only a very small handful of ZP variables.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Fri Dec 13, 2019 4:15 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

I finally had a chance to run the test Arne suggested above. That is, to find out how many times we encounter a second branch when we have one already in the pipeline. This would inform the feasibility of pursuing both sides of a branch in order to avoid branch-related stalls. Here is the result for ehBasic and the BASIC Benchmark after 9 million cycles:

Code:

Cycles:         9000000
Instructions:   6939506   (77%)
Branches:       1539273   (22%)
Branch Forks:   1111193   (72%)
Taken:           779133   (51%)

22% of executed instructions were branches, and 72% of those had at least one other branch in the pipeline at the same time (labeled as a "Branch Forks" above). That's not great. It suggests branches in fact cluster together and we would need to pursue several paths at once to make this work. Either that, or we would be forced to stall quite frequently in order to resolve into one path or another.

Sorry we didn't get a better answer Arne. I'm happy to try any other tests which might shed further light on this.

_________________
C74-6502 Website: https://c74project.com

Top

BigEd

Post subject: Re: Pipelining the 6502

Posted: Fri Dec 13, 2019 10:01 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Interesting - that would argue for a shorter pipeline and better branch prediction. If only those were free choices! To put it another way, it argues against making a longer pipeline in an effort to improve clock speed, and it argues that spending resources on branch prediction is good value.

Presumably, in a model like this, it would be possible to dial the accuracy of branch prediction up and down without needing to build a mechanism? The branch predictor could see the future.

Top

GaBuZoMeu

Post subject: Re: Pipelining the 6502

Posted: Fri Dec 13, 2019 9:33 pm

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany

Wow, 72% !! Never thought that it might happen so frequently :shock:

Indeed that renders a dual pipeline nearly useless

. May also be counterproductive for predictions, not sure.

Hmm, perhaps the missing unsigned branches for <= (BCC & BEQ = BLS) and > (BCS & BNE = BHI) causes those high counts. Signed comparisons are even more complex...

Nevertheless thank you for these numbers, Drass!

Cheers,
Arne

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Sat Dec 14, 2019 12:41 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

GaBuZoMeu wrote:

Wow, 72% !! Never thought that it might happen so frequently :shock:

I was surprised as well, but actually it makes sense. About 20% of executed instructions are branches, so that’s one in five. Assuming a fairly even distribution, odds are that there will be more than one branch in a seven-stage pipeline at any one time.

Quote:

Nevertheless thank you for these numbers, Drass!

Happy to do it Arne. Regards.

_________________
C74-6502 Website: https://c74project.com

Top

Drass

Post subject: Re: Pipelining the 6502

Posted: Sat Dec 14, 2019 2:35 am

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON

Quote:

Presumably, in a model like this, it would be possible to dial the accuracy of branch prediction up and down

Yes, agreed. It’s fair to ask: what would it take to dial-up accuracy from the current 85% to say 95%? Lets look at it ...

One shortcoming of the simple Global History scheme described above is that branches which share the same global history will also share a single prediction entry in the Pattern History Table (PHT). In other words, their results will step on each other and distort the prediction record. By contrast, our 2-bit prediction scheme maintains a unique prediction counter for every branch in the btb. Ideally, we want both; a global branch history to capture correlation *and* good differentiation between individual branches.

We can achieve this with a larger PHT and by using a combination of PC and the Global History Register (GHR) to index it. An 8-bit index, for instance, might use the low-order nibble from PC together with a 4-bit GHR. We would then have up to sixteen different prediction counters for each GHR value, and we can use the lower nibble of PC to select between them. This kind of scheme is appropriately referred to as "Global Select" or GSEL for short.

The tradeoff is the size of the PHT, as well as the porportion of bits used from PC and the GHR. Standard 2-bit prediction is effectively a special case of GSEL which uses a 16-bit all-PC index. We can "dial-up" prediction by introducing bits from the GHR, and playing with the blend. To that end, below is a graph showing the results of applying various combinations to the BASIC Benchmark test we have been running. GSel(6,4), for instance, uses six bits from PC and four bits from GHR to form a 10-bit index. It yields just over 85% accuracy, or slightly better than 2-bit prediction alone:

Attachment:

641BBFE3-D853-4F65-90BB-6FA6096D0800.jpeg [ 210.41 KiB | Viewed 798 times ]

We can see from the graph that accuracy improves steadly as we introduce more bits from PC, reaching 90% with a 16-bit index at Gsel(12,4). Shifting emphasis to the GHR thereafter pushes accuracy to 94% with GSel(6,10). We would need a wider index to better that, but thankfully there is an alternative. We can treat the PHT as a hash-table rather than a sparse array. That is, rather than concatinating, we can XOR a full 16-bit PC and a 16-bit GHR to form a 16-bit index. This is the so-called "Global Share" scheme, or GSHR for short. It yields better accuracy for a given index width at the expense of an additional XOR gate delay. On the graph, we can see that GShr(16,16) gets us to our goal of 95% accuracy. This requires a PHT with 64k entries which can be stored in 16k bytes.

Now, lets stand back for a moment. The notion that anything can predict branches correctly 95% of the time is nothing short of amazing to me -- much less something as simple as what we have here: 2-bit counters, a Global History Register and a Pattern History Table. These are well-known techniques, but that in no way diminishes my astonishment. Wow!

Having said that, I do have some concerns. While a 16k PHT is not prohibitive, a quick search shows that much better results are posible with smaller PHTs. According to this presentation, some commercial gshare implementations use PHTs that are just 512 bytes (MIPS R12000). And this chart shows excellent results for PHTs below 1k:

Attachment:

C4499635-AE12-4159-A7E5-6C694B57FF9A.jpeg [ 586.66 KiB | Viewed 798 times ]

Meanwhile, our results plumet below 1k, reaching just 70% at 64 bytes. By way of comparison, below is a graph showing gshare accuracy as a function of PHT size from our tests:

Attachment:

52F01935-8FC9-489A-BD21-B10EF3DBD722.jpeg [ 200.29 KiB | Viewed 798 times ]

So, something seems off. It may be that ehBasic presents a unique challange, or that I am doing something wrong. The proportion of branches executed seems pretty standard at about 20% of instructions, but perhaps it's something to do with the distribution of those branches in the code space. Or perhaps it’s the kinds of branches they are (i.e. "data related" branches as Arne called them). Not sure.

At this point I'm very happy with the result, but not quite satisfied with the means to get there. I'm going to spend some time thinking about how best to investigate. In the meantime, all thoughts and suggestions welcome!

Cheers for now,
Drass

P.S. I hope these explanations are clear as some of this gets pretty abstract. Please don’t hesitate to ask if anything is confusing.

_________________
C74-6502 Website: https://c74project.com

Top

Rob Finch

Post subject: Re: Pipelining the 6502

Posted: Sat Dec 14, 2019 4:46 am

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 449
Location: Canada

For the rtf65004 (the micro-op 6502) the pattern history table is only 128 bytes (actually 512 two-bit entries), and it seems to be close to the results you’re getting. I think it depends a lot on encountering loops which boost the apparent accuracy. I ran Fibonacci calc and got a relatively low percentage of accuracy. Then running something like a memcpy() that loops hundreds of times is bound to increase the accuracy stats. For the basic interpreter there’s probably a lot of random branching which would lower accuracy. Other programs might have different results. Larger tables in the predictor might help but a small percentage gain might be offset by a lower fmax.
The lookup for prediction must be available within a clock cycle that means using synchronous memories with registered outputs won’t work. Which means using the much smaller distributed memories in an FPGA.

_________________
http://www.finitron.ca

Top

Page 3 of 6

[ 89 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 Next

Board index » 6502.org Users Forum » Hardware

All times are UTC

Who is online

Users browsing this forum: Google [Bot] and 6 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum