6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun May 05, 2024 1:35 pm

All times are UTC




Post new topic Reply to topic  [ 89 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
 Post subject: Re: Pipelining the 6502
PostPosted: Sun Dec 01, 2019 2:55 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
For an illustration, here is an annotated trace of a mis-predicted branch ...
Code:
------------- 821 --------------
WB: lda #$a
EX: bne $52     <— This branch is mis-predicted
DA: lda $0e
... Alub = $0 = [$e]
IX: cmp $10
AD: bcc $05
DE: jsr $c892
... Branch Taken: PC = $c892
FE: $c8a8: 64 b0 fc 49 - d0 27 38 e5
------------- 822 --------------
WB: bne $52     <— mis-prediction detected, PC set to the alternate path, SP restored
... Bad Branch: PC = $c8ed, SP = $fd
EX: ---         <— Flushed instructions
DA: ---                 
IX: ---
AD: ---
DE: ---
FE: $c890: a0 2 a9 d - d0 27 38 e5  <— Code buffer data is invalid
------------- 823 --------------
WB: ---
EX: ---
DA: ---
IX: ---
AD: ---
DE: ---
FE: $c8ec: a0 2 a9 d - 3f c9 20 90  <— Code buffer word loaded from the new PC address
------------- 824 --------------
WB: ---
EX: ---
DA: ---
IX: ---
AD: ---
DE: cmp #$20                        <— instruction decoded from the correct path
FE: $c8f0: 19 48 a5 f - 3f c9 20 90 <— Next word from I-Cache loaded into Code Buffer
------------- 825 --------------
WB: ---
EX: ---
DA: ---
IX: ---
AD: cmp #$20
DE: bcc $19     <— execution continues on the right path
FE: $c8f4: 19 48 a5 f - d0 a a5 e

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 02, 2019 1:48 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10797
Location: England
Lazy question: does the 2-bit prediction also use the branch direction as a hint?

(Indeed, I wonder if the branch type: Z, C, V, N could also be taken into account - you could perhaps have 8 prediction tables, if I've not completely misunderstood how prediction might work.)


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 02, 2019 9:34 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Well, I’m hardly an expert as I just learned about this myself! That said, I decided the model would refer to the direction of the branch as a hint for the first prediction (the first time we encounter the branch). Since we have no better information at that point, an initial static prediction makes sense. A backward branch is started with a “strongly-taken” prediction, forward branches with a “strongly not-taken” prediction. Thereafter, we use only the recorded 2-bit counter to try to dynamically adapt to the branches actual behaviour.

It’s an interesting idea to look at the type of branch — perhaps there is some correlation between the branch type and its behaviour. As it happens, the current model uses the PC address so there is only one table and we identify branches by their address.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 02, 2019 10:33 pm 
Offline
User avatar

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany
Hi Drass,

thank you for your explanations! My last question
Drass wrote:
Quote:
Can you add a stat that tells how many times a jump or branch hits a pipeline that isn't fully established? Or can this count be deducted from the given ones? Somehow I think this does happen during the Benchmark (and other BASIC progs) frequently.
I think you are referring to the number of times we have to reload the pipeline, which is the mis-prediction count above.
was to get an idea of how many times a second branch/jump/return/break does occur while the pipe is resurrected. So the mis-prediction already has happened and another is imminent.

If this is a rare case (of course it could be constructed to happen frequently) a second pipeline could process the alternate way. Once it is known which choice actually was taken you could continue with the "right" pipe sparing the other for the next branch. (I think feeding two pipes with information from different origins might stress the caches and/or memory interface.)

Drass wrote:
Just for fun, I looked at other tabulated results reported on Arne’s benchmark. BillO's Jaguar homebrew 65C02 SBC, for example, ran this test in 1.46 seconds. It will be fun (eventually) to see how a pipelined 6502 might do against some of the faster entries on this scoreboard.
The Jaguar figure is impressive but actually exactly the same figure that my 2 MHz check on the 6502-badge delivers if you consider the clock speed. More intersting would be to have the required RAM/ROM images to simulate the performance improvement and CPI figure for the Durex forth approaches of Jeff_Birt and/or the Plasma2.0 (JIT) approach from Dave (resman).

Perhaps the VTL02C interpreter is worth another verification. As all of these are different languages but interpreted (well Plasma can be tuned a little) it would be interesting whether the CPI figures are similar or not.


Cheers,
Arne


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 02, 2019 11:39 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8432
Location: Southern California
We have a topic discussing branch prediction on AnyCPU at http://anycpu.org/forum/viewtopic.php?f=3&t=612 which Ed started after I had written in another topic, "I have wondered why not let the programmer tell it if it's likely to branch or not. He always knows the goal of the particular part of the code, and what it's used for, unlike even the smartest compilers. Then the processor could use information from the programmer to prepare the most efficient way to do what he says will usually happen."

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 4:21 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
GaBuZoMeu wrote:
a second pipeline could process the alternate way. Once it is known which choice actually was taken you could continue with the "right" pipe sparing the other for the next branch. (I think feeding two pipes with information from different origins might stress the caches and/or memory interface.)
It’s very good idea, which I think is called speculative execution. It requires significant resources of course, but I think it has been done in commercial processors. It’s not something I’m prepared to attempt at the moment. I’m focusing instead on trying to improve the branch prediction accuracy, which feels more feasible.

Drass wrote:
The Jaguar figure is impressive but actually exactly the same figure that my 2 MHz check on the 6502-badge delivers if you consider the clock speed. More intersting would be to have the required RAM/ROM images to simulate the performance improvement and CPI figure for the Durex forth approaches of Jeff_Birt and/or the Plasma2.0 (JIT) approach from Dave (resman).

Perhaps the VTL02C interpreter is worth another verification. As all of these are different languages but interpreted (well Plasma can be tuned a little) it would be interesting whether the CPI figures are similar or not.
I’m certainly interested in testing with as many of these as are available. Is there a version of python that runs on a 6502?

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 5:11 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
GaBuZoMeu wrote:
Perhaps the VTL02C interpreter is worth another verification.

VTL02C was coded to be compact, and achieves some of this by using "unstructured" techniques like branching to common code fragments whenever the opportunity presents itself. Since it doesn't use BRA (to maintain NMOS compatibility), I've a feeling that mis-predictions could factor in significantly.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 7:04 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3352
Location: Ontario, Canada
Drass wrote:
Since we have no better information at that point, an initial static prediction makes sense.
Yup.
Quote:
A backward branch is started with a “strongly-taken” prediction, forward branches with a “strongly not-taken” prediction.
Huh? I would've thought weakly-taken/not-taken are better initial choices. The initial static prediction is little more than a guess. In a case where that guess gets contradicted by actual runtime results, shouldn't the latter should take precedence immediately? (Strongly taken/not-taken as the initial setting means the guess would have to be contradicted twice before the predictor changes its tune.) Am I overlooking something?

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 12:26 pm 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
barrym95838 wrote:
VTL02C was coded to be compact, and achieves some of this by using "unstructured" techniques like branching to common code fragments whenever the opportunity presents itself. Since it doesn't use BRA (to maintain NMOS compatibility), I've a feeling that mis-predictions could factor in significantly.

I would think that it should be possible to predict very well for those techniques by keeping track of flag states that can be known through static analysis (just like a human 6502 programmer does).

If you see an instruction that clears or sets a flag, you know the setting until the next instruction that could affect that flag, so for a sequence like CLV; ...; BVC loop, so long as nothing in the `...` section is any of the instructions that could set the overflow flag (ADC, SBC, BIT)* you know that the BVC will always be taken. This can be a fair whack of instructions; looking through my recent code I have one section that does this with thirteen instructions between the CLV and BVC. (Pulling the CLV back that far was an optimization to keep it out of a loop.)

Further, if you see a branch based on a flag and it's not taken, you also know the value of that flag until the next instruction that can set it. So for example (from the same file linked above), after a loop:

Code:
10$:        sta [*buf0ptr],y
            dey
            bne 10$
            sta *curdigit       ; start with most signficant digit (at pos. 0)
            beq 33$             ; BRA; skip initial multiply of zeroed buffer

The BNE was not taken if you've reached the BEQ, and the itnervening STA does not affect the flags, so the Z flag must be set and the BEQ must always be taken.

_______________
* Well, yeah, assuming that the SOB pin isn't used in your sytem. It's probably worth having a user-settable flag for that because the O flag is changed by far fewer instructions than any other, so it seems to me the most likely to be used for BRA emulation, as in my example linked above. I couldn't have used N, Z or C flags there without pulling the flag set instruction into the loop because all of those flags can be changed by instructions in the loop.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 6:33 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
barrym95838 wrote:
VTL02C was coded to be compact ... I've a feeling that mis-predictions could factor in significantly.
I’m curious to try it. I’m still working on BASIC, and then will come FORTH and others. I’m hoping that after BASIC things will progress more quickly between tests (and of course hope springs eternal ... :) ).

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 10:25 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Dr Jefyll wrote:
I would've thought weakly-taken/not-taken are better initial choices.
Indeed, that’s where I started as well. After some testing, however, I wasn’t very sure anymore.

Consider for example a branch which alternates between not-taken and taken (as in NTNTNTN) ... If the initial setting is weakly-taken, then the prediction will change on every iteration, and will be wrong each time. On the other hand, if we start with a strongly-taken setting instead, then the prediction is never changed and we get 50% accuracy for this same sequence. For the reverse pattern (i.e., TNTNTNT), both weakly and strongly-taken yield 50% accuracy. At the other end of the spectrum, not-taken initial predictions yield the opposite behaviour — we get 50% accuracy for all but the combination of TNTNTNT and weakly-not-taken, which will be wrong every time. In short, the more emphatic predictions are preferable for these alternating patterns.

But that’s not always true. If the sequence starts with two consecutive branches in one direction and then alternates thereafter, then the weaker initial predictions are better. I went with strong predictions only because my early tests happened to show a bias towards that. But that does not mean it’s a good choice in the general case (nor is it a particularly bad choice for that matter).

In the end I convinced myself that if the prediction scheme is good, then the repeating behaviour of the branch should bias the prediction toward the right choice over time, and that’s more important than any one guess. I have since spent some time learning about other prediction schemes to try to better understand the tradeoffs. More on that later ... :)

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Tue Dec 03, 2019 11:02 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
cjs wrote:
The BNE was not taken if you've reached the BEQ, and the itnervening STA does not affect the flags, so the Z flag must be set and the BEQ must always be taken.
This is a great example of two correlated branches. If you know the outcome of one, then you know the outcome of the other. And as I’ve learned, this particular property is known as global branch correlation, and is exploited by a Global History prediction schemes.

The basic idea is that the CPU keeps a Global History Register (GHR) which records the outcomes of all branches. A 1 is left-shifted into the register for a taken branch and a 0 for a not-taken branch. The width of the register is arbitrary, but intuitively we can see that a wider register will keep track of a longer history. The value of the register at any one time corresponds to the unique history we have accumulated to that point in the program stream.

Armed with the global history of branches, we can use that to help us predict what will happen next. The accepted mechanism is to maintain a table of 2-bit prediction counters in a Pattern History Table (PHT). We then use the GHR to index into this table to fetch a prediction. The 2-bit prediction counters behave as usual, but instead of being associated with a particular branch, they are instead associated with a particular history. We are essentially saying something like this:

Last time I saw a TNTT sequence, the next branch was taken. Now that I am seeing that the same sequence again, I will predict that the next branch will also be taken.

It should be clear that this prediction scheme is likely to do much better at identifying the alternating patterns we discussed above, the ones that defeated the 2-bit branch prediction.

So, in 2-bit branch prediction we have a table of 2-bit counters and use PC to index into it to retrieve a prediction. The prediction is associated with a specific branch. The problem is this misses patterns in the relationship of a given branch with the outcome of other branches, and patterns in the history of the branch itself. The Global History scheme captures this nuance beautifully.

One obvious question is how wide to make the register. Wider registers keep track of more history, but also make the PHT bigger (and harder to train). It’s tradeoff which I suspect is best resolved in practise, as we will see.

More to come ... :)

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Wed Dec 04, 2019 5:03 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Oh, wow, that global history thing is brilliant. Still, this seems different from the static analysis in that the static analysis doesn't rely on history at all but just gives you the correct prediciton 100% of the time for the particular instructions to which it applies, right?

It would be interesting to know if, once one has global history, how useful static analysis would be as an addition to that. Does global history get enough right on most workloads (and, in particular, on VTL02C :-)) that static analysis adds nothing to do that? Or does static analysis perhaps allow some savings in global history memory by allowing you to avoid recording certain branches that always have the same result?

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Wed Dec 04, 2019 10:09 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
cjs wrote:
It would be interesting to know if, once one has global history, how useful static analysis would be as an addition to that. Does global history get enough right on most workloads (and, in particular, on VTL02C :-)) that static analysis adds nothing to do that? Or does static analysis perhaps allow some savings in global history memory by allowing you to avoid recording certain branches that always have the same result?
I’m with you on this cjs, and thanks for the suggestion.

One thing to note is that always-taken BRA like branches are handled reasonably well by 2-bit prediction. If the initial assumption is wrong (as it would be for a forward branch), then it will be quickly corrected by the consistent behaviour of the branch thereafter. We therefore get at most two mis-predictions in the life of the branch, which is tolerable in the context of many iterations. (Consider that we are mainly concerned with branches which occur many times in the regular course, since those that don’t also won’t impact performance much either).

Still, we can try to do better. In general, we can observe that if the value of a flag is not changed by an instruction currently in the pipeline, then the flag is an excellent predictor of a branch which tests that flag. Conversely, all bets are off otherwise. Intuitively, it's not likely that the Z flag and N flags will survive very long in the pipeline unchanged, but the C and V flags should fare better. Still, it's hard to know how often things will play out favourably. We could build logic to check, of course, but as is usual in the pipeline, it's best to just go ahead and take our chances. So, rather than slowing down to make sure, we just predict based on the flag's value and move on.

As you suggest, this Flags predictor is probably good as a supplement to a main predictor. In effect, we run two predictions in parallel, and then track which delivers the more accurate results for a given branch. The Flags predictor will easily beat other approaches for those branches where the tested flag remains unchanged. Conversely, if there is no relationship between the eventual outcome of the branch and the value of the flag at the time we sample it, then this predictor will be duely demoted and quickly ignored. Either way, the overall prediction accuracy benefits from the combination.

And once again there is a general approach here, the so-called "tournament" or "hybrid" predictors. The observation is that different branches are better predicted differently, and there is no "best" scheme for all cases. So we run several predictors concurrently and pick a winner for each branch based on its specific dynamic behaviour. The "tournament" method learns soon enough what works best for each branch, and overall accuracy improves as a result. Moreover, we also adapt very well should circumstances change. All in all, it's an excellent approach.

But of course nothing's free in this world. A tournament predictor implies a full implemention of several prediction schemes at once, as well as the logic to choose between them. This adds complexity, hardware and delay, so it's hard to know whether it's justified here. But we should know better soon enough. I'm currently running some tests to improve the pipeline's main predictor. Thereafter we will be in a better position to asses other supplementary approaches.

Once again, thanks cjs for these excellent suggestions and dialogue. I'm sure the end result will be much better for it.

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Thu Dec 05, 2019 11:58 pm 
Offline
User avatar

Joined: Wed Mar 01, 2017 8:54 pm
Posts: 660
Location: North-Germany
As I C it there are four different flavors of branches to deal with:

The first flavor is fully predictable (like CLV ... BVC and "correlated" branches) by tracking a known state of a flag down to the branch. These predictions require some efforts - various triggers like "CLV" or LD* #* and for each trigger a corresponding tracker ...

The second flavor are "loop" branches that usually were taken and occasionally not - here only a minor percentage of mispredictions occur so it doesn't matter. These were well covered by the tiny branch prediction you have already established.

The third flavor are "pattern" branches that mostly could be predicted by using a GHR and a PHT. Well, I assume quite a lot of housekeeping to do (and I could hardly imagine the amount of additional hardware). The drawback me think is that these methods requires a sort of training until they show success and obviously there is always a chance that the outgoing prediction is wrong.

The last flavor are "data related" branches as shown in the multiplication example. I would consider them as not predictable.

If all of these predictors were implemented a sort of prediction selector is required too. :shock:

I cannot but think this would be a huge effort to avoid pipeline stalls due to mispredicted branches.


On the other hand "speculative execution" would not require any branch prediction at all. And as long as there is only one branch in the pipeline at a time two pipelines would be sufficient. Once there are two or more branches down their way through the pipeline(s) things get ugly. Either there are provisions to establish one or two more pipelines (and all the stuff they require :shock: ) or you are facing the imminent "desaster".

To get an idea of how frequently two (or more) branches are simultaneously in one pipe running the given examples so far, I was asking for another counter. I assume that this situation (more than one branch in the pipe) occurs less frequently than mispredicted branches. Thus in turn would mean a second pipeline would result in a better CPI than branch prediction(s).


Cheers,
Arne


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 89 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: