Pipelining the 6502

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 14, 2019 5:51 am

I'll repeat my comment from the last page, since no one here has commented on it: Why not let the programmer tell it if it's likely to branch or not. He always knows the goal of the particular part of the code, and what it's used for, unlike even the smartest compilers. From what the programmer says is likely, the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.

barrym95838 · Post by **barrym95838** » Sat Dec 14, 2019 6:09 am

GARTHWILSON wrote:

... the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.

Maybe I'm not understanding the situation completely, but ISTM that you're asking for an instruction bit that simply can't exist unless you implement 9-bit bytes or something similar. You're instantly losing 65xx machine code compatibility no matter how you slice it, right?

GARTHWILSON · Post by **GARTHWILSON** » Sat Dec 14, 2019 6:23 am

barrym95838 wrote:

GARTHWILSON wrote:

... the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.

Maybe I'm not understanding the situation completely, but ISTM that you're asking for an instruction bit that simply can't exist unless you implement 9-bit bytes or something similar. You're instantly losing 65xx machine code compatibility no matter how you slice it, right?

Well, that's partly what I was wondering about, if the code has to be 100% compatible with earlier versions and software that's already out there. The requirements in the head post seem to mean that software will need to be re-written. I believe Rob had a post on the other forum that I didn't read thoroughly but seemed to suggest that another way to do it is to use a bit in the operand, meaning branches would have to be always even or always odd (meaning the compiler might have to pad code with a NOP once in a while), unless you cut the available branch distance in half (something Jeff and I have talked about, and it would still cover most situations), or add another operand byte and make it able to branch ±16K. Otherwise I have a hard time believing that any scheme could be devised that will predict the branching much better than just flipping a coin, or Ed's suggestion that reverse branches tend to be taken more than forward branches are.

Rob Finch · Post by **Rob Finch** » Sat Dec 14, 2019 9:09 am

It’s amazing what a modern branch predictor can achieve, but it ultimately requires a loop backwards so that the branch is repeated. I suspect for startup code where no branches are being repeated there isn’t a predictor with a lot of accuracy, so I wouldn’t worry about accuracy in those cases. The predictors really gain “accuracy” from branches that are repeated lots of times. Branch prediction really helps a lot with tight loops in a pipelined processor.

Quote:

Why not let the programmer tell it if it's likely to branch or not. He always knows the goal of the particular part of the code

There is a way in ‘C’.
The standard approach to specifying branch predictability in ‘C’ is to use __branch_likely() constructs in code (I hope I got that right IIRC).
I had adapted a ‘C’ compiler at one point to accept an optional second parameter in if () statements. The second parameter allowed the programmer to specify the likelihood of the if statement being true. That propagated out to the machine code as a bit setting in the branch instruction to set the branches’ predictability. So “if (a)” was a normal if with no prediction, “if (a;1)” meant predict true. “if (a;0)” meant predict false. I was going to extend the idea to allow specifying the branch predictor to for machines that have multiple predictors. So, “if (a;2)” would use an alternate predictor. The constant had to be a compile time constant since it gets encoded in a branch instruction.

Drass · Post by **Drass** » Sat Dec 14, 2019 9:19 am

Rob Finch wrote:

I think it depends a lot on encountering loops which boost the apparent accuracy. ... For the basic interpreter there’s probably a lot of random branching which would lower accuracy.

Thanks for this commentary Rob. Very helpful.

Yes, there are far fewer looping branches in the Basic interpreter than we might see in other code, so many branches will appear quite random. The fact that we do see good accuracy with longer histories suggests that there is regularity there, but just over many more iterations than is typical for looping branches.

I feel the Basic interpreter is an important target, so it might be that a larger predictor is a necessity here. But this does suggest that testing more broadly is important before making final decisions on the tradeoffs.

Quote:

using synchronous memories with registered outputs won’t work.

I was thinking one might use a “flow-through” synch RAM for this, like the GS880Z36CGT-250I. Maybe not the tidiest solution, but would that not work?

Drass · Post by **Drass** » Sat Dec 14, 2019 10:14 am

GARTHWILSON wrote:

Why not let the programmer tell it if it's likely to branch or not.

I’ve been thinking about this actually. The existing predictor does offer up an initial static prediction for every branch, namely Backward-Taken-Forward-Not-Taken. It would be beneficial to have this initial prediction be more sophisticated, including perhaps taking a hint from the programmer.

Quote:

Maybe I'm not understanding the situation completely, but ISTM that you're asking for an instruction bit that simply can't exist unless you implement 9-bit bytes or something similar. You're instantly losing 65xx machine code compatibility no matter how you slice it, right?

Yes, the question is how to provide the CPU additional information while still maintaining compatibility. One way I’ve been considering is to allow for some mechanism whereby certain internal structures, like the btb and the prediction table, are pre-loaded. This would essentially be metadata that the CPU can really benefit from. We could pass initial predictions for various branches, for example, whether those are provided by the programmer or otherwise arrived at by profiling the code and observing how certain branches actually behave — or perhaps a combination of these two where code is profiled and the programmer can override the automatic hints.

A related thought here is that the simulator might eventually do double duty as a static profiler/optimizer and generate metadata exactly for this purpose. It’s only a concept now, but probably worth mentioning given the discussion. One obvious challenge is how exactly to load this metadata into the CPU. Another is that performance during profiling will vary depending on the inputs to a given program, so one has use a “representative” input set. Probably easier said than done, but it’s interesting nevertheless.

Quote:

Well, that's partly what I was wondering about, if the code has to be 100% compatible with earlier versions and software that's already out there.

I have not quite given up on 100% compatibility. Self-modifying code is a pain, but certainly not an insurmountable problem. For example, the pipeline can check if any writes will modify instructions that are already loaded and flush them in that case. That would mean that self-modifying code would run slowly, but that’s better than not running at all. The profiler/optimizer might then be a means of “fixing” self-modifying code on an optional basis, perhaps by flagging it and offering btb entries which can be either dynamically or statically applied to the code. Again, only a concept now.

Quote:

It’s amazing what a modern branch predictor can achieve

I am becoming increasingly convinced of this myself. After all, the current optimizer does reach 95% accuracy on this test despite the challenging branches. Yes, it needs 16k bytes to do it, but that’s still a great result. At minimum, I would say the approach is promising, and perhaps can be made even better when supplemented by good static analysis as described above.

Quote:

I had adapted a ‘C’ compiler at one point to accept an optional second parameter in if () statements.

Really cool Rob. Thanks for mentioning it.

Rob Finch · Post by **Rob Finch** » Sat Dec 14, 2019 12:27 pm

Quote:

Quote:
using synchronous memories with registered outputs won’t work.
I was thinking one might use a “flow-through” synch RAM for this, like the GS880Z36CGT-250I. Maybe not the tidiest solution, but would that not work?

Depending on how the predictor is implemented it might need dual-ported ram. The prediction must be updated at writeback, but read and predicted during fetch/decode. I notice that ram is very fast. With fast enough ram and multiplexors it may be possible to clock the ram at twice or more the rate of the pipeline in order to get the access at the different stages. I was thinking in terms of an FPGA, but if using "real" hardware a flow-through ram would be fine.

Drass · Post by **Drass** » Sat Dec 14, 2019 7:34 pm

Rob Finch wrote:

With fast enough ram and multiplexors it may be possible to clock the ram at twice or more the rate of the pipeline in order to get the access at the different stages.

Indeed that’s what I’m contemplating ... running RAM at twice the pipeline clock-rate to time-multiplex dual ports.

MichaelM · Post by **MichaelM** » Sun Dec 15, 2019 2:39 pm

I've been following this discussion and that regarding the "micro op 6502" on anycpu.org.

One thing that I've noticed is that there does not seem to be any consideration given to the branch instruction itself except as a conditional branch or a subroutine call/return. In my limited experience in working with the 6502 instruction set, there seems to be commonly accepted ways that loops, if/then/else, switch statements, etc. are implemented. My observation is that the 6502 conditional branches are used in specific ways to implement these programming constructs.

By not including the conditional branch instruction in the branch predictor, there appears to be an implication that 6502 conditional branches are equally likely in any of these common programming constructs. It's been my experience in programming with the 6502 that its limited set of conditional branch instructions are commonly used in particular arrangements in order to implement each of the structured programming constructs that are commonly part of any program.

I think that it would be worthwhile investigating the frequency of occurrence of a particular conditional branch instruction in particular programming constructs. For the tests being conducted using EhBasic, I suspect that this should be possible using the bytecodes for the various keywords in the language. This information could then be used to determine the frequency of occurrence of particular conditional branch instructions, and also the most likely branch, i.e. taken or not taken, for the conditional branch instruction in each of these situations.

Reading some of the reference material on this subject leads me to conclude that the conditional branch instructions in a particular instruction set are assumed to be equally likely. Nothing that I've been able to read so far indicates that there is any consideration given to the use of particular conditional branch instructions in particular programming contexts. For example, a "while" loop is likely to use conditional branch instructions like bne, bcs, and bmi individually or in rapid combination more often than their complements: beq, bcc, and bpl.

For a particular instruction set, its conditional branch instructions are likely to be used in particular, repeatable ways to implement the various structured programming constructs commonly employed in High Level Languages (HLLs) like C, Pascal, and even Basic. The limited number of conditional branch instructions in the 6502 instruction set, and the fact that they only operate on a single flag at a time, is more likely to result in specific usage patterns when used to implement the commonly used programming structures.

In my M65C02A instruction set, I added approximately 24 conditional branch instructions. These additional conditional branch instructions perform many of the tests in a single instruction that typically require at least two, if not three, closely spaced 6502 branch instructions to implement. When adding this new conditional instructions to the code generator for my Pascal compiler, I found that I used both the original and the new conditional branch instructions in very particular patterns when implementing "for", "while", "repeat", and "case" statements.

Although a general solution is likely the sought after solution, I think that whatever solution is chosen for the pipelined 6502 project should be skewed toward the particular characteristics of 6502 programs. The nature of the 6502 is such that generalizations that apply to register rich CISC/RISC architectures are not likely to apply directly to the 6502. I've always been amazed by the programming idioms which apply well to the 6502, but not as well to other architectures.

Edit: added missing word -- seem -- to second paragraph.

barrym95838 · Post by **barrym95838** » Sun Dec 15, 2019 5:12 pm

Good point, Michael. It seems self-evident to me that BNE would naturally lean toward true and BEQ would naturally lean toward false, but it would take some further analysis to get a feel for the leanings of the others in real-world contexts.

Drass · Post by **Drass** » Sun Dec 15, 2019 10:16 pm

That’s a great idea! A 6502 specific predictor can certainly skew towards particular regularities in 6502 code if they can be identified.

Let’s have a look ...

These tables show the results of the Basic Benchmark, a simple FOR Loop, the Eratosthenes Sieve and the multi-byte multiply. In all but the Eratosthenes Sieve, there is a clear bias towards taking branches where the tested status flag is clear, and not taking branches where the flag is set (i.e. BNE is likely to be taken, and BEQ not taken — just as Mike observed).

I’ll want to sample more code just to be sure this holds, but it is encouraging to say the least.

Nicely done gents!

MichaelM · Post by **MichaelM** » Mon Dec 16, 2019 12:36 am

The larger application, the Basic Benchmark, appears to provide some generalizations that may be applied to the 8 different 6502 conditional branch instructions. The Sieve of Eratosthenes seems to throw a wrench into the rule for the BEQ instruction that seems to be indicated by the Basic Benchmark and the Basic For Loop.

There are, of course, studies for other conditional branch instruction combinations that you've not yet completed, but I am thinking that, for the 6502, the simple two bit saturating counter that you originally started with appears to be able to handle the case for the BEQ instruction when presented with Basic Benchmark in one test, and the Sieve of Eratosthenes in the another. If you extend your branch predictor logic to include some heuristic regarding the branch instruction at a particular address, then you could initialize the two bit saturating branch predictor to '00' or '11' based on the measured frequency for branch taken / not taken as your spreadsheet shows. A few mis-predictions would occur, but they would quickly move to one or the other extreme of the two bit saturating branch prediction counter and provide nearly 100% correct branch prediction. Thus, the simple two bit saturating counter branch predictor should prove to be very effective for the 6502. (If I recall, you've already proven this hypothesis in some of your early work on the branch predictor logic.)

If you're committed to considering a more sophisticated branch predictor, such as the ones that you and Rob have been discussing, then I think that, in addition to constructing Pattern History Table (PHT) using the gshare (XOR) or the gselect (concatenation) methods, the branch instruction type should be used to further qualify / extend that table for conditional branch instructions individually (BPL, BMI, BVC, BVS, BNE, BEQ, BCC, BCS) or by conditional branch instruction classes (N, V, Z, C). Naturally, for the 6502, the coding for individual instructions would require 3 bits and the classes would require 2 bits in the PHT.

The likelihood that these instructions or classes would clash in the PHT when located in different parts of the program should be fairly low. Thus, extending the PHT index with the instruction or class should reduce the need to have a large number of bits from the PC to adequately hash the branches being cached. In fact, I think that you should be able to reduce the number of PC bits into something manageable like 6 bits, which will allow each PHT to be implemented in a dual-ported Xilinx LUT-based RAM (with a synchronous read/write port and an asynchronous read port, if a small FPGA option is in your design space.) If the PHT implements the two bit saturating branch predictor for a conditional branch instruction class, then I think the result wold be very satisfactory for your Pipelined 6502 project.

Drass · Post by **Drass** » Wed Dec 18, 2019 2:03 am

Some great suggestions here Michael. Thank you. I’ll experiment a bit and report back.

MichaelM · Post by **MichaelM** » Wed Dec 18, 2019 2:45 am

Sorry that time does not permit me to help investigate this with you. Looking forward to your next post on the subject.

Drass · Post by **Drass** » Fri Dec 20, 2019 4:41 am

I’m sorry to report that using the branch-type to influence prediction seems to have produced rather lackluster results. I tested a couple of variations:

Use the branch-type to select an initial value for the 2-bit counters in a bi-modal predictor. I used values for each branch-type as well as each direction individually and in combination (i.e. a BPL Backward might have the same or a different initial value than a BPL Forward). Unfortunately, in each case the value is quickly over-written and has little influence on the overall performance of the predictor beyond a handful of cycles.
Use a 3-bit branch-type as part of the PHT index in a GSEL correlating predictor. In this case, performance actually deteriorated. It turns out that adding more bits from PC is preferable to additional bits that reflect a particular branch attribute.

One variation I did not try is to have a wider prediction counter, and use the proportion of taken vs not-taken branches as a guide when updating the counter. The thought was to increment or decrement the counter by different amounts depending the relative proportion of branch outcomes. To reflect a 75/25 ratio, for example, we might increment the counter by 4 when a branch is taken and decrement it by 1 when it is not. The predictor would then naturally favour taken branches in the right proportion.

I’m certainly open to further experiments on this, but for now, I decided to go back to the original GShare scheme. Further tests showed that GShare in fact does quite well with a small table in many instances. It seems that the BASIC interpreter is simply a particularly difficult case (probably because of the large number of data-dependent branches and relative lack of regular looping branches). Be that as it may, it’s an important target, so a large PHT it will have to be. (In a discrete logic implementation, any reasonable size for a PHT is going to need a separate RAM IC in any case, so a large PHT is not as burdensome as it might be in a FPGA).

But there is one drawback to a large table: it makes the predictor harder to train. By way of illustration, see the graph below:

This is the GShare branch predictor running on the Eratosthenes code with various widths for the Global History Register. At just 2-bits, we have a 4-entry pht, and the predictor performs much like a simple bi-modal 2-bit predictor. But performance suffers significantly as we add more history bits. While a 16k pht performs very well for the BASIC Benchmark with 9 Million cycles, a simple program like Eratosthenes is much better off with a simple bi-modal predictor. In general, it’s clear that GShare alone will struggle in situations where bi-modal will do well, and vice-versa. The obvious implication is to blend the two in a tournament predictor.

I tried a tournament predictor with a bi-modal 2-bit predictor in the btb and a GShare predictor in the pht. The simulation already has both, so the only addition was a 2-bit Selector counter for each entry in the btb. The Selector is initialized to point to the bi-modal predictor by default, and is updated at run-time based on which predictor is found to be correct for a given branch, as follows:

If bi-modal is correct AND GShare is incorrect, then decrement selector.
If bi-modal is incorrect AND Gshare is correct, then increment selector.

As before, this is a saturating counter. The net effect is that the selector will default to the fast-acting bi-modal predictor while GShare is busy being trained. Over time, GShare will improve and specific branches may graduate to it. That may occur sooner or later for various branches, and not at all for others. Either way, the predictor will settle onto the preferred scheme for each branch in particular, and more stable performance will result.

The graph below shows the effect of adding a tournament predictor to the Eratosthenes test above:

As we can see, a much more stable performance from the smallest to the largest table.

It’s comforting to know that we have here a fairly well-rounded predictor which can be more easily tuned. The good news is that this is not much more complicated than GShare. It requires that we keep a separate Selector counter in the btb and that we use its high-order bit as input to a 2:1 MUX which selects one or the other predictor. There is additional propagation delay to contend with, but overall this seems straight forward enough and may prove feasible after all. Time will tell. Certainly, I’m glad to have tried it.

Cheers for now,
Drass.

Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502

Re: Pipelining the 6502