Pipelining the 6502
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Pipelining the 6502
I'll repeat my comment from the last page, since no one here has commented on it: Why not let the programmer tell it if it's likely to branch or not. He always knows the goal of the particular part of the code, and what it's used for, unlike even the smartest compilers. From what the programmer says is likely, the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Pipelining the 6502
GARTHWILSON wrote:
... the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Pipelining the 6502
barrym95838 wrote:
GARTHWILSON wrote:
... the compiler sets or clears a bit in the instruction to tell the processor if it's likely to branch or not. At run time, the processor uses that information to prepare the most efficient way to do what the programmer says is most likely to happen. If the problem is that that would require breaking away from a standard like C, then it goes back to something I said earlier, that maybe it's time to do so.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Pipelining the 6502
It’s amazing what a modern branch predictor can achieve, but it ultimately requires a loop backwards so that the branch is repeated. I suspect for startup code where no branches are being repeated there isn’t a predictor with a lot of accuracy, so I wouldn’t worry about accuracy in those cases. The predictors really gain “accuracy” from branches that are repeated lots of times. Branch prediction really helps a lot with tight loops in a pipelined processor.
There is a way in ‘C’.
The standard approach to specifying branch predictability in ‘C’ is to use __branch_likely() constructs in code (I hope I got that right IIRC).
I had adapted a ‘C’ compiler at one point to accept an optional second parameter in if () statements. The second parameter allowed the programmer to specify the likelihood of the if statement being true. That propagated out to the machine code as a bit setting in the branch instruction to set the branches’ predictability. So “if (a)” was a normal if with no prediction, “if (a;1)” meant predict true. “if (a;0)” meant predict false. I was going to extend the idea to allow specifying the branch predictor to for machines that have multiple predictors. So, “if (a;2)” would use an alternate predictor. The constant had to be a compile time constant since it gets encoded in a branch instruction.
Quote:
Why not let the programmer tell it if it's likely to branch or not. He always knows the goal of the particular part of the code
The standard approach to specifying branch predictability in ‘C’ is to use __branch_likely() constructs in code (I hope I got that right IIRC).
I had adapted a ‘C’ compiler at one point to accept an optional second parameter in if () statements. The second parameter allowed the programmer to specify the likelihood of the if statement being true. That propagated out to the machine code as a bit setting in the branch instruction to set the branches’ predictability. So “if (a)” was a normal if with no prediction, “if (a;1)” meant predict true. “if (a;0)” meant predict false. I was going to extend the idea to allow specifying the branch predictor to for machines that have multiple predictors. So, “if (a;2)” would use an alternate predictor. The constant had to be a compile time constant since it gets encoded in a branch instruction.
Re: Pipelining the 6502
Rob Finch wrote:
I think it depends a lot on encountering loops which boost the apparent accuracy. ... For the basic interpreter there’s probably a lot of random branching which would lower accuracy.
Yes, there are far fewer looping branches in the Basic interpreter than we might see in other code, so many branches will appear quite random. The fact that we do see good accuracy with longer histories suggests that there is regularity there, but just over many more iterations than is typical for looping branches.
I feel the Basic interpreter is an important target, so it might be that a larger predictor is a necessity here. But this does suggest that testing more broadly is important before making final decisions on the tradeoffs.
Quote:
using synchronous memories with registered outputs won’t work.
C74-6502 Website: https://c74project.com
Re: Pipelining the 6502
GARTHWILSON wrote:
Why not let the programmer tell it if it's likely to branch or not.
Quote:
Maybe I'm not understanding the situation completely, but ISTM that you're asking for an instruction bit that simply can't exist unless you implement 9-bit bytes or something similar. You're instantly losing 65xx machine code compatibility no matter how you slice it, right?
A related thought here is that the simulator might eventually do double duty as a static profiler/optimizer and generate metadata exactly for this purpose. It’s only a concept now, but probably worth mentioning given the discussion. One obvious challenge is how exactly to load this metadata into the CPU. Another is that performance during profiling will vary depending on the inputs to a given program, so one has use a “representative” input set. Probably easier said than done, but it’s interesting nevertheless.
Quote:
Well, that's partly what I was wondering about, if the code has to be 100% compatible with earlier versions and software that's already out there.
Quote:
It’s amazing what a modern branch predictor can achieve
Quote:
I had adapted a ‘C’ compiler at one point to accept an optional second parameter in if () statements.
C74-6502 Website: https://c74project.com
Re: Pipelining the 6502
Quote:
Quote:
using synchronous memories with registered outputs won’t work.
I was thinking one might use a “flow-through” synch RAM for this, like the GS880Z36CGT-250I. Maybe not the tidiest solution, but would that not work?
Re: Pipelining the 6502
Rob Finch wrote:
With fast enough ram and multiplexors it may be possible to clock the ram at twice or more the rate of the pipeline in order to get the access at the different stages.
C74-6502 Website: https://c74project.com
Re: Pipelining the 6502
I've been following this discussion and that regarding the "micro op 6502" on anycpu.org.
One thing that I've noticed is that there does not seem to be any consideration given to the branch instruction itself except as a conditional branch or a subroutine call/return. In my limited experience in working with the 6502 instruction set, there seems to be commonly accepted ways that loops, if/then/else, switch statements, etc. are implemented. My observation is that the 6502 conditional branches are used in specific ways to implement these programming constructs.
By not including the conditional branch instruction in the branch predictor, there appears to be an implication that 6502 conditional branches are equally likely in any of these common programming constructs. It's been my experience in programming with the 6502 that its limited set of conditional branch instructions are commonly used in particular arrangements in order to implement each of the structured programming constructs that are commonly part of any program.
I think that it would be worthwhile investigating the frequency of occurrence of a particular conditional branch instruction in particular programming constructs. For the tests being conducted using EhBasic, I suspect that this should be possible using the bytecodes for the various keywords in the language. This information could then be used to determine the frequency of occurrence of particular conditional branch instructions, and also the most likely branch, i.e. taken or not taken, for the conditional branch instruction in each of these situations.
Reading some of the reference material on this subject leads me to conclude that the conditional branch instructions in a particular instruction set are assumed to be equally likely. Nothing that I've been able to read so far indicates that there is any consideration given to the use of particular conditional branch instructions in particular programming contexts. For example, a "while" loop is likely to use conditional branch instructions like bne, bcs, and bmi individually or in rapid combination more often than their complements: beq, bcc, and bpl.
For a particular instruction set, its conditional branch instructions are likely to be used in particular, repeatable ways to implement the various structured programming constructs commonly employed in High Level Languages (HLLs) like C, Pascal, and even Basic. The limited number of conditional branch instructions in the 6502 instruction set, and the fact that they only operate on a single flag at a time, is more likely to result in specific usage patterns when used to implement the commonly used programming structures.
In my M65C02A instruction set, I added approximately 24 conditional branch instructions. These additional conditional branch instructions perform many of the tests in a single instruction that typically require at least two, if not three, closely spaced 6502 branch instructions to implement. When adding this new conditional instructions to the code generator for my Pascal compiler, I found that I used both the original and the new conditional branch instructions in very particular patterns when implementing "for", "while", "repeat", and "case" statements.
Although a general solution is likely the sought after solution, I think that whatever solution is chosen for the pipelined 6502 project should be skewed toward the particular characteristics of 6502 programs. The nature of the 6502 is such that generalizations that apply to register rich CISC/RISC architectures are not likely to apply directly to the 6502. I've always been amazed by the programming idioms which apply well to the 6502, but not as well to other architectures.
Edit: added missing word -- seem -- to second paragraph.
One thing that I've noticed is that there does not seem to be any consideration given to the branch instruction itself except as a conditional branch or a subroutine call/return. In my limited experience in working with the 6502 instruction set, there seems to be commonly accepted ways that loops, if/then/else, switch statements, etc. are implemented. My observation is that the 6502 conditional branches are used in specific ways to implement these programming constructs.
By not including the conditional branch instruction in the branch predictor, there appears to be an implication that 6502 conditional branches are equally likely in any of these common programming constructs. It's been my experience in programming with the 6502 that its limited set of conditional branch instructions are commonly used in particular arrangements in order to implement each of the structured programming constructs that are commonly part of any program.
I think that it would be worthwhile investigating the frequency of occurrence of a particular conditional branch instruction in particular programming constructs. For the tests being conducted using EhBasic, I suspect that this should be possible using the bytecodes for the various keywords in the language. This information could then be used to determine the frequency of occurrence of particular conditional branch instructions, and also the most likely branch, i.e. taken or not taken, for the conditional branch instruction in each of these situations.
Reading some of the reference material on this subject leads me to conclude that the conditional branch instructions in a particular instruction set are assumed to be equally likely. Nothing that I've been able to read so far indicates that there is any consideration given to the use of particular conditional branch instructions in particular programming contexts. For example, a "while" loop is likely to use conditional branch instructions like bne, bcs, and bmi individually or in rapid combination more often than their complements: beq, bcc, and bpl.
For a particular instruction set, its conditional branch instructions are likely to be used in particular, repeatable ways to implement the various structured programming constructs commonly employed in High Level Languages (HLLs) like C, Pascal, and even Basic. The limited number of conditional branch instructions in the 6502 instruction set, and the fact that they only operate on a single flag at a time, is more likely to result in specific usage patterns when used to implement the commonly used programming structures.
In my M65C02A instruction set, I added approximately 24 conditional branch instructions. These additional conditional branch instructions perform many of the tests in a single instruction that typically require at least two, if not three, closely spaced 6502 branch instructions to implement. When adding this new conditional instructions to the code generator for my Pascal compiler, I found that I used both the original and the new conditional branch instructions in very particular patterns when implementing "for", "while", "repeat", and "case" statements.
Although a general solution is likely the sought after solution, I think that whatever solution is chosen for the pipelined 6502 project should be skewed toward the particular characteristics of 6502 programs. The nature of the 6502 is such that generalizations that apply to register rich CISC/RISC architectures are not likely to apply directly to the 6502. I've always been amazed by the programming idioms which apply well to the 6502, but not as well to other architectures.
Edit: added missing word -- seem -- to second paragraph.
Last edited by MichaelM on Sun Dec 15, 2019 9:31 pm, edited 1 time in total.
Michael A.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Pipelining the 6502
Good point, Michael. It seems self-evident to me that BNE would naturally lean toward true and BEQ would naturally lean toward false, but it would take some further analysis to get a feel for the leanings of the others in real-world contexts.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
Re: Pipelining the 6502
That’s a great idea! A 6502 specific predictor can certainly skew towards particular regularities in 6502 code if they can be identified.
Let’s have a look ...These tables show the results of the Basic Benchmark, a simple FOR Loop, the Eratosthenes Sieve and the multi-byte multiply. In all but the Eratosthenes Sieve, there is a clear bias towards taking branches where the tested status flag is clear, and not taking branches where the flag is set (i.e. BNE is likely to be taken, and BEQ not taken — just as Mike observed).
I’ll want to sample more code just to be sure this holds, but it is encouraging to say the least.
Nicely done gents!
Let’s have a look ...These tables show the results of the Basic Benchmark, a simple FOR Loop, the Eratosthenes Sieve and the multi-byte multiply. In all but the Eratosthenes Sieve, there is a clear bias towards taking branches where the tested status flag is clear, and not taking branches where the flag is set (i.e. BNE is likely to be taken, and BEQ not taken — just as Mike observed).
I’ll want to sample more code just to be sure this holds, but it is encouraging to say the least.
Nicely done gents!
C74-6502 Website: https://c74project.com
Re: Pipelining the 6502
The larger application, the Basic Benchmark, appears to provide some generalizations that may be applied to the 8 different 6502 conditional branch instructions. The Sieve of Eratosthenes seems to throw a wrench into the rule for the BEQ instruction that seems to be indicated by the Basic Benchmark and the Basic For Loop.
There are, of course, studies for other conditional branch instruction combinations that you've not yet completed, but I am thinking that, for the 6502, the simple two bit saturating counter that you originally started with appears to be able to handle the case for the BEQ instruction when presented with Basic Benchmark in one test, and the Sieve of Eratosthenes in the another. If you extend your branch predictor logic to include some heuristic regarding the branch instruction at a particular address, then you could initialize the two bit saturating branch predictor to '00' or '11' based on the measured frequency for branch taken / not taken as your spreadsheet shows. A few mis-predictions would occur, but they would quickly move to one or the other extreme of the two bit saturating branch prediction counter and provide nearly 100% correct branch prediction. Thus, the simple two bit saturating counter branch predictor should prove to be very effective for the 6502. (If I recall, you've already proven this hypothesis in some of your early work on the branch predictor logic.)
If you're committed to considering a more sophisticated branch predictor, such as the ones that you and Rob have been discussing, then I think that, in addition to constructing Pattern History Table (PHT) using the gshare (XOR) or the gselect (concatenation) methods, the branch instruction type should be used to further qualify / extend that table for conditional branch instructions individually (BPL, BMI, BVC, BVS, BNE, BEQ, BCC, BCS) or by conditional branch instruction classes (N, V, Z, C). Naturally, for the 6502, the coding for individual instructions would require 3 bits and the classes would require 2 bits in the PHT.
The likelihood that these instructions or classes would clash in the PHT when located in different parts of the program should be fairly low. Thus, extending the PHT index with the instruction or class should reduce the need to have a large number of bits from the PC to adequately hash the branches being cached. In fact, I think that you should be able to reduce the number of PC bits into something manageable like 6 bits, which will allow each PHT to be implemented in a dual-ported Xilinx LUT-based RAM (with a synchronous read/write port and an asynchronous read port, if a small FPGA option is in your design space.) If the PHT implements the two bit saturating branch predictor for a conditional branch instruction class, then I think the result wold be very satisfactory for your Pipelined 6502 project.
There are, of course, studies for other conditional branch instruction combinations that you've not yet completed, but I am thinking that, for the 6502, the simple two bit saturating counter that you originally started with appears to be able to handle the case for the BEQ instruction when presented with Basic Benchmark in one test, and the Sieve of Eratosthenes in the another. If you extend your branch predictor logic to include some heuristic regarding the branch instruction at a particular address, then you could initialize the two bit saturating branch predictor to '00' or '11' based on the measured frequency for branch taken / not taken as your spreadsheet shows. A few mis-predictions would occur, but they would quickly move to one or the other extreme of the two bit saturating branch prediction counter and provide nearly 100% correct branch prediction. Thus, the simple two bit saturating counter branch predictor should prove to be very effective for the 6502. (If I recall, you've already proven this hypothesis in some of your early work on the branch predictor logic.)
If you're committed to considering a more sophisticated branch predictor, such as the ones that you and Rob have been discussing, then I think that, in addition to constructing Pattern History Table (PHT) using the gshare (XOR) or the gselect (concatenation) methods, the branch instruction type should be used to further qualify / extend that table for conditional branch instructions individually (BPL, BMI, BVC, BVS, BNE, BEQ, BCC, BCS) or by conditional branch instruction classes (N, V, Z, C). Naturally, for the 6502, the coding for individual instructions would require 3 bits and the classes would require 2 bits in the PHT.
The likelihood that these instructions or classes would clash in the PHT when located in different parts of the program should be fairly low. Thus, extending the PHT index with the instruction or class should reduce the need to have a large number of bits from the PC to adequately hash the branches being cached. In fact, I think that you should be able to reduce the number of PC bits into something manageable like 6 bits, which will allow each PHT to be implemented in a dual-ported Xilinx LUT-based RAM (with a synchronous read/write port and an asynchronous read port, if a small FPGA option is in your design space.) If the PHT implements the two bit saturating branch predictor for a conditional branch instruction class, then I think the result wold be very satisfactory for your Pipelined 6502 project.
Michael A.
Re: Pipelining the 6502
Some great suggestions here Michael. Thank you. I’ll experiment a bit and report back.
C74-6502 Website: https://c74project.com
Re: Pipelining the 6502
Sorry that time does not permit me to help investigate this with you. Looking forward to your next post on the subject.
Michael A.
Re: Pipelining the 6502
I’m sorry to report that using the branch-type to influence prediction seems to have produced rather lackluster results. I tested a couple of variations:
I’m certainly open to further experiments on this, but for now, I decided to go back to the original GShare scheme. Further tests showed that GShare in fact does quite well with a small table in many instances. It seems that the BASIC interpreter is simply a particularly difficult case (probably because of the large number of data-dependent branches and relative lack of regular looping branches). Be that as it may, it’s an important target, so a large PHT it will have to be. (In a discrete logic implementation, any reasonable size for a PHT is going to need a separate RAM IC in any case, so a large PHT is not as burdensome as it might be in a FPGA).
But there is one drawback to a large table: it makes the predictor harder to train. By way of illustration, see the graph below:This is the GShare branch predictor running on the Eratosthenes code with various widths for the Global History Register. At just 2-bits, we have a 4-entry pht, and the predictor performs much like a simple bi-modal 2-bit predictor. But performance suffers significantly as we add more history bits. While a 16k pht performs very well for the BASIC Benchmark with 9 Million cycles, a simple program like Eratosthenes is much better off with a simple bi-modal predictor. In general, it’s clear that GShare alone will struggle in situations where bi-modal will do well, and vice-versa. The obvious implication is to blend the two in a tournament predictor.
I tried a tournament predictor with a bi-modal 2-bit predictor in the btb and a GShare predictor in the pht. The simulation already has both, so the only addition was a 2-bit Selector counter for each entry in the btb. The Selector is initialized to point to the bi-modal predictor by default, and is updated at run-time based on which predictor is found to be correct for a given branch, as follows:
The graph below shows the effect of adding a tournament predictor to the Eratosthenes test above:As we can see, a much more stable performance from the smallest to the largest table.
It’s comforting to know that we have here a fairly well-rounded predictor which can be more easily tuned. The good news is that this is not much more complicated than GShare. It requires that we keep a separate Selector counter in the btb and that we use its high-order bit as input to a 2:1 MUX which selects one or the other predictor. There is additional propagation delay to contend with, but overall this seems straight forward enough and may prove feasible after all. Time will tell. Certainly, I’m glad to have tried it.
Cheers for now,
Drass.
- Use the branch-type to select an initial value for the 2-bit counters in a bi-modal predictor. I used values for each branch-type as well as each direction individually and in combination (i.e. a BPL Backward might have the same or a different initial value than a BPL Forward). Unfortunately, in each case the value is quickly over-written and has little influence on the overall performance of the predictor beyond a handful of cycles.
- Use a 3-bit branch-type as part of the PHT index in a GSEL correlating predictor. In this case, performance actually deteriorated. It turns out that adding more bits from PC is preferable to additional bits that reflect a particular branch attribute.
I’m certainly open to further experiments on this, but for now, I decided to go back to the original GShare scheme. Further tests showed that GShare in fact does quite well with a small table in many instances. It seems that the BASIC interpreter is simply a particularly difficult case (probably because of the large number of data-dependent branches and relative lack of regular looping branches). Be that as it may, it’s an important target, so a large PHT it will have to be. (In a discrete logic implementation, any reasonable size for a PHT is going to need a separate RAM IC in any case, so a large PHT is not as burdensome as it might be in a FPGA).
But there is one drawback to a large table: it makes the predictor harder to train. By way of illustration, see the graph below:This is the GShare branch predictor running on the Eratosthenes code with various widths for the Global History Register. At just 2-bits, we have a 4-entry pht, and the predictor performs much like a simple bi-modal 2-bit predictor. But performance suffers significantly as we add more history bits. While a 16k pht performs very well for the BASIC Benchmark with 9 Million cycles, a simple program like Eratosthenes is much better off with a simple bi-modal predictor. In general, it’s clear that GShare alone will struggle in situations where bi-modal will do well, and vice-versa. The obvious implication is to blend the two in a tournament predictor.
I tried a tournament predictor with a bi-modal 2-bit predictor in the btb and a GShare predictor in the pht. The simulation already has both, so the only addition was a 2-bit Selector counter for each entry in the btb. The Selector is initialized to point to the bi-modal predictor by default, and is updated at run-time based on which predictor is found to be correct for a given branch, as follows:
- If bi-modal is correct AND GShare is incorrect, then decrement selector.
- If bi-modal is incorrect AND Gshare is correct, then increment selector.
The graph below shows the effect of adding a tournament predictor to the Eratosthenes test above:As we can see, a much more stable performance from the smallest to the largest table.
It’s comforting to know that we have here a fairly well-rounded predictor which can be more easily tuned. The good news is that this is not much more complicated than GShare. It requires that we keep a separate Selector counter in the btb and that we use its high-order bit as input to a 2:1 MUX which selects one or the other predictor. There is additional propagation delay to contend with, but overall this seems straight forward enough and may prove feasible after all. Time will tell. Certainly, I’m glad to have tried it.
Cheers for now,
Drass.
C74-6502 Website: https://c74project.com