6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Jul 05, 2024 6:59 pm

All times are UTC




Post new topic Reply to topic  [ 89 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
 Post subject: Re: Pipelining the 6502
PostPosted: Fri Dec 20, 2019 4:43 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
MichaelM wrote:
Sorry that time does not permit me to help investigate this with you. Looking forward to your next post on the subject.
No worries Michael. I’m enjoying working through this. Thanks for your interest and support. Cheers.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Fri Dec 20, 2019 6:10 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10838
Location: England
> It turns out that adding more bits from PC is preferable to additional bits that reflect a particular branch attribute.

Very good result - this is deep into the kind of performance engineering where it's useful to have intuition, but crucial to perform experiments.

Interesting findings with the tournament.


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Sat Dec 21, 2019 10:47 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 449
Location: Canada
Interesting reading. (I have not tried the following (yet)). Since the gshare predictor with a limited global history performs like a gselect predictor I wonder if it could be used by itself without tournament prediction by restricting the global history bits in use to begin with. Then increasing the global history bits used after a number of branches. For instance, if there’s a 16-bit global history, masking off 14 bits using an ‘and’ gate to begin with should make it work much like a gselect predictor. Then after eight or ten branches for example, allow more bits of global history. Scaling the global history allowed to the number of branches encountered.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Sun Dec 22, 2019 12:01 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Good idea! I’ve not tried it yet either, but I do have this graph from a previous test run:
Attachment:
06C3B7EF-B17E-4132-BD99-C38B93C04E93.jpeg
06C3B7EF-B17E-4132-BD99-C38B93C04E93.jpeg [ 204.25 KiB | Viewed 1384 times ]
The GShare predictor takes about 100,000 cycles to exceed the bi-modal predictor’s 85% threshold. Perhaps simply easing the GHR from 0 to 16 bits at the rate of 1 bit every few thousand cycles might do the trick.

I do worry that changing the width of the GHR will change the location of branches in the addressable space and cause the predictor to take even longer to train. Does that make sense? Also, one advantage of the tournament predictor is that it will adapt fairly quickly to a varying workload (with bi-modal once again taking over as the accuracy of GShare drops). Maybe the “sliding” GShare scheme should have some heuristic to once again restrict the width of the GHR if accuracy drops suddenly.

I’ll be interested to see how this works out.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Sun Dec 22, 2019 12:21 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I think if you add the bits from the left to the right, then the locations of the first elements will remain the same. In other words, left justify your sliding / variable index into the memory, filling unused bits with zeroes.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Sun Dec 22, 2019 11:34 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Thanks Michael. Will give it a try.

Btw, it looks like Rob didn’t have much luck with it. :(

http://anycpu.org/forum/viewtopic.php?f=23&t=655&sid=05c4e6715aff5675442c1277a6df8135&start=45#p5199

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 23, 2019 2:54 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
As an aside, I ran across this technical article describing the Intel 486 as "Executing Instructions in One Clock Cycle". I was interested immediately. There are some very intriguing takeaways here, and it's very cool to see how the major enhancements over the 386 were presented at the time:

    “We revamped the prefetch, instruction decode, and control logic to optimally integrate the cache into the instruction execution pipeline. This critical improvement allowed the i486 CPU to execute the most frequent, simplest instructions in one clock cycle.”

The i486 is a classic example of pipelining a CISC instruction-set. It featured a five-stage pipeline as follows:

  • Fetch -- retrieves 16 bytes from the cache into a pre-fetch buffer
  • D1 -- decodes the opcode (and operands, i.e. "directs the instruction aligner/ prefetch queue to step to the next instruction")
  • D2 -- "computes memory addresses"
  • EX -- performs the ALU operation or memory access
  • WB -- Writes to the register file

The D2 stage will use two cycles for indexed-addressing operations, which is analogous to the IX stage in the P7 pipeline simulation. Effectively, the i486 stalls the pipeline to apply an index to a base address, instead of adding a separate pipeline stage to do the job. This will hurt CPI but also reduces the branch mis-prediction penalty. If we're mostly executing non-indexed operations, adding a cycle is the obvious choice. But in the general case, the tradeoff will be between the frequency of these more complex instructions (which will trigger nops from stalling the pipeline) vs. the quality of branch prediction (which will trigger nops on mis-predictions). A longer pipeline which can handle indexed addressing in one pass will also require stronger branch prediction.

In this regard, the i486 makes a very clear choice -- it uses trivial branch prediction and focuses instead on minimizing the mis-prediction penalty. To do so we have the shorter pipeline, but also an intriguing pre-fetch scheme in the EX stage. The FE stage makes an implicit "not-taken" assumption for all conditional branches and continues fetching the next sequential instruction as usual. Then, a speculative fetch is also performed while the branch is being evaluated in the EX stage. This time the branch target instruction is fetched. If the branch result turns out to be "taken", then the pipeline is flushed, and the target instruction is sent directly to the D1 stage for executiion.

This scheme has the effect of reducing the branch mis-prediction penalty by two cycles -- one cycle is eliminated by the pre-fetch itself, and the other by the fact that the fetch is performed in the EX stage. This means that there are only three instructions flushed in the pipeline, rather than four. The net result is a two-cycle penalty for every "taken" branch:

    “the CPU runs a speculative fetch cycle to the target of the jump during the Ex stage of the jump instruction. ... If the CPU evaluates the condition as true, the fetch to the target address refills the pipeline. ... In summary, conditional jumps take one clock cycle for a jump that is not taken, and three cycles for a taken jump.”

The EX-stage pre-fetch is analogous to the speculative pipeline Arne suggested, but with only one stage! Very cool. Other key aspects of the design include:

  • "No Data Delay", which is equivalent to "data-forwarding" in the P7 simulation -- "... because the CPU contains the necessary bypass paths, it can write this data to the register file and forward it to the ALU at the same time".
  • "Pointer Load Delay", which is effectively a RAW hazard stall due to the short pipeline -- "if a value is loaded from memory into a register, and that register is then used as a base register in the next instruction, the CPU will stall for one cycle". The P7 pipeline has a dedicated AD stage to avoid this stall.
  • "Unified Write-Through Cache", which according to the article also means that "no special logic is required to handle self-modifying code". I don't quite see how a unified cache helps with code that modifies instructions that are already loaded in the pre-fetch buffer or elsewhere in the pipeline -- but maybe I'm missing something?

So there is a lot here that resonates. The article does not quote any CPI figures, so it's difficult to asses performance in those terms. But still, what emerges is an approach which emphasizes simplicity and economy, at the expense of tolerating stalls from more complex instructions. We get a shorter pipeline with no branch prediction to speak of. That means no btb, no rtb, no pht, fewer and narrower pipeline registers, and a small unified on-chip write-through cache. And then there is that ingenious speculative fetch ... wow, viewed from this vantage point you can really admire this design.

So intertesting! :)

_________________
C74-6502 Website: https://c74project.com


Last edited by Drass on Mon Dec 23, 2019 4:38 pm, edited 3 times in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 23, 2019 3:29 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Drass:

I am not completely convinced yet that linking the type of or exact conditional branch instruction to the branch predictor is not a good idea. However, yours and Rob's results clearly demonstrate that a simple use of the information does not appear to improve the branch predictor's accuracy as I expected. Will have to continue to research the topic, and possible postulate another approach to incorporating the conditional branch information into a 6502 architecture branch predictor.

I am a bit reticent in accepting the taken/not-taken statistic as being a satisfactory measure of conditional branch behavior when it is clear that some conditional branch instructions are more likely than others, and that some structured programming control structures use the available 6502 architecture conditional branch instructions in specific ways. Reduction of the conditional branches to simple taken/not-taken statistics/measures just seems to assume that the conditional branch instructions are equally likely, and that is just not the case with respect to the 6502 architecture. (I will have to collect the frequency of occurrence for the 8 conditional branch instructions to see how they are truly distributed. But at the moment, I don't have a suitably large set of programs from which I can collect that type of population data.)

As I was reading more of the literature last night, I was struck by what appeared to be a common assumption regarding the x86 instruction set. Namely, that conditional return from subroutine instructions were not present or being used. I suppose that could very well be the case, but the x86 instruction set that I remember using had a full set of conditional return from subroutine instructions. I remember using them fairly infrequently when programming in assembly language, but I don't recall ever seeing them used by a HLL compiler. (It could also be that I'm confusing x86 assembly language with 8080/8085/Z80 assembly language.) I wonder what effect conditional return instructions would have on the internal return stack modern processors use to speed that process.

BTW: Thanks for the article. It's a good read. I like their two stage decoding. I am glad to see that the '486 treated the prefix instructions in the same manner that I do in the M65C02A: one at a time.

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 23, 2019 5:20 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
MichaelM wrote:
Will have to continue to research the topic, and possible postulate another approach to incorporating the conditional branch information into a 6502 architecture branch predictor.
I’m totally up for that. I love the idea of a custom solution tuned to the 6502.

Quote:
I wonder what effect conditional return instructions would have on the internal return stack modern processors use to speed that process.
Interesting. A conditional return is really a conditional indirect branch. So, you’d have to predict the branch outcome and also predict the return address. Both are currently done in the FE stage, and then validated later in the pipeline. So I assume just predict as normal, and then flush the pipeline if you are wrong on either count?

Quote:
I am glad to see that the '486 treated the prefix instructions in the same manner that I do in the M65C02A: one at a time.
Nice. I’m not very familiar with these prefixed instructions, but I can certainly appreciate the sense of confirmation here.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Fri Dec 27, 2019 3:51 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Alright, after much tweaking, the latest run-stats for the BASIC Benchmark look pretty good:
Code:
Ready
RUN
 907           887           20

------------ Run Stats ------------
Dual Port Caches:  True
btb Enabled:       True
rtb Enabled:       True
ind Enabled:       True
Branch Prediction: TOUR 2-bit
Run: BasicBenchmark.65b
- - - - - - - - - - - - - - - - - -
Cycles:         8200000
NMOS Cycles:    22848842
- - - - - - - - - - - - - - - - - -
Instructions:   7132071   (87%)
Data Stalls      245622   (3%)
Control Stalls:  308360   (4%)
Delay Slots:     260187   (3%)
Generated:       253761   (3%)
Total:          8200000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:       2303899   (28%)
Operand Fwd:     241342   (3%)
- - - - - - - - - - - - - - - - - -
Branches:       1581138   (22%)
Taken:           800529   (51%)
Mispredictions:   77090   (5%)
- - - - - - - - - - - - - - - - - -
bCache Access:    55266   (1%)
aCache Conflicts: 63305   (1%)

----------- Performance -----------
CPI:               1.15
NMOS CPI:          3.20
Boost Factor:      2.8x
CPI is at just 1.15 for the BASIC Benchmark, or 2.8x NMOS. The tournament branch predictor performs well here, as well as on the much shorter Eratosthenes Sieve. It also does much better than gshare alone with smaller table sizes, making it much more FPGA friendly:
Attachment:
B330FC52-1FD8-4A8F-A626-B0C969AD0332.jpeg
B330FC52-1FD8-4A8F-A626-B0C969AD0332.jpeg [ 244.83 KiB | Viewed 1239 times ]
All in all, I'm pretty happy with that. I didn't have much luck with the "sliding gshare" scheme, unfortunately, but there are two additional enhancements that I'm holding in abeyance for the moment. They are:

1) JSR currently executes in two micro-instructions. Reducing it to one would eliminate all Generated cycles in the run above. The pipeline would then complete an additional 253,761 instructions for the same run, and CPI would be 1.11. The complication is that JSR would be writing to two addresses on the stack (for the low and high bytes of the return address). Unfortunately, that would double the number of hazard comparators required to keep consistency (from 14 to 28!). That's a lot of ICs.

2) Implementing the i486's Specualtive Execute Fetch would eliminate two cycles from every branch mis-prediction. That translates into 154,180 additional instructions in the run above, or a CPI of 1.09 when taken together with the JSR enhacement. The complication here is that additional memory bandwidth is required for the I-Cache to support the additional fetch. A quick test shows that without it the speculative fetch conflicts with the primary FE stage fetch 70% of the time in this run. There are a number of options for increasing bandwidth (including doubling with width of the I-Cache bus), but each comes at a cost.

We're at a pretty good place with the Basic interpreter for now, so I'm going to resist the impulse to optimize further. Instead, lets see what happens with FORTH and VTL02 ... :)

As always, thanks in advance for any thoughts and comments.

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Mon Dec 30, 2019 9:42 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
I haven't had time to follow this tread and forum for some time, but I think the work you are doing with branch prediction and everything looks pretty amazing.

The 6502-like core I made a couple of years ago was able to do 1 instruction/cycle, but with all sorts of limiting design criteria (separate program/data space, cut-down opcode set, fixed opcode length). I did not have pipelines in the normal context, so no branch prediction and its use was pretty limited. I wanted to expand that design with a pipeline, but thought that it would hurt the speed to a level that made it as slow as all the other cores. I never imagined that branch prediction could do so much to reduce the time-delay with 7 levels pipelining! Amazing.

Keep up the good work and happy new year (eventually).


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Wed Jan 01, 2020 10:19 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Thank you so much kakemoms, and Happy New Year to you as well.

I recall your project well, both in terms of the CPI and the multi-core approach. Very cool. It was necessary to extend the pipeline to deal with addressing modes efficiently, but honestly I did not realize how critical branch prediction would be as a result. Thankfully, the various predictors truly are amazing! :shock:

This document (suggested by hmn here) is a very useful overview of branch prediction schemes. This document provides a little more perspective on their relative performance.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Thu Jan 02, 2020 8:59 pm 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Drass wrote:
Thank you so much kakemoms, and Happy New Year to you as well.

I recall your project well, both in terms of the CPI and the multi-core approach. Very cool. It was necessary to extend the pipeline to deal with addressing modes efficiently, but honestly I did not realize how critical branch prediction would be as a result. Thankfully, the various predictors truly are amazing! :shock:

This document (suggested by hmn here) is a very useful overview of branch prediction schemes. This document provides a little more perspective on their relative performance.


Thank you for the documents Drass! They were really interesting as the prediction scheme has always been a black box for me.

I have started to analyze my previous effort and it took some time to get back into it. One of the main things (which has been pointed out in this tread as well I believe) is that memory bandwidth is essential. My core was running the memory at 400MiB/s with the core itself at 100MHz and 1 CPI. With 2-byte instructions and 1 byte read-modify-write, it was plenty! Not very optimized though since even simple instructions as INX took the same time..


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Fri Jan 03, 2020 2:02 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
kakemoms wrote:
the prediction scheme has always been a black box for me.
I know what you mean. I had heard the term for years and it just seemed like black magic to me. It was a thrill to finally get a sense of what it entails.

Quote:
I have started to analyze my previous effort and it took some time to get back into it.
Good news. There appeared to be some very interesting possibilities for a multi-core design (like neural-net evaluation! :shock: ). It frankly goes a over my head, but I’d love to learn more from a project like yours.

Quote:
One of the main things (which has been pointed out in this tread as well I believe) is that memory bandwidth is essential. My core was running the memory at 400MiB/s with the core itself at 100MHz and 1 CPI. With 2-byte instructions and 1 byte read-modify-write, it was plenty!
Yes, agreed. The penny dropped for me when reviewing a discussion regarding separate memories in this thread.

Quote:
Not very optimized though since even simple instructions as INX took the same time..
It appears your core is performing all the processing required for any given instruction in one cycle. If that’s true, even minimal pipelining should deliver significant gains. To that end, you might consider placing a single pipeline register somewhere along the datapath to store intermediate results. Registers at the inputs of the ALU might work well, for example, or at its outputs (or both). You may then be able to overlap the ALU execution time with the ALU input-fetch/select time and/or the flags-evaluation time to shrink the critical path (and increase the clock-rate).

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
 Post subject: Re: Pipelining the 6502
PostPosted: Sun Jan 05, 2020 10:29 am 
Offline

Joined: Wed Mar 02, 2016 12:00 pm
Posts: 343
Drass wrote:
It appears your core is performing all the processing required for any given instruction in one cycle. If that’s true, even minimal pipelining should deliver significant gains. To that end, you might consider placing a single pipeline register somewhere along the datapath to store intermediate results. Registers at the inputs of the ALU might work well, for example, or at its outputs (or both). You may then be able to overlap the ALU execution time with the ALU input-fetch/select time and/or the flags-evaluation time to shrink the critical path (and increase the clock-rate).


Yes it manages to do the actual opcode in about 1/2 a cycle. It starts at positive cpu clock by sending the address to the memory, which half a cycle later returns the opcode. Then at negative edge, the next address is fetched and operand returned at the second half clock. At the third half cpu cycle, the address at which the operand points to is read (which is done in paralell to reading the next opcode through the dual-port memory), then at the fourth half cpu cycle, the opcode is actually carried out. The result of the opcode is written in the sixth half cpu cycle.
I can't really remember, but I think a taken branch requires an extra cpu cycle to fetch the opcode+operand at the new location. Or maybe its handled differently than other opcodes..


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 89 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: