On another note, I mentioned earlier that it may be possible to optimize JMP <abs> instructions out of the pipeline using
zero-cycle jumps. This a relatively straight-forward change whereby the btb entry for a JMP is applied to the instruction preceding it, rather than to the JMP itself. The CPU will then skip the JMP in future encounters and jump from the preceding instruction directly to the target address of the JMP. (Recall that the Branch Target Buffer stores the address of the
Next PC for the Fetch stage). The JMP instruction itself is not executed, so the jump complets in zero cycles.
There is one caveat to this mechanism. An unconditional JMP btb entry cannot be applied to the preceding instruction if that instruction requires a btb entry of its own (i.e., the instruction is a branch, RTS or indirect jump). It turns out that about half of the executed JMP instructions in the figFORTH code originate from a single JMP -- the one in the NEXT routine. As (bad) luck would have it, this JMP is preceded by a conditional branch (the BCC that Dr Jefyll refers to above). This JMP will therefore only be optimized out about once every 128 iterations, when the preceding branch is not taken.
Even so, the change will reliably optimize the other half of the JMPs, which is beneficial. Let's look at the impact on the Prime Sieve run:
Code:
fig-FORTH 1.0
100 sieve Primes: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 71 73 79 83 89 97 OK
------------ Run Stats ------------
Dual Port Caches: True
btb Enabled: True
rtb Enabled: True
ind Enabled: True
jmp zero Enabled: True
Branch Prediction: TOUR 2-bit
Run: ForthPrimeJMP.65b
- - - - - - - - - - - - - - - - - -
Cycles: 290000
NMOS Cycles: 808472
- - - - - - - - - - - - - - - - - -
Instructions: 236473 (82%)
Data Stalls 27748 (10%)
Control Stalls: 5532 (2%)
Delay Slots: 20112 (7%)
Generated: 136 (0%)
Total: 290000 (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd: 129143 (45%)
Operand Fwd: 6529 (2%)
- - - - - - - - - - - - - - - - - -
JMP Executed: 8885 (4%)
JMP Skipped: 8747 (4%) <—— Note here “skipped” jumps
- - - - - - - - - - - - - - - - - -
Branches: 21874 (9%)
Taken: 20548 (94%)
Mispredictions: 1383 (6%)
- - - - - - - - - - - - - - - - - -
bCache Access: 3365 (1%)
aCache Conflicts: 324 (0%)
----------- Performance -----------
CPI: 1.18
NMOS CPI: 3.30
Boost Factor: 2.8x
As we can see, just about half of the JMP instructions have been successfuly "Skipped" in this run. These skipped JMPs have legitimately completed though, so this number is added to the total count of completed instructions when calculating CPI. As a result, CPI drops from 1.22 in the pervious run to 1.18 in this run -- a meaningful improvement.
Interestingly, we see also that Data Stalls have increased slightly from 9% to 10%. This is due to the fact that certain eliminated jumps sat between instructions that share a mutual dependency. With jumps gone, these instructions now execute one after the other in the code stream, and trigger a stall as a result. In effect, the JMP is being immediately replaced by a Data Stall in these instances. Thankfully this happens only rarely.
For a bit of fun, it's pretty magical to see the skipped JMPs dissapear from the pipeline. Note the
jmp $0244 in the cycle below (between
inx and
ldy #1 instructions):
Code:
------------- 962 --------------
WB: adc $03,x
EX: sta $03,x
DA: inx
IX: inx
AD: jmp $0244
DE: ldy #$1
FE: $248: 85 b2 88 b1 - a0 1 b1 ae
And now, we run the same test with
zero-cycle-jumps enabled, and ...
Shazzam!Code:
------------- 946 --------------
WB: adc $03,x
EX: sta $03,x
DA: inx
IX: inx
AD: ldy #$1
DE: lda ($ae),y
FE: $24c: 85 b2 88 b1 - ae 85 b1 18
... the
jmp is gone (note
inx is followed immediately by
ldy #1). Neat.
Cheers for now,
Drass