As an aside, I ran across
this technical article describing the Intel 486 as
"Executing Instructions in One Clock Cycle". I was interested immediately. There are some very intriguing takeaways here, and it's very cool to see how the major enhancements over the 386 were presented at the time:
“We revamped the prefetch, instruction decode, and control logic to optimally integrate the cache into the instruction execution pipeline. This critical improvement allowed the i486 CPU to execute the most frequent, simplest instructions in one clock cycle.”
The i486 is a classic example of pipelining a CISC instruction-set. It featured a five-stage pipeline as follows:
- Fetch -- retrieves 16 bytes from the cache into a pre-fetch buffer
- D1 -- decodes the opcode (and operands, i.e. "directs the instruction aligner/ prefetch queue to step to the next instruction")
- D2 -- "computes memory addresses"
- EX -- performs the ALU operation or memory access
- WB -- Writes to the register file
The D2 stage will use two cycles for indexed-addressing operations, which is analogous to the IX stage in the P7 pipeline simulation. Effectively, the i486 stalls the pipeline to apply an index to a base address, instead of adding a separate pipeline stage to do the job. This will hurt CPI but also reduces the branch mis-prediction penalty. If we're mostly executing non-indexed operations, adding a cycle is the obvious choice. But in the general case, the tradeoff will be between the frequency of these more complex instructions (which will trigger nops from stalling the pipeline) vs. the quality of branch prediction (which will trigger nops on mis-predictions). A longer pipeline which can handle indexed addressing in one pass will also require stronger branch prediction.
In this regard, the i486 makes a very clear choice -- it uses trivial branch prediction and focuses instead on minimizing the mis-prediction penalty. To do so we have the shorter pipeline, but also an intriguing pre-fetch scheme in the EX stage. The FE stage makes an implicit "not-taken" assumption for all conditional branches and continues fetching the next sequential instruction as usual. Then, a
speculative fetch is also performed while the branch is being evaluated in the EX stage. This time the branch target instruction is fetched. If the branch result turns out to be "taken", then the pipeline is flushed, and the target instruction is sent directly to the D1 stage for executiion.
This scheme has the effect of reducing the branch mis-prediction penalty by two cycles -- one cycle is eliminated by the pre-fetch itself, and the other by the fact that the fetch is performed in the EX stage. This means that there are only three instructions flushed in the pipeline, rather than four. The net result is a two-cycle penalty for every "taken" branch:
“the CPU runs a speculative fetch cycle to the target of the jump during the Ex stage of the jump instruction. ... If the CPU evaluates the condition as true, the fetch to the target address refills the pipeline. ... In summary, conditional jumps take one clock cycle for a jump that is not taken, and three cycles for a taken jump.”
The EX-stage pre-fetch is analogous to the speculative pipeline Arne suggested, but with only one stage! Very cool. Other key aspects of the design include:
- "No Data Delay", which is equivalent to "data-forwarding" in the P7 simulation -- "... because the CPU contains the necessary bypass paths, it can write this data to the register file and forward it to the ALU at the same time".
- "Pointer Load Delay", which is effectively a RAW hazard stall due to the short pipeline -- "if a value is loaded from memory into a register, and that register is then used as a base register in the next instruction, the CPU will stall for one cycle". The P7 pipeline has a dedicated AD stage to avoid this stall.
- "Unified Write-Through Cache", which according to the article also means that "no special logic is required to handle self-modifying code". I don't quite see how a unified cache helps with code that modifies instructions that are already loaded in the pre-fetch buffer or elsewhere in the pipeline -- but maybe I'm missing something?
So there is a lot here that resonates. The article does not quote any CPI figures, so it's difficult to asses performance in those terms. But still, what emerges is an approach which emphasizes simplicity and economy, at the expense of tolerating stalls from more complex instructions. We get a shorter pipeline with no branch prediction to speak of. That means no btb, no rtb, no pht, fewer and narrower pipeline registers, and a small unified on-chip write-through cache. And then there is that ingenious speculative fetch ... wow, viewed from this vantage point you can really admire this design.
So intertesting!