Quote:
The 6502 is, with respect to these, very simple. It has no true pipelining - i.e. there is no pre-fecthing of the next instruction before the present one has finished executing.
Actually there
is a little bit of true pipelining, and a lot of instructions do finish up while the next one is being fetched. An example given in WDC's programming manual (page 40 in my edition) is ADC#, which requires 5 distinct steps, but only two clocks' time:
Step 1: Fetch the instruction opcode ADC.
Step 2: Interpret the opcode to be ADC of a constant.
Step 3: Fetch the operand, the constant to be added.
Step 4: Add the constant to the accumulator contents.
Step 5: Store the result back to the accumulator.
Steps 2 and 3 both happen in a single clock. The processor fetches the next byte not knowing yet if it will need it or what it will be for. Steps 4 and 5 occur during the next instruction's step 1, eliminating the need for two more clocks. It cannot do steps 3 and 4 in one clock because the memory being read may not have the data valid and stable any more than a small set-up time before phase 2 falls and the data actually gets taken into the processor; so step 4 cannot begin until after step 3 is totally finished. But doing 2 and 3 simultaneously, and then doing 4 and 5 simultaneous with step 1 of the next instruction makes the whole 5-step process appear to take only 2 clocks.
Another part of the pipelining is the reason why operands are low-byte-first. The processor starts fetching the operand's low byte before the instruction decode has figured out how many bytes the instruction will have (1, 2, or 3). In the case of indexing before or without any indirection, the low byte needs to be added to the index register first anyway, so the 6502 gets that going before the high byte has finished arriving at the processor. In the case of something like LDA(abs), the first indirect address is fetched before the carry from the low-byte addition is added to the high byte. Then if it finds out there was no carry generated, it already has what it needs, and there's no need to add another cycle to read another address 256 bytes higher in the memory map. This way the whole 7-step instruction process requires only 4 clocks. (This is from the next page of the same programming manual.)