If by " LDA 0200, FF" you mean there's FF in X and you're doing LDA $0200,X, I think the idea is that it just does not
add the extra cycle if the indexing did not make it cross a page boundary.
From an older post: The 6502 does however have minor pipelining and it has many cycles where more than one operation takes place. An example given in WDC's programming manual (page 40 in my edition) is ADC#, which requires 5 distinct steps, but only two clocks' time:
Step 1: Fetch the instruction opcode ADC.
Step 2: Interpret the opcode to be ADC of a constant.
Step 3: Fetch the operand, the constant to be added.
Step 4: Add the constant to the accumulator contents.
Step 5: Store the result back to the accumulator.
Steps 2 and 3 both happen in a single clock. The processor fetches the next byte not knowing yet if it will need it or what it will be for. Steps 4 and 5 occur during the next instruction's step 1, eliminating the need for two more clocks. It cannot do steps 3 and 4 in one clock because the memory being read may not have the data valid and stable any more than a small set-up time before phase 2 falls and the data actually gets taken into the processor; so step 4 cannot begin until after step 3 is totally finished. But doing 2 and 3 simultaneously, and then doing 4 and 5 simultaneous with step 1 of the next instruction makes the whole 5-step process appear to take only 2 clocks.
Another part of the pipelining is the reason why operands are low-byte-first. The processor starts fetching the operand's low byte before the instruction decode has figured out how many bytes the instruction will have (1, 2, or 3). In the case of indexing before or without any indirection, the low byte needs to be added to the index register first anyway, so the 6502 gets that going before the high byte has finished arriving at the processor. In the case of something like LDA(abs), the first indirect address is fetched before the carry from the low-byte addition is added to the high byte. Then if it finds out there was no carry generated, it already has what it needs, and there's no need to add another cycle to read another address 256 bytes higher in the memory map. This way the whole 7-step instruction process requires only 4 clocks. (This is from the next page of the same programming manual.)
The 2-clock NOP has to do with the 2-clock minimum for the instruction step counter. I don't know why that is, but maybe the
visual6502 website would help. Commodore had a patent on a process they used in the 65C
E02 that allowed them to eliminate virtually all the dead bus cycles and make at least 30 op codes take just one clock. There were not very many 65CE02's made, and the '816 was a better upgrade to the 65c02 than the CE02 was, so I'm glad to have that.