I might get flamed for this but it is not the intention:
- immediate: Use 8 bit operand from instruction stream. One step. Simple.
- absolute: Read 16 bit operand from instruction stream. Read or write 8 bit value at literal 16 bit address. Two steps. Simple.
- (zp),Y: Read 8 bit operand from instruction stream. Read 16 bit address from two sequential locations in zero page while adding and carrying RegY. Read or write 8 bit value at computed 16 bit address. Three steps. Complex.
Simple addressing modes are sufficient to do microcontroller stuff like respond to a keypad. Complex addressing modes are sufficient to do computer science data structure stuff, like binary trees and iterating through a linked list of strings. Some CISC processors allow five or more steps. Some RISC processors only allow one step. If index registers are large enough to hold a full address pointer then one step may be sufficient, however two steps is often faster and more convenient.
If you aren't familiar with 6502, zero page indirection may seem counter-intuitive. With the very limited transistor budget, six bytes of processor state wasn't generous but it provided a good hedge between speed and flexibility. Compare 6502 to processors with similar transistor budgets, such RCA1802 or TMS9900. RCA1802 has 16 internal registers which are each 16 bits plus flags. It has little budget left for ALU functionality and everything takes a multiple of four clock cycles. Whereas, at the other extreme, TMS9900 has a 16 bit workspace pointer which allows a stack window of similar size. It has sufficient budget for multiplication and division. However, all ALU operations require two or three memory cycles because, in the minimal implementation, all useful state is external to the processor. With 10x transistors, RCA1802 and TMS9900 successors would have been formidable. See Harris RTX2000 for an example of what would have been possible. However, before that happened, Commodore could sell a 6502 computer for less money than RCA or Texas Instruments could make a computer.
The fun part comes with the cross-cutting concern of interrupts. It is too easy to add registers to a processor design - until you want interrupts. At this point, you may have too many registers. Again, 6502 is a good hedge between speed and flexibility. TMS9900 is extreme fast processing interrupts because it only has to dump program counter, workspace pointer and flags. RCA1802 is extremely slow. 6502 is more similar to TMS9900 because it only dumps program counter and flags. Indeed, it only does the minimum because anything more would have required more micro-code. The remainder has to be saved manually. Counter-intuitively, it is possible to have an interrupt which doesn't use any registers, as our
most musical Paganini recently attempted. On 14MHz 6502, INC zp // RTI can be called 300000 times per second while the processor does other work.
When 6502 executes an opcode with (zp),Y address mode, the performance penalty is worse than TMS9900. However, a good programmer minimizes use of this addressing mode. In a different case, where RegX is used as a loop counter, DEX takes two cycles on 6502 and one cycle on 65816 in native mode or 65CE02. Overall, a good programmer will ensure that execution averages four cycles per instruction. With the asymmetry of the 6502's RegX and RegY addressing modes, 6502 can source more data per cycle than many contemporary processors. In many cases, the volume of data processed with 8 bit 6502 exceeds throughput of 16 bit designs. In a minority of cases, 65816 with 15 bytes of visible processor state may exceed 68000 with 74 bytes of visible processor state.