Google's translation of Beregnyei Balazs' 6502 reverse-engineering site isn't excellent, but you can make out some of the story about the T-state shift register
here - including the RDY signal acting as a multiplexor, to stop or allow the shift. Oddly, the RDY input pad is missing - the schematic isn't complete.
(I should say, the T-states are labels for full clock cycles, there is indeed no internal fast clock as with the early x86.)
If you want to know more about MOS technology, and how transistors are used to make up logic gates, see
here
Although it's said (repeatedly and emphatically) that the 6502 isn't micro-coded, it is notable that the bulk of the control complexity is hidden in the large PLA, which is a regular structure, and which takes six of the T-state signals as inputs. So, not an addressable ROM, and not built by writing microcode, but nonetheless a structured approach which must have helped to get the product to market.
As for the T-states, I'll see if I can sketch an operation like your
Code:
1000 LDA $4321
as accurately as I can - some guesswork involved!
Code:
Cycle 0, start: PC value of 1000 placed on address bus, Sync high, RnW high.
Cycle 0, middle: computing PC=PC+1, possibly committing results of previous opcode to registers.
Cycle 0, end: capture data bus (opcode) into internal PreDecode Register
Cycle 1, start: PC value of 1001 placed on address bus, Sync low, RnW high, PreDecode Register transferred to Intruction Register
Cycle 1, middle: computing PC=PC+1, deriving control signals for next cycle
Cycle 1, end: capture data bus (low address operand) into internal Input Data Latch
Cycle 2, start: PC value of 1002 placed on address bus, Sync low, RnW high, Input Data Latch transferred to B input register
Cycle 2, middle: computing PC=PC+1, deriving control signals for next cycle, add 0 to B register
Cycle 2, end: capture data bus (high address operand) into internal Input Data Latch, capture adder output (low address operand) into Adder Hold Register
Cycle 3, start: Input Data Latch value placed on high byte of address bus, Adder Hold Register placed on low byte of address bus, Sync low, RnW high
Cycle 3, middle: deriving control signals for next cycle, PC not incremented
Cycle 3, end: capture data bus (content of $4321) into internal Input Data Latch
Cycle 4, start: PC value of 1003 placed on address bus, Sync high, RnW high, Input Data Latch transferred to Accumulator
Note that the final action of the opcode occurs during the first cycle of the next instruction. The control signals for this action were set up in the previous cycle, and this action will occur even if (for example) an interrupt is taken.
Note that cycle 1 must always do the same work: there has not yet been time to act on the instruction fetch. This is why the second cycle always reads the byte after the opcode, and why NOP takes two cycles. An instruction which takes only two cycles (like, say, DEX) will use the ALU and commit the result in the third cycle, which overlaps with the fetch of the next instruction.
(I'm not quite sure what happens with PC. Here's a guess: at the end of Cycle1 of a NOP, although PC=PC+1 has been computed for the second time, it won't be captured, and so the start of Cycle2 will present the same singly-incremented value of PC onto the bus, to fetch the next instruction. The schematic shows 4 control signals for each of PC high and low, and the x3 signal might be used to accept or ignore the incremented value.)