Ok, I think I may understand what's happening here.
My design is simpler but relies on having to do more in each cycle and therefore incurs a greater propagation delay for each step. The 6502 on the other hand, cleverly executes several threads in parallel, using additional registers as intermediary stages, and gets it all done in as many steps (with each step being faster).
INX provides a good illustration. It takes two cycles on each architecture. My microcode is as follows:
Step 1) IR := *PC; PC +=1 - fetch opcode, Increment PC
Step 2) X := X + 1 - Roundtrip through the ALU, Incrementing X and storing the result back in X
The 6502 on the other hand performs several things in parallel:
Step 1) XSB, SBADD - move X to the ALU; PCLADL, PCHADH - move PC to address bus, ADLPCL, ADHPCH - move PC to inc circuit
Step 2) ADDSB, SBX - store X+1 back into X; PCLADL, PCHADH; Setup the next read; IPC - Increment PC; Fetch opcode from address setup in step 1
The tradeoff, I think, is that the 6502 will manage a faster clock rate (since it is doing less for each thread in a given cycle) at the expense of more registers (which it requires as intermediary stages). By comparison, my design seems slow and simplistic, forcing signals to travel through much more circuitry on every cycle. And it actually requires quite a few registers itself (12 vs. the 6502's 17) and also quite a few bus buffers to boot.
At least this is what I think is going on ... I'm not actually sure if the signals in the 6502 are doing what I think they are doing. Perhaps someone can comment here.
Either way, I can safely say I am more impressed than ever with the 6502. When I get a chance, I will take a look at the second trace BigEd provided as I'm very curious on the sequencing of the indirect/indexed addressing mode. In the meantime, I will press on with my simpler design.
Maybe I can gain some speed just through a faster TTL logic family