As said before, the original 6502 used latches, but modern cores use edge triggered flip-flops. In
my core, I also only use positive edge triggered flip-flops, but still have the same cycle time per instruction as the original NMOS 6502.
The trick is that the total instruction fetch and execution of an ADC takes more than 2 cycles, but there's overlap with previous and following instructions so that the effective delay is only 2 cycles. Below you can see the timing diagram of the execution of this bit of code on my core. It should be instructive to compare this to the same instructions executed on visual6502.
Code:
FD2D 69 1F ADC #$1F
FD2F 30 02 BMI $FD33
There's a total of 5 cycles, from start to finish, but one overlaps with previous instruction, and two overlap with next instruction, so only 2 are counted for the ADC instruction time.
In the first cycle on the diagram, you see PC = FD2D, and also AB=FD2D. The AB signal is the Address Bus. So in this cycle, the memory is instructed to fetch the byte at location FD2D, which is the opcode of the ADC instruction. This cycle overlaps with the previous instruction, so we don't count it towards the ADC instruction.
In the second cycle, you see PC has incremented to FD2E and the AB follows it. The AB is actually not a register. It is a simple MUX, which happens to follow PC in this state. The default behavior is for the PC to increment and fetch the next byte. Either it is the operand, or it is the next instruction, but we'll need it anyway (there are some exceptions, like jumps). In the meantime, the memory read from cycle #1 has resulted in DI=69, which is the opcode for the ADC #$1F instruction. The state machine is now in the DECODE state, and decodes the instruction as an immediate. There's nothing it can do until it fetches the operand.
In the third cycle, DI=1F, so the operand has been fetched. Now, look at the ALU inputs. There's an AI (A Input) signal that is 41, and a BI (B Input) that's 1F. These values are the result of a set of MUXes. The AI MUX selects the accumulator (A=41), and the BI MUX selects DI (databus in). The CI is the carry input, which is set. So, there are no registers in front of the ALU. It's a combinatorial path straight from the data input bus and from the register file. At the same time, we see that PC=AB=FD2F, so the core is already telling memory to fetch the next instruction.
In the fourth cycle, the ADD output shows the output register of the ALU. This is a true register. It is equal to 61 (=41 + 1F + 1). On the DI input you can already see the opcode of the BMI instruction, and the core will start to decode it.
In the fifth cycle, the accumulator is written with new value (A=61), taken from ADD register in cycle #4. In the meantime, the core is in the BRA0 (Branch 0) state, and has to decide whether to take the branch or not. Just in time to use the results from previous instruction.