If you want it to be
fast above all else, while still doing individual decrements, loop unrolling is the order of the day.
Code:
; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
; start by accurately decrementing Low byte to zero
CPY #0
BEQ :++
: DEY
BNE :-
; from now on, low byte will always go through a complete 256-count cycle, so we can unroll it by any submultiple of 256
; decrementing the high 16 bits is then slow-path
: CPX #0
BNE :++
CMP #0
BNE :+
RTS ; all counters have reached zero
: DEC A
: DEX
: DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
DEY
BNE :-
BEQ :----
This should be typically just over 2 cycles per count, and should work on any 6502. Let's work it out more-or-less exactly:
If the low byte (in Y) is zero, the overhead before main loop entry is 5 cycles. If it's non-zero, then it's 4 cycles to start plus 5 cycles per count. The idea is that you use the low byte only for fine-tuning. The overhead to test the high bytes during function exit is 8 cycles. The minimum execution time is thus 19 cycles including the RTS, when all three counters are zero on entry.
The inner part of the main loop decrements Y 16 times using 35 cycles (taken branch). A complete 256-count cycle that also tests and decrements X (but not A) takes 569 cycles. That's 2.223 cycles per count. Testing and decrementing A as well adds 6 more cycles, but this is only an insignificant 0.01% overhead.
On a 2MHz CPU, 2000000 cycles requires 3514 ($0DBA) main loops with 456 cycles left over. Of these, 18 cycles are occupied in overhead when Y is nonzero, and the correct value of Y is 87 ($57). An entry value of $0DBA57 should actually make the function return after 1999997 cycles.
There is still plenty of headroom here to accommodate, say, a 20MHz CPU. The actual maximum number of cycles you can request (with $FFFFFF) is 18 + 255*5 + 255*(6 + 256*569) = 37,147,143.