Most of my assembly coding is part of reverse-engineering video game consoles, and a common need is to delay N clocks, where N is a constant or run-time value. I figured I'd share the 6502 routines I use, since I found them fun to write. This is the key routine:
Code:
; Delays A+20 clocks (excluding JSR)
; Preserved: X, Y
delay_a_clocks:
lsr a
bcs @b0c ; 2/3
@b0c: lsr a
bcs @b1s ; 2/3
lsr a
bcs @b2s ; 2/3
@b2c: bne @ge8 ; 3
; -1
@ret: rts
@ge8: sec ; 2
sbc #1 ; 2
beq @ret ; 3
; -1
nop ; 2
@eights:
bne :+ ; *3
: sbc #1 ; *2
bne @eights ; *3
; -1
rts
@b1s: lsr a ; 2
bcc @b2c ; 3/2
nop ; 2
@b2s: bcs @b2c ; 3
It's somewhat obfuscated because I wanted to reduce the overhead (the +20) as much as possible. In addition to the above, there are routines that delay A*256+10 clocks and A*65536+10 clocks, allowing 16-bit and 24-bit run-time delays by simply calling the three routines with the low, mid, and high bytes of the delay. There will be an overhead of some constant number of clocks, but that's usually not a problem.
For constant delays, I use a macro (ca65 assembler) that selects between several strategies depending on the delay, which can be any expression evaluating to 2 to 16777216 (or zero). For delays of less than 28, it uses either an inline delay made up of short instructions, or a JSR to a bunch of NOPs followed by a return. For 28 and larger, it uses a call to delay_a_clocks, and optional calls to variants of the "delay A*256 clocks" and "delay A*65536 clocks" routines when the mid and high bytes are non-zero. The X and Y registers are preserved by all routines/macros, but A is not, since it's easy enough to save and restore it. At one point I had a version with the full 24-bit delay stored inline after the JSR, which shortened the delay calls a bit, but this made the delay code much more complex, so I went back to the simpler scheme I use now.
Here's the full commented source code I use:
6502_delay.asm