While considering what the optimum degree of unrolling might be, I had an amusing idea. I had designed the inner loop to unroll by an exact submultiple of 256 (in this case, 16) so that the outer loop could cleanly decrement the upper bytes by one. But what if I chose a degree that was coprime to 256, such as 7? This would cause the inner loop to decrement by 7*256 before exiting. The speed would still be high in this case, just below 2.5 cycles per count on average, even though we must now complicate the control structure.
Code:
; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
; start by accurately decrementing Low byte to zero
CPY #0
BEQ @mainLoop
: DEY
BNE :-
@mainLoop:
CPX #7
BCC @single
DEX
DEX
DEX
DEX
DEX
DEX
DEX
BCS @multiple
@single:
CPX #0
BNE :++
CMP #0
BNE :+
RTS
: SBC #1 ; Carry is already set by CMP
: DEX
DEY ; 256 % 7 = 4
DEY
DEY
DEY
@multiple:
DEY
DEY
DEY
DEY
DEY
DEY
DEY
BNE @multiple
BEQ @mainLoop
The innermost loop performs 7 decrements of Y in 17 cycles, for 2.429 cycles per count. On the fast path this repeats 256 times, after decrementing X 7 times, then returns to the top of the main loop. This takes 4375 cycles to perform 1792 decrements, for 2.441 cycles per count.
The slower path is typically performed 4 times per high-byte count, in which X is decremented once, and if necessary A is tested and decremented, while Y is pre-decremented 4 times so that the inner loop performs 252 decrements instead of 1792. This slow-path loop takes 634 cycles to count down by 256, for 2.476 cycles per count, if X is nonzero. If X is zero, add 6 cycles to test and decrement A, for exactly 2.5 cycles per count over that batch of 256.
It takes 63 more cycles to run the slow path 7 times than to run the fast path once.
The total cycle count per A decrement (ie. 65536 counts) is 36 fast paths, 4 slow paths, and 1 supplement for actually decrementing A. That's 160,042 cycles, so 2.442 cycles per count.
As before, we have a 5 cycle per count loop on entry for fine tuning. The cycle count if entered with zeroed counters is 24, including the RTS - this is due to the additional overhead of comparing X against 7. If Y is nonzero on entry, the overhead beyond the various loop execution times above is reduced to 23 cycles. The maximum cycle count is 255 A-decrements, 36 more fast-paths, 3 more slow-paths, 255 fine-tuning loops, and the function overhead: 40,971,410. That's enough to implement a 2-second delay at 20MHz.
I have not yet checked how densely the possibilities for precise cycle counts are covered. The calculations will be more complex than for my earlier version.