Mini-challenge - fastest 24 bit countdown

BigEd · Post by **BigEd** » Mon Aug 28, 2023 1:44 pm

I've been noodling about, trying to measure BogoMIPS, which I take to mean counting down to zero from some number, with the number chosen such that it takes exactly a second to do it, on a particular machine.

A 24 bit number should be enough for the purpose.

I chose to pass my number in using A, X and Y. We have to count down the multi-byte number, until all bytes are zero, at which point we RTS.

To get the highest BogoMIPS rating, we need this multibyte countdown to run as fast as possible.

I thought I had some nice code for this, but it was wrong!

teamtempest · Post by **teamtempest** » Mon Aug 28, 2023 2:29 pm

Just thinking out loud...

Code: Select all

     cpy #0
     beq  f1    ; high byte already zero

 b1  dey
     bne b1

f1   cpx #0
     beq f2

b2   dex
     bne  b2

f2   tax
     beq f3

b3   dex
     bne b3

f3   rts

gfoot · Post by **gfoot** » Mon Aug 28, 2023 3:10 pm

I'd just use one label, decrementing each register in turn with a bne back to the label after each decrement. I use this for delay loops often. The caveat is that the initial value isn't quite what you'd expect! But I believe it achieves the right result if you increment all the bytes by one before entering the loop, i.e. 010101 is actually "zero" and doesn't loop at all.

An interesting side challenge would be to make the loop constant time for each iteration regardless of rounding, there could be some value in that.

Proxy · Post by **Proxy** » Mon Aug 28, 2023 6:37 pm

on the 65816 you can omit one of the registers as the low word fits a single 16-bit register. X could also be made 16-bits wide and it would only make the CPX instruction 1 byte larger and take 1 cycle longer to execute.

Code: Select all

; 24-bit input in A (low word) and X (high byte)
count_down:
    .A16
    .I8
    CMP #0          ; 3 cycles
    BEQ @decx       ; 2 cycles (3 if taken)
    @loop:
        DEC A       ; 2 cycles
        BNE @loop   ; 2 cycles (3 if taken)
        @decx:
        CPX #0      ; 2 cycles
        BEQ @exit   ; 2 cycles (3 if taken)
        DEX         ; 2 cycles
    BRA @loop       ; 3 cycles
    @exit:
RTS                 ; 6 cycles

though i don't know if i'm skilled enough to make it take a constant amount of time per operation. but i can try!
getting to @loop from the start of the function can either take 5 cycles, or 6+9=15 cycles, depending on if the Low Word starts at 0 or not.
so you could add 5 NOPs before @loop to make both paths take 15 cycles.
one iteration of decrementing A takes 5 cycles, if A is 1 at the start of @loop then it will take 4+9=13 cycles.
you could rework the iteration loop to branch to @decx when A = 0. so when A is 1 at the start of @loop then it will take 5+9=14 cycles. but when A != 1 it will take 4 cycles to pass the BEQ and then wastes an additional 10 cycles to get back to @loop, so 4+10=14 cycles in total.

i think my math is correct...? though i'd like to see some other attempts at making it take a constant time per iteration

Code: Select all

; 24-bit input in A (low word) and X (high byte)
count_down:
    .A16
    .I8
    CMP #0          ; 3 cycles
    BEQ @decx       ; 2 cycles (3 if taken)
    NOP
    NOP
    NOP
    NOP
    NOP             ; 10 cycles
    @loop:
        DEC A       ; 2 cycles
        BEQ @decx   ; 2 cycles (3 if taken)
        NOP
        NOP         ; 4 cycles
        BRA @n      ; 3 cycles
    @n: BRA @loop   ; 3 cycles
        @decx:
        CPX #0      ; 2 cycles
        BEQ @exit   ; 2 cycles (3 if taken)
        DEX         ; 2 cycles
    BRA @loop       ; 3 cycles
    @exit:
RTS                 ; 6 cycles

floobydust · Post by **floobydust** » Mon Aug 28, 2023 7:26 pm

So, if I understand this correctly, you are using a 24-bit number as a value to count down to zero (from that value).

If we just use a register decrement instruction on the 6502, that takes 2 clock cycles. You'll also need a branch instruction for looping, which will take 3 clock cycles to branch and 2 clock cycles if you don't branch. That's a minimum of 5 clock cycles per decrement. With a 24-bit value, just multiply 5 clock cycles times 16,777,215 and you get 83,886,075 clock cycles. You'll still need some additional clock cycles to get things setup correctly as you zero the high order byte and move downwards.

Based on the above (83,886,075) clock cycles with a 1 MHz 6502, that will take 83.886075 seconds to execute. Backing into a 1-second countdown time (with a 1 MHz clock), you could only manage a count value of less than 200,000.

Unless I'm not clearly understanding the problem, you'll need a pretty fast CPU clock rate or limit the number of bits you use for counting, perhaps 18 bits?

BigEd · Post by **BigEd** » Mon Aug 28, 2023 8:10 pm

@floobydust - ah, no, I mean that 16 bits won't be enough but 24 bits should be enough, to hold the system-specific constant which causes the countdown to take one second.

@gfoot - I did wonder about offsetting each byte by one. It doesn't quite feel right for the challenge, as we're changing the constant before we start decrementing. (In fact, I was inclined to go in a different direction, to invert the value and count up to zero, because with an up counter the overflow to the next byte should happen at zero. And I might do that, in my own program, but I don't regard it as a fastest down counter.)

@teamtempest - did you miss out a txa somewhere? I see a tax but no txa.

@proxy - I think you're solving a different challenge, with a different micro!

I should note that I'm thinking of the 6502, not the 65C02, so DEC A isn't available.

So, I think so far we don't have any code which takes a multibyte value and decrements it, detecting all-zero and returning. Maybe I'm being too picky...

A simple approach, I think, would be (Edit: the following code totally not working! Thanks George.)

Code: Select all

STA $70
.loop
DEX
BEQ done
CPX #$FF
BNE loop
DEY
BEQ done
CPY #$FF
BNE loop
DEC $70
BNE loop
.done
RTS

That'll run at a little more than 9 cycles per count.

It might be that comparing with zero before decrementing would be faster, somehow?

Proxy · Post by **Proxy** » Mon Aug 28, 2023 8:26 pm

Quote:

@proxy - I think you're solving a different challenge, with a different micro!

to be fair you didn't specify the CPU in the first post

on another note, i somehow read gfoot's reply and thought that the "constant time per iteration" thing was part of your original post. oops!

BigEd · Post by **BigEd** » Mon Aug 28, 2023 8:27 pm

On second thoughts, perhaps adding 010101 is a good way to go... it's a matter of judgement, and it's more fun if we're not strict about these kinds of challenges.

And yes, solving for the 816 is a different challenge, but go for it if you like! We should hope, I think, to get a slightly faster routine and therefore a slightly higher bogomips rating at the same clock speed.

gfoot · Post by **gfoot** » Mon Aug 28, 2023 8:29 pm

I don't think your code would work Ed - it will count X down until it reaches zero, then return without touching the other registers

If you want the values to strictly count down to zero, and it's not just about looping the right number of times, then I guess these are two non-optimal ways to do it - first the very naive approach:

Code: Select all

STX $70
STY $71
STA $72
SEC
.loop
LDA &70 : ORA &71 : ORA &72 : BEQ done
LDA &70 : SBC #1 : STA &70
LDA &71 : SBC #0 : STA &71
LDA &72 : SBC #0 : STA &72
BCS loop
RTS

That one even requires a consistent number of cycles per loop, so it is in some sense "linear". edit - I added the "LDA/ORA/ORA/BEQ" line, as otherwise it would count down to -1 instead of 0.

Code: Select all

.loop
CPX #0 : BEQ xdone
DEX : JMP loop
.xdone
CPY #0 : BEQ ydone
DEY : JMP loop
.ydone
CMP #0 : BEQ adone
SBC #1 : JMP loop
.adone
RTS

This one checks for zero before all decrements, so if you enter with all registers zero, it will not decrement any of them, and return. It still loops the right number of times, I think. But it leaves me with an urge to optimize it, but also a feeling that any optimization would stop it from strictly adhering to the rules!

Osi · Post by **Osi** » Mon Aug 28, 2023 8:41 pm

Here is my 24 bit countdown proposal. The shown numbers represent 1sec @ 1Mhz clock speed
So it takes 1 sec to count down from 0x030AC7 to zero.

Code: Select all

; A high byte
; y,x low word


	.org $1000
	
	lda #$03
	ldy #$0A
	ldx #$C7
	jsr Count
	brk
	
	
Count	pha
	txa
	eor #$FF
	tax
	tya
	eor #$FF
	tay
	pla
	eor #$FF
	clc

b1	inx
	bne b1

	iny
	bne b1

	adc #01
	bne b1

	rts

BigEd · Post by **BigEd** » Mon Aug 28, 2023 8:53 pm

> I don't think your code would work Ed
Oops! Quite so. I find your second offering there interesting...

Thanks @Osi I think that invert-then-increment is where I was probably headed, even though I don't feel too good about it as a tactic.

The Wikipedia article for BogoMIPS offers us this pseudocode (which I note passes zero before returning, but I'm not going to worry about that.)

Code: Select all

static void delay_loop(long loops)
{
  long d0 = loops;
  do {
    --d0;
  } while (d0 >= 0);
}

Chromatix · Post by **Chromatix** » Mon Aug 28, 2023 9:30 pm

If you want it to be fast above all else, while still doing individual decrements, loop unrolling is the order of the day.

Code: Select all

; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
  ; start by accurately decrementing Low byte to zero
  CPY #0
  BEQ :++
: DEY
  BNE :-

  ; from now on, low byte will always go through a complete 256-count cycle, so we can unroll it by any submultiple of 256
  ; decrementing the high 16 bits is then slow-path
: CPX #0
  BNE :++
  CMP #0
  BNE :+
  RTS      ; all counters have reached zero

: DEC A
: DEX
: DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  BNE :-
  BEQ :----

This should be typically just over 2 cycles per count, and should work on any 6502. Let's work it out more-or-less exactly:

If the low byte (in Y) is zero, the overhead before main loop entry is 5 cycles. If it's non-zero, then it's 4 cycles to start plus 5 cycles per count. The idea is that you use the low byte only for fine-tuning. The overhead to test the high bytes during function exit is 8 cycles. The minimum execution time is thus 19 cycles including the RTS, when all three counters are zero on entry.

The inner part of the main loop decrements Y 16 times using 35 cycles (taken branch). A complete 256-count cycle that also tests and decrements X (but not A) takes 569 cycles. That's 2.223 cycles per count. Testing and decrementing A as well adds 6 more cycles, but this is only an insignificant 0.01% overhead.

On a 2MHz CPU, 2000000 cycles requires 3514 ($0DBA) main loops with 456 cycles left over. Of these, 18 cycles are occupied in overhead when Y is nonzero, and the correct value of Y is 87 ($57). An entry value of $0DBA57 should actually make the function return after 1999997 cycles.

There is still plenty of headroom here to accommodate, say, a 20MHz CPU. The actual maximum number of cycles you can request (with $FFFFFF) is 18 + 255*5 + 255*(6 + 256*569) = 37,147,143.

BigEd · Post by **BigEd** » Tue Aug 29, 2023 6:33 am

Oh I like the loop unrolling - not too messy at all, having synchronised the count.

I did have an idea overnight, in a quest to get closer to the 5 cycle count promised - but not delivered - by

Code: Select all

.loop
DEX
BNE loop

and the idea is to split the positive and negative halves of the countdown, so we can use BPL to detect the FF:

Code: Select all

.xstillnegative
DEX
BMI xstillnegative
.xstillpositive
DEX
BPL xstillpositive

As with the unrolled code, one needs to prepare X to the right state before starting off. That's a bit too fiddly for me to attempt off-the-cuff.

As with the unrolled code, we can decrement Y in a more pedestrian way because it's a rare case.

Also, it looks like this will always take X around to FF before falling out of the loop - rather like Wikipedia's C code, we'll finish at -1 rather than 0. Edit: but maybe I can use BNE xnotyetzero as the second branch-back...

Osi · Post by **Osi** » Tue Aug 29, 2023 9:34 am

In the two's complement proposal, we actually create a negative number equal the positive one and count the two's complement down to zero. I would consider this as a allowed approach.
The cycle count is 5.02 in the first proposal and can be shorten by the unrolling trick down to 2.2 cycles or 0x06EC67 counts @ 1Mhz clock, if needed.

Code: Select all

Count	pha
	txa
	eor #$FF
	tax
	tya
	eor #$FF
	tay
	pla
	eor #$FF
	clc
b0	inx
	bne b0
	beq b2

b1	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	inx
	bne b1

b2	iny
	bne b1

	adc #01
	bne b1
	rts

Chromatix · Post by **Chromatix** » Tue Aug 29, 2023 1:03 pm

So, my code nominally has a speed of under two-and-a-quarter cycles per count, but a resolution of 5 cycles. There is a wrinkle however, which may allow for increasing that resolution in many cases. That wrinkle is the fact that the maximum number of 5-cycle loops takes significantly longer (5*255 = 1275) than one batch of 256 (569, which is not a multiple of 5). Almost two-and-a-quarter times as long, funnily enough.

Thus in the vast majority of cases (when X is nonzero), one batch of 256 could be substituted by 113 ($71) individual counts to subtract 4 cycles, or by 114 ($72) to add one cycle. In nearly a quarter of cases, a second such substitution could be made to subtract 3 from or add 2 to the total. Both these latter options are available if the initial value of Y is less than 28; the "subtract 3" option is also available for Y=28 or 29 ($1C or $1D).

To demonstrate this with my example of a 2 million cycle request, revising the counter from $0DBA57 to $0DB9C9 yields 1,999,998 cycles, closer by one cycle than the original 1,999,997 cycles - and prepending the entire routine with a NOP would cover the remainder. Of course, simply using $0DBA58 would yield 2,000,002 cycles, which is equally close.

In the corner case where X is decremented past zero to perform the above substitution, A also has to be decremented which effectively removes an additional 6 cycles from the total. This requires compensating by incrementing Y and further allowing for the remaining missing cycle.

Mini-challenge - fastest 24 bit countdown

Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown

Re: Mini-challenge - fastest 24 bit countdown