6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu May 09, 2024 8:04 am

All times are UTC




Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Mon Aug 28, 2023 1:44 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
I've been noodling about, trying to measure BogoMIPS, which I take to mean counting down to zero from some number, with the number chosen such that it takes exactly a second to do it, on a particular machine.

A 24 bit number should be enough for the purpose.

I chose to pass my number in using A, X and Y. We have to count down the multi-byte number, until all bytes are zero, at which point we RTS.

To get the highest BogoMIPS rating, we need this multibyte countdown to run as fast as possible.

I thought I had some nice code for this, but it was wrong!


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 2:29 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 387
Location: Minnesota
Just thinking out loud...

Code:
     cpy #0
     beq  f1    ; high byte already zero

 b1  dey
     bne b1

f1   cpx #0
     beq f2

b2   dex
     bne  b2

f2   tax
     beq f3

b3   dex
     bne b3

f3   rts


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 3:10 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I'd just use one label, decrementing each register in turn with a bne back to the label after each decrement. I use this for delay loops often. The caveat is that the initial value isn't quite what you'd expect! But I believe it achieves the right result if you increment all the bytes by one before entering the loop, i.e. 010101 is actually "zero" and doesn't loop at all.

An interesting side challenge would be to make the loop constant time for each iteration regardless of rounding, there could be some value in that.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 6:37 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
on the 65816 you can omit one of the registers as the low word fits a single 16-bit register. X could also be made 16-bits wide and it would only make the CPX instruction 1 byte larger and take 1 cycle longer to execute.

Code:
; 24-bit input in A (low word) and X (high byte)
count_down:
    .A16
    .I8
    CMP #0          ; 3 cycles
    BEQ @decx       ; 2 cycles (3 if taken)
    @loop:
        DEC A       ; 2 cycles
        BNE @loop   ; 2 cycles (3 if taken)
        @decx:
        CPX #0      ; 2 cycles
        BEQ @exit   ; 2 cycles (3 if taken)
        DEX         ; 2 cycles
    BRA @loop       ; 3 cycles
    @exit:
RTS                 ; 6 cycles


though i don't know if i'm skilled enough to make it take a constant amount of time per operation. but i can try!
getting to @loop from the start of the function can either take 5 cycles, or 6+9=15 cycles, depending on if the Low Word starts at 0 or not.
so you could add 5 NOPs before @loop to make both paths take 15 cycles.
one iteration of decrementing A takes 5 cycles, if A is 1 at the start of @loop then it will take 4+9=13 cycles.
you could rework the iteration loop to branch to @decx when A = 0. so when A is 1 at the start of @loop then it will take 5+9=14 cycles. but when A != 1 it will take 4 cycles to pass the BEQ and then wastes an additional 10 cycles to get back to @loop, so 4+10=14 cycles in total.

i think my math is correct...? though i'd like to see some other attempts at making it take a constant time per iteration

Code:
; 24-bit input in A (low word) and X (high byte)
count_down:
    .A16
    .I8
    CMP #0          ; 3 cycles
    BEQ @decx       ; 2 cycles (3 if taken)
    NOP
    NOP
    NOP
    NOP
    NOP             ; 10 cycles
    @loop:
        DEC A       ; 2 cycles
        BEQ @decx   ; 2 cycles (3 if taken)
        NOP
        NOP         ; 4 cycles
        BRA @n      ; 3 cycles
    @n: BRA @loop   ; 3 cycles
        @decx:
        CPX #0      ; 2 cycles
        BEQ @exit   ; 2 cycles (3 if taken)
        DEX         ; 2 cycles
    BRA @loop       ; 3 cycles
    @exit:
RTS                 ; 6 cycles


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 7:26 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1373
So, if I understand this correctly, you are using a 24-bit number as a value to count down to zero (from that value).

If we just use a register decrement instruction on the 6502, that takes 2 clock cycles. You'll also need a branch instruction for looping, which will take 3 clock cycles to branch and 2 clock cycles if you don't branch. That's a minimum of 5 clock cycles per decrement. With a 24-bit value, just multiply 5 clock cycles times 16,777,215 and you get 83,886,075 clock cycles. You'll still need some additional clock cycles to get things setup correctly as you zero the high order byte and move downwards.

Based on the above (83,886,075) clock cycles with a 1 MHz 6502, that will take 83.886075 seconds to execute. Backing into a 1-second countdown time (with a 1 MHz clock), you could only manage a count value of less than 200,000.

Unless I'm not clearly understanding the problem, you'll need a pretty fast CPU clock rate or limit the number of bits you use for counting, perhaps 18 bits?

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:10 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
@floobydust - ah, no, I mean that 16 bits won't be enough but 24 bits should be enough, to hold the system-specific constant which causes the countdown to take one second.

@gfoot - I did wonder about offsetting each byte by one. It doesn't quite feel right for the challenge, as we're changing the constant before we start decrementing. (In fact, I was inclined to go in a different direction, to invert the value and count up to zero, because with an up counter the overflow to the next byte should happen at zero. And I might do that, in my own program, but I don't regard it as a fastest down counter.)

@teamtempest - did you miss out a txa somewhere? I see a tax but no txa.

@proxy - I think you're solving a different challenge, with a different micro!

I should note that I'm thinking of the 6502, not the 65C02, so DEC A isn't available.

So, I think so far we don't have any code which takes a multibyte value and decrements it, detecting all-zero and returning. Maybe I'm being too picky...

A simple approach, I think, would be (Edit: the following code totally not working! Thanks George.)
Code:
STA $70
.loop
DEX
BEQ done
CPX #$FF
BNE loop
DEY
BEQ done
CPY #$FF
BNE loop
DEC $70
BNE loop
.done
RTS


That'll run at a little more than 9 cycles per count.

It might be that comparing with zero before decrementing would be faster, somehow?


Last edited by BigEd on Tue Aug 29, 2023 6:35 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:26 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Quote:
@proxy - I think you're solving a different challenge, with a different micro!


to be fair you didn't specify the CPU in the first post :wink:

on another note, i somehow read gfoot's reply and thought that the "constant time per iteration" thing was part of your original post. oops!


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:27 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
On second thoughts, perhaps adding 010101 is a good way to go... it's a matter of judgement, and it's more fun if we're not strict about these kinds of challenges.

And yes, solving for the 816 is a different challenge, but go for it if you like! We should hope, I think, to get a slightly faster routine and therefore a slightly higher bogomips rating at the same clock speed.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:29 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I don't think your code would work Ed - it will count X down until it reaches zero, then return without touching the other registers

If you want the values to strictly count down to zero, and it's not just about looping the right number of times, then I guess these are two non-optimal ways to do it - first the very naive approach:

Code:
STX $70
STY $71
STA $72
SEC
.loop
LDA &70 : ORA &71 : ORA &72 : BEQ done
LDA &70 : SBC #1 : STA &70
LDA &71 : SBC #0 : STA &71
LDA &72 : SBC #0 : STA &72
BCS loop
RTS

That one even requires a consistent number of cycles per loop, so it is in some sense "linear". edit - I added the "LDA/ORA/ORA/BEQ" line, as otherwise it would count down to -1 instead of 0.

Code:
.loop
CPX #0 : BEQ xdone
DEX : JMP loop
.xdone
CPY #0 : BEQ ydone
DEY : JMP loop
.ydone
CMP #0 : BEQ adone
SBC #1 : JMP loop
.adone
RTS

This one checks for zero before all decrements, so if you enter with all registers zero, it will not decrement any of them, and return. It still loops the right number of times, I think. But it leaves me with an urge to optimize it, but also a feeling that any optimization would stop it from strictly adhering to the rules!


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:41 pm 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 16
Location: Germany
Here is my 24 bit countdown proposal. The shown numbers represent 1sec @ 1Mhz clock speed
So it takes 1 sec to count down from 0x030AC7 to zero.

Code:
; A high byte
; y,x low word


   .org $1000
   
   lda #$03
   ldy #$0A
   ldx #$C7
   jsr Count
   brk
   
   
Count   pha
   txa
   eor #$FF
   tax
   tya
   eor #$FF
   tay
   pla
   eor #$FF
   clc

b1   inx
   bne b1

   iny
   bne b1

   adc #01
   bne b1

   rts


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 8:53 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
> I don't think your code would work Ed
Oops! Quite so. I find your second offering there interesting...

Thanks @Osi I think that invert-then-increment is where I was probably headed, even though I don't feel too good about it as a tactic.

The Wikipedia article for BogoMIPS offers us this pseudocode (which I note passes zero before returning, but I'm not going to worry about that.)
Code:
static void delay_loop(long loops)
{
  long d0 = loops;
  do {
    --d0;
  } while (d0 >= 0);
}


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 28, 2023 9:30 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
If you want it to be fast above all else, while still doing individual decrements, loop unrolling is the order of the day.
Code:
; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
  ; start by accurately decrementing Low byte to zero
  CPY #0
  BEQ :++
: DEY
  BNE :-

  ; from now on, low byte will always go through a complete 256-count cycle, so we can unroll it by any submultiple of 256
  ; decrementing the high 16 bits is then slow-path
: CPX #0
  BNE :++
  CMP #0
  BNE :+
  RTS      ; all counters have reached zero

: DEC A
: DEX
: DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  BNE :-
  BEQ :----
This should be typically just over 2 cycles per count, and should work on any 6502. Let's work it out more-or-less exactly:

If the low byte (in Y) is zero, the overhead before main loop entry is 5 cycles. If it's non-zero, then it's 4 cycles to start plus 5 cycles per count. The idea is that you use the low byte only for fine-tuning. The overhead to test the high bytes during function exit is 8 cycles. The minimum execution time is thus 19 cycles including the RTS, when all three counters are zero on entry.

The inner part of the main loop decrements Y 16 times using 35 cycles (taken branch). A complete 256-count cycle that also tests and decrements X (but not A) takes 569 cycles. That's 2.223 cycles per count. Testing and decrementing A as well adds 6 more cycles, but this is only an insignificant 0.01% overhead.

On a 2MHz CPU, 2000000 cycles requires 3514 ($0DBA) main loops with 456 cycles left over. Of these, 18 cycles are occupied in overhead when Y is nonzero, and the correct value of Y is 87 ($57). An entry value of $0DBA57 should actually make the function return after 1999997 cycles.

There is still plenty of headroom here to accommodate, say, a 20MHz CPU. The actual maximum number of cycles you can request (with $FFFFFF) is 18 + 255*5 + 255*(6 + 256*569) = 37,147,143.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 29, 2023 6:33 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Oh I like the loop unrolling - not too messy at all, having synchronised the count.

I did have an idea overnight, in a quest to get closer to the 5 cycle count promised - but not delivered - by
Code:
.loop
DEX
BNE loop

and the idea is to split the positive and negative halves of the countdown, so we can use BPL to detect the FF:
Code:
.xstillnegative
DEX
BMI xstillnegative
.xstillpositive
DEX
BPL xstillpositive

As with the unrolled code, one needs to prepare X to the right state before starting off. That's a bit too fiddly for me to attempt off-the-cuff.

As with the unrolled code, we can decrement Y in a more pedestrian way because it's a rare case.

Also, it looks like this will always take X around to FF before falling out of the loop - rather like Wikipedia's C code, we'll finish at -1 rather than 0. Edit: but maybe I can use BNE xnotyetzero as the second branch-back...


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 29, 2023 9:34 am 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 16
Location: Germany
In the two's complement proposal, we actually create a negative number equal the positive one and count the two's complement down to zero. I would consider this as a allowed approach.
The cycle count is 5.02 in the first proposal and can be shorten by the unrolling trick down to 2.2 cycles or 0x06EC67 counts @ 1Mhz clock, if needed.

Code:
Count   pha
   txa
   eor #$FF
   tax
   tya
   eor #$FF
   tay
   pla
   eor #$FF
   clc
b0   inx
   bne b0
   beq b2

b1   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   inx
   bne b1

b2   iny
   bne b1

   adc #01
   bne b1
   rts


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 29, 2023 1:03 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
So, my code nominally has a speed of under two-and-a-quarter cycles per count, but a resolution of 5 cycles. There is a wrinkle however, which may allow for increasing that resolution in many cases. That wrinkle is the fact that the maximum number of 5-cycle loops takes significantly longer (5*255 = 1275) than one batch of 256 (569, which is not a multiple of 5). Almost two-and-a-quarter times as long, funnily enough.

Thus in the vast majority of cases (when X is nonzero), one batch of 256 could be substituted by 113 ($71) individual counts to subtract 4 cycles, or by 114 ($72) to add one cycle. In nearly a quarter of cases, a second such substitution could be made to subtract 3 from or add 2 to the total. Both these latter options are available if the initial value of Y is less than 28; the "subtract 3" option is also available for Y=28 or 29 ($1C or $1D).

To demonstrate this with my example of a 2 million cycle request, revising the counter from $0DBA57 to $0DB9C9 yields 1,999,998 cycles, closer by one cycle than the original 1,999,997 cycles - and prepending the entire routine with a NOP would cover the remainder. Of course, simply using $0DBA58 would yield 2,000,002 cycles, which is equally close.

In the corner case where X is decremented past zero to perform the above substitution, A also has to be decremented which effectively removes an additional 6 cycles from the total. This requires compensating by incrementing Y and further allowing for the remaining missing cycle.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: