6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Oct 31, 2024 10:57 pm

All times are UTC




Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Tue Aug 29, 2023 1:54 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
Quote:
@teamtempest - did you miss out a txa somewhere? I see a tax but no txa.


I don't think so. There's no reason to put the value of X (the middle byte in my arrangement) anywhere once it's been counted down to zero. I moved the value of A- into X- because, of course, the 6502 can't decrement A- directly.

On the other hand I did miss the simple loop unrolling trick. Y- and X- hold values that are multiples of 256, which lend themselves to at least double decrements before looping back (the greatest payoff comes from the first unroll, with decreasing returns the more you unroll, ie., the percentage gain decreases with each expansion. But if you want the fastest, that's the way to go).

I don't think it matters much if the low byte gets decremented first or last. I put it last so I wouldn't trash the value in X- until I was done with it. But if you want to unroll even that, you can do something like:

Code:
     tax
     beq f2
     lsr
     bcs f1   ; b: odd number count

b1   dex

f1   dex

     bne b1

f2   rts


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 29, 2023 2:29 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
I put this together the other day, but never got around to posting it.... I also changed it as Ed noted 6502, not 65C02. Loop unrolling is certainly faster.... but it all depends on how much speed you need.

Code:
; 24-bit decrement routine for BogoMIPS
;
; The registers are usd to pass the 24-bit count value as follows:
;
;       A register:     High order byte
;       X register:     Middle order byte
;       Y register:     Low order byte
;
; To decrement as quicky as possible, register decrement instructions are used
;
; To start, we decrement the high order byte first.
;  this requires that the X and Y registers are zeroed first, the the X/Y register pair
;  are decremented as a 16-bit pair for each decrement of the A register.
;
; Once the high order byte is decremented to zero, the next decrement is done by loading
; the Y register to zero, then loading the X register with the middle order byte value.
;
; The X/Y register is decremented as a 16-bit pair until the X register is zero.
;
; Once the middle order byte is decremented to zero, we simply load the Y register with the low
; order byte and decrement it to zero, then return.
;
; 24-bit decrement routine:
; To call this routine, load the registers in the following order:
;       LDY     low
;       LDX     middle
;       LDA     high
;       JSR     DEC24BIT
;       do whatever you need to on return
;
DEC24BIT
;
                PHY             ; Save Low order byte to Stack (3)
                PHX             ; Save Middle order byte to stack (3)
                STA     HIGH    ; Save High order (no DEC A on NMOS) (3)
;
                BEQ     SKIP1   ; If A reg is zero, skip down (2/3)
;
                LDX     #$00    ; Clear X reg (2)
                LDY     #$00    ; Clear Y reg (2)
;
LOOP1
                DEY             ; Decrement low count (2)
                BNE     LOOP1   ; Loop back until zero (2/3)
                DEX             ; Decrement high count (2)
                BNE     LOOP1   ; Loop back until zero (2/3)
;
                DEC     HIGH    ; Decrement high order count (5)
                BNE     LOOP1   ; Loop back for another 16-bit decrement (2/3)
;
SKIP1
                TAY             ; Zero Y reg (as A reg is zero from above) (2)
                PLX             ; Get Middle order byte from stack (4)
                BEQ     SKIP2   ; If zero, skip down to low order decrement (2/3)
;
LOOP2
                DEY             ; Decrement Y reg (2)
                BNE     LOOP2   ; Loop back unti zero (2/3)
                DEX             ; Decrement X reg (2)
                BNE     LOOP2   ; Loop back until done (2/3)
;
SKIP2
                PLY             ; Get low order byte back (4)
                BEQ     DONE    ; If zero, branch (2/3)
;
LOOP3           DEY             ; Decrement Y reg (2)
                BNE     LOOP3   ; Loop back until done (2/3)
;
DONE
                RTS             ; Return to caller (6)
;


Using a 65C02 would reduce the size by a few bytes, but no real advantage in overall speed.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 31, 2023 2:51 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
On second thought...


Code:
entry   cmp #$00   ; low byte zero ?
        beq +      ; b: yes
        sec
-       sbc #$01   ; count down A
        bne -      ; b: not done

+       cpx #$00   ; mid byte zero ?
        beq +        ; b: yes

-       sec
-       sbc #$01   ; count down A
        bne -      ; b: not done
 
        dex         ; count down X
        bne --      ; b: not done

+       cpy #$00    ; high byte zero ?
        beq +       ; b: yes
 
-       sec
-       sbc #$01   ; count down A
        bne -

        dex         ; count down X
        bne --

        dey         ; count down Y
        bne --

+       rts


The second and third subtractions from A- could be multiples of two if you wanted it to run faster...you could even eliminate them altogether and just decrement X- and Y- if you really wanted fast times...though you'd get wildly inconsistent timings, I think, depending on the input values...


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 31, 2023 5:59 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
While considering what the optimum degree of unrolling might be, I had an amusing idea. I had designed the inner loop to unroll by an exact submultiple of 256 (in this case, 16) so that the outer loop could cleanly decrement the upper bytes by one. But what if I chose a degree that was coprime to 256, such as 7? This would cause the inner loop to decrement by 7*256 before exiting. The speed would still be high in this case, just below 2.5 cycles per count on average, even though we must now complicate the control structure.
Code:
; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
  ; start by accurately decrementing Low byte to zero
  CPY #0
  BEQ @mainLoop
: DEY
  BNE :-

@mainLoop:
  CPX #7
  BCC @single
  DEX
  DEX
  DEX
  DEX
  DEX
  DEX
  DEX
  BCS @multiple

@single:
  CPX #0
  BNE :++
  CMP #0
  BNE :+
  RTS

: SBC #1   ; Carry is already set by CMP
: DEX
  DEY        ; 256 % 7 = 4
  DEY
  DEY
  DEY

@multiple:
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  DEY
  BNE @multiple
  BEQ @mainLoop
The innermost loop performs 7 decrements of Y in 17 cycles, for 2.429 cycles per count. On the fast path this repeats 256 times, after decrementing X 7 times, then returns to the top of the main loop. This takes 4375 cycles to perform 1792 decrements, for 2.441 cycles per count.

The slower path is typically performed 4 times per high-byte count, in which X is decremented once, and if necessary A is tested and decremented, while Y is pre-decremented 4 times so that the inner loop performs 252 decrements instead of 1792. This slow-path loop takes 634 cycles to count down by 256, for 2.476 cycles per count, if X is nonzero. If X is zero, add 6 cycles to test and decrement A, for exactly 2.5 cycles per count over that batch of 256.

It takes 63 more cycles to run the slow path 7 times than to run the fast path once.

The total cycle count per A decrement (ie. 65536 counts) is 36 fast paths, 4 slow paths, and 1 supplement for actually decrementing A. That's 160,042 cycles, so 2.442 cycles per count.

As before, we have a 5 cycle per count loop on entry for fine tuning. The cycle count if entered with zeroed counters is 24, including the RTS - this is due to the additional overhead of comparing X against 7. If Y is nonzero on entry, the overhead beyond the various loop execution times above is reduced to 23 cycles. The maximum cycle count is 255 A-decrements, 36 more fast-paths, 3 more slow-paths, 255 fine-tuning loops, and the function overhead: 40,971,410. That's enough to implement a 2-second delay at 20MHz.

I have not yet checked how densely the possibilities for precise cycle counts are covered. The calculations will be more complex than for my earlier version.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 03, 2023 6:33 pm 
Offline
User avatar

Joined: Wed Feb 13, 2013 1:38 pm
Posts: 589
Location: Michigan, USA
BigEd and gang:

I still can't figure out what bogoMIPS is or does. If one second requires 1 million clock cycles on a 1-MHz 6502 or 4 million clock cycles on a 4-MHz 6502, what does bogoMIPS actually measure?

I wonder if my fixed delay sub-system for PIC microcontrollers might be useful if ported to the 65C02? It uses a macro that calls an isochronous timing loop that counts loops rather than clock cycles with provision for "cycle accurate" delays. It's kind of neat because you can write code with delays that will work with almost any clock by simply changing the 'clock' constant in the source and possibly the 'loop' size to accommodate larger delays. Please note that larger loops do not affect the minimum 24 cycle delay. Anyway, here's my untested amateur attempt at something that perhaps looks a little bit like 65C02 assembler code. Will something like this work?

Code:
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;  65C02 "cycle accurate" delay sub-system concept    Mike, K8LH  ~
;                                                                 ~
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

clock   equ     4               ; 1, 2, 4, 8, 16 (MHz clock)
usecs   equ     clock           ; cycles/microsecond multiplier
msecs   equ     usecs*1000      ; cycles/millisecond multiplier
dloop   equ     9               ; loop size (minimum 9)

delayCy macro   cycles                      ; 24 cycles minimum
    if (cycles<24)|(cycles>(dloop*65536+23))
        error " delayCy() range error "
    endif
        ldx     hi((cycles-24)/dloop)+1     ;                (2)
        lda     lo((cycles-24)/dloop)+1     ;                (2)
        jsr     uloop-((cycles-24)%dloop)   ;                (6)
        endm

;~~~~< example macro call >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        delayCy(104*usecs-5)  ; 104 microseconds minus 5 cycles

;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
;  K8LH 65C02 delayCy() subsystem 16-bit delay subroutine         ~
;                                                                 ~
;  cyc = $0B (a single-cycle 'nop' on 65C02 devices)              ~
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        cyc                 ; (cycles-24)%dloop = 8 entry    (1)
        cyc                 ; (cycles-24)%dloop = 7 entry    (1)
        cyc                 ; (cycles-24)%dloop = 6 entry    (1)
        cyc                 ; (cycles-24)%dloop = 5 entry    (1)
        cyc                 ; (cycles-24)%dloop = 4 entry    (1)  <
        cyc                 ; (cycles-24)%dloop = 3 entry    (1)   |
        cyc                 ; (cycles-24)%dloop = 2 entry    (1)   |
        cyc                 ; (cycles-24)%dloop = 1 entry    (1)   |
uloop   dec A               ; decrement 16-bit loop counter  (2)   |
        bne uloop-dloop+5   ; zero? no, branch, else         (2)(3)^
        dex                 ; decrement 16-bit hi byte       (2)
        bne uloop-dloop+9   ; zero? no, branch, else         (3)(2)
        rts                 ; exit                              (6)
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 03, 2023 6:40 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
Let me see if I can illuminate... the point of BogoMIPS, initially, as part of the Linux kernel, is to equip a dumb loop with a suitable count such that it produces a useful delay. It's not intended as a benchmark... but of course it feels like one. Of course, different CPUs at different clock speeds need different counts to make the same real delay. So a tabulation of the counts-per-second starts to look like a ranking of CPUs.

It's also true that different C compilers (and options) will make a better or worse job of the decrementing loop. (They might even optimise the loop away into nothing, which would be bad in the this case.)

If a C compiler isn't available, or is too smart, one might write the loop by hand. And as it turns out, that's what we're most likely to do with 6502. (It would be interesting to see what various C compilers can manage, here. One might see, for example, a full four byte subtraction with a constant $0001, followed by a four byte comparison with $0000. Or one might possibly see a byte-by-byte decrement.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 10:09 am 
Offline
User avatar

Joined: Wed Feb 13, 2013 1:38 pm
Posts: 589
Location: Michigan, USA
So is code with the highest "counts per second" the goal of this exercise?


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 10:50 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
I think so, as well as having fun!

I did realise something: on an '02, we can't DEC A. But we can SBC #1, and although it costs two bytes, it's still only two cycles. All we have to do is take care of the carry bit, and we can use the accumulator for the LSB, which for other reasons has an appeal to me. (ADC #$FF similarly, if counting down, and similarly if we decide counting up is fair game. I think perhaps ADC #$FF will almost always set the carry, so ADC #$FE might be the ticket. Too difficult to think of the equivalent for SBC!)


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 10:59 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
If the carry is set, then SBC #1 is correct. If the carry is clear, then SBC #0 does the same. This is because the carry bit becomes an inverted borrow bit for SBC. In fact, SBC is exactly equivalent to ADC with a bitwise-complemented operand.


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 11:24 am 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 20
Location: Germany
@BigEd
this is what my preferred compiler spits out:
Ref: C-Compiler example: KickC version 0.85
Code:
static void delay_loop(long loops)
{
  long d0 = loops;
  do {
    --d0;
  } while (d0 >= 0);
}

translates to

.segment Code
delay_loop: {
    .label d0 = $a
  __b1:
    lda.z d0
    sec
    sbc #1
    sta.z d0
    lda.z d0+1
    sbc #0
    sta.z d0+1
    lda.z d0+2
    sbc #0
    sta.z d0+2
    lda.z d0+3
    sbc #0
    sta.z d0+3
    bpl __b1
    rts
}

translates to

$05A1   LDA $0A
$05A3   SEC
$05A4   SBC #01
$05A6   STA $0A
$05A8   LDA $0B
$05AA   SBC #00
$05AC   STA $0B
$05AE   LDA $0C
$05B0   SBC #00
$05B2   STA $0C
$05B4   LDA $0D
$05B6   SBC #00
$05B8   STA $0D
$05BA   BPL $05A1
$05BC   RTS


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 11:29 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
Thanks Osi, that's interesting. A simple subtraction with a 32 bit constant, but a trivial test for going past zero.


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 12:31 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
If I take the above and hand-massage it, while keeping the 32-bit data type:
Code:
DelayLoop:
  .label d0 = $A
  INC d0
  INC d0+1
  INC d0+2
: DEC d0
  BNE :-
  DEC d0+1
  BNE :-
  DEC d0+2
  BNE :-
  DEC d0+3
  BPL :-
  RTS
This has an innermost loop taking 8 cycles per count, and the efficiency is barely diminished by the length of the counter.


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 3:32 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
This would be faster for OSI's idea.

SEC
]LP LDA $0
SBC #1
STA $0
TXA
SBC #0
TAX
TYA
SBC #0
TAY
BNE ]LP
RTS


For the unrolling, not sure why the CPX #0 or CPY #0 are being used as the branch instruction already takes care of the case where they are equal to zero. Just add 1 to the each of the registers. Maybe something like this then?

; Add 1 to each register so they end in zero and not -1
INX
INY
CLC
ADC #1
SEC

:1 DEX
BNE :1
BEQ :3

:2 DEX
... ; do DEX 64 times - can't do 128 as the branches below don't go that far and we must be a multiple to get to 256

BNE :2

:3 DEY
BNE :2

SBC #1
BNE :2 ; carry will always be set

RTS


Last edited by IamRob on Wed Sep 06, 2023 2:48 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 3:43 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10971
Location: England
George mentioned offsetting the registers earlier. I found I couldn't quite think clearly enough to see if that's a byte-wise offset or a full width offset with carries...


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 05, 2023 3:58 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
It's byte-wise; see my previous post for an example. The entire point is that decrements don't update the carry, so you can't detect the transition from $00 to $FF which you would like to use to "borrow" from the next higher byte or, in the case of the highest byte, terminate the entire routine. By pre-incrementing the bytes, you can instead detect the transition to $00.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron