Page 2 of 3
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Aug 29, 2023 1:54 pm
by teamtempest
@teamtempest - did you miss out a txa somewhere? I see a tax but no txa.
I don't think so. There's no reason to put the value of X (the middle byte in my arrangement) anywhere once it's been counted down to zero. I moved the value of A- into X- because, of course, the 6502 can't decrement A- directly.
On the other hand I did miss the simple loop unrolling trick. Y- and X- hold values that are multiples of 256, which lend themselves to at least double decrements before looping back (the greatest payoff comes from the first unroll, with decreasing returns the more you unroll, ie., the percentage gain decreases with each expansion. But if you want the fastest, that's the way to go).
I don't think it matters much if the low byte gets decremented first or last. I put it last so I wouldn't trash the value in X- until I was done with it. But if you want to unroll even that, you can do something like:
Code: Select all
tax
beq f2
lsr
bcs f1 ; b: odd number count
b1 dex
f1 dex
bne b1
f2 rts
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Aug 29, 2023 2:29 pm
by floobydust
I put this together the other day, but never got around to posting it.... I also changed it as Ed noted 6502, not 65C02. Loop unrolling is certainly faster.... but it all depends on how much speed you need.
Code: Select all
; 24-bit decrement routine for BogoMIPS
;
; The registers are usd to pass the 24-bit count value as follows:
;
; A register: High order byte
; X register: Middle order byte
; Y register: Low order byte
;
; To decrement as quicky as possible, register decrement instructions are used
;
; To start, we decrement the high order byte first.
; this requires that the X and Y registers are zeroed first, the the X/Y register pair
; are decremented as a 16-bit pair for each decrement of the A register.
;
; Once the high order byte is decremented to zero, the next decrement is done by loading
; the Y register to zero, then loading the X register with the middle order byte value.
;
; The X/Y register is decremented as a 16-bit pair until the X register is zero.
;
; Once the middle order byte is decremented to zero, we simply load the Y register with the low
; order byte and decrement it to zero, then return.
;
; 24-bit decrement routine:
; To call this routine, load the registers in the following order:
; LDY low
; LDX middle
; LDA high
; JSR DEC24BIT
; do whatever you need to on return
;
DEC24BIT
;
PHY ; Save Low order byte to Stack (3)
PHX ; Save Middle order byte to stack (3)
STA HIGH ; Save High order (no DEC A on NMOS) (3)
;
BEQ SKIP1 ; If A reg is zero, skip down (2/3)
;
LDX #$00 ; Clear X reg (2)
LDY #$00 ; Clear Y reg (2)
;
LOOP1
DEY ; Decrement low count (2)
BNE LOOP1 ; Loop back until zero (2/3)
DEX ; Decrement high count (2)
BNE LOOP1 ; Loop back until zero (2/3)
;
DEC HIGH ; Decrement high order count (5)
BNE LOOP1 ; Loop back for another 16-bit decrement (2/3)
;
SKIP1
TAY ; Zero Y reg (as A reg is zero from above) (2)
PLX ; Get Middle order byte from stack (4)
BEQ SKIP2 ; If zero, skip down to low order decrement (2/3)
;
LOOP2
DEY ; Decrement Y reg (2)
BNE LOOP2 ; Loop back unti zero (2/3)
DEX ; Decrement X reg (2)
BNE LOOP2 ; Loop back until done (2/3)
;
SKIP2
PLY ; Get low order byte back (4)
BEQ DONE ; If zero, branch (2/3)
;
LOOP3 DEY ; Decrement Y reg (2)
BNE LOOP3 ; Loop back until done (2/3)
;
DONE
RTS ; Return to caller (6)
;
Using a 65C02 would reduce the size by a few bytes, but no real advantage in overall speed.
Re: Mini-challenge - fastest 24 bit countdown
Posted: Thu Aug 31, 2023 2:51 pm
by teamtempest
On second thought...
Code: Select all
entry cmp #$00 ; low byte zero ?
beq + ; b: yes
sec
- sbc #$01 ; count down A
bne - ; b: not done
+ cpx #$00 ; mid byte zero ?
beq + ; b: yes
- sec
- sbc #$01 ; count down A
bne - ; b: not done
dex ; count down X
bne -- ; b: not done
+ cpy #$00 ; high byte zero ?
beq + ; b: yes
- sec
- sbc #$01 ; count down A
bne -
dex ; count down X
bne --
dey ; count down Y
bne --
+ rts
The second and third subtractions from A- could be multiples of two if you wanted it to run faster...you could even eliminate them altogether and just decrement X- and Y- if you really wanted fast times...though you'd get wildly inconsistent timings, I think, depending on the input values...
Re: Mini-challenge - fastest 24 bit countdown
Posted: Thu Aug 31, 2023 5:59 pm
by Chromatix
While considering what the optimum degree of unrolling might be, I had an amusing idea. I had designed the inner loop to unroll by an exact submultiple of 256 (in this case, 16) so that the outer loop could cleanly decrement the upper bytes by one. But what if I chose a degree that was coprime to 256, such as 7? This would cause the inner loop to decrement by 7*256 before exiting. The speed would still be high in this case, just below 2.5 cycles per count on average, even though we must now complicate the control structure.
Code: Select all
; load A, X, Y with High, Mid, Low bytes of counter respectively
BogoMIPS:
; start by accurately decrementing Low byte to zero
CPY #0
BEQ @mainLoop
: DEY
BNE :-
@mainLoop:
CPX #7
BCC @single
DEX
DEX
DEX
DEX
DEX
DEX
DEX
BCS @multiple
@single:
CPX #0
BNE :++
CMP #0
BNE :+
RTS
: SBC #1 ; Carry is already set by CMP
: DEX
DEY ; 256 % 7 = 4
DEY
DEY
DEY
@multiple:
DEY
DEY
DEY
DEY
DEY
DEY
DEY
BNE @multiple
BEQ @mainLoop
The innermost loop performs 7 decrements of Y in 17 cycles, for 2.429 cycles per count. On the fast path this repeats 256 times, after decrementing X 7 times, then returns to the top of the main loop. This takes 4375 cycles to perform 1792 decrements, for 2.441 cycles per count.
The slower path is typically performed 4 times per high-byte count, in which X is decremented once, and if necessary A is tested and decremented, while Y is pre-decremented 4 times so that the inner loop performs 252 decrements instead of 1792. This slow-path loop takes 634 cycles to count down by 256, for 2.476 cycles per count, if X is nonzero. If X is zero, add 6 cycles to test and decrement A, for exactly 2.5 cycles per count over that batch of 256.
It takes 63 more cycles to run the slow path 7 times than to run the fast path once.
The total cycle count per A decrement (ie. 65536 counts) is 36 fast paths, 4 slow paths, and 1 supplement for actually decrementing A. That's 160,042 cycles, so 2.442 cycles per count.
As before, we have a 5 cycle per count loop on entry for fine tuning. The cycle count if entered with zeroed counters is 24, including the RTS - this is due to the additional overhead of comparing X against 7. If Y is nonzero on entry, the overhead beyond the various loop execution times above is reduced to 23 cycles. The maximum cycle count is 255 A-decrements, 36 more fast-paths, 3 more slow-paths, 255 fine-tuning loops, and the function overhead: 40,971,410. That's enough to implement a 2-second delay at 20MHz.
I have not yet checked how densely the possibilities for precise cycle counts are covered. The calculations will be more complex than for my earlier version.
Re: Mini-challenge - fastest 24 bit countdown
Posted: Sun Sep 03, 2023 6:33 pm
by Michael
BigEd and gang:
I still can't figure out what bogoMIPS is or does. If one second requires 1 million clock cycles on a 1-MHz 6502 or 4 million clock cycles on a 4-MHz 6502, what does bogoMIPS actually measure?
I wonder if my
fixed delay sub-system for PIC microcontrollers might be useful if ported to the 65C02? It uses a macro that calls an isochronous timing loop that counts loops rather than clock cycles with provision for "cycle accurate" delays. It's kind of neat because you can write code with delays that will work with almost any clock by simply changing the 'clock' constant in the source and possibly the 'loop' size to accommodate larger delays. Please note that larger loops do not affect the minimum 24 cycle delay. Anyway, here's my untested amateur attempt at something that perhaps looks a little bit like 65C02 assembler code. Will something like this work?
Code: Select all
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; 65C02 "cycle accurate" delay sub-system concept Mike, K8LH ~
; ~
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
clock equ 4 ; 1, 2, 4, 8, 16 (MHz clock)
usecs equ clock ; cycles/microsecond multiplier
msecs equ usecs*1000 ; cycles/millisecond multiplier
dloop equ 9 ; loop size (minimum 9)
delayCy macro cycles ; 24 cycles minimum
if (cycles<24)|(cycles>(dloop*65536+23))
error " delayCy() range error "
endif
ldx hi((cycles-24)/dloop)+1 ; (2)
lda lo((cycles-24)/dloop)+1 ; (2)
jsr uloop-((cycles-24)%dloop) ; (6)
endm
;~~~~< example macro call >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
delayCy(104*usecs-5) ; 104 microseconds minus 5 cycles
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
; K8LH 65C02 delayCy() subsystem 16-bit delay subroutine ~
; ~
; cyc = $0B (a single-cycle 'nop' on 65C02 devices) ~
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cyc ; (cycles-24)%dloop = 8 entry (1)
cyc ; (cycles-24)%dloop = 7 entry (1)
cyc ; (cycles-24)%dloop = 6 entry (1)
cyc ; (cycles-24)%dloop = 5 entry (1)
cyc ; (cycles-24)%dloop = 4 entry (1) <
cyc ; (cycles-24)%dloop = 3 entry (1) |
cyc ; (cycles-24)%dloop = 2 entry (1) |
cyc ; (cycles-24)%dloop = 1 entry (1) |
uloop dec A ; decrement 16-bit loop counter (2) |
bne uloop-dloop+5 ; zero? no, branch, else (2)(3)^
dex ; decrement 16-bit hi byte (2)
bne uloop-dloop+9 ; zero? no, branch, else (3)(2)
rts ; exit (6)
;~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Re: Mini-challenge - fastest 24 bit countdown
Posted: Sun Sep 03, 2023 6:40 pm
by BigEd
Let me see if I can illuminate... the point of BogoMIPS, initially, as part of the Linux kernel, is to equip a dumb loop with a suitable count such that it produces a useful delay. It's not intended as a benchmark... but of course it feels like one. Of course, different CPUs at different clock speeds need different counts to make the same real delay. So a tabulation of the counts-per-second starts to look like a ranking of CPUs.
It's also true that different C compilers (and options) will make a better or worse job of the decrementing loop. (They might even optimise the loop away into nothing, which would be bad in the this case.)
If a C compiler isn't available, or is too smart, one might write the loop by hand. And as it turns out, that's what we're most likely to do with 6502. (It would be interesting to see what various C compilers can manage, here. One might see, for example, a full four byte subtraction with a constant $0001, followed by a four byte comparison with $0000. Or one might possibly see a byte-by-byte decrement.)
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 10:09 am
by Michael
So is code with the highest "counts per second" the goal of this exercise?
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 10:50 am
by BigEd
I think so, as well as having fun!
I did realise something: on an '02, we can't DEC A. But we can SBC #1, and although it costs two bytes, it's still only two cycles. All we have to do is take care of the carry bit, and we can use the accumulator for the LSB, which for other reasons has an appeal to me. (ADC #$FF similarly, if counting down, and similarly if we decide counting up is fair game. I think perhaps ADC #$FF will almost always set the carry, so ADC #$FE might be the ticket. Too difficult to think of the equivalent for SBC!)
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 10:59 am
by Chromatix
If the carry is set, then SBC #1 is correct. If the carry is clear, then SBC #0 does the same. This is because the carry bit becomes an inverted borrow bit for SBC. In fact, SBC is exactly equivalent to ADC with a bitwise-complemented operand.
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 11:24 am
by Osi
@BigEd
this is what my preferred compiler spits out:
Ref: C-Compiler example: KickC version 0.85
Code: Select all
static void delay_loop(long loops)
{
long d0 = loops;
do {
--d0;
} while (d0 >= 0);
}
translates to
.segment Code
delay_loop: {
.label d0 = $a
__b1:
lda.z d0
sec
sbc #1
sta.z d0
lda.z d0+1
sbc #0
sta.z d0+1
lda.z d0+2
sbc #0
sta.z d0+2
lda.z d0+3
sbc #0
sta.z d0+3
bpl __b1
rts
}
translates to
$05A1 LDA $0A
$05A3 SEC
$05A4 SBC #01
$05A6 STA $0A
$05A8 LDA $0B
$05AA SBC #00
$05AC STA $0B
$05AE LDA $0C
$05B0 SBC #00
$05B2 STA $0C
$05B4 LDA $0D
$05B6 SBC #00
$05B8 STA $0D
$05BA BPL $05A1
$05BC RTS
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 11:29 am
by BigEd
Thanks Osi, that's interesting. A simple subtraction with a 32 bit constant, but a trivial test for going past zero.
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 12:31 pm
by Chromatix
If I take the above and hand-massage it, while keeping the 32-bit data type:
Code: Select all
DelayLoop:
.label d0 = $A
INC d0
INC d0+1
INC d0+2
: DEC d0
BNE :-
DEC d0+1
BNE :-
DEC d0+2
BNE :-
DEC d0+3
BPL :-
RTS
This has an innermost loop taking 8 cycles per count, and the efficiency is barely diminished by the length of the counter.
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 3:32 pm
by IamRob
This would be faster for OSI's idea.
SEC
]LP LDA $0
SBC #1
STA $0
TXA
SBC #0
TAX
TYA
SBC #0
TAY
BNE ]LP
RTS
For the unrolling, not sure why the CPX #0 or CPY #0 are being used as the branch instruction already takes care of the case where they are equal to zero. Just add 1 to the each of the registers. Maybe something like this then?
; Add 1 to each register so they end in zero and not -1
INX
INY
CLC
ADC #1
SEC
:1 DEX
BNE :1
BEQ :3
:2 DEX
... ; do DEX 64 times - can't do 128 as the branches below don't go that far and we must be a multiple to get to 256
BNE :2
:3 DEY
BNE :2
SBC #1
BNE :2 ; carry will always be set
RTS
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 3:43 pm
by BigEd
George mentioned offsetting the registers earlier. I found I couldn't quite think clearly enough to see if that's a byte-wise offset or a full width offset with carries...
Re: Mini-challenge - fastest 24 bit countdown
Posted: Tue Sep 05, 2023 3:58 pm
by Chromatix
It's byte-wise; see my previous post for an example. The entire point is that decrements don't update the carry, so you can't detect the transition from $00 to $FF which you would like to use to "borrow" from the next higher byte or, in the case of the highest byte, terminate the entire routine. By pre-incrementing the bytes, you can instead detect the transition to $00.