Page 1 of 1
Fast and elegant 6502 code to transfer a buffer into display
Posted: Sat Jan 13, 2024 5:31 pm
by Osi
I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough.
Basically we have a double buffer that needs to be block moved into the display memory.
The code should be for a standard 6502 CPU. A 160x160 pixel buffer has to be transferred into a 256x256 monochrome pixel display
So we are talking about 3200 bytes.
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz.
The 160x160 pixel buffer needs to be position on the display in Y- and X-direction in line/byte increments.
Hope to find a faster method that may be obvious, but I can't figure out.
Thomas
Code: Select all
; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display (monochrome)
; 60 Hz full frame time is about 16.6ms (60Hz)
Hires=$8000+512 ; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00 ; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)
source=$13 ; Source address pointer
target=$15 ; Target address pointer
shift=5 ; shift byte offset of source data (variable)
offset=2 ; horizontal output display offset 2=16 pixel (basically fixed)
.org $1000 ; transfer time 51950 cycles (almost 52msec) needed for 3200 bytes
lda #>Buffer
sta source+1
lda #>Hires
sta target+1
ldx #19 ; 20*8 lines to transfer
L1: lda #20+shift ; variable source offset for horizonlal shifts to display output
sta source
lda #20+offset ; Output at display + 16 pixel
sta target
jsr transfer
inc source+1 ; next eight lines + 256 byte
inc target+1
dex
bpl L1
brk
transfer: ; transfer group of 8 lines
ldy #$00
lda (source),y
sta (target),y
ldy #$20
lda (source),y
sta (target),y
ldy #$40
lda (source),y
sta (target),y
ldy #$60
lda (source),y
sta (target),y
ldy #$80
lda (source),y
sta (target),y
ldy #$A0
lda (source),y
sta (target),y
ldy #$C0
lda (source),y
sta (target),y
ldy #$E0
lda (source),y
sta (target),y
dec source
dec target ; dec lower byte of address
lda target
cmp #offset-1
bne transfer
rts
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sat Jan 13, 2024 6:07 pm
by BigEd
The core of one of
Bruce's routines is
Code: Select all
MU1 LDA (FROM),Y
STA (TO),Y
DEY
BNE MU1
and it feels like this could be unrolled to some extent, saving some of the cost of the branching.
But repeated DEY is also something you can avoid, if you adopt self-modifying code. As a minimally advantageous example, the following will transfer at two locations with the same Y value:
Code: Select all
MU1
LDA FROM1,Y
STA TO1,Y
LDA FROM2,Y
STA TO2,Y
DEY
BNE MU1
Once you've done this, it's perhaps less advantageous to unroll to save branch cost, as you're already doing multiple moves per loop.
If you put this code in zero page, it's then cheaper to update the (in this case four) pointers because you can use zero page addressing. But then again, you do this update relatively rarely. If that's fine, all you need is to place the code in RAM rather than ROM. As it turns out abs,Y is faster than (zp),Y too, so you might just use self modifying code for that reason, even without reusing Y.
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sat Jan 13, 2024 7:29 pm
by BigDumbDinosaur
I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough...
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz...
At what clock rate is your MPU being run? It’s hard to get perspective without knowing that. To me, 52ms seems like a long time to copy a couple of KB.
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sat Jan 13, 2024 9:02 pm
by Osi
We have stock speed of 1Mhz.
In a "best case" scenario, a LDA source,y / STA target,y / iny ... we have 4+5+2 cycles/usec or 35.2 ms for 3200 bytes.
My assumption is, this is probably the best we can get.
BigEd's proposal, to make use of the ZeroPage to speed things up reminds me to some Basic interpreter, that use this method for screen scrolling or code scanning.
I will give it a try, maybe with backing up some ZeroPage bytes, to get more space for the unrolled code.
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sat Jan 13, 2024 10:15 pm
by BigEd
Note that you don't need INY for every pair of loads and stores, so it's not quite 2 cycles of cost.
But I do feel this must have been done many times before - there must be existing discussions or writeups. See perhaps
https://forums.atariage.com/topic/17590 ... mory-copy/
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sat Jan 13, 2024 11:19 pm
by Osi
Interesting discussion on AtariAge. Couldn't find the other day.
Using your ZeroPage suggestion, I was able to reduce the time it takes down to 37ms. That's a good big step forward.
We are now at 11.62 usec per byte. The code is a bit longer, but that's not a problem.
Here is how it looks like
Code: Select all
; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display
; 60 Hz full frame time is about 16.6ms (60Hz)
Hires=$8000+512 ; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00 ; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)
Backup=$6B80
ZPtransfer=$80
shift=5 ; shift byte offset of source data (variable)
offset=2 ; horizontal output display offset 2=16 pixel (basically fixed)
.org ZPtransfer
LZP: ldy #19 ; 20 bytes (takes 49 bytes in ZeroPage)
S0: lda Buffer,y
T0: sta Hires,y
S1: lda Buffer,y ; here Buffer + 5*256
T1: sta Hires,y
S2: lda Buffer,y ; here Buffer + 10*256
T2: sta Hires,y
S3: lda Buffer,y ; here Buffer + 15*256
T3: sta Hires,y
dey
bpl S0
inc S0+2
inc S1+2
inc S2+2
inc S3+2
inc T0+2
inc T1+2
inc T2+2
inc T3+2
dex
bne LZP ; Loop for 5x4
rts
.org $1000 ; transfer time 37234 cycles (37.2msec) needed for 3200 bytes
ldy #48 ; Backup ZeroPage
B1: lda ZPtransfer,y
sta Backup,y
lda ZPCode,y ; store ZeroPage code
sta ZPtransfer,y
dey
bpl B1
ldx #shift
lda #offset
L1: stx S0+1
stx S1+1
stx S2+1
stx S3+1
sta T0+1
sta T1+1
sta T2+1
sta T3+1
clc
lda #>Buffer
sta S0+2
adc #5
sta S1+2
adc #5
sta S2+2
adc #5
sta S3+2
lda #>Hires
sta T0+2
adc #5
sta T1+2
adc #5
sta T2+2
adc #5
sta T3+2
ldx #5 ; 5*4*1 lines to transfer
jsr LZP
clc
lda S0+1
adc #$20 ; next line
bcs Exit ; we have reached end of 8 lines
tax
lda T0+1
adc #$20 ; next line
bcc L1
Exit: ldy #48 ; Restore ZeroPage
B2: lda Backup,y
sta ZPtransfer,y
dey
bpl B2
brk
ZPCode: .db $A0, $13, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
.db $00, $82, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
.db $00, $82, $88, $10, $E5, $E6, $84, $E6, $8A, $E6, $90, $E6
.db $96, $E6, $87, $E6, $8D, $E6, $93, $E6, $99, $CA, $D0, $D0
.db $60
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sun Jan 14, 2024 9:33 pm
by drogon
Maybe I'm missing something here, but why can't you unroll the whole loop if you want the ultimate in speed?
So copying a 160x160 pixel image - and it doesn't matter what your copying it to as long as it's a linear framebuffer - that's 160 lines of 20 bytes
Code: Select all
ldy #19 ; Copy 200 pixels or 1 bit each
loop:
lda source,y ; 4 cycles
sta dest,y ; 5 cycles
lda source+20,y
sta dest+20,y
lda source+40,y
sta dest+20,y
... and so on
dey
bpl loop
Wouldn't that be the fastest?
9 cycles per byte plus some small overhead per loop.
the size is 6 bytes per copy, so 6 * 160 = 960 bytes for the body of the loop plus a few bytes overhead.
If you use ca65 then you can do it in a repeat loop:
Here is an example from my own framebuffer code:
Code: Select all
fbCls:
ldy #SCREEN_WIDTHm1 ; Screen width minus 1
lda #' '
clearLoop:
.repeat SCREEN_HEIGHT,i
sta frameBuffer + i * SCREEN_WIDTH,y
.endrep
dey
bpl clearLoop
rts
-Gordon
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Sun Jan 14, 2024 10:15 pm
by Osi
Almost 1kB of code is quite a lot.
But the main issue is, that the source address changes quite a lot.
Destination address on the display is somehow fixed, so that part would work.
I'm not using CA65 it's an older 6502 Macro Assembler/Simulator quite helpful for correct cycle counts.
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Mon Jan 15, 2024 8:38 am
by drogon
Almost 1kB of code is quite a lot.
You wanted fast...
But the main issue is, that the source address changes quite a lot.
Right. So that's the bit I was missing...
Depending on how often - you could write a subroutine that modifies every LDA there to update the source address - however if it's every frame, then it would be no-good, but there may be a trade-off there, somewhere...
(And of-course that routine would be unrolled, so e.g.
Code: Select all
ldx #<source
ldy #>source
jsr updateSource
...
updateSource:
stx loop+1
sty loop+2
stx loop+7
sty loop+8
... repeat for the 160 LDA's..
rts
"only" another 480 bytes of code plus a few bytes of overhead... Hmm..
-Gordon
Re: Fast and elegant 6502 code to transfer a buffer into dis
Posted: Thu Jan 18, 2024 6:08 pm
by Dr Jefyll
We are now at 11.62 usec per byte.
Does that calculation include the overhead of saving the zero-page area beforehand and restoring it afterward?
I was doodling with this problem and came up with some self-modifying code that's about 11% slower. But it doesn't use zero-page, so there's isn't any added overhead for save/restore.
-- Jeff