6502.org

Posted: **Sat Jan 13, 2024 5:31 pm**

I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough.

Basically we have a double buffer that needs to be block moved into the display memory.
The code should be for a standard 6502 CPU. A 160x160 pixel buffer has to be transferred into a 256x256 monochrome pixel display
So we are talking about 3200 bytes.
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz.
The 160x160 pixel buffer needs to be position on the display in Y- and X-direction in line/byte increments.

Hope to find a faster method that may be obvious, but I can't figure out.
Thomas

Code: Select all

; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display (monochrome)
; 60 Hz full frame time is about 16.6ms (60Hz)



Hires=$8000+512			; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00			; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)

source=$13			; Source address pointer
target=$15			; Target address pointer

shift=5				; shift byte offset of source data (variable)
offset=2			; horizontal output display offset 2=16 pixel (basically fixed)

	.org $1000		; transfer time 51950 cycles (almost 52msec) needed for 3200 bytes

	lda #>Buffer
	sta source+1
	
	lda #>Hires
	sta target+1
	
	ldx #19			; 20*8 lines to transfer
	
L1:	lda #20+shift		; variable source offset for horizonlal shifts to display output
	sta source
	lda #20+offset		; Output at display + 16 pixel
	sta target
	
	jsr transfer
	inc source+1		; next eight lines + 256 byte
	inc target+1
	dex
	bpl L1
	brk
	
	
transfer:			; transfer group of 8 lines
	ldy #$00
	lda (source),y
	sta (target),y
	ldy #$20
	lda (source),y
	sta (target),y
	ldy #$40
	lda (source),y
	sta (target),y
	ldy #$60
	lda (source),y
	sta (target),y
	ldy #$80
	lda (source),y
	sta (target),y
	ldy #$A0
	lda (source),y
	sta (target),y
	ldy #$C0
	lda (source),y
	sta (target),y
	ldy #$E0
	lda (source),y
	sta (target),y
	dec source
	dec target		; dec lower byte of address
	lda target
	cmp #offset-1
	bne transfer
	rts

Posted: **Sat Jan 13, 2024 6:07 pm**

The core of one of Bruce's routines is

Code: Select all

MU1      LDA (FROM),Y
         STA (TO),Y
         DEY
         BNE MU1

and it feels like this could be unrolled to some extent, saving some of the cost of the branching.

But repeated DEY is also something you can avoid, if you adopt self-modifying code. As a minimally advantageous example, the following will transfer at two locations with the same Y value:

Code: Select all

MU1
      LDA FROM1,Y
         STA TO1,Y
      LDA FROM2,Y
         STA TO2,Y
         DEY
         BNE MU1

Once you've done this, it's perhaps less advantageous to unroll to save branch cost, as you're already doing multiple moves per loop.

If you put this code in zero page, it's then cheaper to update the (in this case four) pointers because you can use zero page addressing. But then again, you do this update relatively rarely. If that's fine, all you need is to place the code in RAM rather than ROM. As it turns out abs,Y is faster than (zp),Y too, so you might just use self modifying code for that reason, even without reusing Y.

Posted: **Sat Jan 13, 2024 7:29 pm**

Osi wrote:

I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough...
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz...

At what clock rate is your MPU being run? It’s hard to get perspective without knowing that. To me, 52ms seems like a long time to copy a couple of KB.

Posted: **Sat Jan 13, 2024 9:02 pm**

We have stock speed of 1Mhz.
In a "best case" scenario, a LDA source,y / STA target,y / iny ... we have 4+5+2 cycles/usec or 35.2 ms for 3200 bytes.
My assumption is, this is probably the best we can get.

BigEd's proposal, to make use of the ZeroPage to speed things up reminds me to some Basic interpreter, that use this method for screen scrolling or code scanning.
I will give it a try, maybe with backing up some ZeroPage bytes, to get more space for the unrolled code.

Posted: **Sat Jan 13, 2024 10:15 pm**

Note that you don't need INY for every pair of loads and stores, so it's not quite 2 cycles of cost.

But I do feel this must have been done many times before - there must be existing discussions or writeups. See perhaps
https://forums.atariage.com/topic/17590 ... mory-copy/

Posted: **Sat Jan 13, 2024 11:19 pm**

Interesting discussion on AtariAge. Couldn't find the other day.
Using your ZeroPage suggestion, I was able to reduce the time it takes down to 37ms. That's a good big step forward.
We are now at 11.62 usec per byte. The code is a bit longer, but that's not a problem.
Here is how it looks like

Code: Select all

; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display
; 60 Hz full frame time is about 16.6ms (60Hz)



Hires=$8000+512			; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00			; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)

Backup=$6B80
ZPtransfer=$80

shift=5				; shift byte offset of source data (variable)
offset=2			; horizontal output display offset 2=16 pixel (basically fixed)


	.org ZPtransfer
LZP:	ldy #19		; 20 bytes (takes 49 bytes in ZeroPage)
S0:	lda Buffer,y	
T0:	sta Hires,y
S1:	lda Buffer,y	; here Buffer + 5*256
T1:	sta Hires,y
S2:	lda Buffer,y	; here Buffer + 10*256
T2:	sta Hires,y
S3:	lda Buffer,y	; here Buffer + 15*256
T3:	sta Hires,y
	dey
	bpl S0
	inc S0+2
	inc S1+2
	inc S2+2
	inc S3+2
	inc T0+2
	inc T1+2
	inc T2+2
	inc T3+2
	dex
	bne LZP		; Loop for 5x4
	rts	

	.org $1000		; transfer time 37234 cycles (37.2msec) needed for 3200 bytes


	ldy #48			; Backup ZeroPage
B1:	lda ZPtransfer,y
	sta Backup,y
	lda ZPCode,y		; store ZeroPage code
	sta ZPtransfer,y
	dey
	bpl B1
	

	ldx #shift
	lda #offset
L1:	stx S0+1
	stx S1+1
	stx S2+1	
	stx S3+1
	
	sta T0+1
	sta T1+1
	sta T2+1	
	sta T3+1
	
	clc
	lda #>Buffer
	sta S0+2
	adc #5
	sta S1+2
	adc #5	
	sta S2+2
	adc #5	
	sta S3+2
	
	lda #>Hires
	sta T0+2
	adc #5
	sta T1+2
	adc #5	
	sta T2+2
	adc #5	
	sta T3+2
	
	ldx #5			; 5*4*1 lines to transfer
	jsr LZP

	clc
	lda S0+1
	adc #$20		; next line
	bcs Exit		; we have reached end of 8 lines
	tax
	lda T0+1
	adc #$20		; next line
	bcc L1
	
Exit:	ldy #48			; Restore ZeroPage
B2:	lda Backup,y
	sta ZPtransfer,y
	dey
	bpl B2
		
	brk
	
	
ZPCode: .db $A0, $13, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
	.db $00, $82, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
	.db $00, $82, $88, $10, $E5, $E6, $84, $E6, $8A, $E6, $90, $E6
	.db $96, $E6, $87, $E6, $8D, $E6, $93, $E6, $99, $CA, $D0, $D0
	.db $60

Posted: **Sun Jan 14, 2024 9:33 pm**

Maybe I'm missing something here, but why can't you unroll the whole loop if you want the ultimate in speed?

So copying a 160x160 pixel image - and it doesn't matter what your copying it to as long as it's a linear framebuffer - that's 160 lines of 20 bytes

Code: Select all

ldy #19 ; Copy 200 pixels or 1 bit each
loop: 
lda source,y ; 4 cycles
sta dest,y ; 5 cycles
lda source+20,y
sta dest+20,y
lda source+40,y
sta dest+20,y
... and so on
dey
bpl loop

Wouldn't that be the fastest?

9 cycles per byte plus some small overhead per loop.

the size is 6 bytes per copy, so 6 * 160 = 960 bytes for the body of the loop plus a few bytes overhead.

If you use ca65 then you can do it in a repeat loop:

Here is an example from my own framebuffer code:

Code: Select all

fbCls: 
        ldy     #SCREEN_WIDTHm1    ; Screen width minus 1
        lda     #' ' 
clearLoop: 
        .repeat SCREEN_HEIGHT,i 
        sta     frameBuffer + i * SCREEN_WIDTH,y 
        .endrep
        dey
        bpl     clearLoop
        rts

-Gordon

Posted: **Sun Jan 14, 2024 10:15 pm**

Almost 1kB of code is quite a lot.
But the main issue is, that the source address changes quite a lot.
Destination address on the display is somehow fixed, so that part would work.

I'm not using CA65 it's an older 6502 Macro Assembler/Simulator quite helpful for correct cycle counts.

Posted: **Mon Jan 15, 2024 8:38 am**

Osi wrote:

Almost 1kB of code is quite a lot.

You wanted fast...

Quote:

But the main issue is, that the source address changes quite a lot.

Right. So that's the bit I was missing...

Depending on how often - you could write a subroutine that modifies every LDA there to update the source address - however if it's every frame, then it would be no-good, but there may be a trade-off there, somewhere...

(And of-course that routine would be unrolled, so e.g.

Code: Select all

ldx #<source
ldy #>source
jsr updateSource
...

updateSource:
  stx loop+1
  sty loop+2
  stx loop+7
  sty loop+8
... repeat for the 160 LDA's..
  rts

"only" another 480 bytes of code plus a few bytes of overhead... Hmm..

-Gordon

Posted: **Thu Jan 18, 2024 6:08 pm**

Osi wrote:

We are now at 11.62 usec per byte.

Does that calculation include the overhead of saving the zero-page area beforehand and restoring it afterward?

I was doodling with this problem and came up with some self-modifying code that's about 11% slower. But it doesn't use zero-page, so there's isn't any added overhead for save/restore.

-- Jeff

6502.org

Fast and elegant 6502 code to transfer a buffer into display

Fast and elegant 6502 code to transfer a buffer into display

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis

Re: Fast and elegant 6502 code to transfer a buffer into dis