6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 12:25 pm

All times are UTC




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Sat Jan 13, 2024 5:31 pm 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 20
Location: Germany
I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough.

Basically we have a double buffer that needs to be block moved into the display memory.
The code should be for a standard 6502 CPU. A 160x160 pixel buffer has to be transferred into a 256x256 monochrome pixel display
So we are talking about 3200 bytes.
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz.
The 160x160 pixel buffer needs to be position on the display in Y- and X-direction in line/byte increments.

Hope to find a faster method that may be obvious, but I can't figure out.
Thomas
Code:
; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display (monochrome)
; 60 Hz full frame time is about 16.6ms (60Hz)



Hires=$8000+512         ; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00         ; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)

source=$13         ; Source address pointer
target=$15         ; Target address pointer

shift=5            ; shift byte offset of source data (variable)
offset=2         ; horizontal output display offset 2=16 pixel (basically fixed)

   .org $1000      ; transfer time 51950 cycles (almost 52msec) needed for 3200 bytes

   lda #>Buffer
   sta source+1
   
   lda #>Hires
   sta target+1
   
   ldx #19         ; 20*8 lines to transfer
   
L1:   lda #20+shift      ; variable source offset for horizonlal shifts to display output
   sta source
   lda #20+offset      ; Output at display + 16 pixel
   sta target
   
   jsr transfer
   inc source+1      ; next eight lines + 256 byte
   inc target+1
   dex
   bpl L1
   brk
   
   
transfer:         ; transfer group of 8 lines
   ldy #$00
   lda (source),y
   sta (target),y
   ldy #$20
   lda (source),y
   sta (target),y
   ldy #$40
   lda (source),y
   sta (target),y
   ldy #$60
   lda (source),y
   sta (target),y
   ldy #$80
   lda (source),y
   sta (target),y
   ldy #$A0
   lda (source),y
   sta (target),y
   ldy #$C0
   lda (source),y
   sta (target),y
   ldy #$E0
   lda (source),y
   sta (target),y
   dec source
   dec target      ; dec lower byte of address
   lda target
   cmp #offset-1
   bne transfer
   rts   


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 13, 2024 6:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
The core of one of Bruce's routines is
Code:
MU1      LDA (FROM),Y
         STA (TO),Y
         DEY
         BNE MU1

and it feels like this could be unrolled to some extent, saving some of the cost of the branching.

But repeated DEY is also something you can avoid, if you adopt self-modifying code. As a minimally advantageous example, the following will transfer at two locations with the same Y value:
Code:
MU1
      LDA FROM1,Y
         STA TO1,Y
      LDA FROM2,Y
         STA TO2,Y
         DEY
         BNE MU1

Once you've done this, it's perhaps less advantageous to unroll to save branch cost, as you're already doing multiple moves per loop.

If you put this code in zero page, it's then cheaper to update the (in this case four) pointers because you can use zero page addressing. But then again, you do this update relatively rarely. If that's fine, all you need is to place the code in RAM rather than ROM. As it turns out abs,Y is faster than (zp),Y too, so you might just use self modifying code for that reason, even without reusing Y.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 13, 2024 7:29 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8509
Location: Midwestern USA
Osi wrote:
I had a look to several code examples on this subject and as well into the Practical Memory Move Routines by Bruce Clark, but none seems to be fast enough...
My code example takes about 52ms for the transfer (or 16.25 usec per byte) and that is three times the frame rate at 60Hz...

At what clock rate is your MPU being run? It’s hard to get perspective without knowing that. To me, 52ms seems like a long time to copy a couple of KB.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 13, 2024 9:02 pm 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 20
Location: Germany
We have stock speed of 1Mhz.
In a "best case" scenario, a LDA source,y / STA target,y / iny ... we have 4+5+2 cycles/usec or 35.2 ms for 3200 bytes.
My assumption is, this is probably the best we can get.

BigEd's proposal, to make use of the ZeroPage to speed things up reminds me to some Basic interpreter, that use this method for screen scrolling or code scanning.
I will give it a try, maybe with backing up some ZeroPage bytes, to get more space for the unrolled code.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 13, 2024 10:15 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Note that you don't need INY for every pair of loads and stores, so it's not quite 2 cycles of cost.

But I do feel this must have been done many times before - there must be existing discussions or writeups. See perhaps
https://forums.atariage.com/topic/17590 ... mory-copy/


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 13, 2024 11:19 pm 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 20
Location: Germany
Interesting discussion on AtariAge. Couldn't find the other day.
Using your ZeroPage suggestion, I was able to reduce the time it takes down to 37ms. That's a good big step forward.
We are now at 11.62 usec per byte. The code is a bit longer, but that's not a problem.
Here is how it looks like

Code:
; Double buffer fast block move for standard 6502 CPU
; Transfer of an 160x160 pixel buffer into a 256x256 pixel display
; 60 Hz full frame time is about 16.6ms (60Hz)



Hires=$8000+512         ; pixel display 256x256 (8kb) start at line 16
Buffer=$6c00         ; double buffer size 5120 bytes, 3200 bytes used (160x160 pixel)

Backup=$6B80
ZPtransfer=$80

shift=5            ; shift byte offset of source data (variable)
offset=2         ; horizontal output display offset 2=16 pixel (basically fixed)


   .org ZPtransfer
LZP:   ldy #19      ; 20 bytes (takes 49 bytes in ZeroPage)
S0:   lda Buffer,y   
T0:   sta Hires,y
S1:   lda Buffer,y   ; here Buffer + 5*256
T1:   sta Hires,y
S2:   lda Buffer,y   ; here Buffer + 10*256
T2:   sta Hires,y
S3:   lda Buffer,y   ; here Buffer + 15*256
T3:   sta Hires,y
   dey
   bpl S0
   inc S0+2
   inc S1+2
   inc S2+2
   inc S3+2
   inc T0+2
   inc T1+2
   inc T2+2
   inc T3+2
   dex
   bne LZP      ; Loop for 5x4
   rts   

   .org $1000      ; transfer time 37234 cycles (37.2msec) needed for 3200 bytes


   ldy #48         ; Backup ZeroPage
B1:   lda ZPtransfer,y
   sta Backup,y
   lda ZPCode,y      ; store ZeroPage code
   sta ZPtransfer,y
   dey
   bpl B1
   

   ldx #shift
   lda #offset
L1:   stx S0+1
   stx S1+1
   stx S2+1   
   stx S3+1
   
   sta T0+1
   sta T1+1
   sta T2+1   
   sta T3+1
   
   clc
   lda #>Buffer
   sta S0+2
   adc #5
   sta S1+2
   adc #5   
   sta S2+2
   adc #5   
   sta S3+2
   
   lda #>Hires
   sta T0+2
   adc #5
   sta T1+2
   adc #5   
   sta T2+2
   adc #5   
   sta T3+2
   
   ldx #5         ; 5*4*1 lines to transfer
   jsr LZP

   clc
   lda S0+1
   adc #$20      ; next line
   bcs Exit      ; we have reached end of 8 lines
   tax
   lda T0+1
   adc #$20      ; next line
   bcc L1
   
Exit:   ldy #48         ; Restore ZeroPage
B2:   lda Backup,y
   sta ZPtransfer,y
   dey
   bpl B2
      
   brk
   
   
ZPCode: .db $A0, $13, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
   .db $00, $82, $B9, $00, $6C, $99, $00, $82, $B9, $00, $6C, $99
   .db $00, $82, $88, $10, $E5, $E6, $84, $E6, $8A, $E6, $90, $E6
   .db $96, $E6, $87, $E6, $8D, $E6, $93, $E6, $99, $CA, $D0, $D0
   .db $60


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 14, 2024 9:33 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
Maybe I'm missing something here, but why can't you unroll the whole loop if you want the ultimate in speed?

So copying a 160x160 pixel image - and it doesn't matter what your copying it to as long as it's a linear framebuffer - that's 160 lines of 20 bytes

Code:
ldy #19 ; Copy 200 pixels or 1 bit each
loop:
lda source,y ; 4 cycles
sta dest,y ; 5 cycles
lda source+20,y
sta dest+20,y
lda source+40,y
sta dest+20,y
... and so on
dey
bpl loop



Wouldn't that be the fastest?

9 cycles per byte plus some small overhead per loop.

the size is 6 bytes per copy, so 6 * 160 = 960 bytes for the body of the loop plus a few bytes overhead.

If you use ca65 then you can do it in a repeat loop:

Here is an example from my own framebuffer code:

Code:
fbCls:
        ldy     #SCREEN_WIDTHm1    ; Screen width minus 1
        lda     #' '
clearLoop:
        .repeat SCREEN_HEIGHT,i
        sta     frameBuffer + i * SCREEN_WIDTH,y
        .endrep
        dey
        bpl     clearLoop
        rts



-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 14, 2024 10:15 pm 
Offline
User avatar

Joined: Wed May 11, 2022 10:34 am
Posts: 20
Location: Germany
Almost 1kB of code is quite a lot.
But the main issue is, that the source address changes quite a lot.
Destination address on the display is somehow fixed, so that part would work.

I'm not using CA65 it's an older 6502 Macro Assembler/Simulator quite helpful for correct cycle counts.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 15, 2024 8:38 am 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
Osi wrote:
Almost 1kB of code is quite a lot.


You wanted fast...

Quote:
But the main issue is, that the source address changes quite a lot.


Right. So that's the bit I was missing...

Depending on how often - you could write a subroutine that modifies every LDA there to update the source address - however if it's every frame, then it would be no-good, but there may be a trade-off there, somewhere...

(And of-course that routine would be unrolled, so e.g.

Code:
ldx #<source
ldy #>source
jsr updateSource
...

updateSource:
  stx loop+1
  sty loop+2
  stx loop+7
  sty loop+8
... repeat for the 160 LDA's..
  rts


"only" another 480 bytes of code plus a few bytes of overhead... Hmm..

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 18, 2024 6:08 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Osi wrote:
We are now at 11.62 usec per byte.
Does that calculation include the overhead of saving the zero-page area beforehand and restoring it afterward?

I was doodling with this problem and came up with some self-modifying code that's about 11% slower. But it doesn't use zero-page, so there's isn't any added overhead for save/restore.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 13 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: