6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Mar 29, 2024 6:58 am

All times are UTC




Post new topic Reply to topic  [ 8 posts ] 
Author Message
PostPosted: Sat May 04, 2019 8:42 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
Just read this and thought it worth repeating here:
Quote:
So provided that the memory block I wanted to move resided in the bottom 64K of RAM, and provided I didn't need to move it outside that region (both of which conditions were true for the screen-scrolling stuff I was writing at the time) I could load Direct Page with the end address of the source block, turn off interrupts, load the stack pointer with the end address of the destination, and run an unrolled sequence of PEI instructions to do the move. At 6 cycles per two bytes moved that came to 3 cycles per byte, more than twice as fast as the inbuilt memory move instruction.

(Posted by flabdablet here with this preamble for context)


Top
 Profile  
Reply with quote  
PostPosted: Sat May 04, 2019 8:45 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
(Lots of other interesting low-level programming quips in that thread.)


Top
 Profile  
Reply with quote  
PostPosted: Sun May 05, 2019 10:04 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
I don't see the optimization here.

Ok so PEI pushes two bytes (6 cycles) but you then have to modify the address to point at the next pair (two zero page DECs as the transfer is towards low memory 10 cycles), test for the end of the transfer (assuming word count is in X or Y then DEX/DEY is 2 cycles) and branch back to go around again (BNE is 3 cycles on all but the last branch).

So in total its 6+10+2+3 = 21 cycles per word or 10.5 per byte which is worse than MVP/MVN.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Sun May 05, 2019 10:10 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
Hmm. Indeed. How odd - it works as a block fill with a fully unrolled sequence of PEI, which could be useful in a scroll routine to clear the new line, but it isn't a block move.


Top
 Profile  
Reply with quote  
PostPosted: Sun May 05, 2019 1:24 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3328
Location: Ontario, Canada
This flabdablet character is capable of some seriously twisted thinking. 8) :shock: Here's my take on his/her impressive hack.

The inner loop is tidy, and would look like this (below). I chose to unroll by eight, but only to keep the illustration brief. To make the thing worthwhile you'd want to unroll further than that.

All the surrounding code (which I didn't bother to work out) would be rather untidy. You'd have to start by setting up the Stack Pointer and the Direct Page Register; and check that at least 16 bytes remain to be moved before each iteration of the inner loop is allowed to proceed. After the final iteration you'd need to use conventional coding to move the remaining bytes, if any. (Depending on circumstances, you might be able to ensure there are none.)

Code:
PEI $E
PEI $C
PEI $A
PEI $8
PEI $6
PEI $4
PEI $2
PEI $0
TDC
SEC
SBC #$10
TCD

To avoid a one-cycle penalty for every PEI, you need the low eight bits of the Direct Page Register to equal zero -- IOW keep D page-aligned. For that you'd have to unroll the loop so it contains 128 PEI's -- a reasonable tradeoff, if speed is a priority and memory is cheap.

Edit: this hack works out wonderfully if the block-move source, destination and length are known in advance (as in flabdablet's screen-scrolling scenario). For general use it'll be comparatively clumsy, but still worth pursuing if the block-moves are fairly large.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Sun May 05, 2019 2:34 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10760
Location: England
Ah! It's the PEI argument which increments! (Edit: double-decrements. Oops. Not my day.)

BTW I think this might have been noted before, but with a bit of care, or luck, there's no need to disable interrupts: the ISR's stack frame will shortly be overwritten as the block move proceeds. I think.


Top
 Profile  
Reply with quote  
PostPosted: Sun May 05, 2019 3:12 pm 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
BigEd wrote:
Ah! It's the PEI argument which increments!

Its a rather specialised block move. Up to 256 bytes from a fixed set of addresses.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Mon May 06, 2019 1:54 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3328
Location: Ontario, Canada
Bitwise wrote:
Its a rather specialised block move. Up to 256 bytes from a fixed set of addresses.
A fixed set of addresses is what will produce maximum efficiency, because the registers' initial values won't have to be computed at run-time as part of the setup. And of course fixed length would improve efficiency, too.

That said, a more general routine (one that accepts arbitrary addresses and length) will more than recoup the setup overhead if enough bytes need to be moved. I don't have an exact figure for "enough" because I haven't coded the routine. But it won't take much to pass break-even and start to show a profit. The inner loop is twice as fast as the next-best alternative, MVP / MVN. Even if we incur the one-cycle penalty mentioned in my last post we still have a bulk speed of 3.5 cycles per byte. And there needn't be a length limitation of 256 bytes; the inner loop could be called repeatedly. Hmm, looks like I've almost talked myself into coding the routine! But if anyone else wants to give it a go, please do.

BigEd wrote:
with a bit of care, or luck, there's no need to disable interrupts: the ISR's stack frame will shortly be overwritten as the block move proceeds.
Luck will usually be enough! :P But if the interrupt happens to arrive just as the block move ends then you'll overrun the destination buffer. If allowing interrupts is a priority you could perhaps pad the destination buffer to allow a little extra space.
Edit: besides allowing extra space, another option is to always use MVN to move the last few bytes.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 8 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron