Buffer indirection

Paganini · Post by **Paganini** » Sun Feb 22, 2026 6:15 pm

So in the course of working on a CF driver / filesystem for PUNIX I have encountered this general problem.

Suppose I have an input buffer whose base address is stored in a zero page location, like IBUFF, and an output buffer whose base address is also stored in a zero page location, like OBUFF. Normally I would do something like this:

Code: Select all

LDY #<number of bytes to copy>
.loop:
DEY
BMI     .continue
LDA     (IBUFF),Y
STA     (OBUF),Y
BRA     .loop
.continue:

The problem is, how to copy from an offset in the source buffer to a *different* offset in the destination buffer. E.g., the CF firmware ID string starts at byte 46 in IBUFF, and I want to put that string at byte 0 in OBUFF. I solved this problem by unrolling the loop (fine for 8 bytes, I guess):

Code: Select all

		LDY	#47
		LDA	(RP0),Y
		LDY	#0
		STA	(RP1),Y

		LDY	#46
		LDA	(RP0),Y
		LDY	#1
		STA	(RP1),Y

		LDY	#49
		LDA	(RP0),Y
		LDY	#2
		STA	(RP1),Y

		LDY	#48
		LDA	(RP0),Y
		LDY	#3
		STA	(RP1),Y

		LDY	#51
		LDA	(RP0),Y
		LDY	#4
		STA	(RP1),Y

		LDY	#50
		LDA	(RP0),Y
		LDY	#5
		STA	(RP1),Y

		LDY	#53
		LDA	(RP0),Y
		LDY	#6
		STA	(RP1),Y

		LDY	#52
		LDA	(RP0),Y
		LDY	#7
		STA	(RP1),Y

but what if I wanted to copy a lot of bytes? Is there a more general and elegant solution to this problem? Basically, I've only got one indirect index register, but I need to track two indices...

barnacle · Post by **barnacle** » Sun Feb 22, 2026 8:43 pm

I've solved it the depressingly slow way: increment one or both buffer pointers and ignoring X and Y. That works only on the 65c02; otherwise you have to use Y. But it does mean I can use any length of transfer and not worry about the boundaries.

I suppose you might set one pointer and add the offset to it, removing it after the transfer?

Neil

BigEd · Post by **BigEd** » Sun Feb 22, 2026 8:56 pm

Yes, either adjust one of the pointers so you can reuse the Y value, or copy the pointer and modify the copy.

commodorejohn · Post by **commodorejohn** » Sun Feb 22, 2026 8:57 pm

Yeah, the simplest way is gonna be to adjust the pointers so that you can use the same index value for both; in the provided example, if you add 47 to the pointer value in RP0, you can loop through index values 0-7 for both.

jgharston · Post by **jgharston** » Mon Feb 23, 2026 1:37 am

Yah, the annoying omission of (zp),X. What I end up with is:

loop:
STY tmpY
TXA
TAY
LDA (zp1),Y
LDY tmpY
STA (zp2),Y
INY
INX
BR_some_condition loop

If you can push/pop X and Y, you can do:
loop:
PHY
TXA
TAY
LDA (zp1),Y
PLY
STA (zp2),Y
INY
INX
BR_some_condition loop

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon Feb 23, 2026 7:25 am

Don’t be clever, just efficient. Compute the offset and add it to one of the pointers. You can then use .Y to index over both buffers. Incrementing pointers on the fly is always expensive, even with the 65C816 doing it 16 bits at a time. Incrementing .Y only costs two clocks.

Paganini · Post by **Paganini** » Mon Feb 23, 2026 4:40 pm

Thanks everyone!

BigDumbDinosaur wrote:

Don’t be clever, just efficient. Compute the offset and add it to one of the pointers. You can then use .Y to index over both buffers. Incrementing pointers on the fly is always expensive, even with the 65C816 doing it 16 bits at a time. Incrementing .Y only costs two clocks.

This is what I ended up doing a bit later on when I had to copy 40 bytes to extract the model string. My buffers are page-aligned, so it turned out to not be too painful. I just put the offset of the first byte in the low byte of the pointer, then indexed with .Y. Afterwards, STZ to the low byte of the pointer restores it to being a base pointer.

rudla.kudla · Post by **rudla.kudla** » Tue Feb 24, 2026 6:52 am

If your code is in RAM, you can use self-modifying code very effectively.

Even if in ROM, it may be usefull to reserve some small area of zero page for copy routine.

loop:
lda xxxx,Y
sta yyyy,Y
dey
bne loop
rts

This effectively costs you just 6 bytes in ZP (as xxxx and yyyy would have to be in ZP anyways).

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Feb 24, 2026 9:19 pm

rudla.kudla wrote:

If your code is in RAM, you can use self-modifying code very effectively.

Even if in ROM, it may be usefull to reserve some small area of zero page for copy routine.

loop:
lda xxxx,Y
sta yyyy,Y
dey
bne loop
rts

This effectively costs you just 6 bytes in ZP (as xxxx and yyyy would have to be in ZP anyways).

There is no particular reason to run that routine on zero page—it won’t go any faster than if in absolute memory. All instructions will require the same number of clock cycles as if running in absolute RAM. Zero page performance gain is coupled to load/store instructions that operate on a zero page address. None of the above instructions do so.

Thinking that running a routine on zero page will gain performance is a common fallacy. Opcode fetch, decode and execution speed is a constant, regardless of where the code is running.

BigEd · Post by **BigEd** » Tue Feb 24, 2026 9:31 pm

But surely having those two pointers serving double duty is a nice idea?

jgharston · Post by **jgharston** » Wed Feb 25, 2026 2:30 am

When I absolutely must run a bit of code in RAM I push it onto the stack and call it there.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Feb 25, 2026 5:29 am

BigEd wrote:

But surely having those two pointers serving double duty is a nice idea?

Probably...if you are going to use them as actual pointers, rather than as operands, as they are being used in rudla.kudla’s code.

There are some occasions in which I will use self-modifying code in place of indirection. Mostly, that would be in loops with many iterations, such as in reading/writing a disk block. There, the execution speed of a loop using indirection (especially long indirection with the 65C816) will be slightly worse than with the same loop in which the source/destination address is a changeable operand. I do this in my SCSI driver’s quasi-DMA code, which could end up executing many thousands of iterations in a single transaction if multiple disk blocks are being accessed.

rudla.kudla · Post by **rudla.kudla** » Wed Feb 25, 2026 7:01 am

You are right about this routine not going any faster in ZP. However, copying data is common operation and in such case, you save bytes on setting the start and end address.

So the code

lda #<start
sta xxx_adr
lda #>start
sta xxx_adr+1
lda #<start
sta yyy_adr
lda #>start
sta yyy_adr+1

is four bytes shorter and faster.

However I agree, that the gain is probably small. However, if you have the free space in ZP, then putting the routine there still seems reasonable.
And it's a useful technique i felt was worth mentioning.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Feb 25, 2026 9:39 am

rudla.kudla wrote:

You are right about this routine not going any faster in ZP. However, copying data is common operation and in such case, you save bytes on setting the start and end address...However I agree, that the gain is probably small.

When looking for ways to improve performance, it’s more profitable to focus on code segments that will be heavily used, rather than on the onesies and twosies gains.

Yes, setting the source and destination addresses if the routine is on zero page will save four clock cycles and bytes over doing the same in absolute memory. However, the grunt work is in the loop, not the setup, so you are sacrificing a piece of valuable real estate to set up a garden that is going to grow one potato. Zero page’s value is in the addressing modes it supports and the approximately 25 percent performance gain with fetch/store operations. Such real estate needs to be conserved, not squandered.

Relative to the total number of clocks needed to run the loop, saving four clock cycles is not doing anything significant for performance—unless the routine as a whole is called hundreds or thousands of time per second.

barrym95838 · Post by **barrym95838** » Wed Feb 25, 2026 2:29 pm

Unless you're a brutal byte-miser, anything you can do to speed up the inner-most loop is going to pay the best dividends.

Buffer indirection

Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection

Re: Buffer indirection