drogon provided an interesting approach for printing strings where the string is embedded immediately after the call to the string output routine. His approach uses self-modifying code in RAM using the return address. His routine essentially pulls the return address from the stack, places it in the absolute address bytes of a lda abs instruction. He increments these two bytes to move over the string.
This approach reminded me in some ways to how I used to print strings on my PDP-11/24 when programming in MACRO-11 assembler.
I wrote four routines to implement drogon's string output approach using my M65C02A ISA.
The first, strout.asm, used the M65C02A's stack-relative direct addressing mode to increment the return address directly on the stack, and the M65C02A's stack-relative indirect addressing mode to load the accumulator with the character from the string.
The second, strout1.asm, used the Y register, in 16-bit mode, as an index to the return address on the stack to point to the string and load the character into the accumulator. After the string has been output, the Y register is placed into the accumulator and the return address from the stack is added to the accumulator and this adjusted return address is written over the return address on the stack.
The third, strout2.asm, pulls the return address from the stack into a 16-bit Y register. The 16-bit Y register is added to a constant offset 16-bit offset of one to index through the string (in order to use the lda abs,Y instruction). The 16-bit Y register is incremented for each character in the string, and when the last character has been output to the console, the 16-bit Y register is pushed onto the stack as the new return address for the string output subroutine.
The fourth, strout3.asm, pulls the return address into a 16-bit X register. The 16-bit X register is added to a constant 8-bit offset of 1 to index through the string (in order to use the lda zp,X instruction). The 16-bit X register is incremented for each character in the string, and when the last character has been output to the console, the 16-bit X register is pushed onto the stack as the new return address of the string output subroutine.
The assembler outputs for these four routines are attached. The performance comparison shows that strout3.asm is the fastest (in total number of instruction cycles). All four routines output "Hello World\n" to the py65 console.
Code:
.load strout.bin 200
Wrote +40 bytes from $0200 to $0227
.g 200
Hello World
.cycles
Total = 283, Num Inst = 67, Pgm Rd = 174, Data Rd = 67, Data Wr = 42, Dummy Cycles = 0
CPI = 4.22, Avg Inst Len = 2.60, Time = 0.002505, time/cycle = 8.851 us, MIPS = 26748.643
--------------------------------------------------------------------------------
.load strout1.bin 200
Wrote +47 bytes from $0200 to $022E
.g 200
Hello World
.cycles
Total = 231, Num Inst = 71, Pgm Rd = 170, Data Rd = 43, Data Wr = 18, Dummy Cycles = 0
CPI = 3.25, Avg Inst Len = 2.39, Time = 0.004371, time/cycle = 18.923 us, MIPS = 16243.051
--------------------------------------------------------------------------------
.load strout2.bin 200
Wrote +40 bytes from $0200 to $0227
.g 200
Hello World
.cycles
Total = 198, Num Inst = 68, Pgm Rd = 163, Data Rd = 17, Data Wr = 18, Dummy Cycles = 0
CPI = 2.91, Avg Inst Len = 2.40, Time = 0.003645, time/cycle = 18.408 us, MIPS = 18657.228
--------------------------------------------------------------------------------
.load strout3.bin 200
Wrote +39 bytes from $0200 to $0226
.g 200
Hello World
.cycles
Total = 185, Num Inst = 68, Pgm Rd = 150, Data Rd = 17, Data Wr = 18, Dummy Cycles = 0
CPI = 2.72, Avg Inst Len = 2.21, Time = 0.004635, time/cycle = 25.052 us, MIPS = 14671.931
--------------------------------------------------------------------------------