The thing I was proud of here was using LSR and a series of bits to manage a pair of nested loops. The inner loop executes twice on each pass through the outer loop, which also executes twice. It uses a single byte and shifting flags off the end to control everything, and it kind of reminded me of a Turing Machine. I stored the control bits in TOS not because it's part of the stack, but because TOS suddenly became an available zero page scratch location after pushing its contents onto the real stack. The POP at the end will overwrite TOS, and that's okay. The only reason it's here is to leave the same number of bytes (12) on the stack as we came in with (12)
Ultimately, this approach didn't save me any memory vs. just juggling the stack around or I would have left it in. I didn't really care about the clock cycles for this primitive, because I can't recall a single occasion when I've ever used 2ROT or done much with double precision in a Forth program. But I thought the LSR ... BCS thing was pretty nifty. Self-modifying code is usually a bad thing (like where I toggle L057 between being an INX and DEX instruction on alternating iterations through the loop) and my inner jury is still out on whether using EOR to toggle the X register between the STACKL and the STACKH area was ugly or slick.
Here's 2ROT now, same size, probably 1/2 the clocks, and I'll still never use it.
Code:
tworot ldy stackh+4,x
lda stackh+2,x
sta stackh+4,x
lda stackh,x
sta stackh+2,x
sty stackh,x
ldy stackl+4,x
lda stackl+2,x
sta stackl+4,x
lda stackl,x
sta stackl+2,x
sty stackl,x
ldy stackh+3,x
lda stackh+1,x
sta stackh+3,x
lda tos+1
sta stackh+1,x
sty tos+1
ldy stackl+3,x
lda stackl+1,x
sta stackl+3,x
lda tos
sta stackl+1,x
sty tos
jmp next
But back to your question, "What is POP?" Okay... on the PET, the kernel uses three zero page addresses from $8D - $8F for the 1/60th of a second 24-hour jiffy clock, and $90 - $FF for whatever. I'm just going to leave everything from $8D to the top of zero page alone, and content myself with the 141 bytes I get.
Code:
; zero page usage
stackl = $00 ; stackl = $00..$3b (60 bytes)
stackh = $30 ; stackh = $3c..$79 (60 bytes)
bos = stackh-stackl ; includes TOS
up = $76 ; user area pointer
n = $78 ; scratch space
w = $7e ; w overlaps n
tos = $80 ; top of stack
zi = $82 ; innermost DO LOOP counter/limit
next = $86
;0086 next inc ip
next1 = $88
;0088 next1 inc ip
nexto = $8a
ip = $8b
;008a nexto jmp ($cafe)
I do a few nontraditional things here, like having a NEXT that ignores moving across page boundaries. Instead I leave that up to the compiler to insert a call to PAGE at $xxFD or $xxFE. There's more too, like padding at compile time and skipping to the next page at runtime (when necessary) after a literal or string if it leaves me at $xxFF. This hardware is a real NMOS 6502 complete with the JMP ($xxFF) bug. The key benefit of this approach is getting NEXT down to 15 clocks.
It is typical to have an 8-byte region called "N" on the zeropage for primitive scratch space. It's also typical to have the parameter stack in zero page and index it with zp,X addressing mode.
I haven't seen a Forth that puts the innermost DO-LOOP index and limit on the zero page (ZI) but this one does. Usually DO LOOP counters and limits live on the return stack. Also for (I hope) an overall speed boost, I keep the topmost value on the parameter stack in a separate 2-byte area and split the rest of the stack into low-byte and high-byte areas. Instead of having a two-byte pointer to code stored at the code field address (CFA) there is always actual executable machine code instead. That saves one level of indirection and is called DTC (direct-threaded code) vs. the traditional ITC (indirect-threaded code)
The first word in the dictionary is (LIT), used for pushing 2-byte values that were embedded in the dictionary onto the stack. Down at the bottom of (LIT) we have a few useful ways to exit primitives, PUSHN, PUSH, and PUT. Both PUSH and PUT assume you've loaded Y:A with the two bytes you want to drop on the stack.
Code:
pushn ldy n+1
lda n
push sta n
dex
lda tos+1
sta stackh,x
lda tos
sta stackl,x
lda n
put sty tos+1
sta tos
jmp next
A few definitions later, we run into (DO) which is the business end of setting up a DO LOOP. It's also got a couple useful ways to exit a primitive, POPTWO, and POP. I just have to make sure to slide the last thing off the bottom into TOS when I POP and also be sure to move TOS onto the real stack when I PUSH or PUT.
Code:
poptwo inx
pop ldy stackh,x
lda stackl,x
inx
jmp put
To recap, POPTWO, POP, PUSHN, PUSH, PUT, NEXT and NEXTO are all ways to get out of a primitive and move the IP (instruction pointer) to the next word.