GARTHWILSON wrote:
The data stack kind of has to be in ZP though, because so many cells will be addresses that you'll need to use to fetch/store from/to, with the (ZP,X) addressing mode. You could keep the data stack in another page and copy addresses to a ZP location (like N mentioned above) for indirection, but I think the penalty in performance and required memory would make it not worth it.
A net four bytes extra memory usage because you have two utility vectors in the zero page plus the stack somewhere in general memory rather than the stack in the zero page is a negligible memory penalty. And a two cycle faster DROP, typically the most frequently executed primitive, pays off some the net clock penalty that exists for memory access op execution overall.
And for memory access primitive execution speed, there are trade-offs, since you get to use (ZP) and (ZP),Y to fetch and store, which are both normally one cycle faster than (ZP,X)*, and which reduces the 16bit access penalty, since "INC D,X : BNE + : INC D+1,X : + ..." costs 8 clocks (more on crossing a page boundary)*, as opposed to 2 for "LDY #1" or "INY", and for fetches beyond C@,"TAY" to postpone the store of the first fetched byte because you are still using the vector sitting on the stack costs another two. So the 14 cycles to move a memory vector is not normally a net 14 cycle cost, it's more like a net 0-2 cycles on 16bit memory ops. For 8bit memory ops, it will be a bigger net cost, but OTOH, for the double memory ops, it's probably a net gain:
TWOFETCH: ; 2@
LDA TL,X ; costs clocks
STA W ; costs clocks
LDA TH,X ; costs clocks
STA W+1 ; costs clocks
DEX
LDA (W) ; saves 1 clock*
STA TL,X
LDY #1 ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves 1 clock*
STA TH,X
INY ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves 1 clock*
STA TL+1,X ; saves clocks on preserving (T,X)
INY ; saves clocks on T,X/T+1,X increment
LDA (W),Y ; saves one clock*
STA TH+1,X
JMP NEXT
__________________
{* Note: if the (W),Y crosses a page boundary, because the low byte of the base address is $FD to $FF, there is no clock saving in the post-indexed versus pre-indexed address mode, but on the other hand, the "INC T,X : BNE + : INC T+1,X" skips the branch once at a cost of 4 clock cycles, so it just juggles around where the clock cycles are saved.}