In his "Zero" Overhead Forth Interrupts article, Garth mentions a four instruction (non-IRQ section) NEXT. I'm not sure which four instructions he has in mind there, but I'd thought point out that a two instruction indirect threaded NEXT routine is possible on the 65816, namely:
Code:
PLX
JMP (0,X)
X is Forth's W register, and S (the 65816 stack pointer!) is Forth's IP register. This NEXT routine is only 11 cycles long, which is faster than a JSR RTS pair in subroutine threading. (As kc5tja & I have alluded to in other forum posts, optimization possibilities are available to eliminate many JSRs and RTSs in subroutine threading.) Since this indirect threaded NEXT only takes 4 bytes, it can be placed inline rather than using JMP NEXT, adding 1 byte per primitive, but saving 3 cycles.
This particular NEXT implementation is not as impractical as it may seem at first. On the 65816, it's usually just as convenient to keep TOS in the accumulator. This makes using the Y register as the data stack pointer (where the nth cell on the data stack is n-2,Y) much more efficient, since it offsets the fact that there are no ASL, INC, etc. indexed by Y instructions. A word such as D2*, where the second cell on the data stack is shifted, will be less efficient than if X were the data stack pointer, but these sorts of words are not all that common.
The return stack pointer (RP) could just be stored on the zero page. The X register is free to be used by primitives, it will just get overwritten by NEXT. So, LDX RP and so on would be used when needed. It would be nice if there were a register available for RP, but there aren't a whole lot of words that need to access the return stack, so I don't think this will be too bad.
The downside...
The usurping of the S register for the IP introduces some complications. One is that a JSR or JSL to a ROM subroutine (for EMIT or KEY, e.g.) can overwrite the ITC being executed by the inner interpreter. There are a couple of solutions. One is to simply write a special routine for Forth and not rely on a ROM library (often this is very reasonable). The second is to change S temporarily to point to a sufficiently sized area of memory that can be safely overwritten, then use JSR or JSL, then restore S. This will add a few cycles, but not a whole lot.
Another, more serious, complication is that dealing with interrupts is far tricker. If it is known an IRQ won't occur until after a WAI instruction, then S can be temporarily changed, just like a JSR or JSL above. A RESET is not necessarily a problem either. RESET could always just cold start Forth, so that nothing in RAM would need to be preserved.
However, it is often desirable to have any non-power-on RESET do a warm start rather than a cold start. The same problem exists with an NMI or IRQ, namely how to either keep from overwriting the ITC in RAM, or how to restore what did get overwritten. There are some possible solutions to this problem, but nothing particularly efficient that I can see.
To keep the example somewhat simple, let's assume (a) all of the ITC is in bank 0, (b) there are two 64k RAM chips, U0 and U1, and (c) U0 is (ordinarily) mapped to bank 0. If you could somehow figure out when an interrupt was pushing on the stack (I don't see a simple way of doing this offhand -- the VPB signal seems to be too late, since the vector is pulled AFTER pushing on the stack), you might be able to map bank 0 to U1 while pushing onto the stack and the ITC (stored in U0) won't be overwritten at all.
An alternative is to set up the hardware so that a read from bank 0 reads from U0, and a read from bank 1 reads from U1, but a write to bank 0 writes to both U0 and U1. Then U1 contains a copy of the ITC that was overwritten (which can be read from bank 1) that can be used to restore the ITC.
This could also be done from software without any special hardware support. In this case, banks 0 and 1 would be ordinary RAM without any unusual mapping. The software would then write to both banks 0 and 1 when storing ITC. Since this double write occurs only at compile time, it's only a one time cost and won't be a huge performance hit. Restoring the ITC is really where the efficiency will take a hit, with or without hardware support. The best solution is not to use any interrupts.
In summary, if you have a self-contained system (no JSRs or JSLs to general-purpose ROM subroutines for things like EMIT), do not use interrupts, and use RESET only to cold start Forth -- none of which is unreasonable in Forth -- you can have a very fast ITC Forth. My curiousity is sufficiently piqued, so I'd like to actually experiment with this at some point. I don't see any other major downsides to this approach off the top of my head.