65816 indirect threaded NEXT

dclxvi · Post by **dclxvi** » Fri Feb 04, 2005 7:54 am

In his "Zero" Overhead Forth Interrupts article, Garth mentions a four instruction (non-IRQ section) NEXT. I'm not sure which four instructions he has in mind there, but I'd thought point out that a two instruction indirect threaded NEXT routine is possible on the 65816, namely:

Code: Select all

PLX
JMP (0,X)

X is Forth's W register, and S (the 65816 stack pointer!) is Forth's IP register. This NEXT routine is only 11 cycles long, which is faster than a JSR RTS pair in subroutine threading. (As kc5tja & I have alluded to in other forum posts, optimization possibilities are available to eliminate many JSRs and RTSs in subroutine threading.) Since this indirect threaded NEXT only takes 4 bytes, it can be placed inline rather than using JMP NEXT, adding 1 byte per primitive, but saving 3 cycles.

This particular NEXT implementation is not as impractical as it may seem at first. On the 65816, it's usually just as convenient to keep TOS in the accumulator. This makes using the Y register as the data stack pointer (where the nth cell on the data stack is n-2,Y) much more efficient, since it offsets the fact that there are no ASL, INC, etc. indexed by Y instructions. A word such as D2*, where the second cell on the data stack is shifted, will be less efficient than if X were the data stack pointer, but these sorts of words are not all that common.

The return stack pointer (RP) could just be stored on the zero page. The X register is free to be used by primitives, it will just get overwritten by NEXT. So, LDX RP and so on would be used when needed. It would be nice if there were a register available for RP, but there aren't a whole lot of words that need to access the return stack, so I don't think this will be too bad.

The downside...

The usurping of the S register for the IP introduces some complications. One is that a JSR or JSL to a ROM subroutine (for EMIT or KEY, e.g.) can overwrite the ITC being executed by the inner interpreter. There are a couple of solutions. One is to simply write a special routine for Forth and not rely on a ROM library (often this is very reasonable). The second is to change S temporarily to point to a sufficiently sized area of memory that can be safely overwritten, then use JSR or JSL, then restore S. This will add a few cycles, but not a whole lot.

Another, more serious, complication is that dealing with interrupts is far tricker. If it is known an IRQ won't occur until after a WAI instruction, then S can be temporarily changed, just like a JSR or JSL above. A RESET is not necessarily a problem either. RESET could always just cold start Forth, so that nothing in RAM would need to be preserved.

However, it is often desirable to have any non-power-on RESET do a warm start rather than a cold start. The same problem exists with an NMI or IRQ, namely how to either keep from overwriting the ITC in RAM, or how to restore what did get overwritten. There are some possible solutions to this problem, but nothing particularly efficient that I can see.

To keep the example somewhat simple, let's assume (a) all of the ITC is in bank 0, (b) there are two 64k RAM chips, U0 and U1, and (c) U0 is (ordinarily) mapped to bank 0. If you could somehow figure out when an interrupt was pushing on the stack (I don't see a simple way of doing this offhand -- the VPB signal seems to be too late, since the vector is pulled AFTER pushing on the stack), you might be able to map bank 0 to U1 while pushing onto the stack and the ITC (stored in U0) won't be overwritten at all.

An alternative is to set up the hardware so that a read from bank 0 reads from U0, and a read from bank 1 reads from U1, but a write to bank 0 writes to both U0 and U1. Then U1 contains a copy of the ITC that was overwritten (which can be read from bank 1) that can be used to restore the ITC.

This could also be done from software without any special hardware support. In this case, banks 0 and 1 would be ordinary RAM without any unusual mapping. The software would then write to both banks 0 and 1 when storing ITC. Since this double write occurs only at compile time, it's only a one time cost and won't be a huge performance hit. Restoring the ITC is really where the efficiency will take a hit, with or without hardware support. The best solution is not to use any interrupts.

In summary, if you have a self-contained system (no JSRs or JSLs to general-purpose ROM subroutines for things like EMIT), do not use interrupts, and use RESET only to cold start Forth -- none of which is unreasonable in Forth -- you can have a very fast ITC Forth. My curiousity is sufficiently piqued, so I'd like to actually experiment with this at some point. I don't see any other major downsides to this approach off the top of my head.

GARTHWILSON · Post by **GARTHWILSON** » Fri Feb 04, 2005 10:24 am

Doggone if you don't turn my thinking all up-side-down Bruce! This is great. Keep it commin'. With work the way it is now, it'll be a while before I can really think this out thoroughly and do it justice, but I'd like to evaluate the possibilities. Thanks. I'm sure I'll learn something good even if I decide the penalties outweigh the advantages.

BTW, I have almost no JSRs, but >DOES does compile JSR does, and ;CODE does compile JSR doCODE. I was thinking I might also want to use JSR occasionally with the online assembler for small, time-critical code portions, but now that you mention it, I can't think of any cases that I've ever used it there either—probably just because they are small and time-critical.

GARTHWILSON · Post by **GARTHWILSON** » Mon Feb 07, 2005 7:43 am

Far better than going without any interrupt capability at all, how about having certain regularly executed primitives (like nest and maybe a couple of others) set up the conditions for handling interrupts without overwriting ITC, and then do CLI SEI, giving frequent opportunities for interrupts to get noticed. Average response will be slow— not from huge overhead, but rather that the interrupt is simply not acted upon until you get to those certain places in the code.

If you can run the IRQ\ line to bit 6 or 7 of an I/O port, it would present less overhead in the majority of cases (ie, when no interrupt is pending) to actually poll in those same places in the code, with something like BIT followed by a BMI to where where "interrupts" can get handled safely.

dclxvi · Post by **dclxvi** » Thu Feb 10, 2005 6:38 am

That's a good idea. But you can do that without adding ANY cycles when interrupts aren't being used, at least for ITC.

First, I'd use unnest (EXIT) rather than nest, because unnest overwrites the S register, so its value does not need to be saved or restored. Nest also overwrites S, but not until the middle of the word.

Then, locate the code field of (in this case) EXIT (i.e. the DW "instruction" that contains the address of the actual 6502 instructions) in I/O (address) space. Thus, rather than the 2-byte code field being read from RAM or ROM, it's read from a pair of (consecutive) I/O locations. "Hard-wire" all but one of the 16 bits of the code field (to a fixed value), and connect the remaining bit to the IRQ signal. This puts the address of the IRQ and non-IRQ versions of EXIT 2^N (i.e. a power of two) bytes apart, which may be a little space inefficient, but it is probably better to have only one bit in the code field change than to conserve space.

Also, using WAI (with the I flag set) rather than CLI in the IRQ portion of EXIT allows the interrupt to be serviced without having to push anything onto the stack, so you may not need to change the S register to point to a "safe" area at all. In that case, you may want to latch IRQ to make sure you don't hang if the signal could become inactive before you hit the WAI instruction.

BTW, I haven't written a program that used interrupts in probably 4 or 5 years (yes, I USE programs that use interrupts), so going without them isn't much of a limitation, at least not to me.

asmlang_6 · Post by **asmlang_6** » Fri Jul 22, 2005 12:00 am

Why do we even need NEXT at all? I'm writing a Forth system without NEXT.

kc5tja · Post by **kc5tja** » Fri Jul 22, 2005 2:58 pm

asmlang_6 wrote:

Why do we even need NEXT at all? I'm writing a Forth system without NEXT.

Subroutine-threaded Forths produce binaries that are 33% larger (on average) than direct- or indirect-threaded Forths. For space-conscious designs, this may be an issue.

I agree, however, that subroutine threading is the best all-around solution to the problem for the 65xx series, due to its lack of registers.

asmlang_6 · Post by **asmlang_6** » Fri Jul 22, 2005 10:56 pm

kc5tja wrote:

Subroutine-threaded Forths produce binaries that are 33% larger (on average) than direct- or indirect-threaded Forths. For space-conscious designs, this may be an issue.

I'm not writing a STC Forth. I'm writing an ITC Forth. My NEXTless implementation of EXECUTE in C:

Code: Select all

void thr_execute(int *thread)
{
        for ( ; *thread; *thread++) {
                if (isprim(*thread) == 1) { /* executing native code as though it was Forth is a bad idea ;-) */
                        void (*prim)(void) = *thread;
                        (*prim)(); /* it all comes down to this sooner or later (in Forth usually later ;-) */
                } else {
                        thr_execute(((int *)(*thread))); /* here we go agian! */
                }
        }
}

Oh, and I know the diffrence between ITC and STC.

kc5tja · Post by **kc5tja** » Sun Jul 24, 2005 6:16 am

asmlang_6 wrote:

Oh, and I know the diffrence between ITC and STC.

Actually, the code you posted would be qualified as a "Call"-threaded code (CTC). That would work too, but I don't know its performance characteristics.

asmlang_6 · Post by **asmlang_6** » Sun Jul 24, 2005 8:09 am

kc5tja wrote:

Actually, the code you posted would be qualified as a "Call"-threaded code (CTC). That would work too, but I don't know its performance characteristics.

Why good grief, I think I just discovered a new implementation technique for Forth without knowing about it!