65816 indirect threaded NEXT

Topics relating to various Forth models on the 6502, 65816, and related microprocessors and microcontrollers.
Post Reply
User avatar
dclxvi
Posts: 362
Joined: 11 Mar 2004

65816 indirect threaded NEXT

Post by dclxvi »

In his "Zero" Overhead Forth Interrupts article, Garth mentions a four instruction (non-IRQ section) NEXT. I'm not sure which four instructions he has in mind there, but I'd thought point out that a two instruction indirect threaded NEXT routine is possible on the 65816, namely:

Code: Select all

PLX
JMP (0,X)
X is Forth's W register, and S (the 65816 stack pointer!) is Forth's IP register. This NEXT routine is only 11 cycles long, which is faster than a JSR RTS pair in subroutine threading. (As kc5tja & I have alluded to in other forum posts, optimization possibilities are available to eliminate many JSRs and RTSs in subroutine threading.) Since this indirect threaded NEXT only takes 4 bytes, it can be placed inline rather than using JMP NEXT, adding 1 byte per primitive, but saving 3 cycles.

This particular NEXT implementation is not as impractical as it may seem at first. On the 65816, it's usually just as convenient to keep TOS in the accumulator. This makes using the Y register as the data stack pointer (where the nth cell on the data stack is n-2,Y) much more efficient, since it offsets the fact that there are no ASL, INC, etc. indexed by Y instructions. A word such as D2*, where the second cell on the data stack is shifted, will be less efficient than if X were the data stack pointer, but these sorts of words are not all that common.

The return stack pointer (RP) could just be stored on the zero page. The X register is free to be used by primitives, it will just get overwritten by NEXT. So, LDX RP and so on would be used when needed. It would be nice if there were a register available for RP, but there aren't a whole lot of words that need to access the return stack, so I don't think this will be too bad.

The downside...

The usurping of the S register for the IP introduces some complications. One is that a JSR or JSL to a ROM subroutine (for EMIT or KEY, e.g.) can overwrite the ITC being executed by the inner interpreter. There are a couple of solutions. One is to simply write a special routine for Forth and not rely on a ROM library (often this is very reasonable). The second is to change S temporarily to point to a sufficiently sized area of memory that can be safely overwritten, then use JSR or JSL, then restore S. This will add a few cycles, but not a whole lot.

Another, more serious, complication is that dealing with interrupts is far tricker. If it is known an IRQ won't occur until after a WAI instruction, then S can be temporarily changed, just like a JSR or JSL above. A RESET is not necessarily a problem either. RESET could always just cold start Forth, so that nothing in RAM would need to be preserved.

However, it is often desirable to have any non-power-on RESET do a warm start rather than a cold start. The same problem exists with an NMI or IRQ, namely how to either keep from overwriting the ITC in RAM, or how to restore what did get overwritten. There are some possible solutions to this problem, but nothing particularly efficient that I can see.

To keep the example somewhat simple, let's assume (a) all of the ITC is in bank 0, (b) there are two 64k RAM chips, U0 and U1, and (c) U0 is (ordinarily) mapped to bank 0. If you could somehow figure out when an interrupt was pushing on the stack (I don't see a simple way of doing this offhand -- the VPB signal seems to be too late, since the vector is pulled AFTER pushing on the stack), you might be able to map bank 0 to U1 while pushing onto the stack and the ITC (stored in U0) won't be overwritten at all.

An alternative is to set up the hardware so that a read from bank 0 reads from U0, and a read from bank 1 reads from U1, but a write to bank 0 writes to both U0 and U1. Then U1 contains a copy of the ITC that was overwritten (which can be read from bank 1) that can be used to restore the ITC.

This could also be done from software without any special hardware support. In this case, banks 0 and 1 would be ordinary RAM without any unusual mapping. The software would then write to both banks 0 and 1 when storing ITC. Since this double write occurs only at compile time, it's only a one time cost and won't be a huge performance hit. Restoring the ITC is really where the efficiency will take a hit, with or without hardware support. The best solution is not to use any interrupts.

In summary, if you have a self-contained system (no JSRs or JSLs to general-purpose ROM subroutines for things like EMIT), do not use interrupts, and use RESET only to cold start Forth -- none of which is unreasonable in Forth -- you can have a very fast ITC Forth. My curiousity is sufficiently piqued, so I'd like to actually experiment with this at some point. I don't see any other major downsides to this approach off the top of my head.
User avatar
GARTHWILSON
Forum Moderator
Posts: 8773
Joined: 30 Aug 2002
Location: Southern California
Contact:

Post by GARTHWILSON »

Doggone if you don't turn my thinking all up-side-down Bruce! This is great. Keep it commin'. With work the way it is now, it'll be a while before I can really think this out thoroughly and do it justice, but I'd like to evaluate the possibilities. Thanks. I'm sure I'll learn something good even if I decide the penalties outweigh the advantages.

BTW, I have almost no JSRs, but >DOES does compile JSR does, and ;CODE does compile JSR doCODE. I was thinking I might also want to use JSR occasionally with the online assembler for small, time-critical code portions, but now that you mention it, I can't think of any cases that I've ever used it there either—probably just because they are small and time-critical.
Last edited by GARTHWILSON on Mon Feb 07, 2005 7:47 am, edited 1 time in total.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
User avatar
GARTHWILSON
Forum Moderator
Posts: 8773
Joined: 30 Aug 2002
Location: Southern California
Contact:

Post by GARTHWILSON »

Far better than going without any interrupt capability at all, how about having certain regularly executed primitives (like nest and maybe a couple of others) set up the conditions for handling interrupts without overwriting ITC, and then do CLI SEI, giving frequent opportunities for interrupts to get noticed. Average response will be slow— not from huge overhead, but rather that the interrupt is simply not acted upon until you get to those certain places in the code.

If you can run the IRQ\ line to bit 6 or 7 of an I/O port, it would present less overhead in the majority of cases (ie, when no interrupt is pending) to actually poll in those same places in the code, with something like BIT followed by a BMI to where where "interrupts" can get handled safely.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
User avatar
dclxvi
Posts: 362
Joined: 11 Mar 2004

Post by dclxvi »

That's a good idea. But you can do that without adding ANY cycles when interrupts aren't being used, at least for ITC.

First, I'd use unnest (EXIT) rather than nest, because unnest overwrites the S register, so its value does not need to be saved or restored. Nest also overwrites S, but not until the middle of the word.

Then, locate the code field of (in this case) EXIT (i.e. the DW "instruction" that contains the address of the actual 6502 instructions) in I/O (address) space. Thus, rather than the 2-byte code field being read from RAM or ROM, it's read from a pair of (consecutive) I/O locations. "Hard-wire" all but one of the 16 bits of the code field (to a fixed value), and connect the remaining bit to the IRQ signal. This puts the address of the IRQ and non-IRQ versions of EXIT 2^N (i.e. a power of two) bytes apart, which may be a little space inefficient, but it is probably better to have only one bit in the code field change than to conserve space.

Also, using WAI (with the I flag set) rather than CLI in the IRQ portion of EXIT allows the interrupt to be serviced without having to push anything onto the stack, so you may not need to change the S register to point to a "safe" area at all. In that case, you may want to latch IRQ to make sure you don't hang if the signal could become inactive before you hit the WAI instruction.

BTW, I haven't written a program that used interrupts in probably 4 or 5 years (yes, I USE programs that use interrupts), so going without them isn't much of a limitation, at least not to me.
asmlang_6
Posts: 53
Joined: 20 Jul 2005
Location: Hawaii

Post by asmlang_6 »

Why do we even need NEXT at all? I'm writing a Forth system without NEXT.
Sam

---
"OK, let's see, A0 on the 6502 goes to the ROM. Now where was that reset vector?"
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

asmlang_6 wrote:
Why do we even need NEXT at all? I'm writing a Forth system without NEXT.
Subroutine-threaded Forths produce binaries that are 33% larger (on average) than direct- or indirect-threaded Forths. For space-conscious designs, this may be an issue.

I agree, however, that subroutine threading is the best all-around solution to the problem for the 65xx series, due to its lack of registers.
asmlang_6
Posts: 53
Joined: 20 Jul 2005
Location: Hawaii

Post by asmlang_6 »

kc5tja wrote:
Subroutine-threaded Forths produce binaries that are 33% larger (on average) than direct- or indirect-threaded Forths. For space-conscious designs, this may be an issue.
I'm not writing a STC Forth. I'm writing an ITC Forth. My NEXTless implementation of EXECUTE in C:

Code: Select all

void thr_execute(int *thread)
{
        for ( ; *thread; *thread++) {
                if (isprim(*thread) == 1) { /* executing native code as though it was Forth is a bad idea ;-) */
                        void (*prim)(void) = *thread;
                        (*prim)(); /* it all comes down to this sooner or later (in Forth usually later ;-) */
                } else {
                        thr_execute(((int *)(*thread))); /* here we go agian! */
                }
        }
}
Oh, and I know the diffrence between ITC and STC.
Sam

---
"OK, let's see, A0 on the 6502 goes to the ROM. Now where was that reset vector?"
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

asmlang_6 wrote:
Oh, and I know the diffrence between ITC and STC.
Actually, the code you posted would be qualified as a "Call"-threaded code (CTC). That would work too, but I don't know its performance characteristics.
asmlang_6
Posts: 53
Joined: 20 Jul 2005
Location: Hawaii

Post by asmlang_6 »

kc5tja wrote:
Actually, the code you posted would be qualified as a "Call"-threaded code (CTC). That would work too, but I don't know its performance characteristics.
Why good grief, I think I just discovered a new implementation technique for Forth without knowing about it!
Sam

---
"OK, let's see, A0 on the 6502 goes to the ROM. Now where was that reset vector?"
Post Reply