In the thread 65816 direct threaded NEXT
discusses a slick idea for a smokin' fast 65c816 Forth implementation. It re-purposes the RTS instruction as a one-byte NEXT that executes in just 6 cycles! But...
the trick I used to accomplish this has a tradeoff, and in this case it's a doozy. Your program is stored on the stack (!) which means that anything that pushes onto the stack will clobber your programThe problem has no solution for the platform dclxvi is targeting. In this post I solve a different problem, taking his subversive use of RTS into an alternative environment where hardware can be redefined.
Edit: Later in this topic I belatedly added some basic background info about RTS threading. If you are reading this for the first time I suggest you either scroll down or else click here
The problem lies in getting SP to do the job of two registers
. We want to use SP conventionally as a pointer to short-term read/write storage (for interrupts and for explicit pushes and pulls), and (for the fast NEXT) we also want to use SP as a pointer into a list of pseudo-subroutine addresses -- a list that mustn't be written to. This puzzle has been teasing my imagination, and I wondered what a hardware solution might look like. I ended up with two approaches (both untested). The one shown here is the simplest. If I've overlooked anything I hope someone'll mention.
The '816 is capable of addressing memory as multiple banks, 64KB max per bank. (dclxvi
operates using the simplest model, where all banks map to the same 64K.) We can't have two SP registers, but we can
trick SP into addressing two different banks. All we have to do is tweak the CPU's policy of steering all stack-related operations to Bank 0.
For ease of discussion let's assume we have a machine with 128K of memory -- a single 128KB chip, perhaps. The RAM's A0 - A15 inputs are fed straight from the CPU, and A16 comes from an Address Latch
. This is S.O.P. All '816 systems using multiple banks require an Address Latch to capture the bank-address information that appears on the data bus during the first half of every bus cycle.
Our goal is to use memory "at" SP in Bank 0 for short-term read/write storage, and to use memory "at" SP in Bank 1 for the Forth address lists. So, we reserve a large stack area in Bank 0. We set the '816's Program Bank Register to 1 and we load the address lists into Bank 1 (and the callee machine-code routines, too). We set SP to point to the desired address list.
By using two banks we've provided two separate storage areas, as required. Normally RTS would be unable to access the address list, due to the CPU's policy of steering all stack-related operations to Bank 0. RTS will cause the CPU to do a pull, but it would normally be from Bank 0. To correct this, the circuit below detects RTS opcodes and forces the resulting "stack" memory accesses to remain in the Program Bank instead of going to Bank 0
Animation10.gif [ 27.72 KiB | Viewed 377 times ]
The '574 octal register captures eight signals at the end of every bus cycle. No action results unless the bus cycle is an opcode fetch (SYNC =1). If SYNC =1 the flip-flop may change state in the cycle following the opcode fetch.
If RTS (opcode $60) is detected then the D input will be low and the flip-flop will be cleared. Clearing the flip-flop has the effect of discontinuing pulses on ALE
, the Address Latch Enable signal. The latch ceases to update cycle-by-cycle as it normally would, and the most recent update (indicating the Program Bank) "sticks." All accesses are forced to the Program Bank, and we get the desired effect of RTS pulling an address from the Program Bank rather than Bank 0.
- The circuit as shown is a little sloppy in that it's triggered by opcodes $70, $E0 and $F0 as well as $60. That turns out not to make any difference, since those other instructions access only the Program Bank anyway.
- The XOR gate protects against an obscure hazard: the false opcode fetch that occurs whenever an interrupt is recognized. There's a risk that an RTS opcode will get fetched but not executed. If this happens the circuit mustn't respond; ie, the flip-flop needs to remain set. To recognize the beginning of the cpu's interrupt sequence we look for the unique circumstance of the address bus failing to increment in the cycle following an opcode fetch. (See Table 5-7 of the '816 Data Sheet) No increment means the least-significant address line, A0, will fail to toggle. The output of the XOR gate will be low and the flip-flop will remain set (allowing a successful interrupt with machine state pushed to stack in Bank 0 where it belongs).
So! The upside is that 6-cycle NEXT is all that we'd hoped: we can aim SP at a list of addresses and use RTS to rapidly "Call" each routine in sequence. JSR opcodes are not required, slashing Call overhead from 12 cycles to 6. Explicit pushes & pulls can be used for short-term storage, and interrupts are fully functional. The downsides are:
- you need hardware mod's or a built-from-scratch machine
- The large (and mostly unused) stack area that must be reserved in Bank 0. That's required because for all addresses pointed to by SP in the Program Bank, those corresponding addresses (and some additional bytes below) in Bank 0 are subject to being written at any time by interrupts. Also SP must avoid addresses corresponding to Direct-page, I/O ports and any other sensitive areas in Bank 0.
When coding, you need to bear in mind that SP gets altered every time an RTS (ie, NEXT) executes! Nevertheless, explicit pushes and pulls (PHA & PLA, for example) are
permitted provided that you put SP back where you found it before the next NEXT occurs. The Bank 0 allocation for stack will be wastefully large. Luckily, memory is cheap -- and there's nothing to prevent us populating additional banks.
Edit: improved illustration and text. Link to subsequent post.