Six-cycle NEXT (with a h/w mod to accommodate interrupts)

Dr Jefyll · Post by **Dr Jefyll** » Sat Nov 12, 2011 9:01 am

Usually 65xx Forth implementations map the Interpretive Pointer into a memory location, which unfortunately limits performance. (IP must be accessed and updated during NEXT.) But where else could IP reside? dclxvi proposes that, on an '816, IP could reside in the 16-bit stack pointer, S. In the thread 65816 indirect threaded NEXT Bruce discusses an 11-cycle NEXT that's just 2 instructions

Code: Select all
```
PLX
JMP (0,X)
```

and in the thread 65816 direct threaded NEXT he discusses a 6-cycle NEXT that's just one instruction

Code: Select all
```
RTS
```

Using S is a slick idea for some smokin' fast Forth, but, as Bruces notes, there's a tradeoff, and it's a doozy.

dclxvi wrote:

anything that pushes onto the stack will clobber your program, PHA, JSR, interrupts, etc. For instructions like JSL you can simply temporarily change the stack pointer [...]. However interrupts are far trickier.

This thread deals with the interrupt problem and a hardware solution for it. Ironically, Garth later summarized it better than I did, so I'll quote him here.

Quote:

The circuit below detects RTS opcodes and forces the resulting "stack" memory accesses to remain in the program bank instead of going to Bank 0. Basically, Forth runs in a non-0 bank, and most of bank 0 must be left available for the hardware stack, since it will not be known what the value in the Forth program counter (IP, held in register S) is when an interrupt hits. Stack operations relating to the interrupt will write to and read from bank 0, while RTS serving as Forth's NEXT reads the bank that the program is in, rather than bank 0 where the hardware stack resides.

/edit

The problem lies in getting SP to do the job of two registers. We want to use SP conventionally as a pointer to short-term read/write storage (for interrupts and for explicit pushes and pulls), and (for the fast NEXT) we also want to use SP as a pointer into a list of pseudo-subroutine addresses -- a list that mustn't be written to. This puzzle has been teasing my imagination, and I wondered what a hardware solution might look like. I ended up with two approaches (both untested). The one shown here is the simplest. If I've overlooked anything I hope someone'll mention.

The '816 is capable of addressing memory as multiple banks, 64KB max per bank. (dclxvi operates using the simplest model, where all banks map to the same 64K.) We can't have two SP registers, but we can trick SP into addressing two different banks. All we have to do is tweak the CPU's policy of steering all stack-related operations to Bank 0.

For ease of discussion let's assume we have a machine with 128K of memory -- a single 128KB chip, perhaps. The RAM's A0 - A15 inputs are fed straight from the CPU, and A16 comes from an Address Latch. This is S.O.P. All '816 systems using multiple banks require an Address Latch to capture the bank-address information that appears on the data bus during the first half of every bus cycle.

Our goal is to use memory "at" SP in Bank 0 for short-term read/write storage, and to use memory "at" SP in Bank 1 for the Forth address lists. So, we reserve a large stack area in Bank 0. We set the '816's Program Bank Register to 1 and we load the address lists into Bank 1 (and the callee machine-code routines, too). We set SP to point to the desired address list.

By using two banks we've provided two separate storage areas, as required. Normally RTS would be unable to access the address list, due to the CPU's policy of steering all stack-related operations to Bank 0. RTS will cause the CPU to do a pull, but it would normally be from Bank 0. To correct this, the circuit above detects RTS opcodes and forces the resulting "stack" memory accesses to remain in the Program Bank instead of going to Bank 0.

The '574 octal register captures eight signals at the end of every bus cycle. No action results unless the bus cycle is an opcode fetch (SYNC =1). If SYNC =1 the flip-flop may change state in the cycle following the opcode fetch.

If RTS (opcode $60) is detected then the D input will be low and the flip-flop will be cleared. Clearing the flip-flop has the effect of discontinuing pulses on ALE, the Address Latch Enable signal. The latch ceases to update cycle-by-cycle as it normally would, and the most recent update (indicating the Program Bank) "sticks." All accesses are forced to the Program Bank, and we get the desired effect of RTS pulling an address from the Program Bank rather than Bank 0.

Note-
The circuit as shown is a little sloppy in that it's triggered by opcodes $70, $E0 and $F0 as well as $60. That turns out not to make any difference, since those other instructions access only the Program Bank anyway.
The XOR gate protects against an obscure hazard: the false opcode fetch that occurs whenever an interrupt is recognized. There's a risk that an RTS opcode will get fetched but not executed. If this happens the circuit mustn't respond; ie, the flip-flop needs to remain set. To recognize the beginning of the cpu's interrupt sequence we look for the unique circumstance of the address bus failing to increment in the cycle following an opcode fetch. (See Table 5-7 of the '816 Data Sheet) No increment means the least-significant address line, A0, will fail to toggle. The output of the XOR gate will be low and the flip-flop will remain set (allowing a successful interrupt with machine state pushed to stack in Bank 0 where it belongs).

So! The upside is that 6-cycle NEXT is all that we'd hoped: we can aim SP at a list of addresses and use RTS to rapidly "Call" each routine in sequence. JSR opcodes are not required, slashing Call overhead from 12 cycles to 6. Explicit pushes & pulls can be used for short-term storage, and interrupts are fully functional. The downsides are:

you need hardware mod's or a built-from-scratch machine
The large (and mostly unused) stack area that must be reserved in Bank 0. That's required because for all addresses pointed to by SP in the Program Bank, those corresponding addresses (and some additional bytes below) in Bank 0 are subject to being written at any time by interrupts. Also SP must avoid addresses corresponding to Direct-page, I/O ports and any other sensitive areas in Bank 0.
Forth's Return Pointer will have to reside in Y or in memory.

When coding, you need to bear in mind that SP gets altered every time an RTS (ie, NEXT) executes! Nevertheless, pushes and pulls (PHA & PLA, for example) are permitted provided that you put SP back where you found it before the next NEXT occurs. The Bank 0 allocation for stack will be wastefully large. Luckily, memory is cheap -- and there's nothing to prevent us populating additional banks.

-- Jeff

dclxvi · Post by **dclxvi** » Wed Nov 16, 2011 3:28 am

An interesting approach!

One possible complication is literal data (e.g. numbers, strings, branches). For example, LITERAL compiles:

Code: Select all

DW LIT-1,data

and LIT is:

Code: Select all

LIT PLA     ;get literal data
    DEX     ;push it onto the data stack
    DEX     ;"
    STA 0,X ;"
    RTS     ;NEXT

AGAIN and AHEAD compile:

Code: Select all

DW BRANCH-1,address-1

and BRANCH is:

Code: Select all

BRANCH PLA ;get literal data
       TCS ;store it in IP
       RTS ;NEXT

(Actually, in my kernel TOS is in A not 0,X, I think TOS in A is the right choice for STC and probably ITC as well, but I'm now thinking I should've used 0,X for TOS in my (DTC) kernel, since I wind up having to save/restore A more than I thought I would, and I might be able to have the x flag be 1 (it is currently 0, which is slower). But I digress.)

PLA won't access the program bank, but there is an (untested) alternative. For example, if the x flag is 0 (and assuming DBR = PBR):

Code: Select all

LIT TCS     ;Put address of inline data into A...
    TAY     ;...and then into Y
    LDA 0,Y ;now get the inline data
    DEX     ;and push it onto the data stack
    DEX     ;"
    STA 0,X ;"
    PLA     ;advance the stack pointer past the inline data
    RTS     ;NEXT

If the x flag is 1, this could be used instead:

Code: Select all

LIT TCS     ;Put address of inline data into A...
    STA N   ;...and then into N
    LDA (N) ;now get the inline data
    DEX     ;and push it onto the data stack
    DEX     ;"
    STA 0,X ;"
    PLA     ;advance stack pointer
    RTS     ;NEXT

(Self-modifying code could be used in place of STA N and LDA (N) too.)

However, the whole point of the RTS trick is speed, so the key, I think, is to keep the gymnastics (like the above) to a minimum as much as possible, so as not to negate the speed advantage of the RTS trick.

Dr Jefyll wrote:

(dclxvi operates with no distinction between banks -- they all use the same 64K.)

In a sense, it does make a distinction between banks; all the action takes place in bank 0, so "everything" (code, data, addresses) must be there, but you can use words like L@ ( addrlo addrhi -- data) L! ( etc.) LC@ and LC! to access memory outside bank 0 (cells are 16 bits wide). In theory, my kernel could be written in such a way that code and/or data need not be in bank 0, but I've opted for the all-in-bank-0 approach as a starting point. Besides, most Forth applications will easily fit in 64k.

Dr Jefyll · Post by **Dr Jefyll** » Wed Nov 16, 2011 4:42 am

Cool! I'm happy to see some actual code snippets. On the other hand I'm a little embarrassed, because your code samples show that I've been short-sighted. RTS shouldn't be the only means to do a pull from the Program Bank. We also need to be able to pull from the Program Bank to some other register (A, X or Y), not just to the PC (which is what RTS does).

PLA looks like it'd be a good choice to share the "use Program Bank" attribute with RTS. The circuit can certainly be modified to allow both RTS and PLA to behave this way; in fact the similarity of the opcodes ( $60 vs $68 ) makes the mod pretty trivial. So if we want to play with the idea of 6-cycle NEXT assisted by hardware, let's assume RTS and PLA both pull from the Program Bank.

The revised schematic is here:

: Six-cycle_NEXTrev1.gif (6.53 KiB) Viewed 4270 times

dclxvi wrote:

Dr Jefyll wrote:

(dclxvi operates with no distinction between banks -- they all use the same 64K.

In a sense, it does make a distinction between banks; all the action takes place in bank 0, so "everything" (code, data, addresses) must be there, but you can use words like L@ ( addrlo addrhi -- data) L! ( etc.) LC@ and LC! to access memory outside bank 0

I wanted to ask you about that, because frankly I was half-guessing as to what hardware you are running. Can you describe your system?

-- Jeff
[Edited for brevity and clarity. Schematic added.]

dclxvi · Post by **dclxvi** » Thu Nov 17, 2011 3:27 am

Dr Jefyll wrote:

I wanted to ask you about that, because frankly I was half-guessing as to what hardware you are running. Can you describe your system?

When I wrote the initial post in other thread way back when, I figured I'd use an Apple IIgs to develop it. I wanted to keep it generic so that it could be adapted for other 65C816 systems, but that was my starting point for a system. 16-bit cells meant 16-bit addresses, so it seemed simplest to keep everything in a single bank, and bank 0 it was with the use of RTS. Other banks would/could then be accessed via L@ and friends (or via BLOCKs).

However, I've been developing it on a simulator that I wrote. I specifically wrote the simulator for this purpose, so some of the things that this kernel doesn't use (the Direct register, the DBR, the PBR, emulation mode, 8-bit index registers) haven't been implemented. I've added to it as needed (originally it didn't have 8-bit accumulator support but that got unwieldy fast). It is an extremely crude tool, but it has a built in assembler, it's extensible (e.g. I/O support is not built in, it gets loaded from a text file at run time), it works, and it's less than 3.5k. Essentially it's one 64k block of RAM, and the WDM opcode is used to make system calls (e.g. for I/O). There's a small ROM library/monitor equivalent (in 65C816 assembly, also loaded from a text file at run time) which has things like a hex output routine, and dumps registers when a BRK is encountered. The 65C816 code (including the kernel) easily fits in 64k, so it's been suffcient for getting it up and running and experimenting.

Also, for those of you scoring at home, I just noticed that TCS should be TSC in the last two examples of my previous post.

Dr Jefyll · Post by **Dr Jefyll** » Mon Nov 21, 2011 8:00 am

dclxvi wrote:

When I wrote the initial post in other thread way back when, I figured I'd use an Apple IIgs to develop it. I wanted to keep it generic so that it could be adapted for other 65C816 systems

Well, to add the hardware mod to an existing '816 system would be a little messy but not too bad. A person would have to tap onto the data bus and a few other signals. The only slightly nasty bit is having to cut the trace that supplies the Enable signal to the system's Address Latch (so that the mod circuit can take over as the new source for that signal).

As far as other targets for your RTS-powered Forth, built-from-scratch '816 systems are eligible, of course. 6502 systems that've been retrofitted with an '816 could certainly run the RTS trick. To also have the mod (and the use of interrupts etc) you'd also install some additional RAM and at least a one-bit address latch.

Re: the matter of allowing PLA to access the Program Bank (as mentioned in my previous post), I've updated that post to include a revised schematic. The revision consists simply of removing all reference to D3, which for our purposes becomes a don't-care. The revised circuit triggers from opcodes $60 and $68 as required, and also spuriously but harmlessly from opcodes 70, 78, E0, E8, F0 and F8.

To implement the mod using discrete logic I'd suggest using a 74AC138 as the 6-input NAND gate. Also the delay from PHI2 to ALE should be kept short, so a minor change may be in order there.

-- Jeff

Dr Jefyll · Post by **Dr Jefyll** » Sat Apr 14, 2012 6:50 am

I've been asked for a clarification of this topic. (The topic begins here.) But right now I want to backtrack and lay out the basics of how dclxvi's 65816 six-cycle NEXT operates -- as I probably should have done in the first place!

Then in an additional post (below) I'll touch on my hardware mod and the problem it solves.

: RTS threading 0 rev1.gif (10.27 KiB) Viewed 4060 times

In the lower part of this diagram there are three routines which I've named ROUTINE1, ROUTINE2, ROUTINE3. Of course they don't actually do anything because they're mostly NOPs. But we will see how they can be invoked in sequence by CALLER (at the top of the diagram). We want CALLER to execute ROUTINE1, ROUTINE2, ROUTINE3 (etc) almost as if we had explicitly coded JSR ROUTINE1, JSR ROUTINE2 and so on.

But CALLER has no JSR opcodes; it is just a list of addresses -- a form of Direct Threaded Code (DTC). For ease of discussion I've located ROUTINE1 at address 1111, ROUTINE2 at 2222 and so on. If we assume the 65816 Stack Pointer has been set to 3FFF and an RTS has been executed then we'll see the process set in motion. (Some of you will immediately grasp this, but here are the exact details.)

The RTS pops the 16-bit value at ( SP+1 ) and increments SP by 2. The popped value goes to the PC, which is subsequently incremented by 1. In this case the 16-bit value at 4000 is 1110, so the PC ends up at 1111 -- the address of ROUTINE1. SP is left at 4001. Machine code is fetched and executed starting at 1111. When the work (represented by NOPs) is done the routine exits with an RTS.

Again RTS pops the 16-bit value at (SP+1) and increments SP by 2. The popped value goes to the PC, which is subsequently incremented by 1. In this case the 16-bit value at 4002 is 2221, so the PC ends up at 2222 -- the address of ROUTINE2. SP is left at 4003. Machine code is fetched and executed starting at 2222. When the work is done the routine exits with an RTS.

Again RTS pops (SP+1) and the pattern continues. SP is left at 4005 and machine code is fetched and executed starting at 3333. When the work is done ROUTINE3 exits with an RTS.

This demonstrates how the S register and RTS can be used as an Interpreter to "execute" a list of addresses. The advantage is that no JSR instructions are used -- only RTS. There is a substantial benefit in that the cost for each routine called is 2 bytes (not 3), and 6 cycles -- not 12 !

The S register in this setting acts as the "Interpretive Pointer," ie; the FORTH "program counter," and RTS is what vectors us to the NEXT routine to execute. FORTH implementations vary widely. dclxvi's use of SP and RTS is radically unusual.

Dr Jefyll · Post by **Dr Jefyll** » Sat Apr 14, 2012 7:03 am

dclxvi wrote:

Your program is stored on the stack (!) which means that anything that pushes onto the stack will clobber your program

To illustrate this, let's run some scenarios (referring to the routines in the post above).

If an interrupt occurs during ROUTINE1, SP will be at 4001, which means 4001 (and the bytes below) is where the 65816 will save its PC and P registers.
If an interrupt occurs during ROUTINE2 then 4003 and the bytes below is where PC and P will get written.
If an interrupt occurs during ROUTINE3 then 4005 and the bytes below get written.

This is the "clobber your program" effect! We see that SP is unpredictable. But the region holding the address lists is guaranteed to take a hit.

The solution is to use the 65C816's facility for multiple 64K banks. For example, 4001 in Bank 0 is distinct from 4001 in any other bank, and that's what saves the day. All we have to do is tweak the CPU's policy of steering all stack-related operations to Bank 0. For example we allow the CPU use 400_ (whatever) in Bank 0 when an interrupt occurs, but the hardware mod forces activation of 400_ (whatever) in the Program Bank when RTS accesses the list. That's how we manage to use the SP register for two different functions.

What is perhaps counter-intuitive about the scheme is that SP is somewhat of a loose cannon; we have very little control over where interrupts will write to! All we know for sure is that PC and P will get saved in Bank 0, and it'll be somewhere in the region of Bank 0 that corresponds to the region that holds the address lists in the Program Bank. The diagram below illustrates the two corresponding regions.

The address lists might be a dozen or so Kbytes in size, and the corresponding Bank 0 region slightly larger -- certainly far more memory than any interrupt would need. Nevertheless, the uncertainly about SP forces us to reserve that amount solely for interrupt usage -- a wasteful yet reasonable tradeoff, given memory prices nowadays.

-- Jeff

Six-cycle NEXT (with a h/w mod to accommodate interrupts)

Six-cycle NEXT (with a h/w mod to accommodate interrupts)

Re: Six-cycle NEXT (with a h/w mod for alternative platforms

Re: Six-cycle NEXT (with a h/w mod for alternative platforms

Re: Six-cycle NEXT (with a h/w mod for alternative platforms

Re: Six-cycle NEXT (with a h/w mod for alternative platforms

Re: Six-cycle NEXT (with a h/w mod for alternative platforms

Re: Six-cycle NEXT ( with a h/w mod for alternative platform