Split stack vs. other approaches

chitselb · Post by **chitselb** » Wed Apr 22, 2015 5:19 pm

On another thread, Mike B writes

barrym95838 wrote:

Dr Jefyll wrote:

... "TOS and TOS+1 are at fixed points in ZP." TOS resides someplace other than on stack??? I guess this relates to the split stack you're using. TOS is treated differently than the other stack items so it can be used as an indirect pointer -- is that right?

That's Charlie's deal with his Pettil implementation. It involves some trade-offs. A favorable example: NIP becomes INX. There are pros and cons to any decision like that. For my 65m32 DTC Forth, I keep TOS in the accumulator, but that would be impossible for a 6502, and a dubious optimization for an '802/'816. Perhaps Charlie did some benchmarks that led to his design decision, and would be able to share them briefly with us, here or in a fresh thread?

The idea came from David Wheeler's 6502 page and also here.

Garth has gently suggested that the split stack might not be very good.

GARTHWILSON wrote:

It would take some convincing for me to believe that splitting the stack and having to keep moving things in and out of a fixed-address TOS pair of ZP bytes would be more efficient than keeping byte pairs together in every cell and doing indexing and doing the INX or DEX twice to add or drop stack cells.

There are two 48-byte regions of zero page called 'stackl' and 'stackh', with 'tos' being stored in a separate 2-bytes contiguous region of zero page. The X register is the stack pointer Here are DUP and DROP

Code: Select all

DUP
    dex               ; [2]
    lda tos+1         ; [3]
    sta stackh,x      ; [4]
    lda tos           ; [3]
    sta stackl,x      ; [4]
    jmp next          ; [3]
                      ; [19]
DROP
    lda stackh,x      ; [4]
    sta tos+1         ; [3]
    lda stackl,x      ; [4]
    sta tos           ; [3]
    inx               ; [2]
    jmp next          ; [3]
                      ; [19]

FIG DUP = [23] = 2 + 2 + 4 + 4 + 4 + 4 + 3
FIG DROP = [7] = 2 + 2 + 3

GARTHWILSON · Post by **GARTHWILSON** » Wed Apr 22, 2015 7:45 pm

The normal ones (with bytes together) is:

Code: Select all

DUP
    DEX
    DEX
    LDA  3,X
    STA  1,X
    LDA  2,X
    STA  0,X
    JMP  NEXT   ; FIG's DUP is less efficient, having an extra PHA and PLA, and has
                ; to make NEXT come immediately after PUSH to avoid an extra JMP.

DROP
    INX
    INX
    JMP  NEXT   ; (or in the case of STC, just straightline it with only the two INX's (2 bytes & 4 clocks total), with no JMP)

chitselb · Post by **chitselb** » Wed Apr 22, 2015 8:02 pm

Another way to do it would be splitting the stack without a separate 16-bit TOS. As with FIG, words like @ or ! require a copy of tos before using (ZP),Y addressing

Code: Select all

DUP
    dex               ; [2]
    lda stackl+1,x    ; [4]
    sta stackl,x      ; [4]
    lda stackh+1,x    ; [4]
    sta stackh,x      ; [4]
    jmp next          ; [3]
                      ; [21]
DROP
    inx               ; [2]
    jmp next          ; [3]
                      ; [5]

barrym95838 · Post by **barrym95838** » Thu Apr 23, 2015 2:50 am

I can't say that I've analyzed every permutation, but when it comes to the good old NMOS 6502, I think that perhaps it would be best to concentrate on which words get executed the most often and optimize them as best you can.

Dr. Brad dropped an interesting little tidbit here:

Quote:

For reference, Phil Koopman did a study of how often Forth primitives get called in some benchmark Forth programs. The top ten were CALL a.k.a. ENTER (12.21%), EXIT (11.74%), VARIABLE (5.46%), @ (5.40%), 0BRANCH (4.78%), LIT (4.54%), + (4.18%), SWAP (3.90%), R> (3.89%), >R (3.87%), CONSTANT (3.68%), and DUP (3.05%). Note that CALL, EXIT, R>, and >R combined add up to about 32% of primitive executions.

I'll forgive him for listing twelve words in the top ten, and also mention that NEXT can be a big cycle hog, depending on implementation and host architecture (free for STC, significantly less free for others, but to varying degrees). I think that Pettil is a strong performer in the NMOS 6502 crowd because of its streamlined NEXT, not because of its split stack and separate TOS, but I am most certainly not a expert in that area, so I'll just leave it at a thought. It's a shame that NIP isn't on that list, though, because Pettil crushes the competition for that word.

A parting thought would be to spend some serious time and energy on coding an optimal ENTER and EXIT (unless we're talking STC, in which case they're just JSR and RTS).

Mike B.

Brad R · Post by **Brad R** » Thu Apr 23, 2015 3:05 pm

Oops. I don't know how I miscounted that.

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.

chitselb · Post by **chitselb** » Thu Apr 23, 2015 5:01 pm

VARIABLE is third? really? I'm also surprised CONSTANT makes this list

Brad R · Post by **Brad R** » Thu Apr 23, 2015 5:58 pm

The key phrase is "some benchmark Forth programs." Koopman tested a fractal landscape generation program, Conway's game of Life, a floating-point package written in high-level Forth, and the Forth compiler itself. VARIABLE was used almost as much as CALL and EXIT in the Life program. In the Forth compiler, VARIABLE didn't make the top twelve. The results I quoted were the average across all four test programs.

barrym95838 · Post by **barrym95838** » Fri Apr 24, 2015 2:02 am

Brad R wrote:

Oops. I don't know how I miscounted that.

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.

Now it's my turn to say "oops". I have read your Moving Forth series enough times to know better, but maybe I'm due for another round.

I have really become enamored with DTC ... it seems to have the right "feel" in the simple work that I've been doing so far. I am not nearly advanced enough to tackle the subjects of optimizing compilers and other more modern techniques, but I wouldn't mind peeking at a few links, if you have any available.

Thanks,

Mike B.

nyef · Post by **nyef** » Fri Apr 24, 2015 2:16 am

Brad R wrote:

Oops. I don't know how I miscounted that. :)

On STC Forths that don't do inline expansion, the function of NEXT (chaining from one machine language primitive to another) is performed by RTS/JSR, so it's not free. On CPUs that can do a single-instruction NEXT, STC can actually be slower.

So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. Taken branches involve an extra instruction before sync, so even that can be dealt with. Next, use a 16-bit data bus and a MUX for memory access by the CPU. Add another 16-bit pointer for use as a Forth instruction pointer. Allow the microcode to kick off a 16-bit aligned read to load into the instruction pointer in a single cycle. Finally, use instruction substitution or similar to set up a two-cycle instruction that will load the externalized CPU instruction pointer from the memory pointed to by the Forth instruction pointer, also forcing a pre- or post-increment.

It's not quite a one-cycle NEXT, as it takes two cycles, but it should still be a win over RTS/JSR. Requires DTC, though. ITC would be three cycles with this sort of scheme.

(edit: swapped DTC and ITC in the context of cycle counts.)

theGSman · Post by **theGSman** » Fri Apr 24, 2015 12:52 pm

Had I been aware of the concept of a split stack at the time, I would definitely have incorporated it into my forth. However, it would be low on my list of priorities to do now. Since adding or removing numbers from the stack involves only one DEX or INX instead of two (saving 2 clock cycles each time) there is a saving to be had but not enough to justify the effort of getting it to work right in an existing forth.

Having a separate TOS is useful in processors like the x86 where you can dedicate a register solely for that purpose - especially if you can reverse the roles of the data and return stack pointers so that refilling the TOS becomes a simple POP operation.

On a 6502 there is not so much of an advantage. Even the fact that the TOS would be a pointer in its own right is only a marginal advantage. There is definitely an advantage when performing the @ operation but not for the corresponding ! operation because like most forth primitives, the TOS has to be refilled each time the word is executed.

In spite of the "research" listed above, I believe that after NEXT, the words that need the most time saved would be DUP and DROP. The DUP operation would be speeded up (very) slightly with a separate TOS but the DROP operation would be slowed down significantly as shown in the OP. In an STC forth, the difference would be even greater because without a separate TOS, DROP could be coded in-line as INX (or INX INX) with no JSR/RTS overhead.

barrym95838 · Post by **barrym95838** » Fri Apr 24, 2015 5:05 pm

theGSman wrote:

... Having a separate TOS is useful in processors like the x86 where you can dedicate a register solely for that purpose - especially if you can reverse the roles of the data and return stack pointers so that refilling the TOS becomes a simple POP operation ...

I like all of your points, especially this one. It is a huge advantage in my custom hobby processor design, because I have just enough registers to dedicate the accumulator to TOS, I have auto-increment and auto-decrement, the accumulator can be used as a pointer without penalty, and pushes to the R stack can be combined with a post-load to the same register with no code space penalty. Now I just need to get the dang thing working ...

Mike B.

Brad R · Post by **Brad R** » Fri Apr 24, 2015 5:26 pm

barrym95838 wrote:

I am not nearly advanced enough to tackle the subjects of optimizing compilers and other more modern techniques, but I wouldn't mind peeking at a few links, if you have any available.

Alas, I don't. Many years ago my friend Charles Curley wrote a book on Forth for the 68000 -- never published; I reviewed a draft -- which had a chapter on some optimization techniques (inline expansion, if I recall correctly). Other than that, what I've learned I have picked up from hints dropped in various conference papers. I know there has been some very good work done on optimizing Forth compilers, but it tends to be proprietary.

Brad R · Post by **Brad R** » Fri Apr 24, 2015 5:28 pm

nyef wrote:

So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. ...

Yes, but if you're going to do that, why not just design a Forth CPU?

nyef · Post by **nyef** » Fri Apr 24, 2015 5:47 pm

Brad R wrote:

nyef wrote:

So, here's a frightening thought: Using KimKlone-like techniques, it should be possible to maintain and manipulate the "effective" program counter outside of the CPU, basically using a microcode rom to indicate, based on sync and latched opcode, which bytes need to be fetched from the instruction stream, completely ignoring the address provided by the CPU. ...

Yes, but if you're going to do that, why not just design a Forth CPU?

Because then it would be off-topic for 6502.org?

barrym95838 · Post by **barrym95838** » Fri Apr 24, 2015 11:34 pm

nyef wrote:

Because then it would be off-topic for 6502.org?

I can't wait to see how it turns out ... I hope that you name it the 6502.Borg!!

Locutus B.

Split stack vs. other approaches

Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches

Re: Split stack vs. other approaches