Splitting Datastack in ZP

enso1 · Post by **enso1** » Thu Mar 27, 2025 4:50 pm

Splitting the low and high bytes of the datastack is not a new idea. It has been mentioned in several threads here, but I haven't seen a thorough discussion, or any cons mentioned. This makes me wonder if I am missing something, because so few Forths do it.

Pros:
* DROP is a single inx instead of two;
* pushing data onto the datastack may be a little smaller;
* Single-byte operations are more efficient;
* 24-bit (or even 32-bit) operations are possible with no changes to existing code. You could even keep the rarely-used high bytes in non-zp memory.

Cons:
* You cannot address memory directly through TOS without copying H and L bytes together elsewhere first;
* You cannot jump through TOS, although that would be sketchy anyway.

Addressing memory is better as it looks as only one byte needs to be copied: you can put the L byte just below the H for the duration (and technically, adjust X, or be sketchy and careful if calling deeper or have interrupts) .

Am I missing something that makes it less interesting?

related threads:
viewtopic.php?f=9&t=1619

SamCoVT · Post by **SamCoVT** » Thu Mar 27, 2025 6:22 pm

I think you already nailed it with the first con. Forth does a LOT with addresses. Almost anything that is going to interface with hardware is going to be using @, C@, !, C!, etc. Strings are an address and a length (or just an address for older counted strings). Arrays and structures are almost always passed around by address.

It's also common to pass words to other words by address. Here's an example from when I was playing with FAT32 support. I'm only showing the stack comments and the comment block here. The "xt" in the stack comments is the "eXecution Token", which in Tali is the address of a routine to EXECUTE.

Code: Select all

: walkdirectory  ( xt fileid -- addr | 0)
   \ Run routine (given by xt) on every starting address of a
   \ directory entry in directory described by fileid.  The routine
   \ should be ( direntry_addr -- continueflag ) where continueflag
   \ is 1 if directory processing should continue. The return value
   \ of walkdirectory will either be the address of the directory
   \ entry it stopped on (in the sector buffer) or 0 if it reached
   \ the end.

Then I make a helper word named print-fileinfo that prints a single directory entry and leaves a flag saying it wants to go on to the next entry. I also make a helper word named findfile that checks a single directory entry against a given filename and leaves a flag saying to stop when it finds an exact match. These are then used along with walkdirectory like so (with ['] being a word that gets the execution token of the next word in the source and places it on the stack):

Code: Select all

: ls ( -- ) ( List the working directory)
   cr ." FILENAME     ATTRIB   SIZE"
   working_dir init_directory
   ['] print-fileinfo working_dir walkdirectory drop ;

and in my open-file I have (the part in the comments is what is currently on the stack)
   ( fileid ) working_dir  init_directory
   ( fileid ) ['] findfile working_dir walkdirectory
   ( fileid direntry|0 )
   ?dup 0= if
      cr ." Can't open file: " dup 0B type cr
      1 ( ior of 1 = Can't open file ) exit
   then

In other languages, this technique would likely be called "map" or something similar. C has pointers to functions for doing this. It's very powerful in Forth because all words can be passed by their xt, even functions like + (whereas in C you'd need to write a plus function, do plus inside, and then you can have a pointer to that function).

If you also have to interface with assembly routines you did not write, they will usually be expecting 16-bit values in consecutive memory locations.

barrym95838 · Post by **barrym95838** » Thu Mar 27, 2025 7:01 pm

chitselb has PETTIL and I think he does a fantastic job of managing a split stack with separate TOS in a very efficient manner, even with only NMOS instructions. I'm unsure what he's doing currently ... I hope he's okay.

Code: Select all

#if 0
name=-
stack=( n1 n2 -- difference )
tags=forth-83,nucleus
Subtract `n2` from `n1`

#endif
minus
    jsr donegate
                                ; fall through
;--------------------------------------------------------------
#if 0
name=+
stack=( n1 n2 -- sum )
registers=A:X:C:TOS=sum
tags=forth-83,nucleus,math,fig,forth-79,primitive
Return the sum of n1 plus n2

!!! pronounced: "plus"
#endif
plus
    lda stackl,x
plus01                          ; entry point from >BIT
    clc
    adc tos
    sta tos
    lda stackh,x
    adc tos+1
    sta tos+1
    ;fall through
;--------------------------------------------------------------
#if 0
name=NIP
stack=( n1 n2 -- n2 )
tags=nucleus,stack
Remove the second item from the stack

#endif
nip
    inx
qdup01
    jmp next

It's also DTC with a very fast NEXT and the headers stored elsewhere. This is tight and fast stuff right here.

enso1 · Post by **enso1** » Thu Mar 27, 2025 7:25 pm

Yes, I can see that is may be a problem.

My thinking is that, in a loop manipulating memory, you have to dup the pointer, read a byte, increment a 16-bit pointer (unless using y, which involves its own overhead), read another byte, and maybe increment again. With a split stack, we would do a weird dup which shoves the low byte under the high first. The pointer looks like two stacked words, but we only use a half of each.

The extra load and store may be somewhat compensated by the many 8-bit operations.

Another possibility is keeping the low byte of TOS in A, which reduces pointer construction to a single sta (DSPHIGH-1,x). On x86 I generally keep TOS in a register; keeping half makes it more interesting, I suppose.

barrym, yes, I noticed PETTIL in another thread and need to take a look.

Dr Jefyll · Post by **Dr Jefyll** » Thu Mar 27, 2025 8:08 pm

enso1 wrote:

* 24-bit (or even 32-bit) operations are possible with no changes to existing code. You could even keep the rarely-used high bytes in non-zp memory.

IMO, the latter sentence above expresses a GREAT idea. But my post is slightly OT because I still wanna keep the two least-significant bytes together in Z-pg. That's cuzza the #1 Con Enso & Sam mentioned.

I would only split away the top two (or more) bytes that let each stack location hold a dword (or even a float!).

For me the most compelling advantage by far is read- and writeability of the code. Sam made that point for me just a few days ago (see below; emphasis added). I have wrestled with code like that, and trust me it was no fun.

Even after I've tortured it into working there's still no way to make it succinct & maintainable.

-- Jeff

SamCoVT wrote:

FAT32 ended up with a lot of mixed double and single-cell values on the stack (difficult to keep track of)

enso1 · Post by **enso1** » Thu Mar 27, 2025 8:22 pm

Yes, good points. Especially with something like Tali, where there is a ton of handwritten assembler code -- I can see how it would hurt. For a more normal forth, the complexity would be hidden quickly in a handful of words that deal with indirection.

I am still chewing on the lower-half-of-TOS in A.

BruceRMcF · Post by **BruceRMcF** » Thu Mar 27, 2025 8:49 pm

barrym95838 wrote:

chitselb has PETTIL and I think he does a fantastic job of managing a split stack with separate TOS in a very efficient manner, even with only NMOS instructions. I'm unsure what he's doing currently ... I hope he's okay.

Yes, split hi/lo byte Z-stacks naturally goes along with a separate TOS ... just as an S-indexed stack and the rack in the zero page goes along with a separate TOS. It retains the same basic trade-offs -- drop is slower, dup is faster, and nip becomes as fast a drop would be with TOS on the stack itself, a stable (TOS) makes (TOS),Y operations easier, but a TOS,X makes (TOS,X) available.

The trade-offs are fairly close, so much of it would be taste ... and, if relying on an existing code-base as a guide, how they did it.
_________

enso1 wrote:

... With a split stack, we would do a weird dup which shoves the low byte under the high first. The pointer looks like two stacked words, but we only use a half of each. ...

That's why it's much cleaner with a separate TOS. In that set-up, you only have to move pointers when there is a pointer lower down in the stack ... and for words like CMOVE> you want to move the pointers anyway so you can use (U),Y and (V),Y addressing.

enso1 · Post by **enso1** » Fri Mar 28, 2025 2:26 pm

Very tempting...

Since I seem to be writing a bunch of basic low-level crap for different machines, I am considering creating a ROM with basic IO using a datastack for parameter passing.

Dealing with the datastack for low-level code would make it more consistent, no worrying about which registers to pass data in. With basic IO done, I could easily attach an interpreter -- or a compiler to have a working Forth without having to write the annoying bindings becaise IO always uses the wrong registers (on a 6502, every register is usually the wrong one...) It would make low-level code slightly bigger, but not bigger than ad-hoc IO plus Forth bindings, if Forth is the ultimate goal.

enso1 · Post by **enso1** » Fri Mar 28, 2025 4:11 pm

PETTIL is very curious indeed. I am trying to figure out the details. Split stack but not caching TOS in A, I think.

I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.

BruceRMcF · Post by **BruceRMcF** » Fri Mar 28, 2025 5:52 pm

enso1 wrote:

PETTIL is very curious indeed. I am trying to figure out the details. Split stack but not caching TOS in A, I think.

I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.

Except for things like C@ C! @ ! where TOS in Zero Page is a win, and you have to be very careful that the ~4 cycles are not charged back in NEXT or ENTER/EXIT.

One advantage of the rack in the hardware stack and data stack on the zero page is you can have bit threaded, direct threaded and subroutine threaded with mostly the same primitives, but if a direct threaded ENTER uses the "JSR ENTER" trick to extract the address of the caller, that ENTER is faster when the hardware stack is the data stack (with separate TOS) and the rack is in zero page.

; +
PLA : CLC : ADC T0 : STA T0
PLA : ADC T1 : STA T1
JMP NEXT

; C@ 65C02
LDA (T0) : STA T0 : STZ T1 : JMP NEXT

; @ 65C02
LDY #1
LDA (T0),Y : TAY : LDA (T0) : STA T0 : STY T1
JMP NEXT

vs

; +
LDA DL,X : CLC : ADC T0 : STA T0
LDA DH,X : ADC T1 : STA T1
INX
JMP NEXT

; C@ 65C02
LDA (T0) : STA T0 : STZ T1 : JMP NEXT

; @ 65C02
LDY #1
LDA (T0),Y : TAY : LDA (T0) : STA T0 : STY T1
JMP NEXT

enso1 · Post by **enso1** » Fri Mar 28, 2025 6:56 pm

Looking at +,

Code: Select all

; Split stack DH, DL, TOS in A

add       clc      
          adc     DL+1,x    ; low byte
          tay
          lda     DH+0,x    ; high byte
          adc     DH+1,x
          sta     DH+1,x
          inx
          tya
          rts

; Split stack, TOS not cached
          clc
          lda     DL+0,x
          adc     DL+1,x
          sta     DL+1,x
          lda     DH+0,x
          adc     DH+1,x
          sta     DH+1,x
          inx
          rts

This is fairly typical: we save two zero-page accesses, but lose with TAY/TYA, net 4 cycle win...

Actually, using Y instead of pushing

BruceRMcF · Post by **BruceRMcF** » Fri Mar 28, 2025 7:18 pm

enso1 wrote:

Looking at +,

Code: Select all

; Split stack DH, DL, TOS in A

add
          clc      
          adc     DL+1,x    ; low byte
          tay
          lda     DH+0,x    ; high byte
          adc     DH+1,x
          sta     DH+1,x
          inx
          tya
          rts [ 24 clock cycles + NEXT]

; Split stack, TOS not cached
          clc
          lda     DL+0,x
          adc     DL+1,x
          sta     DL+1,x
          lda     DH+0,x
          adc     DH+1,x
          sta     DH+1,x
          inx
          rts ;[ 28 clock cycles + NEXT]

...

And split stack, T0, T1 in a zero page vector.

Code: Select all

          clc
          lda     T0
          adc     DL+1,x
          sta     T0
          lda     T1
          adc     DH+1,x
          sta     T1
          inx
          rts ;[ 24 cycles + NEXT]

So the same execution speed as retaining the TOS in the A register, except also gaining the execution benefits on C@/C!/@/! -- don't forget that each access to TOS with TOS in a zp vector location saves a clock versus retaining the TOS on the stack, and with a vector TOS, that's four accesses to TOS, not two, because the "original TOS" and the "new TOS" are in the same place, with no indexing penalty.

Two clock cycles can be gained by reserving the TOS vector for @/! operations and using the high byte of that for the high byte rather than the stack:

Code: Select all

; Split stack DH, DL, TOS in A & "T+1" in zp.

add
          clc      
          adc    DL+0,x    ; low byte
          tay
          lda     T+1    ; high byte
          adc    DH+0,x
          sta     T+1
          inx
          tya
          rts ;[ 22 clock cycles + NEXT]

cfetch ; C@ {65C02} cheaper
          sta T
          lda (T)
          stz T+1
          rts ;[ 11 clock cycles + NEXT]

dup ; dup cheaper by 1 clock
          dex
          sta DL,X
          ldy T+1
          sty DH,X
          rts ;[ 13 clock cycles + NEXT]

drop ; nip cheaper but drop more expensive by 7 clocks
          lda DL+1,X
          ldy DH+1,X
          sty T+1
          inx
          rts ;[ 13 clock cycles + NEXT]

Since DROP is the most common Forth primitive in execution, it may well be that the savings in the split stack, no TOS cached (only 2 clocks + NEXT) on DROP pays for the more expensive primitives elsewhere.

BruceRMcF · Post by **BruceRMcF** » Sat Mar 29, 2025 9:30 pm

I will note another pragmatic reason for picking a split datastack ... what if your Forth system is more comfortable with an x-indexed data stack, but you want a bigger stack than there is room for in your zero page?

Just to give a sample motivation, suppose a board is using a Basic for it's operating system which uses the top half of the zero page, and you wish to avoid trampling those locations.

Setting the question about how big the datastack needs to be (people that do a lot of recursive descent may need more than some other people) ... if there isn't enough available space for the preferred size of data stack in the zero page, you may be able to get half of the zero-page benefit over a stack in main RAM by having half of the split stack in the zero page and half somewhere else (if the hardware stack page is not fully used, the other half of the stack could well be placed at $100).

One of the reason to not split the stack go away when the entire stack cannot fit into the zero page, since (D,x) addressing is no longer an option, so that moving values into zp vector locations is needed for RAM access operations in any event.

enso1 · Post by **enso1** » Mon Mar 31, 2025 3:04 pm

Thank you, those are some good examples. I am sold on not keeping TOS in A.

I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.

BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.

BruceRMcF · Post by **BruceRMcF** » Mon Mar 31, 2025 6:03 pm

enso1 wrote:

Thank you, those are some good examples. I am sold on not keeping TOS in A.

I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.

BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.

I had an almost working implementation ... the xForth16 that I was working on for the Commander X16 board.

Then I had the laptop were I was working on it die, and they had to transition their community web location because of site ownership and maintenance cost issues, and the online source also disappeared.

I might be able to find it using the wayback machine, but the second part of it is the "almost working" part ... my CREATE / DOES> was broken.

But in any event, with the news that the Game Console version of the system will feature a 65816 with access to a full 16MB in its address space, I've decided to try to port the system from its Camel Forth base again, working on a more straightforward implementation aiming for fewer bells and whistles, in order to have a working 65C02 Forth running on the 64KB memory map, before tackling one or more featured 65816 system(s), likely with more bells and whistles.

So this question came up just as I was hammering out the X-indexed vs S-indexed stacks, NEXT/ENTER/EXIT (modern meaning of "ENTER", not fig-Forth), etc., in the re-write.

What I have presently coded is TOS-on-stack, with a split byte stack data stack precisely because I want this version to play nice with the X16 BASIC, so I don't have 128bytes available for a 64 cell data stack on the zero page. If I had gone with an X-indexed Rack, I had enough space for a 32 cell return stack in zero page, but I ended up going with the Rack on the hardware stack.

So I am actually writing these for a mongrel split stack, with the DL,X stack in $40-$7F and the DH,X stack in $0100-$013F ... having "half of it in zero page" as much for the byte saving in the opcodes accessing the lower byte as for the clock saving.

Since it's not aiming to squeeze out every clock, it uses a zero page self-modifying "NEXT0: JMP ($0000)", where "IP" is:

IP = NEXT0+1

... and most of NEXT is among the code words in the dictionary ending with a "JMP NEXT0", so I don't have to worry about assembling the NEXT into the dictionary "as if" it was assembled in the zero page, to be copied into the zero page by the cold start, but rather just:

Code: Select all

SETUP_IP_RTN:
        LDA #$6C
        STA NEXT0
        STZ IP
        STZ IP+1
        RTS

... that is called as part of the warm start routine.

I also had separate name headers in my partly working system, looking ahead to moving name headers into the expansion memory banking, but for anything where this more minimal Forth doesn't have the room to support, there's the possibility of a full 64KB for a 16bit 65816 forth, or even a 128KB 16bit forth with 64KB dictionary space and 64KB dataspace ... so I've dropped that and am going to "hard code" the embedded headers in a way similar to the early fig-Forth sources.

As far as the TOS in register, my feeling is it more often helps with processors that have registers for the core Forth pointers "and one or more to spare". Camel Forth for the Z80 keeps it's TOS in one of the Z80 register pairs. So it really starts to be interesting in a 65816 subroutine threaded system, since then the hardware instruction register acts as the "IP" register, the return stack is the Rack, and with an X-indexed data stack, you have a spare 16bit Y register for indexed memory access and the 16bit A register (in 16bit data mode) can hold the entire TOS value.

Splitting Datastack in ZP

Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP

Re: Splitting Datastack in ZP