Splitting Datastack in ZP
Splitting Datastack in ZP
Splitting the low and high bytes of the datastack is not a new idea. It has been mentioned in several threads here, but I haven't seen a thorough discussion, or any cons mentioned. This makes me wonder if I am missing something, because so few Forths do it.
Pros:
* DROP is a single inx instead of two;
* pushing data onto the datastack may be a little smaller;
* Single-byte operations are more efficient;
* 24-bit (or even 32-bit) operations are possible with no changes to existing code. You could even keep the rarely-used high bytes in non-zp memory.
Cons:
* You cannot address memory directly through TOS without copying H and L bytes together elsewhere first;
* You cannot jump through TOS, although that would be sketchy anyway.
Addressing memory is better as it looks as only one byte needs to be copied: you can put the L byte just below the H for the duration (and technically, adjust X, or be sketchy and careful if calling deeper or have interrupts) .
Am I missing something that makes it less interesting?
related threads:
viewtopic.php?f=9&t=1619
Pros:
* DROP is a single inx instead of two;
* pushing data onto the datastack may be a little smaller;
* Single-byte operations are more efficient;
* 24-bit (or even 32-bit) operations are possible with no changes to existing code. You could even keep the rarely-used high bytes in non-zp memory.
Cons:
* You cannot address memory directly through TOS without copying H and L bytes together elsewhere first;
* You cannot jump through TOS, although that would be sketchy anyway.
Addressing memory is better as it looks as only one byte needs to be copied: you can put the L byte just below the H for the duration (and technically, adjust X, or be sketchy and careful if calling deeper or have interrupts) .
Am I missing something that makes it less interesting?
related threads:
viewtopic.php?f=9&t=1619
Last edited by enso1 on Sun Mar 30, 2025 11:02 pm, edited 1 time in total.
Re: Splitting Datastack in ZP
I think you already nailed it with the first con. Forth does a LOT with addresses. Almost anything that is going to interface with hardware is going to be using @, C@, !, C!, etc. Strings are an address and a length (or just an address for older counted strings). Arrays and structures are almost always passed around by address.
It's also common to pass words to other words by address. Here's an example from when I was playing with FAT32 support. I'm only showing the stack comments and the comment block here. The "xt" in the stack comments is the "eXecution Token", which in Tali is the address of a routine to EXECUTE. Then I make a helper word named print-fileinfo that prints a single directory entry and leaves a flag saying it wants to go on to the next entry. I also make a helper word named findfile that checks a single directory entry against a given filename and leaves a flag saying to stop when it finds an exact match. These are then used along with walkdirectory like so (with ['] being a word that gets the execution token of the next word in the source and places it on the stack):
In other languages, this technique would likely be called "map" or something similar. C has pointers to functions for doing this. It's very powerful in Forth because all words can be passed by their xt, even functions like + (whereas in C you'd need to write a plus function, do plus inside, and then you can have a pointer to that function).
If you also have to interface with assembly routines you did not write, they will usually be expecting 16-bit values in consecutive memory locations.
It's also common to pass words to other words by address. Here's an example from when I was playing with FAT32 support. I'm only showing the stack comments and the comment block here. The "xt" in the stack comments is the "eXecution Token", which in Tali is the address of a routine to EXECUTE.
Code: Select all
: walkdirectory ( xt fileid -- addr | 0)
\ Run routine (given by xt) on every starting address of a
\ directory entry in directory described by fileid. The routine
\ should be ( direntry_addr -- continueflag ) where continueflag
\ is 1 if directory processing should continue. The return value
\ of walkdirectory will either be the address of the directory
\ entry it stopped on (in the sector buffer) or 0 if it reached
\ the end.
Code: Select all
: ls ( -- ) ( List the working directory)
cr ." FILENAME ATTRIB SIZE"
working_dir init_directory
['] print-fileinfo working_dir walkdirectory drop ;
and in my open-file I have (the part in the comments is what is currently on the stack)
( fileid ) working_dir init_directory
( fileid ) ['] findfile working_dir walkdirectory
( fileid direntry|0 )
?dup 0= if
cr ." Can't open file: " dup 0B type cr
1 ( ior of 1 = Can't open file ) exit
thenIf you also have to interface with assembly routines you did not write, they will usually be expecting 16-bit values in consecutive memory locations.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Splitting Datastack in ZP
chitselb has PETTIL and I think he does a fantastic job of managing a split stack with separate TOS in a very efficient manner, even with only NMOS instructions. I'm unsure what he's doing currently ... I hope he's okay.It's also DTC with a very fast NEXT and the headers stored elsewhere. This is tight and fast stuff right here.
Code: Select all
#if 0
name=-
stack=( n1 n2 -- difference )
tags=forth-83,nucleus
Subtract `n2` from `n1`
#endif
minus
jsr donegate
; fall through
;--------------------------------------------------------------
#if 0
name=+
stack=( n1 n2 -- sum )
registers=A:X:C:TOS=sum
tags=forth-83,nucleus,math,fig,forth-79,primitive
Return the sum of n1 plus n2
!!! pronounced: "plus"
#endif
plus
lda stackl,x
plus01 ; entry point from >BIT
clc
adc tos
sta tos
lda stackh,x
adc tos+1
sta tos+1
;fall through
;--------------------------------------------------------------
#if 0
name=NIP
stack=( n1 n2 -- n2 )
tags=nucleus,stack
Remove the second item from the stack
#endif
nip
inx
qdup01
jmp next
Last edited by barrym95838 on Thu Mar 27, 2025 8:55 pm, edited 3 times in total.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
Re: Splitting Datastack in ZP
Yes, I can see that is may be a problem.
My thinking is that, in a loop manipulating memory, you have to dup the pointer, read a byte, increment a 16-bit pointer (unless using y, which involves its own overhead), read another byte, and maybe increment again. With a split stack, we would do a weird dup which shoves the low byte under the high first. The pointer looks like two stacked words, but we only use a half of each.
The extra load and store may be somewhat compensated by the many 8-bit operations.
Another possibility is keeping the low byte of TOS in A, which reduces pointer construction to a single sta (DSPHIGH-1,x). On x86 I generally keep TOS in a register; keeping half makes it more interesting, I suppose.
barrym, yes, I noticed PETTIL in another thread and need to take a look.
My thinking is that, in a loop manipulating memory, you have to dup the pointer, read a byte, increment a 16-bit pointer (unless using y, which involves its own overhead), read another byte, and maybe increment again. With a split stack, we would do a weird dup which shoves the low byte under the high first. The pointer looks like two stacked words, but we only use a half of each.
The extra load and store may be somewhat compensated by the many 8-bit operations.
Another possibility is keeping the low byte of TOS in A, which reduces pointer construction to a single sta (DSPHIGH-1,x). On x86 I generally keep TOS in a register; keeping half makes it more interesting, I suppose.
barrym, yes, I noticed PETTIL in another thread and need to take a look.
Re: Splitting Datastack in ZP
enso1 wrote:
* 24-bit (or even 32-bit) operations are possible with no changes to existing code. You could even keep the rarely-used high bytes in non-zp memory.
I would only split away the top two (or more) bytes that let each stack location hold a dword (or even a float!).
For me the most compelling advantage by far is read- and writeability of the code. Sam made that point for me just a few days ago (see below; emphasis added). I have wrestled with code like that, and trust me it was no fun.
-- Jeff
SamCoVT wrote:
FAT32 ended up with a lot of mixed double and single-cell values on the stack (difficult to keep track of)
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: Splitting Datastack in ZP
Yes, good points. Especially with something like Tali, where there is a ton of handwritten assembler code -- I can see how it would hurt. For a more normal forth, the complexity would be hidden quickly in a handful of words that deal with indirection.
I am still chewing on the lower-half-of-TOS in A.
I am still chewing on the lower-half-of-TOS in A.
Re: Splitting Datastack in ZP
barrym95838 wrote:
chitselb has PETTIL and I think he does a fantastic job of managing a split stack with separate TOS in a very efficient manner, even with only NMOS instructions. I'm unsure what he's doing currently ... I hope he's okay.
The trade-offs are fairly close, so much of it would be taste ... and, if relying on an existing code-base as a guide, how they did it.
_________
enso1 wrote:
... With a split stack, we would do a weird dup which shoves the low byte under the high first. The pointer looks like two stacked words, but we only use a half of each. ...
Re: Splitting Datastack in ZP
Very tempting...
Since I seem to be writing a bunch of basic low-level crap for different machines, I am considering creating a ROM with basic IO using a datastack for parameter passing.
Dealing with the datastack for low-level code would make it more consistent, no worrying about which registers to pass data in. With basic IO done, I could easily attach an interpreter -- or a compiler to have a working Forth without having to write the annoying bindings becaise IO always uses the wrong registers (on a 6502, every register is usually the wrong one...) It would make low-level code slightly bigger, but not bigger than ad-hoc IO plus Forth bindings, if Forth is the ultimate goal.
Since I seem to be writing a bunch of basic low-level crap for different machines, I am considering creating a ROM with basic IO using a datastack for parameter passing.
Dealing with the datastack for low-level code would make it more consistent, no worrying about which registers to pass data in. With basic IO done, I could easily attach an interpreter -- or a compiler to have a working Forth without having to write the annoying bindings becaise IO always uses the wrong registers (on a 6502, every register is usually the wrong one...) It would make low-level code slightly bigger, but not bigger than ad-hoc IO plus Forth bindings, if Forth is the ultimate goal.
Last edited by enso1 on Fri Mar 28, 2025 4:12 pm, edited 1 time in total.
Re: Splitting Datastack in ZP
PETTIL is very curious indeed. I am trying to figure out the details. Split stack but not caching TOS in A, I think.
I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.
I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.
Re: Splitting Datastack in ZP
enso1 wrote:
PETTIL is very curious indeed. I am trying to figure out the details. Split stack but not caching TOS in A, I think.
I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.
I am more and more convinced that splitting the stack_and_ TOS in A is the recipe for some minor gains (~4 cycles many words I looked at), meaningful in a subroutine-threaded inlining compiler.
One advantage of the rack in the hardware stack and data stack on the zero page is you can have bit threaded, direct threaded and subroutine threaded with mostly the same primitives, but if a direct threaded ENTER uses the "JSR ENTER" trick to extract the address of the caller, that ENTER is faster when the hardware stack is the data stack (with separate TOS) and the rack is in zero page.
; +
PLA : CLC : ADC T0 : STA T0
PLA : ADC T1 : STA T1
JMP NEXT
; C@ 65C02
LDA (T0) : STA T0 : STZ T1 : JMP NEXT
; @ 65C02
LDY #1
LDA (T0),Y : TAY : LDA (T0) : STA T0 : STY T1
JMP NEXT
vs
; +
LDA DL,X : CLC : ADC T0 : STA T0
LDA DH,X : ADC T1 : STA T1
INX
JMP NEXT
; C@ 65C02
LDA (T0) : STA T0 : STZ T1 : JMP NEXT
; @ 65C02
LDY #1
LDA (T0),Y : TAY : LDA (T0) : STA T0 : STY T1
JMP NEXT
Re: Splitting Datastack in ZP
Looking at +,
This is fairly typical: we save two zero-page accesses, but lose with TAY/TYA, net 4 cycle win...
Actually, using Y instead of pushing
Code: Select all
; Split stack DH, DL, TOS in A
add clc
adc DL+1,x ; low byte
tay
lda DH+0,x ; high byte
adc DH+1,x
sta DH+1,x
inx
tya
rts
; Split stack, TOS not cached
clc
lda DL+0,x
adc DL+1,x
sta DL+1,x
lda DH+0,x
adc DH+1,x
sta DH+1,x
inx
rts
Actually, using Y instead of pushing
Re: Splitting Datastack in ZP
enso1 wrote:
Looking at +,
...
Code: Select all
; Split stack DH, DL, TOS in A
add
clc
adc DL+1,x ; low byte
tay
lda DH+0,x ; high byte
adc DH+1,x
sta DH+1,x
inx
tya
rts [ 24 clock cycles + NEXT]
; Split stack, TOS not cached
clc
lda DL+0,x
adc DL+1,x
sta DL+1,x
lda DH+0,x
adc DH+1,x
sta DH+1,x
inx
rts ;[ 28 clock cycles + NEXT]
Code: Select all
clc
lda T0
adc DL+1,x
sta T0
lda T1
adc DH+1,x
sta T1
inx
rts ;[ 24 cycles + NEXT]
Two clock cycles can be gained by reserving the TOS vector for @/! operations and using the high byte of that for the high byte rather than the stack:
Code: Select all
; Split stack DH, DL, TOS in A & "T+1" in zp.
add
clc
adc DL+0,x ; low byte
tay
lda T+1 ; high byte
adc DH+0,x
sta T+1
inx
tya
rts ;[ 22 clock cycles + NEXT]
cfetch ; C@ {65C02} cheaper
sta T
lda (T)
stz T+1
rts ;[ 11 clock cycles + NEXT]
dup ; dup cheaper by 1 clock
dex
sta DL,X
ldy T+1
sty DH,X
rts ;[ 13 clock cycles + NEXT]
drop ; nip cheaper but drop more expensive by 7 clocks
lda DL+1,X
ldy DH+1,X
sty T+1
inx
rts ;[ 13 clock cycles + NEXT]
Re: Splitting Datastack in ZP
I will note another pragmatic reason for picking a split datastack ... what if your Forth system is more comfortable with an x-indexed data stack, but you want a bigger stack than there is room for in your zero page?
Just to give a sample motivation, suppose a board is using a Basic for it's operating system which uses the top half of the zero page, and you wish to avoid trampling those locations.
Setting the question about how big the datastack needs to be (people that do a lot of recursive descent may need more than some other people) ... if there isn't enough available space for the preferred size of data stack in the zero page, you may be able to get half of the zero-page benefit over a stack in main RAM by having half of the split stack in the zero page and half somewhere else (if the hardware stack page is not fully used, the other half of the stack could well be placed at $100).
One of the reason to not split the stack go away when the entire stack cannot fit into the zero page, since (D,x) addressing is no longer an option, so that moving values into zp vector locations is needed for RAM access operations in any event.
Just to give a sample motivation, suppose a board is using a Basic for it's operating system which uses the top half of the zero page, and you wish to avoid trampling those locations.
Setting the question about how big the datastack needs to be (people that do a lot of recursive descent may need more than some other people) ... if there isn't enough available space for the preferred size of data stack in the zero page, you may be able to get half of the zero-page benefit over a stack in main RAM by having half of the split stack in the zero page and half somewhere else (if the hardware stack page is not fully used, the other half of the stack could well be placed at $100).
One of the reason to not split the stack go away when the entire stack cannot fit into the zero page, since (D,x) addressing is no longer an option, so that moving values into zp vector locations is needed for RAM access operations in any event.
Re: Splitting Datastack in ZP
Thank you, those are some good examples. I am sold on not keeping TOS in A.
I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.
BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.
I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.
BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.
Re: Splitting Datastack in ZP
enso1 wrote:
Thank you, those are some good examples. I am sold on not keeping TOS in A.
I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.
BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.
I think I mentioned up top the ability to use non-zero-page areas to widen the stack as needed -- that is definitely a big plus. Also, 8-bit operations don't need the extra inx/dex, which I suppose is not particularly important but somehow satisfying.
BruceRMcF, do you have a working implementation? It seems like you've worked out a lot of details.
Then I had the laptop were I was working on it die, and they had to transition their community web location because of site ownership and maintenance cost issues, and the online source also disappeared.
I might be able to find it using the wayback machine, but the second part of it is the "almost working" part ... my CREATE / DOES> was broken.
But in any event, with the news that the Game Console version of the system will feature a 65816 with access to a full 16MB in its address space, I've decided to try to port the system from its Camel Forth base again, working on a more straightforward implementation aiming for fewer bells and whistles, in order to have a working 65C02 Forth running on the 64KB memory map, before tackling one or more featured 65816 system(s), likely with more bells and whistles.
So this question came up just as I was hammering out the X-indexed vs S-indexed stacks, NEXT/ENTER/EXIT (modern meaning of "ENTER", not fig-Forth), etc., in the re-write.
What I have presently coded is TOS-on-stack, with a split byte stack data stack precisely because I want this version to play nice with the X16 BASIC, so I don't have 128bytes available for a 64 cell data stack on the zero page. If I had gone with an X-indexed Rack, I had enough space for a 32 cell return stack in zero page, but I ended up going with the Rack on the hardware stack.
So I am actually writing these for a mongrel split stack, with the DL,X stack in $40-$7F and the DH,X stack in $0100-$013F ... having "half of it in zero page" as much for the byte saving in the opcodes accessing the lower byte as for the clock saving.
Since it's not aiming to squeeze out every clock, it uses a zero page self-modifying "NEXT0: JMP ($0000)", where "IP" is:
IP = NEXT0+1
... and most of NEXT is among the code words in the dictionary ending with a "JMP NEXT0", so I don't have to worry about assembling the NEXT into the dictionary "as if" it was assembled in the zero page, to be copied into the zero page by the cold start, but rather just:
Code: Select all
SETUP_IP_RTN:
LDA #$6C
STA NEXT0
STZ IP
STZ IP+1
RTS
I also had separate name headers in my partly working system, looking ahead to moving name headers into the expansion memory banking, but for anything where this more minimal Forth doesn't have the room to support, there's the possibility of a full 64KB for a 16bit 65816 forth, or even a 128KB 16bit forth with 64KB dictionary space and 64KB dataspace ... so I've dropped that and am going to "hard code" the embedded headers in a way similar to the early fig-Forth sources.
As far as the TOS in register, my feeling is it more often helps with processors that have registers for the core Forth pointers "and one or more to spare". Camel Forth for the Z80 keeps it's TOS in one of the Z80 register pairs. So it really starts to be interesting in a 65816 subroutine threaded system, since then the hardware instruction register acts as the "IP" register, the return stack is the Rack, and with an X-indexed data stack, you have a spare 16bit Y register for indexed memory access and the 16bit A register (in 16bit data mode) can hold the entire TOS value.