Page 1 of 3
Reentrant Programming: Small versus Big Software Stacks
Posted: Sun Jan 21, 2018 3:25 pm
by Martin_H
I'm noodling around with some ideas around stacks and parameter passing to allow reentrant programming. So I built two separate stack modules to see how many instructions are required for basic operations. I tried to make the interfaces similar enough that I could switch between them and judge their relative merits.
The first is the X register stack I've seen in several Forths (including Tali Forth):
https://github.com/Martin-H1/Lisp65/blo ... /stack.asm
The second is a 16 bit stack pointer in page zero that grows downward from ram top:
https://github.com/Martin-H1/Lisp65/blo ... gstack.asm
It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.
The second stack seems way too big and slow for operations like push, drop, swap. But on the plus side supports large allocations, stack depth, and a light register and page zero footprint. That allows interfacing with reused code.
Not sure where I am going with this, other than I am curious if anyone has better ideas?
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Sun Jan 21, 2018 7:06 pm
by Alarm Siren
To save him some typing, I will first point you to Garth's excellent
Stacks Treatise.
As to your own points, yea, I feel there's no single answer. Different solutions are better for different situations, it all depends how you intend to use them.
There is however a third method you may not have considered, where you can use the Actual Stack for parameter passing - even on the 6502. You just have to use the TSX / TXS instructions and the a, x addressing mode with a base offset.
E.g. see this
macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Sun Jan 21, 2018 11:21 pm
by Druzyek
If you want to stick with a 6502 instead of a 65816, you might consider adding extra hardware to do the job. For example, you could map half of zero page to an unused section of RAM and hook up an up/down counter to drive address lines when that half of zero page is accessed. That way you could pulse the counter and have a new half zero page to work with for every function. You can pass parameters on the stack or just copy them to directly where you will need them in the mapped zero page if you aren't copying from another mapped zero page. I suppose you might also be able to set up a part of the memory map to point to the current half zero page plus one so you can copy from the current page to the next one directly saving cycles at the expense of more hardware.
This is all overly complicated but I think it will be faster than just about anything else you could do on the 6502.
EDIT: Alarm Siren, neat macros! Do you need to index by X for the STA in PTA?
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Mon Jan 22, 2018 11:46 am
by BitWise
You could use a hybrid technique using the CPU stack for arguments and locals but copying/restoring from RAM as it gets too full/empty.
I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Mon Jan 22, 2018 6:01 pm
by Martin_H
@Druzyek, thanks for the reply, my project is a rewrite of a small Lisp interpreter from C to 6502 assembly, while making it generally reusable. So new hardware isn't something I want to consider. I tried using cc65 for the initial port, but between the cc65 overheard and lisp overheard, there was nothing left for programs.
@Alarm Siren, thanks for the links, I read through them and there was lots of useful information.
@BitWise, I think I will probably end up using a hybrid technique. The low level program machinery (e.g. I/O, memory management, and math) can use a small stack, as there's no recursion, few arguments, and call depths are shallow. It's parsing and evaluation where deep recursion can come into play. So I think a large stack specifically for those functions will be needed.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Mon Jan 22, 2018 8:46 pm
by Alarm Siren
EDIT: Alarm Siren, neat macros!
Thanks
Do you need to index by X for the STA in PTA?
Yep. and PTX, PTY and PTZ. Looks like I forgot to put it on that whole class of macro
Apart from checking they compile, I've not had anything complex enough yet to test them in. Still being mucking around trying to get my UART working.
I'll update the file now.
EDIT: Updated.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Tue Jan 23, 2018 12:58 am
by Druzyek
I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).
Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Tue Jan 23, 2018 2:49 am
by GARTHWILSON
[Edit: I think this was in response to something in the macro library Alarm Siren linked to. My comment seems pretty out of place without having said this before.]
In my macros, I like to use prepositions to clarify; so you have for example
Code: Select all
COPY VAR1, to, VAR2 ; Copy a single byte.
COPY2 VAR3, to, VAR4 ; Copy a pair of bytes.
COPY VAR1, from, FOOBAR ; Copy the other direction.
There's no doubt about the direction, ie, which one is the source and which is the destination, and it's more English-like, and it takes no extra bytes or execution time in the final machine code. "to" and "from" are EQUates that the macro uses to conditionally assemble the copying to be one direction or the other. Such parameters can also be put in the parameter list to just improve readability, without actually using them in the macro.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Tue Jan 23, 2018 9:47 am
by BitWise
I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).
Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?
I can't remember what the criteria is.
The disassembly here
https://acorn.huininga.nl/pub/docs/sour ... BASIC2.asm
has the stack push at $B1A7 and the pull at $B236.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Tue Jan 23, 2018 8:41 pm
by Druzyek
Thanks. I was just wondering myself about the best way to do it. I have been thinking a slightly more sophisticated macro system could figure out automatically how much stack space your local variables need and copy that many bytes off the stack for you. I did something similar to manage zero page memory by pushing enough zero page bytes to hold function arguments and locals. It was not very efficient since it reuses the same memory over and over even if there is more zero page available.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Thu Jan 25, 2018 9:14 pm
by Druzyek
E.g. see this
macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.
I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:
Code: Select all
PKA MACRO mOffset
TSX
LDA HWSTACK_BASE+mOffset+1, X
ENDM
Wouldn't this stop working if you did something like this:
Code: Select all
PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!
I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Thu Jan 25, 2018 10:23 pm
by GARTHWILSON
You can save X on the stack too, but then take that into account when formulating the number to add the index to. It not always necessary to re-do the TSX just because you pushed or pulled something. (This is all covered in the 6502 stacks treatise.) The macro will probably need to be modified.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Thu Jan 25, 2018 10:29 pm
by GARTHWILSON
It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.
As stated in the 6502 stacks treatise, the most stack-intensive Forth application I could come up with, including servicing interrupts in high-level Forth, and also having the RTC on NMI interrupts, did not take more than 20% of either page 1 or page 0. Regarding the space it does take, keep in mind that having the ZP data stack reduces the need for other variables and pointers that are normally kept in page 0; so the penalty is little or none. You may even come out ahead.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Fri Jan 26, 2018 12:15 am
by Alarm Siren
E.g. see this
macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.
I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:
Code: Select all
PKA MACRO mOffset
TSX
LDA HWSTACK_BASE+mOffset+1, X
ENDM
Wouldn't this stop working if you did something like this:
Code: Select all
PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!
I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?
PKA is defined as copying into A the value at Offset from the top of the stack. If you've pushed something onto the stack, you need to adjust your offset to compensate. To avoid these difficulties, I envisage programs using the PHN macro to reserve an amount of space on the stack for locals. So, you'd get something like this:
Code: Select all
PHN LocalSize
PKA LocalSize+X1
ADC #5
PTA LocalSize-3 ; Lets assume LocalSize is big enough to allow this
PKA LocalSize+X2 ; Now it's correct!
I have considered modifying the macros, or adding a new set, that allows you to optionally skip the TSX if you know that you're not overwriting X in the mean-time thus improving performance a little - which would also permit your suggestion that allows you to still use the normal push/pull without clobbering the offsets since X isn't being constantly reloaded, but I originally decided that overcomplicated what I was trying to achieve. Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.
A thought I've just had, would be modify the ENTER and LEAVE macros thusly:
Code: Select all
ENTER MACRO mLocalSize
PHA
PHX
PHY
PHN mLocalSize
ENDM
LEAVE MACRO mLocalSize
PLN mLocalSize
PLY
PLX
PLA
ENDM
My intent with ENTER/LEAVE was to simplify the process of entering and exiting from an Interrupt context, but with this addition it can also be used in ordinary function calls to preserve register state and reserve space for locals on the stack. The caller function's register values (and stack values, for that matter) can still be accessed with an appropriate PKA/PTA offset.
Re: Reentrant Programming: Small versus Big Software Stacks
Posted: Fri Jan 26, 2018 4:40 am
by Druzyek
You can save X on the stack too, but then take that into account when formulating the number to add the index to.
Would you do that by having the macro assign a temporary value that is added to the hard coded base address and replace all push and pull arguments with macros that increase or decrease that value? It seems like it would be tough to keep adding and subtracting to the base yourself, especially if you go back and add a push or pull and have to redo the following ones.
Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.
That makes sense. If it helps you, I think it is good that you have them, even if you sacrifice a little speed for ease of use.
I was curious about maximizing performance, so I tested out a few different ways on a simple function. When A='a', X=3, and Y=4 the function prints "aaabbbcccddd." A returns with the count of characters printed. Here are the cycle counts for the ways I tried, even though I would not say they are generalizable in any way. I imagine doing some of this with macros but they are expanded so you can see what I'm thinking. Could I change anything to make it faster?
I think doing everything in zero page could be the fastest. The fourth way without pushing anything only works for top level functions. I don't think it is possible with macros but a more sophisticated program could automatically rearrange zero page usage so as few bytes as possible have to be pushed.
1. Hardware stack: 322
2. Pushing zero page with loop: 324
3. Pushing zero page with unrolled loop: 277
4. Zero page with no pushing: 229
5. Zero page stack: 323
Code: Select all
LDA #'A' ;Character to start printing with
LDX #3 ;How many times to repeat each character
LDY #4 ;How many sets of characters
JSR func
Output = AAABBBCCCDDD
1. Hardware stack
Code: Select all
func:
PHA ;Copy of character
PHX ;Copy of number to print (3)
TSX ;Make room for one byte to hold return value (characters printed)
;Alarm Siren's macro chooses faster of two subtraction methods
;TXA
;SEC
;SBC #1
DEX
;TAX
TXS
;Save copy of calculated stack offset (replaces TXS)
STX zp_sp_copy
;Zero one byte variable which returns total characters printed
STZ HWSTACK_BASE+0+1,X
loop1:
;Load value passed in through X (number of characters to print = 3)
LDA HWSTACK_BASE+1+1,X
;Save a copy to load into X counter
PHA
;Load the character to print
LDA HWSTACK_BASE+2+1,X
;Number of characters to print = 3
PLX
loop2:
putc
DEX
BNE loop2
;Reload X with calculated stack offset instead of TXS
LDX zp_sp_copy
;Count of characters printed
LDA HWSTACK_BASE+0+1,X
CLC
;Add 3 to total
ADC HWSTACK_BASE+1+1,X
STA HWSTACK_BASE+0+1,X
;Advance to the next character
INC HWSTACK_BASE+2+1,X
;Print five sets
DEY
BNE loop1
;Get character count before X and SP are changed
LDY HWSTACK_BASE+0+1,X
;Empty stack (character to print, characters to print, and total printed)
TSX
TXA
CLC
ADC #3
TAX
TXS
;Character count to A
TYA
RTS
2, 3, 4. Zero page
Code: Select all
;**********2. Pushing zero page with loop**********
func:
;Copy of A (character to print)
STA zp_temp
;Copy of X (characters to print = 3)
STX zp_temp2
LDX #0
;Push zero page bytes so we can use them
loop3:
LDA R0,X
PHA
INX
CPX #3
BNE loop3
;Store A (character to print) in R2
LDA zp_temp
STA R2
;X = characters to print = 3
LDA zp_temp2
STA R1
;Count of total characters printed
STZ R0
;**********3. Pushing zero page with unrolled loop**********
func:
STA zp_temp
LDA R0
PHA
LDA R1
PHA
LDA R2
PHA
LDA zp_temp
STA R2
STX R1
STZ R0
;**********4. No pushing**********
func:
STA R2
STX R1
STZ R0
;**********Body for 2, 3, and 4**********
loop4:
;X = characters to print = 3
LDX R1
;A = character to print
LDA R2
loop5:
putc
DEX
BNE loop5
;Add characters printed (3) to total in R0
LDA R0
CLC
ADC R1
STA R0
;Advance to the next charcter
INC R2
DEY
BNE loop4
;**********Ending for 2**********
LDX #2
;Restore the three bytes we used
loop6:
PLA
STA R0,X
DEX
BPL loop6
;Character count is return value
LDA R0
RTS
;**********Ending for 3**********
PLA
STA R2
PLA
STA R1
PLA
STA R0
;Not needed but macros would not optimize it out
LDA R0
RTS
;**********Ending for 4**********
LDA R0
RTS
5. Zero page stack
Code: Select all
func5:
;Character to print
PHA
;Pointer into zero page stack
LDA zp_ptr
;Increase by 3 for A copy, X copy, and character count
CLC
ADC #3
STA zp_ptr
;Push copy of characters to print (3)
PHX
;X points to zero page stack
TAX
;Characters to print (3)
PLA
STA 1,X
;Character to print
PLA
STA 2,X
;Zero count of total characters printed
STZ 0,X
loop34:
;Count of characters to print (3)
LDA 1,X
;Necessary since we cant do LDX 1,X
PHA
;Character to print
LDA 2,X
;X is now 3, count of characters to print
PLX
loop35:
putc
DEX
BNE loop35
;Pointer to stack in zero page
LDX zp_ptr
;Count of total characters printed
LDA 0,X
CLC
;Add three to total (printed three characters
ADC 1,X
STA 0,X
;Advance to next character
INC 2,X
DEY
BNE loop34
;Free up the three bytes we took on the stack
LDA zp_ptr
SEC
SBC #3
STA zp_ptr
;Returns total characters printed in A
LDA R0
RTS