Reentrant Programming: Small versus Big Software Stacks

Martin_H · Post by **Martin_H** » Sun Jan 21, 2018 3:25 pm

I'm noodling around with some ideas around stacks and parameter passing to allow reentrant programming. So I built two separate stack modules to see how many instructions are required for basic operations. I tried to make the interfaces similar enough that I could switch between them and judge their relative merits.

The first is the X register stack I've seen in several Forths (including Tali Forth): https://github.com/Martin-H1/Lisp65/blo ... /stack.asm

The second is a 16 bit stack pointer in page zero that grows downward from ram top: https://github.com/Martin-H1/Lisp65/blo ... gstack.asm

It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.

The second stack seems way too big and slow for operations like push, drop, swap. But on the plus side supports large allocations, stack depth, and a light register and page zero footprint. That allows interfacing with reused code.

Not sure where I am going with this, other than I am curious if anyone has better ideas?

Alarm Siren · Post by **Alarm Siren** » Sun Jan 21, 2018 7:06 pm

To save him some typing, I will first point you to Garth's excellent Stacks Treatise.

As to your own points, yea, I feel there's no single answer. Different solutions are better for different situations, it all depends how you intend to use them.

There is however a third method you may not have considered, where you can use the Actual Stack for parameter passing - even on the 6502. You just have to use the TSX / TXS instructions and the a, x addressing mode with a base offset.

E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.

Druzyek · Post by **Druzyek** » Sun Jan 21, 2018 11:21 pm

If you want to stick with a 6502 instead of a 65816, you might consider adding extra hardware to do the job. For example, you could map half of zero page to an unused section of RAM and hook up an up/down counter to drive address lines when that half of zero page is accessed. That way you could pulse the counter and have a new half zero page to work with for every function. You can pass parameters on the stack or just copy them to directly where you will need them in the mapped zero page if you aren't copying from another mapped zero page. I suppose you might also be able to set up a part of the memory map to point to the current half zero page plus one so you can copy from the current page to the next one directly saving cycles at the expense of more hardware.

This is all overly complicated but I think it will be faster than just about anything else you could do on the 6502.

EDIT: Alarm Siren, neat macros! Do you need to index by X for the STA in PTA?

BitWise · Post by **BitWise** » Mon Jan 22, 2018 11:46 am

You could use a hybrid technique using the CPU stack for arguments and locals but copying/restoring from RAM as it gets too full/empty.

I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).

Martin_H · Post by **Martin_H** » Mon Jan 22, 2018 6:01 pm

@Druzyek, thanks for the reply, my project is a rewrite of a small Lisp interpreter from C to 6502 assembly, while making it generally reusable. So new hardware isn't something I want to consider. I tried using cc65 for the initial port, but between the cc65 overheard and lisp overheard, there was nothing left for programs.

@Alarm Siren, thanks for the links, I read through them and there was lots of useful information.

@BitWise, I think I will probably end up using a hybrid technique. The low level program machinery (e.g. I/O, memory management, and math) can use a small stack, as there's no recursion, few arguments, and call depths are shallow. It's parsing and evaluation where deep recursion can come into play. So I think a large stack specifically for those functions will be needed.

Alarm Siren · Post by **Alarm Siren** » Mon Jan 22, 2018 8:46 pm

Druzyek wrote:

EDIT: Alarm Siren, neat macros!

Thanks

Druzyek wrote:

Do you need to index by X for the STA in PTA?

Yep. and PTX, PTY and PTZ. Looks like I forgot to put it on that whole class of macro

Apart from checking they compile, I've not had anything complex enough yet to test them in. Still being mucking around trying to get my UART working.
I'll update the file now.

EDIT: Updated.

Druzyek · Post by **Druzyek** » Tue Jan 23, 2018 12:58 am

BitWise wrote:

I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).

Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?

GARTHWILSON · Post by **GARTHWILSON** » Tue Jan 23, 2018 2:49 am

[Edit: I think this was in response to something in the macro library Alarm Siren linked to. My comment seems pretty out of place without having said this before.]

In my macros, I like to use prepositions to clarify; so you have for example

Code: Select all

        COPY   VAR1, to, VAR2       ; Copy a single byte.
        COPY2  VAR3, to, VAR4       ; Copy a pair of bytes.
        COPY   VAR1, from, FOOBAR   ; Copy the other direction.

There's no doubt about the direction, ie, which one is the source and which is the destination, and it's more English-like, and it takes no extra bytes or execution time in the final machine code. "to" and "from" are EQUates that the macro uses to conditionally assemble the copying to be one direction or the other. Such parameters can also be put in the parameter list to just improve readability, without actually using them in the macro.

BitWise · Post by **BitWise** » Tue Jan 23, 2018 9:47 am

Druzyek wrote:

BitWise wrote:

I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).

Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?

I can't remember what the criteria is.

The disassembly here https://acorn.huininga.nl/pub/docs/sour ... BASIC2.asm
has the stack push at $B1A7 and the pull at $B236.

Druzyek · Post by **Druzyek** » Tue Jan 23, 2018 8:41 pm

Quote:

The disassembly here https://acorn.huininga.nl/pub/docs/sour ... BASIC2.asm
has the stack push at $B1A7 and the pull at $B236.

Thanks. I was just wondering myself about the best way to do it. I have been thinking a slightly more sophisticated macro system could figure out automatically how much stack space your local variables need and copy that many bytes off the stack for you. I did something similar to manage zero page memory by pushing enough zero page bytes to hold function arguments and locals. It was not very efficient since it reuses the same memory over and over even if there is more zero page available.

Druzyek · Post by **Druzyek** » Thu Jan 25, 2018 9:14 pm

Quote:

E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.

I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:

Code: Select all

  PKA         MACRO   mOffset
    TSX
    LDA     HWSTACK_BASE+mOffset+1, X
  ENDM

Wouldn't this stop working if you did something like this:

Code: Select all

PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!

I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?

GARTHWILSON · Post by **GARTHWILSON** » Thu Jan 25, 2018 10:23 pm

You can save X on the stack too, but then take that into account when formulating the number to add the index to. It not always necessary to re-do the TSX just because you pushed or pulled something. (This is all covered in the 6502 stacks treatise.) The macro will probably need to be modified.

GARTHWILSON · Post by **GARTHWILSON** » Thu Jan 25, 2018 10:29 pm

Martin_H wrote:

It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.

As stated in the 6502 stacks treatise, the most stack-intensive Forth application I could come up with, including servicing interrupts in high-level Forth, and also having the RTC on NMI interrupts, did not take more than 20% of either page 1 or page 0. Regarding the space it does take, keep in mind that having the ZP data stack reduces the need for other variables and pointers that are normally kept in page 0; so the penalty is little or none. You may even come out ahead.

Alarm Siren · Post by **Alarm Siren** » Fri Jan 26, 2018 12:15 am

Druzyek wrote:

Quote:

E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.

I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:

Code: Select all

  PKA         MACRO   mOffset
    TSX
    LDA     HWSTACK_BASE+mOffset+1, X
  ENDM

Wouldn't this stop working if you did something like this:

Code: Select all

PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!

I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?

PKA is defined as copying into A the value at Offset from the top of the stack. If you've pushed something onto the stack, you need to adjust your offset to compensate. To avoid these difficulties, I envisage programs using the PHN macro to reserve an amount of space on the stack for locals. So, you'd get something like this:

Code: Select all

PHN LocalSize
PKA LocalSize+X1
ADC #5
PTA LocalSize-3  ; Lets assume LocalSize is big enough to allow this
PKA LocalSize+X2 ; Now it's correct!

I have considered modifying the macros, or adding a new set, that allows you to optionally skip the TSX if you know that you're not overwriting X in the mean-time thus improving performance a little - which would also permit your suggestion that allows you to still use the normal push/pull without clobbering the offsets since X isn't being constantly reloaded, but I originally decided that overcomplicated what I was trying to achieve. Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.

A thought I've just had, would be modify the ENTER and LEAVE macros thusly:

Code: Select all

ENTER		MACRO mLocalSize
	PHA
	PHX
	PHY
	PHN	mLocalSize
  ENDM
LEAVE		MACRO mLocalSize
	PLN	mLocalSize
	PLY
	PLX
	PLA
  ENDM

My intent with ENTER/LEAVE was to simplify the process of entering and exiting from an Interrupt context, but with this addition it can also be used in ordinary function calls to preserve register state and reserve space for locals on the stack. The caller function's register values (and stack values, for that matter) can still be accessed with an appropriate PKA/PTA offset.

Druzyek · Post by **Druzyek** » Fri Jan 26, 2018 4:40 am

Quote:

You can save X on the stack too, but then take that into account when formulating the number to add the index to.

Would you do that by having the macro assign a temporary value that is added to the hard coded base address and replace all push and pull arguments with macros that increase or decrease that value? It seems like it would be tough to keep adding and subtracting to the base yourself, especially if you go back and add a push or pull and have to redo the following ones.

Quote:

Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.

That makes sense. If it helps you, I think it is good that you have them, even if you sacrifice a little speed for ease of use.

I was curious about maximizing performance, so I tested out a few different ways on a simple function. When A='a', X=3, and Y=4 the function prints "aaabbbcccddd." A returns with the count of characters printed. Here are the cycle counts for the ways I tried, even though I would not say they are generalizable in any way. I imagine doing some of this with macros but they are expanded so you can see what I'm thinking. Could I change anything to make it faster?

I think doing everything in zero page could be the fastest. The fourth way without pushing anything only works for top level functions. I don't think it is possible with macros but a more sophisticated program could automatically rearrange zero page usage so as few bytes as possible have to be pushed.

1. Hardware stack: 322
2. Pushing zero page with loop: 324
3. Pushing zero page with unrolled loop: 277
4. Zero page with no pushing: 229
5. Zero page stack: 323

Code: Select all

LDA #'A' ;Character to start printing with
LDX #3 ;How many times to repeat each character
LDY #4 ;How many sets of characters
JSR func

Output = AAABBBCCCDDD

1. Hardware stack

Code: Select all

	func:
	PHA ;Copy of character
	PHX ;Copy of number to print (3)
	TSX ;Make room for one byte to hold return value (characters printed)
	;Alarm Siren's macro chooses faster of two subtraction methods
	;TXA 
	;SEC
	;SBC #1
	DEX 
	;TAX
	TXS

	;Save copy of calculated stack offset (replaces TXS)
	STX zp_sp_copy
	
	;Zero one byte variable which returns total characters printed
	STZ HWSTACK_BASE+0+1,X
	
loop1: 
	;Load value passed in through X (number of characters to print = 3)
	LDA HWSTACK_BASE+1+1,X
	;Save a copy to load into X counter
	PHA
	;Load the character to print
	LDA HWSTACK_BASE+2+1,X
	;Number of characters to print = 3
	PLX
loop2:
	putc
	DEX
	BNE loop2
	;Reload X with calculated stack offset instead of TXS
	LDX zp_sp_copy
	;Count of characters printed
	LDA HWSTACK_BASE+0+1,X
	CLC
	;Add 3 to total
	ADC HWSTACK_BASE+1+1,X
	STA HWSTACK_BASE+0+1,X
	;Advance to the next character
	INC HWSTACK_BASE+2+1,X
	;Print five sets
	DEY
	BNE loop1
	
	;Get character count before X and SP are changed
	LDY HWSTACK_BASE+0+1,X
	
	;Empty stack (character to print, characters to print, and total printed)
	TSX
	TXA
	CLC
	ADC #3
	TAX
	TXS
		
	;Character count to A
	TYA
	RTS

2, 3, 4. Zero page

Code: Select all

	;**********2. Pushing zero page with loop**********
	func:
	;Copy of A (character to print)
	STA zp_temp
	;Copy of X (characters to print = 3)
	STX zp_temp2
	LDX #0
	;Push zero page bytes so we can use them
loop3:
	LDA R0,X
	PHA
	INX
	CPX #3
	BNE loop3
	;Store A (character to print) in R2
	LDA zp_temp
	STA R2
	;X = characters to print = 3
	LDA zp_temp2
	STA R1
	;Count of total characters printed
	STZ R0

	;**********3. Pushing zero page with unrolled loop**********
	func:
	STA zp_temp
	LDA R0
	PHA
	LDA R1
	PHA
	LDA R2
	PHA
	LDA zp_temp
	STA R2
	STX R1
	STZ R0

	;**********4. No pushing**********
	func:
	STA R2
	STX R1
	STZ R0

	;**********Body for 2, 3, and 4**********
	loop4:
	;X = characters to print = 3
	LDX R1
	;A = character to print
	LDA R2
loop5:
	putc
	DEX
	BNE loop5
	
	;Add characters printed (3) to total in R0
	LDA R0
	CLC
	ADC R1
	STA R0
	;Advance to the next charcter
	INC R2
	DEY
	BNE loop4
	
	;**********Ending for 2**********
	LDX #2
	;Restore the three bytes we used
loop6:
	PLA
	STA R0,X
	DEX
	BPL loop6
	
	;Character count is return value
	LDA R0
	RTS

	;**********Ending for 3**********
	PLA
	STA R2
	PLA
	STA R1
	PLA
	STA R0
	
	;Not needed but macros would not optimize it out
	LDA R0
	RTS

	;**********Ending for 4**********
	LDA R0
	RTS

5. Zero page stack

Code: Select all

	func5:
	;Character to print
	PHA
	;Pointer into zero page stack
	LDA zp_ptr
	;Increase by 3 for A copy, X copy, and character count
	CLC
	ADC #3
	STA zp_ptr
	;Push copy of characters to print (3)
	PHX
	;X points to zero page stack
	TAX
	;Characters to print (3)
	PLA
	STA 1,X
	;Character to print
	PLA
	STA 2,X
	;Zero count of total characters printed
	STZ 0,X
	
loop34:
	;Count of characters to print (3)
	LDA 1,X
	;Necessary since we cant do LDX 1,X
	PHA
	;Character to print
	LDA 2,X
	;X is now 3, count of characters to print
	PLX
loop35:
	putc
	DEX
	BNE loop35
	
	;Pointer to stack in zero page
	LDX zp_ptr
	;Count of total characters printed
	LDA 0,X
	CLC
	;Add three to total (printed three characters
	ADC 1,X
	STA 0,X
	;Advance to next character
	INC 2,X
	DEY
	BNE loop34
	
	;Free up the three bytes we took on the stack
	LDA zp_ptr
	SEC
	SBC #3
	STA zp_ptr
	
	;Returns total characters printed in A
	LDA R0
	RTS

Reentrant Programming: Small versus Big Software Stacks

Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks

Re: Reentrant Programming: Small versus Big Software Stacks