6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 9:49 pm

All times are UTC




Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Sun Jan 21, 2018 3:25 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
I'm noodling around with some ideas around stacks and parameter passing to allow reentrant programming. So I built two separate stack modules to see how many instructions are required for basic operations. I tried to make the interfaces similar enough that I could switch between them and judge their relative merits.

The first is the X register stack I've seen in several Forths (including Tali Forth): https://github.com/Martin-H1/Lisp65/blob/master/stack.asm

The second is a 16 bit stack pointer in page zero that grows downward from ram top: https://github.com/Martin-H1/Lisp65/blob/master/bigstack.asm

It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.

The second stack seems way too big and slow for operations like push, drop, swap. But on the plus side supports large allocations, stack depth, and a light register and page zero footprint. That allows interfacing with reused code.

Not sure where I am going with this, other than I am curious if anyone has better ideas?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 21, 2018 7:06 pm 
Offline
User avatar

Joined: Tue Oct 25, 2016 8:56 pm
Posts: 362
To save him some typing, I will first point you to Garth's excellent Stacks Treatise.

As to your own points, yea, I feel there's no single answer. Different solutions are better for different situations, it all depends how you intend to use them.

There is however a third method you may not have considered, where you can use the Actual Stack for parameter passing - even on the 6502. You just have to use the TSX / TXS instructions and the a, x addressing mode with a base offset.

E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.

_________________
Want to design a PCB for your project? I strongly recommend KiCad. Its free, its multiplatform, and its easy to learn!
Also, I maintain KiCad libraries of Retro Computing and Arduino components you might find useful.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 21, 2018 11:21 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
If you want to stick with a 6502 instead of a 65816, you might consider adding extra hardware to do the job. For example, you could map half of zero page to an unused section of RAM and hook up an up/down counter to drive address lines when that half of zero page is accessed. That way you could pulse the counter and have a new half zero page to work with for every function. You can pass parameters on the stack or just copy them to directly where you will need them in the mapped zero page if you aren't copying from another mapped zero page. I suppose you might also be able to set up a part of the memory map to point to the current half zero page plus one so you can copy from the current page to the next one directly saving cycles at the expense of more hardware.

This is all overly complicated but I think it will be faster than just about anything else you could do on the 6502.

EDIT: Alarm Siren, neat macros! Do you need to index by X for the STA in PTA?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 22, 2018 11:46 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
You could use a hybrid technique using the CPU stack for arguments and locals but copying/restoring from RAM as it gets too full/empty.

I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 22, 2018 6:01 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
@Druzyek, thanks for the reply, my project is a rewrite of a small Lisp interpreter from C to 6502 assembly, while making it generally reusable. So new hardware isn't something I want to consider. I tried using cc65 for the initial port, but between the cc65 overheard and lisp overheard, there was nothing left for programs.

@Alarm Siren, thanks for the links, I read through them and there was lots of useful information.

@BitWise, I think I will probably end up using a hybrid technique. The low level program machinery (e.g. I/O, memory management, and math) can use a small stack, as there's no recursion, few arguments, and call depths are shallow. It's parsing and evaluation where deep recursion can come into play. So I think a large stack specifically for those functions will be needed.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 22, 2018 8:46 pm 
Offline
User avatar

Joined: Tue Oct 25, 2016 8:56 pm
Posts: 362
Druzyek wrote:
EDIT: Alarm Siren, neat macros!

Thanks

Druzyek wrote:
Do you need to index by X for the STA in PTA?


Yep. and PTX, PTY and PTZ. Looks like I forgot to put it on that whole class of macro :oops:
Apart from checking they compile, I've not had anything complex enough yet to test them in. Still being mucking around trying to get my UART working.
I'll update the file now.

EDIT: Updated.

_________________
Want to design a PCB for your project? I strongly recommend KiCad. Its free, its multiplatform, and its easy to learn!
Also, I maintain KiCad libraries of Retro Computing and Arduino components you might find useful.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 23, 2018 12:58 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
BitWise wrote:
I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).
Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 23, 2018 2:49 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
[Edit: I think this was in response to something in the macro library Alarm Siren linked to. My comment seems pretty out of place without having said this before.]

In my macros, I like to use prepositions to clarify; so you have for example
Code:
        COPY   VAR1, to, VAR2       ; Copy a single byte.
        COPY2  VAR3, to, VAR4       ; Copy a pair of bytes.
        COPY   VAR1, from, FOOBAR   ; Copy the other direction.

There's no doubt about the direction, ie, which one is the source and which is the destination, and it's more English-like, and it takes no extra bytes or execution time in the final machine code. "to" and "from" are EQUates that the macro uses to conditionally assemble the copying to be one direction or the other. Such parameters can also be put in the parameter list to just improve readability, without actually using them in the macro.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 23, 2018 9:47 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
Druzyek wrote:
BitWise wrote:
I think BBC BASIC uses this approach when executing deeply nested procedures. Top active procedures are on the CPU stack for fast access while the stack image for parent precedures is copied to a software stack that grows down from below the screen memory (HIMEM).
Do you know if every child procedure copies or if they had some sort of system for comparing stack space needed to stack space left before copying?


I can't remember what the criteria is.

The disassembly here https://acorn.huininga.nl/pub/docs/sourcecode/Acorn/BASIC%202/BASIC2.asm
has the stack push at $B1A7 and the pull at $B236.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 23, 2018 8:41 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
The disassembly here https://acorn.huininga.nl/pub/docs/sourcecode/Acorn/BASIC%202/BASIC2.asm
has the stack push at $B1A7 and the pull at $B236.

Thanks. I was just wondering myself about the best way to do it. I have been thinking a slightly more sophisticated macro system could figure out automatically how much stack space your local variables need and copy that many bytes off the stack for you. I did something similar to manage zero page memory by pushing enough zero page bytes to hold function arguments and locals. It was not very efficient since it reuses the same memory over and over even if there is more zero page available.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 25, 2018 9:14 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.
I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:
Code:
  PKA         MACRO   mOffset
    TSX
    LDA     HWSTACK_BASE+mOffset+1, X
  ENDM
Wouldn't this stop working if you did something like this:
Code:
PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!

I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 25, 2018 10:23 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
You can save X on the stack too, but then take that into account when formulating the number to add the index to. It not always necessary to re-do the TSX just because you pushed or pulled something. (This is all covered in the 6502 stacks treatise.) The macro will probably need to be modified.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 25, 2018 10:29 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
Martin_H wrote:
It's pretty clear that the Forth style stack is leaner, faster, and more suited for fast light calls. For example I could see it used as the calling standard for a math library, a heap, or I/O. The problem is that it doesn't support that much depth, consumes a register, and half of page zero. That makes interfacing it with reused code a challenge at times.

As stated in the 6502 stacks treatise, the most stack-intensive Forth application I could come up with, including servicing interrupts in high-level Forth, and also having the RTC on NMI interrupts, did not take more than 20% of either page 1 or page 0. Regarding the space it does take, keep in mind that having the ZP data stack reduces the need for other variables and pointers that are normally kept in page 0; so the penalty is little or none. You may even come out ahead.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 26, 2018 12:15 am 
Offline
User avatar

Joined: Tue Oct 25, 2016 8:56 pm
Posts: 362
Druzyek wrote:
Quote:
E.g. see this macro library I wrote. This is basically very similar to your first option, but without permanently tying up the X register - it can still be used for other purposes when you're not accessing the stack.
I was thinking about this. Wouldn't this fail if you needed to use the stack for anything? For example, you have this for PKA:
Code:
  PKA         MACRO   mOffset
    TSX
    LDA     HWSTACK_BASE+mOffset+1, X
  ENDM
Wouldn't this stop working if you did something like this:
Code:
PKA X1
ADC #5
PHA
PKA X2 ;Wrong address!

I think the secret would be only doing the TSX once at the beginning, but then you have to keep track of it yourself. What do you think?


PKA is defined as copying into A the value at Offset from the top of the stack. If you've pushed something onto the stack, you need to adjust your offset to compensate. To avoid these difficulties, I envisage programs using the PHN macro to reserve an amount of space on the stack for locals. So, you'd get something like this:

Code:
PHN LocalSize
PKA LocalSize+X1
ADC #5
PTA LocalSize-3  ; Lets assume LocalSize is big enough to allow this
PKA LocalSize+X2 ; Now it's correct!


I have considered modifying the macros, or adding a new set, that allows you to optionally skip the TSX if you know that you're not overwriting X in the mean-time thus improving performance a little - which would also permit your suggestion that allows you to still use the normal push/pull without clobbering the offsets since X isn't being constantly reloaded, but I originally decided that overcomplicated what I was trying to achieve. Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.

A thought I've just had, would be modify the ENTER and LEAVE macros thusly:
Code:
ENTER      MACRO mLocalSize
   PHA
   PHX
   PHY
   PHN   mLocalSize
  ENDM
LEAVE      MACRO mLocalSize
   PLN   mLocalSize
   PLY
   PLX
   PLA
  ENDM

My intent with ENTER/LEAVE was to simplify the process of entering and exiting from an Interrupt context, but with this addition it can also be used in ordinary function calls to preserve register state and reserve space for locals on the stack. The caller function's register values (and stack values, for that matter) can still be accessed with an appropriate PKA/PTA offset.

_________________
Want to design a PCB for your project? I strongly recommend KiCad. Its free, its multiplatform, and its easy to learn!
Also, I maintain KiCad libraries of Retro Computing and Arduino components you might find useful.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 26, 2018 4:40 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
You can save X on the stack too, but then take that into account when formulating the number to add the index to.

Would you do that by having the macro assign a temporary value that is added to the hard coded base address and replace all push and pull arguments with macros that increase or decrease that value? It seems like it would be tough to keep adding and subtracting to the base yourself, especially if you go back and add a push or pull and have to redo the following ones.

Quote:
Also, the point of the macros is that you're not permanently turning X into an extra stack pointer, so it is expected that X may get used for other purposes in between stack accesses, so logically X will need to be reloaded.
That makes sense. If it helps you, I think it is good that you have them, even if you sacrifice a little speed for ease of use.

I was curious about maximizing performance, so I tested out a few different ways on a simple function. When A='a', X=3, and Y=4 the function prints "aaabbbcccddd." A returns with the count of characters printed. Here are the cycle counts for the ways I tried, even though I would not say they are generalizable in any way. I imagine doing some of this with macros but they are expanded so you can see what I'm thinking. Could I change anything to make it faster?

I think doing everything in zero page could be the fastest. The fourth way without pushing anything only works for top level functions. I don't think it is possible with macros but a more sophisticated program could automatically rearrange zero page usage so as few bytes as possible have to be pushed.

1. Hardware stack: 322
2. Pushing zero page with loop: 324
3. Pushing zero page with unrolled loop: 277
4. Zero page with no pushing: 229
5. Zero page stack: 323

Code:
LDA #'A' ;Character to start printing with
LDX #3 ;How many times to repeat each character
LDY #4 ;How many sets of characters
JSR func
Output = AAABBBCCCDDD

1. Hardware stack
Code:
   func:
   PHA ;Copy of character
   PHX ;Copy of number to print (3)
   TSX ;Make room for one byte to hold return value (characters printed)
   ;Alarm Siren's macro chooses faster of two subtraction methods
   ;TXA
   ;SEC
   ;SBC #1
   DEX
   ;TAX
   TXS

   ;Save copy of calculated stack offset (replaces TXS)
   STX zp_sp_copy
   
   ;Zero one byte variable which returns total characters printed
   STZ HWSTACK_BASE+0+1,X
   
loop1:
   ;Load value passed in through X (number of characters to print = 3)
   LDA HWSTACK_BASE+1+1,X
   ;Save a copy to load into X counter
   PHA
   ;Load the character to print
   LDA HWSTACK_BASE+2+1,X
   ;Number of characters to print = 3
   PLX
loop2:
   putc
   DEX
   BNE loop2
   ;Reload X with calculated stack offset instead of TXS
   LDX zp_sp_copy
   ;Count of characters printed
   LDA HWSTACK_BASE+0+1,X
   CLC
   ;Add 3 to total
   ADC HWSTACK_BASE+1+1,X
   STA HWSTACK_BASE+0+1,X
   ;Advance to the next character
   INC HWSTACK_BASE+2+1,X
   ;Print five sets
   DEY
   BNE loop1
   
   ;Get character count before X and SP are changed
   LDY HWSTACK_BASE+0+1,X
   
   ;Empty stack (character to print, characters to print, and total printed)
   TSX
   TXA
   CLC
   ADC #3
   TAX
   TXS
      
   ;Character count to A
   TYA
   RTS


2, 3, 4. Zero page
Code:
   ;**********2. Pushing zero page with loop**********
   func:
   ;Copy of A (character to print)
   STA zp_temp
   ;Copy of X (characters to print = 3)
   STX zp_temp2
   LDX #0
   ;Push zero page bytes so we can use them
loop3:
   LDA R0,X
   PHA
   INX
   CPX #3
   BNE loop3
   ;Store A (character to print) in R2
   LDA zp_temp
   STA R2
   ;X = characters to print = 3
   LDA zp_temp2
   STA R1
   ;Count of total characters printed
   STZ R0

   ;**********3. Pushing zero page with unrolled loop**********
   func:
   STA zp_temp
   LDA R0
   PHA
   LDA R1
   PHA
   LDA R2
   PHA
   LDA zp_temp
   STA R2
   STX R1
   STZ R0

   ;**********4. No pushing**********
   func:
   STA R2
   STX R1
   STZ R0

   ;**********Body for 2, 3, and 4**********
   loop4:
   ;X = characters to print = 3
   LDX R1
   ;A = character to print
   LDA R2
loop5:
   putc
   DEX
   BNE loop5
   
   ;Add characters printed (3) to total in R0
   LDA R0
   CLC
   ADC R1
   STA R0
   ;Advance to the next charcter
   INC R2
   DEY
   BNE loop4
   
   ;**********Ending for 2**********
   LDX #2
   ;Restore the three bytes we used
loop6:
   PLA
   STA R0,X
   DEX
   BPL loop6
   
   ;Character count is return value
   LDA R0
   RTS

   ;**********Ending for 3**********
   PLA
   STA R2
   PLA
   STA R1
   PLA
   STA R0
   
   ;Not needed but macros would not optimize it out
   LDA R0
   RTS

   ;**********Ending for 4**********
   LDA R0
   RTS


5. Zero page stack
Code:
   func5:
   ;Character to print
   PHA
   ;Pointer into zero page stack
   LDA zp_ptr
   ;Increase by 3 for A copy, X copy, and character count
   CLC
   ADC #3
   STA zp_ptr
   ;Push copy of characters to print (3)
   PHX
   ;X points to zero page stack
   TAX
   ;Characters to print (3)
   PLA
   STA 1,X
   ;Character to print
   PLA
   STA 2,X
   ;Zero count of total characters printed
   STZ 0,X
   
loop34:
   ;Count of characters to print (3)
   LDA 1,X
   ;Necessary since we cant do LDX 1,X
   PHA
   ;Character to print
   LDA 2,X
   ;X is now 3, count of characters to print
   PLX
loop35:
   putc
   DEX
   BNE loop35
   
   ;Pointer to stack in zero page
   LDX zp_ptr
   ;Count of total characters printed
   LDA 0,X
   CLC
   ;Add three to total (printed three characters
   ADC 1,X
   STA 0,X
   ;Advance to next character
   INC 2,X
   DEY
   BNE loop34
   
   ;Free up the three bytes we took on the stack
   LDA zp_ptr
   SEC
   SBC #3
   STA zp_ptr
   
   ;Returns total characters printed in A
   LDA R0
   RTS


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 33 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 39 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: