A (WDC) 65816 calling convention

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Sep 25, 2025 6:43 am

gfoot wrote:

BigDumbDinosaur wrote:

That only becomes a concern if the function being called is independently loaded from any other program. In most of my libraries, I use a symbol to indicate when a function is being remotely called, which of course implies that the return must be via RTL.

Do you have an example of this? I don't think I understand what it's for.

Here’s the comment block from a library function, the block explaining how to call the function:

Code: Select all

;blkread: READ FROM SCSI BLOCK DEVICE
;
;	———————————————————————————————————————————————————————————————————————
;	Synopsis: This function reads one or more blocks from a SCSI block dev-
;	          ice, e.g., a disk, into a buffer.  The logical unit number is
;	          assumed to be zero.
;
;	          § All parameters are pointers to data.
;
;	          § The target device's bus ID is a 16-bit quantity, with the
;	            MSB set to $00.
;
;	          § The logical block address (LBA) is a 32-bit quantity, even
;	            though the target device may not require or accept a 32-bit
;	            LBA.
;
;	          § The number of blocks to be accessed is a 16-bit quantity,
;	            with the MSB set to $00.  The maximum number of blocks that
;	            may be accessed is 127 ($007F).  If this field is $0000, no
;	            operation will occur & this function will immediately ret-
;	            urn without an error.
;
;	          § The buffer address is expressed as 32-bits, with the MSB of
;	            the MSW set to $00.  The buffer must be of sufficient size
;	            to hold the requested number of blocks multiplied by the
;	            device's block size.
;
;	          § If the buffer pointer is null, the buffer pointer set by a
;	            previous call to the SSBUFS BIOS API is assumed to have
;	            been made.  It is recommended this sequence be used if rep-
;	            itive accesses are to be made to the same buffer.  Doing so
;	            will avoid the overhead of a BIOS API call on each access
;	            to set the buffer pointer.  Use this feature with caution!
;
;	          § The target SCSI device must have been enumerated during the
;	            system POST.
;	———————————————————————————————————————————————————————————————————————
;	Invocation example: pea #buf >> 16         ;buffer pointer MSW
;	                    pea #buf & $ffff       ;buffer pointer LSW
;	                    pea #nblk >> 16        ;block count pointer MSW
;	                    pea #nblk & $ffff      ;block count pointer LSW
;	                    pea #lba >> 16         ;LBA pointer MSW
;	                    pea #lba & $ffff       ;LBA pointer LSW
;	                    pea #scsi_id >> 16     ;device ID pointer MSW
;	                    pea #scsi_id & $ffff   ;device ID pointer LSW
;	                    .IF .DEF(_SCSI_)       ;define this symbol...
;	                    JSL blkread            ;if making a “far” call...
;	                    .ELSE
;	                    JSR blkread
;	                    .ENDIF
;	                    BCS ERROR
;
;	Exit registers: .A: entry value ¹
;	                .B: entry value ²
;	                .X: entry value
;	                .Y: entry value
;	                DB: entry value
;	                DP: entry value
;	                PB: entry value
;	                SR: nvmxdizc
;	                    ||||||||
;	                    |||||||+———> 0: okay
;	                    |||||||      1: error
;	                    +++++++————> entry value
;
;	Notes: 1) One of the following if an error:
;
;	            e_ptrnul: null pointer passed
;	            e_ssabt : transaction aborted
;	            e_ssblk : too many blocks
;	            e_sschk : check condition
;	            e_sscfe : controller fifo error
;	            e_sscge : controller general error
;	            e_sscmd : illegal controller command
;	            e_ssdne : device not enumerated
;	            e_ssdnp : device not responding
;	            e_sssnr : SCSI subsystem not ready
;	            e_sstid : device ID out of range
;	            e_ssubp : unsupported bus phase
;	            e_ssubr : unexpected bus reset
;	            e_ssucg : unsupported command group
;	            e_ssudf : unsupported driver function
;	            e_ssudt : unsupported device type
;
;	       2) $00 if an error.
;	———————————————————————————————————————————————————————————————————————

Next are the stack frame definitions, in which conditional assembly is involved:

Code: Select all

;—————————————————————————————————————————————————————————
;
;LOCAL DEFINITIONS
;
.maxblk  =128                  ;max blocks per transaction +1
.s_cdb   =s_cdbg2              ;size of local CDB
.sfbase  .set 0                ;base stack index
.sfidx   .set .sfbase          ;workspace index
;
;—————————> workspace stack frame start <—————————
;
.wsf     =.sfidx               ;start of workspace
;
.cdb     =.sfidx               ;local CDB
.sfidx   .= .sfidx+.s_cdb
.nblks   =.sfidx               ;block count
.sfidx   .= .sfidx+s_word
.lba     =.sfidx               ;LBA
.sfidx   .= .sfidx+s_lba
;
;—————————> workspace stack frame end <—————————
;
.s_wsf   =.sfidx-.sfbase       ;workspace size
.sfbase  .set .sfidx
;
;—————————> register stack frame start <—————————
;
.reg_dp  =.sfidx               ;DP
.sfidx   .= .sfidx+s_mpudpx
.reg_db  =.sfidx               ;DB
.sfidx   .= .sfidx+s_mpudbx
.reg_c   =.sfidx               ;.C
.sfidx   .= .sfidx+s_word
.reg_x   =.sfidx               ;.X
.sfidx   .= .sfidx+s_word
.reg_y   =.sfidx               ;.Y
.sfidx   .= .sfidx+s_word
.reg_sr  =.sfidx               ;SR
.sfidx   .= .sfidx+s_mpusrx
.reg_pc  =.sfidx               ;PC
.sfidx   .= .sfidx+s_mpupcx
	.if .def(_SCSI_)             <——— conditional assembly for “far” call
.reg_pb  =.sfidx               ;PB
.sfidx   .= .sfidx+s_mpupbx
	.endif                       <——— conditional assembly for “far” call
;
;—————————> register stack frame end <—————————
;
.s_rsf   =.sfidx-.sfbase       ;register frame size
.sfbase  .set .sfidx
;
;—————————> parameter stack frame start <—————————
;
.idptr   =.sfidx               ;*SCSI_ID
.sfidx   .= .sfidx+s_dptr
.lbaptr  =.sfidx               ;*LBA
.sfidx   .= .sfidx+s_dptr
.nblkptr =.sfidx               ;*NBLK
.sfidx   .= .sfidx+s_dptr
.bufptr  =.sfidx               ;*BUF
.sfidx   .= .sfidx+s_dptr
;
;—————————> parameter stack frame end <—————————
;
.s_psf   =.sfidx-.sfbase       ;parameter frame size
;
;—————————————————————————————————————————————————————————

The function’s postamble is as follows—the postamble fixes up the stack:

Code: Select all

.done    rep #m_setr|sr_car    ;16-bit registers & clear carry
         tsc
         adc !#.s_wsf          ;ephemeral workspace size
         tcs                   ;expunge workspace
         adc !#.s_rsf          ;register stack frame size
         tax                   ;“from” address for stack realignment
         adc !#.s_psf          ;parameter stack frame size
         tay                   ;“to” address for stack realignment
         lda !#.s_rsf-1
         mvp #0,#0             ;realign stack, gets rid of parameter frame
         tyx                   ;fix up...
         txs                   ;stack pointer
         pld
         plb
         pla
         plx
         ply
         plp
	.if .def(_SCSI_)             <——— conditional assembly for “far” call
         rtl
	.else                        <——— conditional assembly for “near” call
         rts
	.endif                       <——— conditional assembly end

In the above code fragments, if the symbol _SCSI_ has been defined, the function’s register stack frame will be configured to include the program bank (PB) and the postamble will be assembled to use RTL to return. If _SCSI_ is not defined, assembly will assume that the function is being called with JSR and PB will not be defined as part of the register frame.

Within the main program, if the SCSI library macros are used to call the BLKREAD function, the call will be made with JSL if _SCSI_ has been defined. It’s all based upon conditional assembly. Here’s the BLKREAD macro:

Code: Select all

;————————————————————————————————————————————————————————
;READ BLOCK(S) FROM SCSI DEVICE
;
;blkread *ID,*LBA,*NBLK,*BUF,'<m1>','<m2>','<m3>','<m4>'
;
;  *ID   — device bus ID
;  *LBA  — logical block address
;  *NBLK — number of blocks to read
;  *BUF  — buffer
;  <m1> - <m4> parameter addressing mode, ‘d’, ‘f’, ‘n’ or ‘r’
;  If addressing mode ‘r’ is specified, the pointer is
;  expected to be in .X (LSW) & .Y (MSW).  Index register
;  widths must be set to 16 bits before invoking this
;  macro.
;————————————————————————————————————————————————————————
;
blkread  .macro ...
.rp	=8
	.if @0 == .rp
.np	    =.rp/2
.ct	    .set .np
.xr	    .set 0
	    .rept .ct
.m	        .= {{@{.ct+.np}} | %00100000}
	        .if .m == 'd'
         pei @.ct+2
         pei @.ct
	        .else
	            .if .m == 'f'
         pea #@.ct >> 16
         pea #@.ct & $ffff
	            .else
	                .if .m == 'n'
	                    .if .def(execbank)
         pei execbank
	                    .else
         phk
         phk
	                    .endif
         per @.ct
	                .else
	                    .if .m == 'r'
	                        .if .xr
	                            .error ""+@0$+": 'r' addressing can only be used once"
	                        .else
.xr	                            .set 1
         phy
         phx
	                        .endif
	                    .else
	                        .error ""+@0$+": mode must be 'd', 'f', 'n' or 'r'"
	                    .endif
	                .endif
	            .endif
	        .endif
.ct	        .= .ct-1
	    .endr
	.else
	    .error ""+@0$+": missing parameters"
	.endif
	.if .def(_SCSI_) <——— start of conditional assembly
         jsl blkread
	.else
         jsr blkread
	.endif           <——— end of conditional assembly
         .endm

Quote:

...you mean the called function removes things from the stack that were put there by the caller?

Yes.

When I first embarked on writing 816 code and working out how to manage parameter-passing, I decided that since something had to clean up the stack after a function call, it would be more efficient to have the function handle that task—see the above postamble code. Once a function has been fully debugged, it can be called with the confidence of knowing the stack isn’t going to get boogered up by a bug in the calling program—assuming the caller correctly structures the stack frame. The latter is where the macros get into the picture—they will always build a proper stack frame, and will rebuke the programmer (me) if the wrong number or type of parameters is passed.

Incidentally, functions can freely modify the registers by using <offset>,S addressing. When the registers are pulled in the postamble, the caller will see the changes. In many of my functions, DP is pointed to SP+1 after local workspace has been allocated on the stack, which is real convenient for rewriting the registers, accessing the caller’s parameters, etc., using direct-page addressing. Here’s a typical example of how I do it:

Code: Select all

         php
         rep #m_setr|sr_bdm    ;16-bit registers
         phy                   ;save machine state
         phx
         pha
         phb
         phd
         sec
         tsc
         sbc !#.s_wsf          ;reserve workspace
         tcs
         inc                   ;DP is the workspace...
         tcd                   ;stack frame pointer

With the above sequence, the first byte of workspace could be accessed with LDA $00. Something such as STA .reg_c would modify the entry value of the accumulator.

Martin_H · Post by **Martin_H** » Thu Sep 25, 2025 2:02 pm

I don't if any of you have programmed a System 360, but it doesn't have a hardware stack and uses register linkage instead. The caller provides a pointer to a block of memory with the arguments, the BALR instruction calls the subroutine and places the return address in the link register. The subroutine must preserve the link register contents if it calls another subroutine. Unfortunately, the contents of the parameter block were largely unstructured which made cross programming language calls difficult. In contrast the VAX has a hardware stack with call and return instructions, and standardized stack frame contents. The call instruction pushed the number of parameters and return address onto the stack. Since the VAX hardware enforced the calling standard every language was interoperable.

The VAX is a classic CISC machine, however by the end of the 80's RISC CPU performance eclipsed it because they used register linkage. For example, MIPS uses the JAL instruction which calls a subroutine and places the return address in the link register. At the time it gave me a System 360 déjà vu, but the hardware designers made these changes because stacks don't work well with instruction pipelines and dual issue CPUs.

Brining this back to 6502 programming. I've often wondered if its small hardware stack was a design error and page zero "registers" with a BALR instruction would have been a better fit. Conversely the 65816's large hardware stack feels like a classic CISC machine.

Martin_H · Post by **Martin_H** » Thu Sep 25, 2025 2:28 pm

With regards to who pushes arguments and who is responsible for pulling them. This has always been a point of debate with single stack calling standards.

In a two-stack design like a Forth VM it's easy. Pop the arguments as they are consumed because the return address is in a separate stack. So caller pushes and consumer pulls is built into the basic design.

But a single data stack requires popping the return address into a temporary location, consuming the input arguments, pushing the return address, and issuing the return call. It's doable, but more often I've seen caller pushes and caller pulls as the standard in programming languages.

Personally I greatly prefer Forth's two stack approach and wish it caught on.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Sep 25, 2025 3:36 pm

Martin_H wrote:

I don't if any of you have programmed a System 360...

My programming experience with the S360 was limited to tinkering with the 6502 reference assembler provided by MOS Technology, which assembler was a concoction of FORTRAN and weird-looking macros.

Quote:

Brining (sic) this back to 6502 programming. I've often wondered if its small hardware stack was a design error and page zero "registers" with a BALR instruction would have been a better fit. Conversely the 65816's large hardware stack feels like a classic CISC machine.

The historical record supports the notion that Chuck Peddle and company went with the one-page stack as part of the cost-cutting process. Making the stack “pointer” act as an index would require fewer transistors to implement than if it were a true pointer register. It’s the same thinking than informed other design decisions and led to a major reduction in die area and cost.

For the most part, the one-page stack doesn’t seem to hamper the 6502 too much, although I can recall a number of situations in which more stack-addressing flexibility would have been useful. The 65C816’s 16-bit stack pointer seems luxurious in comparison, but most 6502 programs never use up the full stack. That said, I had at least one occasion in writing code for my POC V1.3 unit in which I used enough stack space to exceed a page and inadvertently trample on an adjacent data structure.

The 816’s stack-relative addressing modes and the ability to read/write the stack as though it is direct page is a significant step up from the eight-bit 65xx architecture. The 16-bit registers are the icing on the cake.

vbc · Post by **vbc** » Thu Sep 25, 2025 3:54 pm

Martin_H wrote:

With regards to who pushes arguments and who is responsible for pulling them. This has always been a point of debate with single stack calling standards.

In a two-stack design like a Forth VM it's easy. Pop the arguments as they are consumed because the return address is in a separate stack. So caller pushes and consumer pulls is built into the basic design.

Even with separate stacks I would argue that caller-pull is faster in most cases. For many ABIs, there are situations like variable-arguments where the callee does not have an easy way to know the size of the arguments. So you would have to pass the argument size as well. The caller usually knows the size.
Also, there is the option to accumulate the arguments of several calls and pull them at once.

Callee-pull might get smaller code-size if there are several call-sites for a function and the pull-code is only needed once. It is also useful, of course, for systems where you can not easily modify the stack apart from pulling the topmost element. But those are ill-suited for many languages anyway.

Regarding the 65816/WDC ABI, I did have a look at that when doing the 65816 port of vbcc. There are a number of possible ABIs for the 65816 that do each have their pros and cons. While I did not do any real comparisons, I decided against the approach of setting dpage to the current stack-frame. Using a non-aligned dpage costs one cycle penalty, making dpage-direct addressing pretty much the same as stack-relative addressing (you do get some more addressing options using dpage, though). But you lose the ability for faster dpage registers and global dpage variables. Also, there is the overhead code for setting the dpage register. It makes more sense for a simple compiler without register-allocation.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Thu Sep 25, 2025 5:12 pm

Martin_H wrote:

Personally I greatly prefer Forth's two stack approach and wish it caught on.

The Commodore 128 took that approach in implementing BASIC 7.0, and it cost dearly in terms of performance.

The run-time stack pointer at $7D-$7E had to be manipulated each time something was pushed or pulled, which happened with each DO loop, FOR-NEXT loop or GOSUB. The only way to manipulate that pointer was with multi-byte arithmetic. Further increasing the workload was the necessity of using indirect addressing to read/write the stack, with .Y having to be set to $00 each time, since the 8502 lacks a true indirect addressing mode. The result was BASIC 7.0 on the C-128 running at 1 MHz was markedly slower than BASIC 2.0 running on the C-64 at the same speed. Even when the C-128 was running at 2 MHz, its BASIC was only marginally faster than the C-64.

vbc wrote:

Even with separate stacks I would argue that caller-pull is faster in most cases. For many ABIs, there are situations like variable-arguments where the callee does not have an easy way to know the size of the arguments.

In calls in which the number of parameters (arguments) will be variable, the function can be written to interpret the bottom-most entry in the parameter stack frame as a count of the number of parameters in the frame (a design feature that may be used to implement some error-checking before disaster caused by an unbalanced stack occurs). In my functions that use that approach, that count is a word that is the number of words in the frame, not including the count itself.

In functions that take a fixed number of parameters, I have, to date, elected to have local stack frame definitions cheerfully assume the caller will correctly load the stack before the call—use of macros for function calls assures this is the case. When I write the kernel for my POC unit, I will uniformly adopt the method of including a parameter count in the stack frame, even in functions that take a fixed number of parameters—mostly so I can use a “one size fits all” macro to load the stack. Here’s a macro for the Kowalski assembler that builds a stack frame in that fashion:

Code: Select all

;———————————————————————————————————————————————————————————————————————————————
;GENERATE FUNCTION CALL STACK FRAME
;
;pushparm parm1 [,parm2 [...,parmN]],mode1 [,mode2 [...,modeN]]
;
;This macro generates a parameter stack frame.  Parameters will be pushed from
;right to left, i.e., PARM1 will be the lowest parameter on the stack.  The form
;of each parameter will depend on the corresponding mode flag that is passed.
;There is a one-for-one correspondence between parameters & mode flags.
;
;Below PARM1 will be a word value that indicates the number of words that were
;pushed.  For example:
;
;  pushparm addr1,addr2,addr3,'f','f','f'
;
;will produce the following stack frame:
;
;  addr3 (MSW)
;  addr3 (LSW)
;  addr2 (MSW)
;  addr2 (LSW)
;  addr1 (MSW)
;  addr1 (LSW)
;  $0006 (words pushed)
;
;For each parameter in the invocation, a corresponding mode must be passed.  The
;recognized modes are:
;
;  'd'  The corresponding parameter is a direct-page address from which
;       the value to be pushed will be fetched.  The resulting instruc-
;       tion sequence will be...
;
;         PEI <parm>+2
;         PEI <parm>
;
;       ...hence pushing a double word (DWORD) in little-endian format.
;
;  'f'  The corresponding parameter is a value that is to be processed
;       as a 32-bit, little-endian quantity, such as a “far” address.
;       The resulting instruction sequence will be...
;
;         PEA #<parm> >> 16
;         PEA #<parm> & $FFFF
;
;       ...hence pushing a double word (DWORD) in little-endian format.
;
;  'i'  The correspnding parameter is a value that is to be processed
;       as a 16-bit, little-endian quantity, with no assumptions made
;       about its purpose.  The resulting instruction sequence will
;       be...
;
;         PEA #<PARM> & $FFFF
;
;       ...hence pushing a WORD.
;
;  'n'  The corresponding parameter is to be processed as a “near” add-
;       ress.  If the direct-page location EXECBANK has been defined,
;       the resulting instruction sequence will be:
;
;         PEI EXECBANK       ;execution bank
;         PER <parm> & $FFFF ;parameter LSW
;
;       Obviously, EXECBANK must contain the execution bank, with the
;       EXECBANK+1 location containing $00.
;
;       If EXECBANK has not been defined, the sequence will be:
;
;         PHK
;         PHK
;         PER <parm>
;
;       Note that although the execution bank is pushed twice in the
;       above sequence, the extra push is compuationally irrelevant.
;
;  'r'  The corresponding parameter is assumed to be in the index reg-
;       isters: .X = LSW & .Y = MSW.  The index registers must be set
;       to 16-bits —— there is no check for this.  The corresponding
;       parameter is ignored, but is required for syntax reasons.  It
;       is recommended that it be $00 to make it clear that the value
;       being processed is being passed in the registers.  Although of
;       limited value, this mode may be used multiple times.  The res-
;       ulting instruction sequence will be:
;
;         PHY
;         PHX
;
;  'z'  The corresponding parameter is an address on direct page.  The
;       resulting instruction sequence will be:
;
;         PEA #0
;         PEA #<parm> & $FF
;
;       Note that the 'z' mode is not the same as the 'd' mode, despite
;       a direct-page location is being referenced in both cases.  With
;       the 'd' mode, it is the content of the direct-page location that
;       is pushed.  'z' mode pushes the address of the direct-page loca-
;       tion as a 32-bit quantity.
;
;  Addressing modes are case-insensitive.
;———————————————————————————————————————————————————————————————————————————————
;
pushparm .macro ...
.tp	=@0
	.if .tp == 0
	    .error "syntax: "+@0$+" parm1 [,parm2 [,parmN]],mode1 [,mode2 [,modeN]]"
	.endif
	.if .tp @ 2
	    .error ""+@0$+": parameter & mode counts mismatch"
	.endif
.np	=.tp/2
.ix	.set .np
.nw	.set 0
	.rept .np
.m	    .= @{.ix+.np} | %00100000
	    .if .m == 'd'
         pei {@.ix}+2
         pei @.ix
.nw	.= .nw+2
	    .else
	        .if .m == 'f'
         pea #{@.ix} >> 16
         pea #{@.ix} & $ffff
.nw	.= .nw+2
	        .else
	            .if .m == 'i'
         pea #{@.ix} & $ffff
.nw	.= .nw+1
	            .else
	                .if .m == 'n'
	                    .if .def(execbank) && execbank < $0100
         pei execbank
.nw	.= .nw+2
	                    .else
         phk
         phk
	                    .endif
         per @.ix
.nw	.= .nw+2
	                .else
	                    .if .m == 'r'
         phy
         phx
.nw	.= .nw+2
	                    .else
	                        .if .m == 'z'
	                            .if @.ix > $ff
	                                .error ""+@0$+": parameter must be a direct-page address"
	                            .endif
         pea #0
         pea #{@.ix} & $ff
.nw	.= .nw+2
	                        .else
	                            .error ""+@0$+": mode must be 'd', 'f', 'i', 'n', 'r' or 'z'"
	                        .endif
	                    .endif
	                .endif
	            .endif
	        .endif
	    .endif
.ix	    .= .ix-1
	.endr
         pea #.nw              ;number of words pushed by the above
         .endm

Quote:

Even with separate stacks I would argue that caller-pull is faster in most cases.

Whether one method is faster than the other is something that can be endlessly debated. My many years of writing 6502 machine code have conditioned me to consider code size over speed in most cases—I started out working with machines with very limited memory. Old habits die hard...

It is patent that having a function clean up the stack makes for smaller code as soon as more than one call to the that function occurs. Many of my programs make multiple calls to the same functions, especially in the math and string manipulation libraries. So I am reaping the benefits of having only one instance of the stack cleanup code. Performance in itself is not the issue, since the cleanup has to be done somewhere, and with each function call if the stack isn’t to grow to an excessive size and step on something.

Quote:

Also, there is the option to accumulate the arguments of several calls and pull them at once.

...which is fine if you aren’t too concerned about how much stack space you are using before housekeeping occurs.

Quote:

...I decided against the approach of setting dpage to the current stack-frame. Using a non-aligned dpage costs one cycle penalty, making dpage-direct addressing pretty much the same as stack-relative addressing (you do get some more addressing options using dpage, though).

I have done some analysis of this, although not exhaustively. I rationalize the use of stack space for direct page because although it is likely DP will not be aligned on a page boundary when pointed to the stack (thus, as you note, costing a cycle with each fetch or store), the expressiveness of DP addressing modes, as well as the convenience of using instructions such as TRB and TSB for bit-field manipulation, often results in smaller and faster code within the function. I have concluded that that gain more than offsets the slight performance loss caused by the page-boundary-alignment penalty.

Quote:

Also, there is the overhead code for setting the dpage register.

That is a one-time penalty per function call, and mostly involves register-to-register copies, which uniformly act in two cycles on the 816, regardless of register width. The gain of being able to use DP addressing modes seems to me to be much larger than the minuscule performance penalty needed to set up DP.

As the old adage goes, your mileage may vary.

Martin_H · Post by **Martin_H** » Thu Sep 25, 2025 10:15 pm

@BDD, I seem to recall you described using the COP instruction to call your BIOS routines. Do I remember correctly, and do you have a link to a forum discussion?

In the 80s I wrote a batch OS for a System 360 machine and I used the SVC instruction for user programs to call the BIOS. So, it sounds like a similar idea.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Sep 26, 2025 1:56 am

Martin_H wrote:

@BDD, I seem to recall you described using the COP instruction to call your BIOS routines. Do I remember correctly, and do you have a link to a forum discussion?

Despite your advancing age, your memory is correct.

Here’s some discussion on it. Also, this. Searching for more COP stuff turns up tons of debris that I don’t have time right now to filter.

vbc · Post by **vbc** » Fri Sep 26, 2025 5:53 pm

BigDumbDinosaur wrote:

I have done some analysis of this, although not exhaustively. I rationalize the use of stack space for direct page because although it is likely DP will not be aligned on a page boundary when pointed to the stack (thus, as you note, costing a cycle with each fetch or store), the expressiveness of DP addressing modes, as well as the convenience of using instructions such as TRB and TSB for bit-field manipulation, often results in smaller and faster code within the function. I have concluded that that gain more than offsets the slight performance loss caused by the page-boundary-alignment penalty.

Of course the idea is not to leave the dpage unused. The goal is to keep it aligned, avoid the penalty, and then use it e.g. like registers. If parameters are used a lot, or are used in way that benefits from dpage addressing-modes, you copy them to dpage registers. For example, in a time-critical loop you would try to have most variables in dpage to get maximum speed (without the cycle penalty).

Furthermore, you can put heavily used global variables into dpage for fast access, something not possible when moving dpage around.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Sep 26, 2025 6:16 pm

vbc wrote:

Furthermore, you can put heavily used global variables into dpage for fast access, something not possible when moving dpage around.

For a number of reasons, I tend to be stingy with direct-page usage, probably a habit acquired long ago when the 6502 was new, RAM was expensive and there was never enough direct-page space. Global variables are placed on direct page only if they are accessed often enough to justify doing so. Most of my direct-page usage is for pointers and arithmetic workspace.

As for page-alignment of DP, I usually don’t sweat that one-cycle penalty in functions that ultimately access I/O. The little bit of extra execution time that results is usually an insignificant fraction of the time required to access I/O hardware, especially if wait-stating is involved. I’m usually more focused on minimizing overall execution time by taking advantage of direct-page addressing modes and the more-succinct code that doing so can produce.

One other thing that informs my direct-page usage is the possibility of multitasking. If each program insists on storing its working data on direct page, conflicts are going to occur, unless the operating environment is able to allocate separate direct-page space to each program. Since direct page access is forced into bank $00, space constraints could result in some programs’ direct page not being page-aligned. After all, that same memory range has to accommodate stacks (which can be much bigger than a program’s direct-page needs), hardware vectors, interrupt handlers (at least their front ends), non-indexed indirect jump vectors, etc.. It can get awfully crowded in there!

gfoot · Post by **gfoot** » Fri Sep 26, 2025 6:22 pm

Thanks for this discussion, it's been interesting to follow along, and think about the pros and cons of different options. So far in practice I've just felt my way through, exploring what is possible without settling on anything, and it's good to hear different approaches. I might have to try vasm and vbcc.

It feels like a lot of the function entry boilerplate could be handled by a helper function - a pattern I've sometimes seen done on the 6502 but which doesn't seem to have much benefit there. But subtracting a certain amount from the stack, saving all of the registers and arranging for them to be restored on exit feels very reusable, the only variable needed to pass in is the amount to subtract; and the epilogue can then just be a matter of calling "RTS" and letting the wrapper tidy everything up correctly.

That said, I feel like memory is not as precious in this architecture, so perhaps saving 10-20 bytes (?) per function isn't such a big deal.

I also wondered whether this was a good idea in principle:

Code: Select all

myfunc:
    phd     ; save DP
    pha     ; save A
    tsc     ; get SP
    pha     ; save it on the stack
    sec     ; subtract 'amount'
    sbc #amount
    tcs     ; set SP here
    pha
    pld     ; set DP here as well
    ...
    lda amount-1
    tcs     ; restore old SP value
    pla     ; restore registers
    pld
    rts     ; return

Part of the idea here is that it doesn't matter what happens to the stack pointer in the meantime, so you could for example not bother tidying up stack frames for further functions you call, if that suits your purposes, and then collect them all at once here when the function exits. Or you could make SP point somewhere else entirely if you had a reason to do that. Again I find it hard to see why you'd want to do that on the 65816 because you might as well just allocate more memory for the stack in the first place; I know early ARM calling conventions did have some restriction on stack use though and a SWI call (I think) to ask the OS to give you a new stack because you'd run out.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Sep 26, 2025 7:39 pm

gfoot wrote:

It feels like a lot of the function entry boilerplate could be handled by a helper function - a pattern I've sometimes seen done on the 6502 but which doesn't seem to have much benefit there. But subtracting a certain amount from the stack, saving all of the registers and arranging for them to be restored on exit feels very reusable, the only variable needed to pass in is the amount to subtract; and the epilogue can then just be a matter of calling "RTS" and letting the wrapper tidy everything up correctly.

Saving and restoring the registers comes down to how important function transparency is to the overall program. Most of my library functions fully preserve machine state, and only change registers as needed to report processing results. Functions typically receive pointers as entry parameters (via the stack) and use those pointers to access and/or change the data being processed.

I tend to go for transparency in an effort to maintain predictable behavior in the program that is calling library functions. If I can safely assume the overall execution environment will be maintained with each function call, I can concentrate on making the mainline code get the central job done, and not have to constantly think about accommodating variations to the environment caused by calling a function...other than returned results. How much that approach affects efficiency and performance is hard to quantify, but it does tend to make my programming time more productive.

The decision on how to handle the stack within a function is strongly influenced by how the function utilizes the stack. In a function that always allocates a fixed amount of stack space and then releases it on exit, stashing the stack pointer somewhere and later restoring it might suffice. The problem with doing so is the function may lose recursivity.

If you are shooting for recursivity, you can’t store the current stack pointer on the stack, because as soon as you arbitrarily change SP, you can no longer access the old SP value. So you have to store the old SP somewhere in RAM. However, unless a function can magically change that “somewhere in RAM” with each recursive call, the new-old SP will overwrite the old-old SP and you will be SOL. Hence the reason I developed the entry and exit procedures I use. Most of my library functions are completely self-maintaining of stack usage and can be recursed as many times as needed—at least until a collision occurs between the stack and something else in bank $00.

Quote:

That said, I feel like memory is not as precious in this architecture, so perhaps saving 10-20 bytes (?) per function isn't such a big deal.

Memory is always precious in the sense there is a finite amount of it. However, you have to consider the overall picture, especially in separating local data from global data. The technique of allocating workspace on the stack and later getting rid of it may be used to enforce variable scope. The 65C816’s design, especially in being able to temporarily “marry” direct page to the stack, strongly encourages this programming model.

Quote:

I also wondered whether this was a good idea in principle:

Code: Select all

myfunc:
    phd     ; save DP
    pha     ; save A
    tsc     ; get SP
    pha     ; save it on the stack
    sec     ; subtract 'amount'
    sbc #amount
    tcs     ; set SP here
    pha
    pld     ; set DP here as well
    ...
    lda amount-1
    tcs     ; restore old SP value
    pla     ; restore registers
    pld
    rts     ; return

What determines AMOUNT? At one point, you are using it in immediate-mode addressing, which implies that AMOUNT is an assembly-time constant. Later, you are referring to it with absolute addressing, which implies it is now a location in RAM. It’s not making sense to me. If AMOUNT is indeed an assembly-time constant, loading it into the accumulator and then copying it to SP will always set the latter to the same value. However, SP is very much a dynamic register and there is no telling what will be in it when the function call is made.

As for saving SP on the stack, you’ve added two extra stack accesses of dubious value—and don’t forget a pull is more expensive than a push.

In your next paragraph, you explain why the above is not the best way to go about it.

Quote:

Part of the idea here is that it doesn't matter what happens to the stack pointer in the meantime, so you could for example not bother tidying up stack frames for further functions you call, if that suits your purposes, and then collect them all at once here when the function exits. Or you could make SP point somewhere else entirely if you had a reason to do that. Again I find it hard to see why you'd want to do that on the 65816 because you might as well just allocate more memory for the stack in the first place

The raison d'être of the 16-bit stack pointer is to facilitate use of the stack as ephemeral workspace. In terms of execution speed, there is usually little-to-no penalty in allocating a large amount of stack space compared to allocating a small amount. Either way, the same instructions are involved, and it is even possible to dynamically change the amount of allocation that occurs from one function call to the next.

Quote:

I know early ARM calling conventions did have some restriction on stack use though and a SWI call (I think) to ask the OS to give you a new stack because you'd run out.

With the 65C816, you only run out of stack space when SP collides with some other bank $00 structure. If your system has extended memory, you can run all of your programs, even the operating system, in an extended bank, and use almost all of bank $00 for direct page and stack space.

gfoot · Post by **gfoot** » Fri Sep 26, 2025 11:06 pm

BigDumbDinosaur wrote:

I tend to go for transparency in an effort to maintain predictable behavior in the program that is calling library functions. If I can safely assume the overall execution environment will be maintained with each function call, I can concentrate on making the mainline code get the central job done, and not have to constantly think about accommodating variations to the environment caused by calling a function...other than returned results. How much that approach affects efficiency and performance is hard to quantify, but it does tend to make my programming time more productive.

Yes I see the appeal and I've already tripped over a few cases where bugs in code have led to incorrect M/X states after a function call, which can be subtle or not subtle, and saving and restoring the flags would have made them less likely.

What I meant here though was actually using another subroutine to do the saving, i.e.:

Code: Select all

myfunc:
    pea amount_of_local_storage    ; or something similar
    jsr prologue

"prologue" would save all the registers, reserve the right amount of storage space, and swap the return addresses around a little so that after it returns to "myfunc", the next thing up the stack is the address of an "epilogue" routine - so that when "myfunc" is finished, it just uses RTS to return and "epilogue" tidies up the stack and returns to the caller. It'd be less speed-efficient I expect, but it would reduce the boilerplate in the actual function prologue to just these six bytes.

Quote:

What determines AMOUNT? At one point, you are using it in immediate-mode addressing, which implies that AMOUNT is an assembly-time constant. Later, you are referring to it with absolute addressing, which implies it is now a location in RAM. It’s not making sense to me. If AMOUNT is indeed an assembly-time constant, loading it into the accumulator and then copying it to SP will always set the latter to the same value. However, SP is very much a dynamic register and there is no telling what will be in it when the function call is made.

I might have written it wrongly, I'm new to the 65816 and actually find the syntax confusing - but the idea was that "amount" would be the amount of space to reserve on the stack for local variables, which gets subtracted from SP as an immediate - but then, as DP is now pointed at the bottom of that block, we can use DP-relative addressing to read back that saved stack value later on - at offset "amount-1", the top two bytes of the local variable space.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Sep 27, 2025 10:23 am

gfoot wrote:

BigDumbDinosaur wrote:

I tend to go for transparency in an effort to maintain predictable behavior in the program that is calling library functions...

Yes I see the appeal and I've already tripped over a few cases where bugs in code have led to incorrect M/X states after a function call, which can be subtle or not subtle, and saving and restoring the flags would have made them less likely.

I decided relatively early on that the cost of preserving state was less than the cost of chasing obscure bugs. In the past when coaching people on writing assembly language, I have always reminded them that it isn’t the cost of an instruction you should consider when writing a routine. What you should be looking at is how often you have to bear that cost. Minimizing the number of instructions executed and the time required to execute them is fine, but mostly matters in repetitive situations.

When you save state in a function, you only do it once per function call. If you only call that function once in a while, there is little to be gained by obsessing over how many cycles were consumed in saving and restoring state—use that time for something more productive.

On the other hand, there are some functions in my libraries in which the only state I save is register widths (defensive programming)—this is mostly the case with math functions, which often get called inside of loops and thus need to be quick-acting.

As a typical 65C816 system has more RAM to work with, you can usually be a little less obsessive about saving a few program bytes. Oftentimes, taking advantage of the 816’s greater capabilities actually makes for smaller code. For example, incrementing a word with the 816 takes one instruction—doing the same with the 65C02 requires two instructions and a branch. In some situations, you can trade bytes for speed by unrolling loops, since the 816 is usually not as memory-constrained as a 65C02 might be. My 64-bit integer division function is one such example:

Code: Select all

.init020 ldx #s_blword         ;main loop counter
         clc
;
;—————————
;Main Loop
;—————————
;
.main    =*
.o	.set 0
	.rept s_ifacw*2
         rol faca+.o
.o	    .= .o+s_word
	.endr
;
;	—-—-—-—-—-—-—-—-—-—-
;	Do Trial Subtraction
;	—-—-—-—-—-—-—-—-—-—-
;
         sec
.o	.set 0
	.rept s_ovrflo/s_word
         lda faco+.o
         sbc facb+.o
         sta facc+.o
.o	    .= .o+s_word
	.endr
         bcc .main010          ;discard trial difference
;
;	—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;	Copy Trial Difference to Overflow
;	—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—-—
;
.o	.set 0
	.rept s_ifacw
         lda facc+.o
         sta faco+.o
.o	    .= .o+s_word
	.endr
;
.main010 dex
         bne .main             ;next
;
;	—-—-—-—-—-—-—-—-—-
;	Handle Final Carry
;	—-—-—-—-—-—-—-—-—-
;
.o	.set 0
	.rept s_ifacw
         rol faca+.o
.o	    .= .o+s_word
	.endr

     ...etc...

Yes, that does produce a bunch of instructions, but the inner loops are, instead, inline code, which is significantly faster than smaller code with a loop counter and branching. In fact, using the Kowalski simulator’s cycle counter, I saw about a 25 percent reduction in execution time with the unrolled loops.

Quote:

What I meant here though was actually using another subroutine to do the saving, i.e.:

Code: Select all

myfunc:
    pea amount_of_local_storage    ; or something similar
    jsr prologue

"prologue" would save all the registers, reserve the right amount of storage space, and swap the return addresses around a little so that after it returns to "myfunc", the next thing up the stack is the address of an "epilogue" routine - so that when "myfunc" is finished, it just uses RTS to return and "epilogue" tidies up the stack and returns to the caller. It'd be less speed-efficient I expect, but it would reduce the boilerplate in the actual function prologue to just these six bytes.

My opinion is you are needlessly complicating things in an effort to save a few bytes. Also, you would actually be increasing the total clock cycles consumed in setup due to the JSR - RTS sequence needed to call PROLOGUE, 14 cycles to be exact. Plus the PEA will cost you another five cycles.

The sequence...

Code: Select all

         phd                   ;(4)
         sec                   ;(2)
         tsc                   ;(2)
         sbc !#amount          ;(3) (16-bit subtraction)
         tcs                   ;(2)
         inc a                 ;(2)
         tcd                   ;(2) (direct page at SP+1)

...has nine bytes total and consumes 17 cycles. That’s only three more cycles than consumed by the JSR - RTS needed to call PROLOGUE—and PROLOGUE itself will eat up even more cycles as it manipulates the stack. This is definitely a case where inline code is better, even though it, or something like it, will be repeated in other functions.

If you want, you could also copy SP to .X and then after workspace has been allocated and DP pointed to it, write .X to $00 (the virtual direct page). In that case, AMOUNT would have to include an extra word to provide room for the SP copy. For example:

Code: Select all

         phd                   ;(4)
         sec                   ;(2)
         tsc                   ;(2)
         tsx                   ;(2) SP copy...see below
         sbc !#amount+2        ;(3) (16-bit subtraction)
         tcs                   ;(2)
         inc a                 ;(2)
         tcd                   ;(2)
         stx $00               ;(4,5) save unmodified SP

To reverse the above...

Code: Select all

         ldx $00               ;(4,5) get unmodified SP
         txs                   ;(2)
         pld                   ;(5)

As it is likely DP will not be page-aligned, the accesses to $00 will most often take five cycles.

Aside from execution time, I think you are going to quickly encounter stack management issues in PROLOGUE—I foresee a bunch of convolution, since PROLOGUE will be working lower in the stack than the function that called it and will have to deal with that pesky return address that was pushed when the caller did JSR PROLOGUE.

Quote:

What determines AMOUNT? At one point, you are using it in immediate-mode addressing, which implies that AMOUNT is an assembly-time constant. Later, you are referring to it with absolute addressing, which implies it is now a location in RAM. It’s not making sense to me...

I might have written it wrongly, I'm new to the 65816 and actually find the syntax confusing...

With the exception of the “long” addressing modes and differentiating between eight- and 16-bit immediate-mode operands, syntax is the same as with the 65C02. Exactly how you indicate that an immediate-mode value is to be generated as a 16-bit operand depends on the assembler you are using. For example, in the Kowalski assembler, the syntax is !#, with the ! telling the assembler the operand is “wide.” For example, LDA !#$12 would result in A9 12 00 being assembled. Absent the !, A9 12 would be assembled. The same syntax is used in Supermon 816.

Other assemblers may have pseudo-ops that tell the assembler how to handle immediate-mode operands. For example, the WDC assembler uses LONGA ON|OFF and LONGI ON|OFF to affect the accumulator and the index registers, respectively. Those are assembler directives—they don’t generate any REP or SEP instructions. My opinion is that is a clunky way to handle things—more typing, for one thing. The !# syntax makes things easier—I don’t want to have to keep track of the state of an invisible assembler flag.

vbc · Post by **vbc** » Sat Sep 27, 2025 10:44 am

BigDumbDinosaur wrote:

For a number of reasons, I tend to be stingy with direct-page usage, probably a habit acquired long ago when the 6502 was new, RAM was expensive and there was never enough direct-page space. Global variables are placed on direct page only if they are accessed often enough to justify doing so.

Of course. But there often are such global variables and shifting the dpage eliminates that option. Note that I am just discussing what will be faster, I am not trying to convince you to change your ABI. There are valid reasons apart from speed, especially for limited use-case ABIs.

Quote:

One other thing that informs my direct-page usage is the possibility of multitasking. If each program insists on storing its working data on direct page, conflicts are going to occur, unless the operating environment is able to allocate separate direct-page space to each program.

Yes, a multitasking system will have to take care about providing suitable memory spaces to the tasks. Due to the architecture of the 65816 there will usually be several memory spaces for tasks and the OS has to take care to make good use of bank 0. For example, you can take a look at how the Apple IIGS OS handles this.

If you are dealing with a very large number of tasks (unlikely on a 65816) you could swap out used dpage parts of a task while it is not running (just like the register set on a register-based machine).

A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention

Re: A (WDC) 65816 calling convention