IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Jan 15, 2019 10:59 pm

At Ed’s suggestion, I decided to start a separate topic on discussion over here about methods of implementing an Application Programming Interface (API) to an operating system running on the 65C816 in native mode. By way of explanation for those who aren’t up to speed on operating system internals, the theory behind an API is the application programmer is given access to useful operating system kernel functions, such as reading/writing a file or driving a display, using a stable and well-defined methodology. Each API call has a formalized method of accepting parameters and a formalized method of returning parameters.

In the 6502 universe, the traditional API method has been to treat the kernel as a collection of subroutines. Commodore (CBM), in their eight-bit machines, formalized this concept into the "kernal" jump table, which was guaranteed to stay in the same location despite revisions to the kernel or the introduction of a new machine. For example, if the programmer wants to output a byte to the current output device, he/she can load the accumulator with the byte and execute JSR $FFD2, knowing it will be a valid way of outputting a byte on all CBM eight bit machines.

In a system with no more than 64KB of address space and no hardware memory protection, treating the kernel as a collection of subroutines is practical and efficient, especially if the kernel is running in ROM. In a 65C816 native mode environment, the subroutine approach entails some new considerations. As the address space of a 65C816 machine can substantially exceed 64KB and as programs are effectively “sandboxed” into 64KB boundaries (banks), the use of a subroutine call to access the kernel API becomes potentially troublesome. No longer will a simple JSR <api_addr> suffice, as a JSR target is limited to the bank in which the JSR instruction is located. In practical terms, if a 65C816 kernel is running in, say, bank $00 and the program wishing to make an API call is located in, say, bank $0C, JSR is useless.

Bill Mensch anticipated this limitation in the 65C816 and created the JSL (Jump to Subroutine Long) instruction, which uses a 24-bit address to reach its target. A companion instruction, RTL (ReTurn Long) is used to return from a subroutine called with JSL. RTL is necessary because JSL pushes a 24-bit return address to the stack—RTS would only pull 16 of those 24 bits, causing a major malfunction.

Another method of calling the API from a remote bank is to use a software interrupt, which is how just about all present-day operating systems give a user-space program access to the kernel. The 65C816 has two software interrupts: BRK and COP. In native mode, BRK is independently vectored from an IRQ. Both are two-byte instructions, the second byte commonly referred to as the “signature.” Both also behave in identical fashion in native mode. Upon executing BRK or COP, the 816 pushes PB (program bank), PC (program counter) and SR (status register), sets the I-bit in SR, clears the D-bit in SR, loads PB with $00 and then jumps through the appropriate hardware vector, which is implicitly in bank $00. When the interrupt service routine (ISR) has been completed, an RTI will cause the ’816 to pull SR, PC and PB, and execution will resume at the address formed from PB:PC (see here for a full discussion of the 65C816’s interrupt-processing behavior).

A useful feature of any computer expected to support multiple processes is hardware protection. In general, the protection allows the system to run in either user or kernel mode. By way of explanation, once the kernel has been loaded into memory, access to that memory, as well as memory used by the kernel for data, buffers, etc., would be made off-limits to user-space programs—also, the kernel itself would be run in write-protected memory. That way, an errant write instruction in a user program wouldn’t mangle the kernel or its data. However, such arrangements would preclude the use of JSL to access a 65C816 kernel, since hardware protection would raise an exception when the user-space program attempted to enter kernel space via the API jump table. However, there is a solution: the vector pull (VPB) hardware output.

In the 65C816 running in native mode, VPB is driven low during cycles seven and eight of any interrupt sequence, which is when the MPU is fetching the interrupt vector address. It is conceivable that the glue logic could be arranged so the system switches from user mode to kernel mode when an API is called. The resulting relaxed protection would allow access to kernel data structures, as well as hardware registers. Upon exit from the kernel a write to a register in the hardware management unit (an abstract device implemented in a CPLD or FPGA) would return the system to user mode.

In my view, there are clear advantages to using a software interrupt for API access instead of a subroutine call. The handling of hardware protection is one of them. The bank-agnostic nature of this calling method is another. The automatic handling of SR, important in any API in which SR is used to return “success” or “fail” to the caller, is yet another. Furthermore, user programs need have no knowledge of a kernel API jump table, which means if the kernel is relocated in memory, the API call mechanism will continue to work without change. This would not be the case if JSL were the calling method. Last but not necessarily least, a software interrupt is a two-byte instruction; JSL is four bytes.

Of the two software interrupts that are available, it’s my opinion COP should be used for calling the kernel API. BRK is the “traditional” method by which a program is halted for debugging purposes, a “tradition” that I believe should be maintained. As BRK and COP are separately vectored, maintaining this usage is not a problem.

Okay, I’ve started the topic. Let the arguing...er...discussion begin!

whartung · Post by **whartung** » Wed Jan 16, 2019 12:33 am

All good info BDD.

As I mentioned in the other thread, the Apple IIGS uses the JSL approach. The only downside of the COP approach is that the kernel can not trivially use status register (SR) to indicate success or failure. It has to do stack shenanigans to update the SR saved on the stack to return any values. JSL doesn't suffer that.

If your system is "always" in native mode, the saving of the SR may be mostly wasted time. Simply, the software would be responsible for ensuring that the accumulator and index modes are properly set (to whatever values they need to be) rather than having the kernel routines constantly do this.

Most of this is moot, frankly, because most of the services being offered tend to be coarse, slow services (a few cycles here or there for SR juggling are crushed while waiting for the UART to respond, or the disk drive to load a sector, for example).

whartung · Post by **whartung** » Wed Jan 16, 2019 12:38 am

Pulling from the other thread.

Chromatix wrote:

In which case, how *do* you handle more than one routine

Simply, you push the "function family" as parameter.

On the IIGS, they have different entry points for different functions. The Toolbox (memory, graphics, etc.) has one entry point, GS/OS (disk/file system) has another. You push the toolbox id of the routine you want on to the stack (along with the parameters), and then invoke the toolbox -- which then dispatches appropriately.

Since the IIGS is a native system, it's a 16 Bit toolbox id, which is more than ample.

Even with a byte, that's 250+ routines right there, one of which can offer extended services (1 byte to identify the extended service, another to dispatch within that).

The same thing can happen if you use the signature byte.

commodorejohn · Post by **commodorejohn** » Wed Jan 16, 2019 12:49 am

I'm just waiting for one of the resident Forth junkies to ask what on earth you'd ever need an operating system for.

Dr Jefyll · Post by **Dr Jefyll** » Wed Jan 16, 2019 3:44 am

whartung wrote:

The only downside of the COP approach is that the kernel can not trivially use status register (SR) to indicate success or failure. It has to do stack shenanigans to update the SR saved on the stack to return any values. JSL doesn't suffer that.

I don't necessarily advocate COP or JSL. But whartung mentioned a downside to COP, and I'm chiming in to say the downside has a workaround. IOW the caller doesn't need stack shenanigans to learn about success or failure.

Consider: what's the difference between RTI and the 2-instruction sequence PLP RTL? The RTI takes you to the return address, whereas PLP RTL takes you to the return address+1. So, in the latter case the byte at the return address doesn't get executed.

The workaround works by putting a CLC (for example) before the COP and a SEC immediately after -- ie, the SEC is in the spot that might not get executed. For its part, the called routine chooses to return either via RTI or via PLP RTL, based on its success/failure.

I hope that's clear. After the SEC you'd code a BCC or BCS. Details could vary, but the key point is how the caller code has one byte whose execution is conditional, as determined by the callee.

-- Jeff

BigEd · Post by **BigEd** » Wed Jan 16, 2019 6:03 am

Nice idea Jeff, and nice intro post BDD. The point about supporting possibly more sophisticated hardware and software, such as a kernel-mode address map, when using COP, is an important one.

So, I think we have three ideas for an OS ABI on an '816:

JSL to one or one of several nailed-down addresses
COP with signature byte to select one of several services
COP with an ignored signature byte and with a service selector in a register

It's normal enough to return results in registers, in a parameter block assembled by the caller, to well-known addresses, on the stack, or in the status flags. Or some mixture. As Jeff notes, there's more than one way to return status in the flags even when using COP.

I think it's worth noting that various OSes have various methods and widths of ABI: CP/M's BDOS has a few dozen services all reached by calling a single well-known address in low memory. MS-DOS seems to have a dozen or so software interrupts as an interface, but with INT 21h providing several dozen disk and file related calls. The Kernal used by Commodore's PET, VIC-20 and C64 started with just over a dozen well-known high addresses, some of which vector through well-known low addresses. Acorn's MOS uses a couple of dozen well-known high addresses, some vectored through low memory, and with two of them providing very many services according to the value passed in A. Linux and the BSDs seem to use a single software interrupt: INT 80h with a service selector in the EAX register.

If an '816 system uses COP, it's most like those last, INT 80h, and the natural place for a service selector is A. This might seem odd to a Kernal user, a bit less odd to an Acorn user, and not odd to a unix user.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 16, 2019 6:45 am

Diddling SR to indicate return status is really not much of a chore. If the API processing front end is properly structured accessing the stack copy of SR is no different than accessing any other register's stack copy. In the code that I have tested on POC V1.1, the API processing front end is as follows:

Code: Select all

;KERNEL API FRONT END — EXECUTED IN RESPONSE TO A COP INSTRUCTION
;
;    ——————————————————————————————————————————————————————————————————
;    .A must be loaded with the 8 bit API index prior to executing COP.
;    ——————————————————————————————————————————————————————————————————
;
icop     rep #%00110000        ;16 bit registers
         pha                   ;preserve MPU state
         phx
         phy
         cli                   ;restart IRQs
         and #$00FF            ;mask noiose in .B (16 bit mask)
         beq icop01            ;API index cannot be zero
;
         dec a                 ;zero-align API index
         cmp #maxapi           ;index in range (16 bit comparison)?
         bcs icop01            ;no, error
;
         asl a                 ;double API index for...
         tax                   ;API dispatch table offset
         jmp (apidptab,x)      ;run appropriate code
;
;
;    invalid API index error processing...
;
icop01   ...handle invalid API index...

At the point where IRQs are restarted the stack "picture" will look like the following:

Code: Select all

;    register stack frame...
;
reg_y    =1                    ;16 bit .Y
reg_x    =reg_y+2              ;16 bit .X
reg_c    =reg_x+2              ;16 bit .A
reg_sr   =reg_c+2              ;8 bit SR
reg_pc   =reg_sr+1             ;16 bit PC
reg_pb   =reg_pc+2             ;8 bit PB
s_regsf  =reg_pb+1-reg_y       ;register stack frame size in bytes

The above are positive offsets relative to the stack pointer. There will also likely be other elements higher in the stack, but they aren't relevant for now.

With the above established, the programmer can readily modify any register and thus pass a value back to the user-space program when the MPU state is restored. So, if I want to set carry to tell the caller there was an error I can:

Code: Select all

         lda reg_sr,s          ;stack copy of SR
         ora #%00000001        ;set carry bit &...
         sta reg_sr,s          ;rewrite

The above assumes the m bit in SR is 1 (8-bit accumulator). No other shenanigans are needed. The same technique would make it possible to pass a value back in any register, or even reroute execution of the user-space program, the latter by overwriting the word at offset reg_pc with a different address.

whartung wrote:

On the IIGS, they have different entry points for different functions. The Toolbox (memory, graphics, etc.) has one entry point, GS/OS (disk/file system) has another. You push the toolbox id of the routine you want on to the stack (along with the parameters), and then invoke the toolbox -- which then dispatches appropriately.

That seems to be unnecessarily cumbersome when use of a software interrupt would require only an API index, and the caller wouldn't even have to know where in RAM to kernel entry point is located.

Quote:

If your system is "always" in native mode, the saving of the SR may be mostly wasted time. Simply, the software would be responsible for ensuring that the accumulator and index modes are properly set (to whatever values they need to be) rather than having the kernel routines constantly do this.

Having studied the design of the UNIX kernel in depth, I'd have to say that the code bits that are executed in the API front- and back-ends are trivial compared to the low-level code that does the actual work. You yourself noted this, e.g., waiting on a disk sector to load.

As for saving machine state, any operating system must afford generality to the user-space so programs running in the latter aren't unnecessarily constrained by limitations and/or fallacious assumptions built into the kernel. In that vein, you cannot make assumptions about what was going on at the time of an API call. Since the API front-end won't know whether the registers are set to 8 or 16 bits you will need to save SR so you can put things back the way they were at the time of the API call. If COP is used to call an API the preservation of SR is automatic.

Also, if the operating system is able to support preemptive multitasking, it is imperative that machine state be full preserved upon an entry to the kernel. The reason is it is the kernel that will preempt a process when it is time to do so. Preemption will occur during the final stages of interrupt processing, whether that interrupt is hardware or the software interrupt used to access an API (in simplistic terms, the task switch comes by copying machine state from the stack to user-space storage and then copy the saved machine state of the preempting process from user-space storage to the stack; when interrupt return occurs, the new task will be running). If the API front end hasn't saved machine state and preemption occurs, the interrupted process will not be able to restart when it is its turn to run again.

commodorejohn wrote:

I'm just waiting for one of the resident Forth junkies to ask what on earth you'd ever need an operating system for.

Not to rile up any Forth junkies, but that is very much a niche situation. The overwhelming majority of computers in use today have an operating system, and Forth can certainly avail itself of operating system services, such as file access, time-of-day facilities, etc., but still be Forth.

BigEd wrote:

So, I think we have three ideas for an OS ABI on an '816:

JSL to one or one of several nailed-down addresses
COP with signature byte to select one of several services
COP with an ignored signature byte and with a service selector in a register

That's a good summary.

Quote:

It's normal enough to return results in registers, in a parameter block assembled by the caller, to well-known addresses, on the stack, or in the status flags. Or some mixture. As Jeff notes, there's more than one way to return status in the flags even when using COP.

How to pass returns to the caller are an interesting topic in themselves. Using UNIX running on MC68000 hardware as an example, if I create/open a file and the operation is successful, the kernel will say so by clearing carry and loading the file descriptor (a positive integer) into register D0. If the operation fails for any reason, the kernel will say so by setting carry and loading an error code into D0. This is the basic mechanism by which much of the kernel reports back to the user-space caller. (However, note that the standard C library modifies this mechanism, reporting an error in the global variable errno and -1 in place of a file descriptor.)

In some cases, the API that has been called will produce something other than just a integer. For example, in Linux we have mkstemp(), which generates a temporary filename using randomized data to populate a template, and then creates and opens the file. One would call mkstemp() from C as follows:

Code: Select all

fd = mkstemp(char *template);

template is a pointer to a filename of the form nameXXXXXX. For example, payablesXXXXXX. Assuming a successful call, fd will return a positive integer—the file descriptor—and the XXXXXX field in the filename will be populated with randomized characters.

In order for the filename template to be so modified, the kernel has to be able to write into user space, since that is where the filename will be ensconced. This would be accomplished by indirectly writing through the filename pointer that was push to the stack prior to the API call, easy enough to do with 68000 and x86 hardware, but a little more involved with the 65C816.

BitWise · Post by **BitWise** » Wed Jan 16, 2019 10:03 am

I suspect that an efficient '816 operating system MUST use both JSL entry points and software interrupts. At least that's the conclusion I have come to in my own development work.

The cost of passing arguments to a JSL or COP accessed function is the same -- its just a set of stack pushes.

A function accessed by a JSL to a fixed address (that probably then does to a JML to actual address gets you into the body of the code and a RTL to come back. That's an overhead of 8+6+6=20 cycles to get there and back. Most '816 system will need to have at least a small ROM at the top of bank 0 to implement the vector area so this is the natural place to put a set of function entry points.

The fastest COP approach to access requires loading a scaled API function number (LDX #2*fn), a software interrupt (COP) and an indirect jump (JMP (table,X)) with an RTI to get back. That's a minimum of 3+8+6+7=24 cycles. More are needed if the function number is passed in A or is range checked or needs shifting to create a table index or if an initial JML is needed to transfer execution from bank 0 to the bank where the OS resides. COP also disables interrupts so the handler must re-enable them or any pending hardware interrupts will be delayed until the software interrupt returns. BDD's code probably spends at least twice the minimum number of cycles before it gets to the point of doing any useful work.

Returning results in flags is simpler from a subroutine than from an interrupt as it doesn't need any stack trickery.

So for trivial enquiry functions (like UNIX's getpid or time) a simple subroutine interface is probably twice as fast as using COP. For more complex functions the overhead of the entry becomes less significant.

If you are building a cooperative or preemptive tasking system then using software interrupts to save the running tasks state definitely simplifies the coding. If all the user and bank registers are pushed to the stack then only SP needs to be stored in memory for each inactive tasks. So there is definitely a place for its use within an OS.

I've have been writing an '816 OS on and off for a while targeting the SXB boards, especially the '265 as I have my 1MB RAM extension for it. I use JSL to access functions via a set of standard JMLs which start at $F000. This upper part of the memory also contains device interrupt handlers and the power on reset code. As the SXB has Flash ROM in its upper half of bank 0 I'm putting the rest of the OS there along with an implementation of common C functions and maths utilities so they don't have to be linked into applications (when I get that far) but the OS code could be in another bank. As the '265 used $DF00-$DFFF for hardware registers an RAM I have to split the OS either side of it. All of the RAM in bank 0 is allocated to task stacks - The SXBs with have 16 tasks with 2K of stack each. Task code and data will reside in banked RAM either as non-shared CODE+DATA (e.g. PBR == DBR) or as shared CODE with non-shared DATA (PBR != DBR).

BigEd · Post by **BigEd** » Wed Jan 16, 2019 10:56 am

One of the simple and tempting things to do to improve the '816 memory map is to hide the vectors until they are needed (using the vector pull signals) - that works well with COP but not with jumps. With the OS in low memory you can then have contiguous memory starting part way through bank 0.

It might be a case of "You Ain't Gonna Need It" but it feels attractive to me to use a software interrupt for this sort of reason: use a COP even on a simple-as-possible build, but keep using COP for later designs which are more sophisticated.

It's true that every system call gets a little more overhead in cycle count, but I'm not sure it would often matter, especially running at 14MHz when compared to the 1 or 2 MHz we're used to. If an application cares about every microsecond then that is a strong constraint on the OS design. Without those constraints, keep things simple, allow ways of extending functionality, and re-use design patterns we see elsewhere.

I think it's clear from the Kernal and CP/M approaches, which seem to use a dispatch table rather than a service ID, that the table keeps getting larger, which changes the ABI. I confess I was a bit surprised that the Acorn table is as large as it is, given the use of service numbers for OSWORD and OSBYTE.

BitWise · Post by **BitWise** » Wed Jan 16, 2019 11:22 am

BigEd wrote:

One of the simple and tempting things to do to improve the '816 memory map is to hide the vectors until they are needed (using the vector pull signals) - that works well with COP but not with jumps. With the OS in low memory you can then have contiguous memory starting part way through bank 0.

But the vectors are only two bytes each so at some point you have to have some executable code in bank 0 either in a ROM or bootstrapped into RAM prior to power on.

BigEd · Post by **BigEd** » Wed Jan 16, 2019 11:34 am

Indeed: I'm supposing there'd be a boot ROM in high memory, temporarily, and that would bootstrap into an OS in low memory. Perhaps the logical place for the OS is 0200. Thinking about it, it would reopen that perennial problem with 6502 memory maps, that the application start address wouldn't be fixed. So perhaps the logical place for the OS in an '816 system is at the top of the highest bank, with only the RAM-based redirection vectors in bank 0, perhaps in page 2. The application start address could be 0400 in bank 0.

Or, maybe I just haven't thought this through enough! Especially as an '816 OS has more space to consider multi-tasking and therefore may need some tactic for relocation... unless it places applications in individual banks.

So: where's the bulk of the OS? Bottom of bank 0, top of bank 0, in its own bank, or even in some shadow bank which isn't visible by the application?

(Just to recap, I was musing on the idea of making a clear flat memory space which runs through the high part of bank 0 into bank 1. And now I'm not so sure about that. There are aspects of the '816 which just don't like flat memory!)

BitWise · Post by **BitWise** » Wed Jan 16, 2019 12:03 pm

On the '816 only the stack is tied to bank zero. You are not as dependent on direct page when you have 16-bit index registers so many of its uses can be simply replaced with indexed addressing and stack indirect indexed (e.g. LDA (dd,S),Y).

The WDC C compiler for example uses DP as a frame pointer for accessing function arguments and local variables. You must have at least some code to redirect interrupt vectors to the handling code but everything else can be in other banks.

BigEd · Post by **BigEd** » Wed Jan 16, 2019 12:07 pm

I quite like vectoring through RAM, as Commodore and Acorn do, to allow hooking into OS services. That's flexibility, again for a few cycles more, if I still hang on to COP as the preferred call interface. (One of the services an OS should provide is the version, which allows the application to adjust or abort, depending. In this case, there should no doubt be an OS service to find the location of the RAM vector(s).

As you know, in some cases Acorn's MOS provides both a pre-hook and a post-hook.

whartung · Post by **whartung** » Wed Jan 16, 2019 5:15 pm

I think for a multitasking system, you would pretty much have to dedicate the bulk of Bank 0 to stacks for the processes. What I mean by that, is that it should be made available to processes for stacks, rather than coding the kernel in that space. 64K can support a lot of processes.

The dispatch through the API router overhead is marginal in pretty much all cases simply due to either a) the enormity of the task involved (i.e. reading a disk sector), or simply the infrequency of the calls (how often is getpid called, for example).

The GS allocated stack to processes based on meta-data associated with the applications, just like it allocates heap to the processes. In fact, I think it simply used the heap allocation processes to allocate stacks, because the GS allows programs themselves to allocate memory from Bank 0.

On the Mac, for those edge case times where API dispatch times were deemed to onerous, it was straightforward to peek in to the API routing tables, and dereference the routines. Then the code could make those calls directly, bypassing the routing mechanism. Obviously this requires intimacy with the implementation, but there's no reason you could not have a high level API call that returns the low level address of an underlying routine in a portable manner.

Folks also patched those calls to add interceptors or decorators to the core APIs. Having first class support for that would be interesting as well.

whartung · Post by **whartung** » Thu Jan 17, 2019 10:42 pm

On another note, another thing that the GS does is it requires the caller to make space on the stack for return values.

This makes a lot of sense because then the routine can actually use that area for not just the return value, but also as a work area before returning. Vs potentially having the routine results in some temporary space only to be deposited on the stack after the parameters have been cleaned off of it.

This detail has not affect on the COP vs JSL/R aspect of invocation, rather just the setting up of the call frame.

IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM

Re: IMPLEMENTING AN OPERATING SYSTEM API IN A 65C816 SYSTEM