Ultimately, the banking architecture imposed by the 65C02’s limited addressing space was simply too unwieldy to make any reasonable amount of throughput possible with more than two or three processes running at the same time. While I had a multiuser, multitasking environment running on the thing—at one point I had four terminals connected and all functional, it was slower than a snake caught in a snowstorm. I discontinued the project after about two years of work. However, the experience was very useful.
That is pretty much what I'm expecting to find too, especially the bit about experience - I like the idea of making it work, and seeing what hardware augmentation is sufficient for it to work, but ultimately I think it is likely to give worse performance than a well-designed cooperative multitasking system would on similar hardware. Still I'm interested in seeing what can be achieved. I don't really have any specific use case in mind, or a problem on this sort of platform that I think only preemptive multitasking can solve - it's just a technical challenge, and interesting to see what I end up with in comparison to you, Andre, or others who've walked this path.
An advantage I think my system might have over what you were working with before is having a few specific hardware features designed specifically to support this. I think the way my supervisor mode switch works - especially the switch back to user mode - is just enough to be viable, and does not add much overhead to interrupt handling. Letting the kernel also choose to use another process's page mappings or its own - at least for the bottom 8 pages - while still remaining in supervisor mode, has also been useful, even though I didn't really plan for that - at one point whether or not supervisor mode was active dependeng on the active PID. I'm glad I made it a separate state in the end.
I also have way fewer interrupt sources! But less flexible hardware extensibility as a result.
A goal of mine with my POC unit is to write a preemptive kernel to run on it. In principle, that should be possible on POC V1.3, since all of the required hardware elements are there (plus it has a much more elegant IRQ hardware than did the Uni-FI unit). Also, given the 65C816’s design, the code required to effect a context switch becomes relatively simple and fast. A limitation will be that unit’s 112K of accessible RAM, which will constrain the number of processes that could be concurrently run. That could be overcome by using disk space to swap out idle processes, but would definitely put a kink in throughput if a lot of task switching occurs in a short period of time. A POC unit with more RAM is, of course, the obvious solution. 
For this system I was originally going to have 544K total RAM, but I couldn't bring myself to have less than an old DOS system, so I upped it to 1MB. I figured a system like this ought to have enough RAM that the POST check takes a while! I won't ever be able to support virtual memory paged to disk, as that seems to require a better processor or a coprocessor, neither of which I want to do.
Though having a second CPU running processes in parallel, from the same memory and process pool, would be an interesting feature in a "version 2" of this design.
Now progress updates - I got buffered serial I/O working in both directions, with getchar and putchar system calls - this was fairly simple. I had to do something about blocking - either for when the buffer was full, or when there was no data to read yet. Initially I went ahead with my plan to edit the process's return address on the stack to cause the syscall to run again, and put the process to the back of the queue - with a view to, in future, having a way to mark it as blocked and only unblock in when whatever is blocking it is resolved. But I haven't thought that through yet so I just put it at the back of the queue.
That support for blocked threads worked, but it was quite a chore for the syscall code. I quite liked the approach of arranging for the user process to rerun the syscall though, and realised a nice alternative was to make the kernel modify the flags instead, and let the user mode code choose to loop to repeat the syscall, or accept that it failed. I refactored this, along with the general mechanism for specifying syscall numbers, so the user code now uses a header file like this to interface with syscalls:
Code: Select all
; mtos syscall interface
; Blocking syscall
;
; If the syscall blocks, it returns with N set, and this client code repeats the syscall until it succeeds
#define SYSCALL(n) \
.( :\
again: :\
ldx #n :\
brk : brk :\
bmi again :\
.)
; Nonblocking syscall
;
; For syscalls that never block, this is slightly more efficient client code.
;
; For other syscalls, this is also useful if the client wants to handle the blocking itself.
#define SYSCALLNB(n) \
ldx #n :\
brk : brk
#define SYSCALL_NOOP SYSCALLNB(0)
#define SYSCALL_YIELD SYSCALLNB(1)
#define SYSCALL_PUTCHAR SYSCALL(2)
#define SYSCALL_GETCHAR SYSCALL(3)
#define SYSCALL_EXIT SYSCALLNB(4)
#define SYSCALL_PUTSTRING SYSCALL(5)
There are two macros - one for blocking calls, one for nonblocking calls. The blocking one has a built-in loop to keep running the syscall until it no longer reports being blocked. In the nonblocking case, the user process can check the N flag itself.
Nonblocking calls are thus a bit cheaper, and it makes sense to use them always for syscalls that can never block. No-op, yield, and exit are all non-blocking; in fact yield does block (that's how it works, it puts the process at the back of the queue) but the user code ignores that.
The N flag works well because normally it's not set when a syscall is called, because the macro has just loaded a positive value into the X register. This means that the kernel only has to fiddle with it when the call blocks. The kernel code to do that looks like this:
Code: Select all
jsr syscall ; dispatch the system call
bcc resume ; resume current process if carry clear
; Carry set means the process is blocked, so we need to arrange for the syscall to repeat.
; We do this by reselecting the user process and adjusting its flags, which are at the
; top of the stack so easy to access.
;
; Syscalls are always of the form "ldx #nn : brk : brk" with 0 <= nn < 128, so as far as
; the flags are concerned, normally N=0. So we can have the user code perform a BMI to
; repeat the syscall, and here we just need to set the N bit to trigger it.
lda zp_prevprocess : sta PID
pla : ora #$80 : pha
stz PID ; required for later code
; We also want to preempt the process, as it can't run at the moment.
bra preempted
Note that the easiest way to manipulate the flags on the stack is just to switch process IDs temporarily and use regular pla/pha operations - there's no need for the kernel to set up its own page mappings. It has to set the PID back to zero afterwards because later code needs to access kernel zero page data, which is paged out when the PID changes.
The other thing I experimented with was a "putstring" syscall - this is a lot more efficient than user processes calling a syscall for every character. But it's also a nice opportunity to experimennt with ways to deal with memory mapping, page crossings, and long-running syscalls - as the string could be quite long and the transmit buffer could fill up.
The putstring syscall takes an address in A and Y, pointing to the first character of the string. In order to make it work with the blocking policy, it needs to update the A and Y registers so that replaying the same syscall will carry on and print the rest of the string, rather than starting again from the beginning. To begin with I made it just print one character at a time, as if it blocked after every character, to make sure that the A/Y update was working correctly.
I also made a helper routine to look up the process's page mapping for a particular address. It's based on code I used to have in the syscall handler itself. Given the high byte of an address, in A, it looks up which physical page (PP) the user process would use for that address. Then it's easy to write that into the pagetable entry for one of the kernel's pages, and use that to access the data.
Having got the character-by-character variant working, I made it just increment its internal pointer and loop back to print the next character, if it hadn't wrapped into a new logical page; and if it has wrapped into a new page, I made it also rerun the page mapping code to map the next page before continuing. That all seems to work well. I also made it have a limit to the number of characters it will print in a single syscall, even if buffer space is available, so I can tune that to reduce impact on other processes if necessary. I guess another option for that would be to poll for the preempt interrupt flag and stop printing characters if that comes in - maybe that would be better in fact.
Anyway this routine seems to work well, and having several processes printing messages on top of each other gives roughly the effect I'd expect - chunks of each one's string get printed, until the buffer is full, and then they get to print one character each in a vaguely random order depending which one got scheduled first after the serial system sent a character, freeing up a buffer space. I think this sort of approach is going to work fairly well for bulk I/O in general.
I think next I need to make a simple shell, so I can use it to choose which programs to start instead of hardcoding the scheduler to queue up some specific ones every time.