Page 4 of 5

Re: Fairly complete multiprocess computer design

Posted: Sun Dec 17, 2023 12:18 am
by BigDumbDinosaur
gfoot wrote:
The preempt interrupt is the thorny one there, do we allow task switching during a syscall?

The short answer is probably “no.”

Leaving an API call half-done in that fashion will likely result in some major headaches in maintaining state information from the interrupted API call.  Furthermore, you would be creating the potential for multiple API calls to be interrupted “mid-stream,” which would then demand that you have a stack somewhere to store all that state.  Since a stack is a LIFO data structure, now you have to either restart the half-done APIs in the opposite order in which they were interrupted, or you have to shuffle the stack to keep state information correctly organized as you complete half-done APIs.  Can you say “slow, complicated and bug-prone?”  :D

Even nastier is the risk of kernel deadlock.  If an in-process API locks a resource (perhaps to perform an atomic operation) and is then interrupted to run another API, what do you do if the API that is now running needs the resource locked by the interrupted API?

In most preemptive kernels, an API call is run to completion, if possible, with the only interruptions coming from higher-priority events, such as an I/O device that is demanding attention and has time-critical service requirements.  If an API must be suspended due to resource contention, it is done in a way that prevents deadlock—a complex bit of programming that would be best avoided in a 65xx system.

Once an API has completed its work and is ready to return to the caller, the API handler’s back end code would invoke the task scheduling functions that determine if a user process should be preempted.  In a 65xx system, if a decision is made to preempt the current user process, zero page and the stack must be manipulated so the current state is saved in some fashion and the state that existed when the user process that is to be restarted must be made “visible” to the MPU.  If correctly carried out, the result will be a clean task switch when RTI is executed at the end of the API handler’s back end.

Re: Fairly complete multiprocess computer design

Posted: Sun Dec 17, 2023 2:08 am
by gfoot
BigDumbDinosaur wrote:
gfoot wrote:
The preempt interrupt is the thorny one there, do we allow task switching during a syscall?

The short answer is probably “no.”

Leaving an API call half-done in that fashion will likely result in some major headaches in maintaining state information from the interrupted API call.  Furthermore, you would be creating the potential for multiple API calls to be interrupted “mid-stream,” which would then demand that you have a stack somewhere to store all that state.  ...
I agree. In my system there are perhaps some ways this could work, but ultimately I think they'd only really work in cases where the code shouldn't really be a syscall at all, i.e. it's really a helper function that doesn't need to be in the kernel.
Quote:
Even nastier is the risk of kernel deadlock.  If an in-process API locks a resource (perhaps to perform an atomic operation) and is then interrupted to run another API, what do you do if the API that is now running needs the resource locked by the interrupted API?
And locking resources in that case would be almost essential - another process could almost certainly call the same syscall, and need to access the same resource.
Quote:
If an API must be suspended due to resource contention, it is done in a way that prevents deadlock—a complex bit of programming that would be best avoided in a 65xx system.
I've just implemented buffered serial output, and the buffer being full is one form of contention. As a first pass I made the system just hang with an error in that case, but then went back to do something a bit better about it - now if it hits that case it runs the interrupt service routine and loops back to try again. Still not very good - all processes are suspended! - but I think it is OK until I have proper support for processes being blocked on I/O, which should be soon anyway.

I think the support for blocked processes will look something like decrementing that process's PC entry in its stack by two, so when it resumes it rewinds to the BRK instruction that caused the syscall, and then not scheduling it until the I/O situation has changed. Any processes waiting for output space could then go in a queue, and get woken up front to back whenever space becomes available. It means the ISR needs to walk through this queue after processing the interrupt, and again with some care I think this can be done with interrupts re-enabled... eventually. I can make something simpler work first though and then iterate on it.

File I/O is something I haven't thought a lot about yet. I already have support for loading files over the serial connection - especially program code - but I wouldn't want to try to make that play nicely in a multitasking environment, it would require server work that I don't want to do. I do, however, want SD card access code to work nicely. If it uses the VIA's shift register then I'll get one byte every 18 CPU cycles, and it doesn't make sense to try to let something else run in the meantime; but I would want other interrupts to be serviced as a priority.

To do that, I think I could WAI between bytes, and then manually trigger servicing of whatever interrupts have come up, whether it's the shift register or something else. Perhaps non-urgent interrupts should also get disabled during the load operation, so that it's just critical things like serial receive that are still enabled. It may also make sense to restrict it to a certain number of bytes at a time - e.g. 32 - and then let other processes run in turn before resuming the file load operation. So then, to avoid system calls being left hanging while other processes run, this one could exit before it's finished, rewinding the process's instruction pointer as I said above for blocked I/O handling, so that it reissues the system call when its turn to run comes around again. That would allow the kernel to resume the operation.

There's a lot of complexity there but I think it can be done step by step, with that ultimate solution in mind.
Quote:
Once an API has completed its work and is ready to return to the caller, the API handler’s back end code would invoke the task scheduling functions that determine if a user process should be preempted.  In a 65xx system, if a decision is made to preempt the current user process, zero page and the stack must be manipulated so the current state is saved in some fashion and the state that existed when the user process that is to be restarted must be made “visible” to the MPU.  If correctly carried out, the result will be a clean task switch when RTI is executed at the end of the API handler’s back end.
Yes that bit is working very well already. As I page the whole memory map, these stacks don't really overlap, which might be making it easier. Perhaps the most awkward bit is the stack pointer itself, which is shared - but I save each process's value for that like any other register and restore it via the X register, and the kernel itself has no need of a persistent stack - it essentially starts from scratch on every interrupt, using whatever stack pointer the user process had reached.

Re: Fairly complete multiprocess computer design

Posted: Sun Dec 17, 2023 9:28 am
by BigDumbDinosaur
gfoot wrote:
BigDumbDinosaur wrote:
In my system there are perhaps some ways this could work, but ultimately I think they’d only really work in cases where the code shouldn’t really be a syscall at all, i.e. it’s really a helper function that doesn’t need to be in the kernel.

Although developing a trustworthy, preemptive kernel capable of supporting multitasking is no simple task, it is much, much easier with the 65C816 than with the 65C02, something that became crystal-clear to me over 30 years ago—although I didn’t know it at the time.

Back in 1991, I did some programming work for (now-defunct) Fiscal Information in Florida, who had developed a terminal server for multiplexing a lot of terminals onto a few serial ports on the Point 4 minicomputers they were selling at the time.  Their Uni-FI terminal server was powered by an 8 MHz WDC 65C02 and had a couple of GALs for glue logic.  The unit was equipped with a 1 MB DIMM of the type then common in PCs, which was made to look like 16 banks of 32KB to the MPU, with each RAM bank appearing from $0000-$7FFF.  The range from $C000-$FFFF was common to all banks.  $C000-$FFEF could be RAM or ROM.  $FF00-$FFFF was always ROM, and due to limited address decoding, all I/O appeared in the range $8000-BFFF :twisted:.  Bank switching involved writing the bank number into a “register” that existed in one of the GALs.

After I had finished development work for Fiscal, they let me keep the two Uni-FI units I had been given.  So I decided to take a whack at writing a lightweight, preemptive kernel to run on one of them.  As I said, there were 16 banks of RAM, so that meant the 65C02 could be made to see 16 different zero pages and 16 different stacks.  Sounded real tempting, although I knew that a 65C02 running at 8 MHz would be majorly loaded down if more than a few processes were simultaneously running.

With all that in mind, I figured I’d run the kernel in the $C000-$FEFF common-RAM range, with ROM mapped out.  Doing so would minimize having to change the memory map just to call kernel APIs.  RAM bank $00 would be workspace for the kernel—including the kernel’s zero page and stack, and the remaining banks could run individual processes, each with their own zero page and stack.  Since the kernel was visible to all banks at all times and hence static, I arranged things to call kernel APIs with JSR, with the jump table starting at $C000 so it would be stable from one version to the next.

In theory, task-switching on this unit would be fairly simple.  Keep track of the current process’ stack pointer (SP) in a kernel table when it (the process) was suspended, and when that process was to be run again, copy its stack pointer to SP, select its RAM bank, do an RTI and off we go.  The MPU would fetch SR (status register) and PC (program counter) from the in-context process’ stack (along with copies of the other registers) and resume where it left off.

In practice, it wasn’t quite that simple when an IRQ occurred.  The unit had a total of 20 (!) IRQ sources, 19 in an octal UART (OCTART, an Phillips 2898) and one in the SCSI host adapter that I built to attach mass storage.  Each time an IRQ hit, the system had to be remapped to expose the kernel bank, since the zero page pointers used for I/O access were in there.  Complications arose because the IRQ front end, which was in the immutable part of ROM at $FF00-$FFFF, would have to push the MPU state to the stack of whichever process was in context at the time—there was no way to switch to the kernel’s bank (and stack) without clobbering a register.  Ergo back-and-forth gyrations between the in-context bank and the kernel bank were necessary to allow the kernel’s interrupt handler access to the MPU state that was saved when the interrupt hit.

Problems snowballed when I got SCSI working and now had to work with disk buffers and foreground process routing during a SCSI transaction (triggered by bus phase IRQs)—which got even more exciting when the OCTART was interrupting at the same time, as well as keeping track of which process was reading or writing mass storage.  It became a battle of trying to access data structures in different banks without accidentally touching something at the wrong time or accidentally selecting a context that sent the MPU off into the asteroid belt.

Ultimately, the banking architecture imposed by the 65C02’s limited addressing space was simply too unwieldy to make any reasonable amount of throughput possible with more than two or three processes running at the same time.  While I had a multiuser, multitasking environment running on the thing—at one point I had four terminals connected and all functional, it was slower than a snake caught in a snowstorm.  I discontinued the project after about two years of work.  However, the experience was very useful.

A goal of mine with my POC unit is to write a preemptive kernel to run on it.  In principle, that should be possible on POC V1.3, since all of the required hardware elements are there (plus it has a much more elegant IRQ hardware than did the Uni-FI unit).  Also, given the 65C816’s design, the code required to effect a context switch becomes relatively simple and fast.  A limitation will be that unit’s 112K of accessible RAM, which will constrain the number of processes that could be concurrently run.  That could be overcome by using disk space to swap out idle processes, but would definitely put a kink in throughput if a lot of task switching occurs in a short period of time.  A POC unit with more RAM is, of course, the obvious solution.  :D

————————————————————
EDIT: Fixed two typos that were belatedly noticed.

Re: Fairly complete multiprocess computer design

Posted: Sun Dec 17, 2023 1:03 pm
by barnacle
And then, in the afternoon... :mrgreen:

Neil (still trying to negotiate time between moving boxes to the new house next door to sort out some prototyping :) )

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 12:47 am
by gfoot
BigDumbDinosaur wrote:
Ultimately, the banking architecture imposed by the 65C02’s limited addressing space was simply too unwieldy to make any reasonable amount of throughput possible with more than two or three processes running at the same time.  While I had a multiuser, multitasking environment running on the thing—at one point I had four terminals connected and all functional, it was slower than a snake caught in a snowstorm.  I discontinued the project after about two years of work.  However, the experience was very useful.
That is pretty much what I'm expecting to find too, especially the bit about experience - I like the idea of making it work, and seeing what hardware augmentation is sufficient for it to work, but ultimately I think it is likely to give worse performance than a well-designed cooperative multitasking system would on similar hardware. Still I'm interested in seeing what can be achieved. I don't really have any specific use case in mind, or a problem on this sort of platform that I think only preemptive multitasking can solve - it's just a technical challenge, and interesting to see what I end up with in comparison to you, Andre, or others who've walked this path.

An advantage I think my system might have over what you were working with before is having a few specific hardware features designed specifically to support this. I think the way my supervisor mode switch works - especially the switch back to user mode - is just enough to be viable, and does not add much overhead to interrupt handling. Letting the kernel also choose to use another process's page mappings or its own - at least for the bottom 8 pages - while still remaining in supervisor mode, has also been useful, even though I didn't really plan for that - at one point whether or not supervisor mode was active dependeng on the active PID. I'm glad I made it a separate state in the end.

I also have way fewer interrupt sources! But less flexible hardware extensibility as a result.
Quote:
A goal of mine with my POC unit is to write a preemptive kernel to run on it.  In principle, that should be possible on POC V1.3, since all of the required hardware elements are there (plus it has a much more elegant IRQ hardware than did the Uni-FI unit).  Also, given the 65C816’s design, the code required to effect a context switch becomes relatively simple and fast.  A limitation will be that unit’s 112K of accessible RAM, which will constrain the number of processes that could be concurrently run.  That could be overcome by using disk space to swap out idle processes, but would definitely put a kink in throughput if a lot of task switching occurs in a short period of time.  A POC unit with more RAM is, of course, the obvious solution.  :D
For this system I was originally going to have 544K total RAM, but I couldn't bring myself to have less than an old DOS system, so I upped it to 1MB. I figured a system like this ought to have enough RAM that the POST check takes a while! I won't ever be able to support virtual memory paged to disk, as that seems to require a better processor or a coprocessor, neither of which I want to do.

Though having a second CPU running processes in parallel, from the same memory and process pool, would be an interesting feature in a "version 2" of this design.


Now progress updates - I got buffered serial I/O working in both directions, with getchar and putchar system calls - this was fairly simple. I had to do something about blocking - either for when the buffer was full, or when there was no data to read yet. Initially I went ahead with my plan to edit the process's return address on the stack to cause the syscall to run again, and put the process to the back of the queue - with a view to, in future, having a way to mark it as blocked and only unblock in when whatever is blocking it is resolved. But I haven't thought that through yet so I just put it at the back of the queue.

That support for blocked threads worked, but it was quite a chore for the syscall code. I quite liked the approach of arranging for the user process to rerun the syscall though, and realised a nice alternative was to make the kernel modify the flags instead, and let the user mode code choose to loop to repeat the syscall, or accept that it failed. I refactored this, along with the general mechanism for specifying syscall numbers, so the user code now uses a header file like this to interface with syscalls:

Code: Select all

; mtos syscall interface

; Blocking syscall
;
; If the syscall blocks, it returns with N set, and this client code repeats the syscall until it succeeds
#define SYSCALL(n) \
    .(            :\
    again:        :\
        ldx #n    :\
        brk : brk :\
        bmi again :\
    .)

; Nonblocking syscall
;
; For syscalls that never block, this is slightly more efficient client code.
;
; For other syscalls, this is also useful if the client wants to handle the blocking itself.
#define SYSCALLNB(n) \
    ldx #n          :\
    brk : brk

#define SYSCALL_NOOP SYSCALLNB(0)
#define SYSCALL_YIELD SYSCALLNB(1)
#define SYSCALL_PUTCHAR SYSCALL(2)
#define SYSCALL_GETCHAR SYSCALL(3)
#define SYSCALL_EXIT SYSCALLNB(4)
#define SYSCALL_PUTSTRING SYSCALL(5)
There are two macros - one for blocking calls, one for nonblocking calls. The blocking one has a built-in loop to keep running the syscall until it no longer reports being blocked. In the nonblocking case, the user process can check the N flag itself.

Nonblocking calls are thus a bit cheaper, and it makes sense to use them always for syscalls that can never block. No-op, yield, and exit are all non-blocking; in fact yield does block (that's how it works, it puts the process at the back of the queue) but the user code ignores that.

The N flag works well because normally it's not set when a syscall is called, because the macro has just loaded a positive value into the X register. This means that the kernel only has to fiddle with it when the call blocks. The kernel code to do that looks like this:

Code: Select all

    jsr syscall      ; dispatch the system call

    bcc resume       ; resume current process if carry clear

    ; Carry set means the process is blocked, so we need to arrange for the syscall to repeat.
    ; We do this by reselecting the user process and adjusting its flags, which are at the
    ; top of the stack so easy to access.
    ;
    ; Syscalls are always of the form "ldx #nn : brk : brk" with 0 <= nn < 128, so as far as
    ; the flags are concerned, normally N=0.  So we can have the user code perform a BMI to
    ; repeat the syscall, and here we just need to set the N bit to trigger it.

    lda zp_prevprocess : sta PID
    pla : ora #$80 : pha
    stz PID ; required for later code

    ; We also want to preempt the process, as it can't run at the moment.
    bra preempted
Note that the easiest way to manipulate the flags on the stack is just to switch process IDs temporarily and use regular pla/pha operations - there's no need for the kernel to set up its own page mappings. It has to set the PID back to zero afterwards because later code needs to access kernel zero page data, which is paged out when the PID changes.

The other thing I experimented with was a "putstring" syscall - this is a lot more efficient than user processes calling a syscall for every character. But it's also a nice opportunity to experimennt with ways to deal with memory mapping, page crossings, and long-running syscalls - as the string could be quite long and the transmit buffer could fill up.

The putstring syscall takes an address in A and Y, pointing to the first character of the string. In order to make it work with the blocking policy, it needs to update the A and Y registers so that replaying the same syscall will carry on and print the rest of the string, rather than starting again from the beginning. To begin with I made it just print one character at a time, as if it blocked after every character, to make sure that the A/Y update was working correctly.

I also made a helper routine to look up the process's page mapping for a particular address. It's based on code I used to have in the syscall handler itself. Given the high byte of an address, in A, it looks up which physical page (PP) the user process would use for that address. Then it's easy to write that into the pagetable entry for one of the kernel's pages, and use that to access the data.

Having got the character-by-character variant working, I made it just increment its internal pointer and loop back to print the next character, if it hadn't wrapped into a new logical page; and if it has wrapped into a new page, I made it also rerun the page mapping code to map the next page before continuing. That all seems to work well. I also made it have a limit to the number of characters it will print in a single syscall, even if buffer space is available, so I can tune that to reduce impact on other processes if necessary. I guess another option for that would be to poll for the preempt interrupt flag and stop printing characters if that comes in - maybe that would be better in fact.

Anyway this routine seems to work well, and having several processes printing messages on top of each other gives roughly the effect I'd expect - chunks of each one's string get printed, until the buffer is full, and then they get to print one character each in a vaguely random order depending which one got scheduled first after the serial system sent a character, freeing up a buffer space. I think this sort of approach is going to work fairly well for bulk I/O in general.

I think next I need to make a simple shell, so I can use it to choose which programs to start instead of hardcoding the scheduler to queue up some specific ones every time.

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 7:22 am
by BigDumbDinosaur
barnacle wrote:
And then, in the afternoon... :mrgreen:

Neil (still trying to negotiate time between moving boxes to the new house next door to sort out some prototyping :) )

“New house next door”  :?: :?: :?: :shock: :roll:

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 7:48 am
by BigEd
> Anyway this routine seems to work well

Excellent! (Feels like something a little reminiscent of EAGAIN, but that's fine. A simple OS needs to do things simply.)

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 9:02 am
by BigDumbDinosaur
gfoot wrote:
BigDumbDinosaur wrote:
While I had a multiuser, multitasking environment running on the thing...it was slower than a snake caught in a snowstorm.
That is pretty much what I’m expecting to find too, especially the bit about experience - I like the idea of making it work, and seeing what hardware augmentation is sufficient for it to work, but ultimately I think it is likely to give worse performance than a well-designed cooperative multitasking system would on similar hardware.

Cooperative multitasking is okay...until one of the processes doesn’t cooperate and sends the MPU down a rabbit hole.  :evil:

Preemptive multitasking, although more laborious to develop and debug, has the potential to produce higher overall throughput when scheduling rules take into account if a process is compute- or I/O-bound.  Since I/O tends to demand more in the way of system resources and often demands exclusive access to specific bits of hardware, the preemptive scheduler could/should grant more run time to I/O-bound processes to get them done more quickly.  Compute-bound processes that are running strictly in user mode can be given less run time, since they are going to get more done in terms of wall-clock time—they aren’t waiting on a disk or a UART to do something.

While performance is not likely to be anything to write home about, you should keep in mind that some of the smaller minicomputers of the late 1970s achieved more than acceptable performance while supporting up to eight users on a 128KB machine.  One of them, with which I have some familiarity, as the Point 4 mini running the IRIS operating system.  The “Point 4” name came from their first machine, which had a 0.4 microsecond machine cycle time, fast for its day.  That Point 4 mini produced about the same number of MIPS that a 65C02 system running at 12-14 MHz can achieve, although it was more efficient with integer math.  It was the highly-refined scheduling algorithm of IRIS that made the system seem to punch above its weight.

So some processing alacrity can be had, but probably not with your present configuration.  The worst case scenario in a typical system of this type would be if all users’ terminals are simultaneously repainting their screens.  It was that, more than anything else, that brought my Uni-FI experiment to its knees.  The interrupt storm was intense.

Quote:
Still I’m interested in seeing what can be achieved...it’s just a technical challenge...

Ultimately, that is what this hobby is all about.  I say as much on my POC website, in which I compare my POC unit to my large-scale model locomotive, the latter which can’t haul passengers from New York to Chicago...the POC unit isn’t going to be able to run Amtrak’s passenger reservations system.  :D

Quote:
An advantage I think my system might have over what you were working with before is having a few specific hardware features designed specifically to support this.

Yep!  The Uni-FI hardware on which I was working was optimized for collecting data from terminals and sending it to a specific serial port, and collecting data from that port and sending it to a specific terminal.  Everything was optimized for serial I/O, using round-robin scheduling with no concept of preemption.  It was not a design on which random task selection could be efficiently carried out.

Quote:
I think the way my supervisor mode switch works - especially the switch back to user mode - is just enough to be viable, and does not add much overhead to interrupt handling.

It helps that the 65C02 is fundamentally “quick on the draw” when it comes to IRQ response.  That characteristic helps to compensate for the limitations of the C02 when it comes to rapidly changing context.

Quote:
Letting the kernel also choose to use another process’s page mappings or its own - at least for the bottom 8 pages - while still remaining in supervisor mode, has also been useful, even though I didn’t really plan for that - at one point whether or not supervisor mode was active dependeng on the active PID.

In the Linux/UNIX environment, the decision as to whether processing is to be carried out in user or supervisor mode actually is not in the kernel’s purview.  The act of an interrupt being serviced is what makes the switch.  Whether that interrupt is hardware or software doesn’t enter into it.

My first experiences with UNIX were on a PDP-11, and while I never got involved with the kernel, I learned enough about it and the hardware to understand how memory protection was produced and why instructions that were permissible in kernel mode would cause a core dump in user mode.  Then, as now, kernel APIs were called via a vectored software interrupt, which on the PDP-11 was TRAP, followed by a vector index.  On a Motorola-powered machine, the MC68000 instruction was also TRAP with a vector index.  In both cases, TRAP caused the processor to shift from user to supervisor mode.  Provisions are made to allow nested interrupts to be handled without the machine switching back to user mode.

BTW, there is a fairly strong similarity between PDP-11 assembly language and 6502 assembly language, in part because the Motorola 6800 used some of the PDP-11 assembly language, which was transported to MOS Technology when Chuck Peddle and the gang left Motorola.  :D

Quote:
I also have way fewer interrupt sources!  But less flexible hardware extensibility as a result.

A lot of interrupt sources can be accommodated with suitable “helper” hardware.  A priority encoder is useful, as is having a register somewhere in glue logic that can be read to determine which device(s) is(are) interrupting.  I’ve got that latter setup in POC V1.3, in which IRQ outputs generated by the four serial I/O channels (eight IRQs total) are readable from a faux register fashioned by an inverting buffer chip.  My ISR doesn’t poll the channels to see who’s interrupting.  It merely reads the faux register and gets status for all channels in one operation.

Quote:
Quote:
A goal of mine with my POC unit is to write a preemptive kernel to run on it...A limitation will be that unit’s 112K...That could be overcome by using disk space to swap out idle processes...A POC unit with more RAM is, of course, the obvious solution.
For this system I was originally going to have 544K total RAM, but I couldn’t bring myself to have less than an old DOS system, so I upped it to 1MB.  I figured a system like this ought to have enough RAM that the POST check takes a while!

POC V1.3 takes about one second to test 112K of RAM, running at 16 MHz.  I figure checking a meg of ram would run about 8-9 seconds.  My check includes delays after each write-fetch-compare step, plus it’s a non-destructive test for all RAM outside of pages $00 and $01, and the range from page $B7 to $BF.  That means I have to preserve the location being tested, which means an extra read and write per location.

Quote:
I won’t ever be able to support virtual memory paged to disk, as that seems to require a better processor or a coprocessor, neither of which I want to do.

You need a really fast mass storage I/O system to make swapping practical.  Even though POC V1.3 has SCSI I/O and can produce a write throughput that is over 700K per second, it’s nowhere near fast enough to be practical with swapping.  I would need to get SCSI I/O running at close to bus speed (3.5 MB/second) to make swapping tolerable.

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 9:41 am
by GARTHWILSON
BigDumbDinosaur wrote:
Cooperative multitasking is okay...until one of the processes doesn’t cooperate and sends the MPU down a rabbit hole.  :evil:
These discussions always bring me back to the question of how important that is for our applications here.  It is disgusting that certain companies schedule, and promise the release of, new software titles and versions for certain dates, regardless of readiness, so in the name of making a big market splash and the accompanying profits, they are released with bugs, including crashes.  That is unforgivable IMO.  If we aren't going to be running anyone else's software, we have full control of everything we run, and we work on something until it is free of bugs, not to mention crashes.  Then the only remaining factor might be that if we have a development system running on the same computer as new software we're working on, we don't want a crash of the new stuff to take the development environment down with it too.

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 10:21 am
by drogon
gfoot wrote:
That is pretty much what I'm expecting to find too, especially the bit about experience - I like the idea of making it work, and seeing what hardware augmentation is sufficient for it to work, but ultimately I think it is likely to give worse performance than a well-designed cooperative multitasking system would on similar hardware
A well designed co-operating multitasking system DID work well on similar hardware - back in the early 80s on BBC Micros (and others that supported BCPL)

The closest I got to pre-emptive multitasking back then (on a micro) was MP/M with the Z80. The system I used was a Northstar Horizon and had a 64KB RAM bank per user/process. Not cheap, not flexible either, but fairly robust.

Obviously a few years later we had the Acorn Archimedes with Arthur then RISC-OS which was cooperatively multitasking and it also worked extremely well. It seemed pre-emptive might have been a step too far then on those little micros - even though things like the PDP11 and the plethora of other mini computers were doing it very well (especially Unix) and 68000 things were rapidly catching up... RISC-OS is still in-use today although I think the model may have changed to pre-emptive multi tasking, I've not been keeping up.

(And digging out the history of the early Acorn ARM OS's is also interesting - the UK group got co-operative multitasking working while the US group doing a pre-emptive system at the same time failed to deliver on time)

But back then (early 80s) I was at uni., studying things like real time control, robotics, factory automation and so on and needed a platform to work on that and was finding things like CP/M, etc. to be somewhat inflexible, so when we got some BBC Micros it was a joy. I wrote everything in BCPL which supported coroutines - cooperative multitasking although essentially a single process model. I had threads to monitor the CNC machines it was connected to, the separate 6502 based "PLC" that controlled the mechanicals and the shared filing system over the LAN (Econet) which was remarkably robust, given the somewhat electrically hostile nature of a typical engineering machine shop. Each station (CNC machine, loading/unloading robot, measuring/inspection station, etc.), had it's own Beeb and ran autonomously fetch work lists from a central server. The beeb had colour display with a "mimic" diagram of the station and pictures of the incoming and outgoing pallets, etc. and could be manually driven by an operator to override the work tasks. I wish I had photos but this was pre-personal camera days and film was expensive, so we didn't waste it on academic projects!

Fast forward to today and I wanted multitasking on my Ruby816 BCPL project, so initially it was going to be coroutines as that's what BCPL supports natively, but I wanted pre-emptive, so I wound back a few years to the Inmos Transputer (which I'd used a lot in the late 80s) - it's instruction set bears a resemblance to that of the BCPL intermediate code (no real coincidence as they're all from the same place really) and it implemented pre-emptive multitasking in hardware.

So I set about emulating that in my BCPL Cintcode VM and it works remarkably well - that video I posted recently really is 12 clocks running pre-emptively in the background with a foreground process (the Pi calculation) running too. The clocks do call a sleep function rather than spin on the clock waiting for the next second but the Pi program is just a normal program that gets pre-emptively descheduled to allow the clocks to run.

Scheduling is strictly round-robin, but I have a plan to increase the priority of the foreground process (run it every other process or every 3rd, 4th, etc.) it's not intended to be anything "real time" - just usable and I had/have plans basically to run stuff like a printer background task, or making all the IO separate tasks or something.

And this all happens without any special hardware support. The CLI can spawn processes and a process can spawn threads which get killed when the main process ends. Each process has it's own stack and the task-switching overhead is very small - to make life easy, I copy the (virtual) machine registers to a per-process save area, copy in the next tasks registers and carry on. Relatively speaking it's a lot of RAM to copy (24 bytes each way) but using the 816 block-move instruction it's bearable - process switching takes 22µS at 16Mhz.

Of-course when a program crashes and writes all over RAM then it all goes horribly wrong and you reach for the big red button - again.

And that's why we have MMUs with memory protection and hardware supervisor/user modes but then the complexity starts to go up almost exponentially and I wonder then at what point do we consider moving to a more suitable platform - but I have to confess I'm more a software person than hardware even though I do find this level of hardware to be very interesting, but it's not something I think I'd consider doing myself.

Having a 6502 with banks of 64K RAM seems easy - same for an '816. Writing the code to handle it all is, I think the harder task!

My system would benefit from hardware write-protection. It doesn't need swapping/paging just more RAM (currently has 512KB and I've never run out when doing the usual stuff like editing and compiling). It does have one large shared 'heap' (getvec/malloc) that all processes use, so giving each process its own would be a protection benefit, but this is then when you need an MMU to grow that heap and keep it appearing contiguous in RAM, an so the complexity grows...

Am I going to do any huge data processing on it? No. It's just not fast enough for todays data. Could I write an "office suite" for it? (wp/spreadsheet/graphics) Yes, sure. Just look at e.g. AppleWorks back in the early 80s and the memory size we had then - also the Early Mac with 128KB....

Of-course a multi-tasking (even multi-user BASIC!) now that might be a thing :-)

Cheers,

-Gordon

Re: Fairly complete multiprocess computer design

Posted: Mon Dec 18, 2023 8:57 pm
by barnacle
BigDumbDinosaur wrote:
barnacle wrote:
And then, in the afternoon... :mrgreen:

Neil (still trying to negotiate time between moving boxes to the new house next door to sort out some prototyping :) )

“New house next door”  :?: :?: :?: :shock: :roll:
Yeah... we're renting the place where we currently live. Our neighbour unexpectedly died while I was away earlier in the year, and after some negotiation (he left his affairs in a mess) we bought the house from his heirs.

Today I moved about half a ton of books - pushing three thousand - into new shelves... tomorrow I need to lay a floor in my new lair as I'm moving from the basement to the attic. But I did find time to order some chips and sockets from Mouser.

The SVGA/STM DMA approach (half an idea... viewtopic.php?f=4&t=7832) is looking good but I'm not buying a PCB until I've proved the VGA timing; I don't know why but I'm sure I recall in the past that the '165 or '166 latched on an edge, but the data sheets suggest that they work on levels, which complicates things a little.)

Neil

Re: Fairly complete multiprocess computer design

Posted: Tue Dec 19, 2023 12:15 am
by BigDumbDinosaur
barnacle wrote:
Today I moved about half a ton of books - pushing three thousand - into new shelves...

3000 books?  Better hope the wall bracing the bookcase doesn’t cave in. :shock:

Quote:
tomorrow I need to lay a floor in my new lair as I'm moving from the basement to the attic.

From the basement to the attic.  That’s what I call movin’ on up!

Re: Fairly complete multiprocess computer design

Posted: Tue Dec 19, 2023 1:31 am
by gfoot
drogon wrote:
Of-course when a program crashes and writes all over RAM then it all goes horribly wrong and you reach for the big red button - again.

And that's why we have MMUs with memory protection and hardware supervisor/user modes but then the complexity starts to go up almost exponentially[...]
That's sort of what I was interested to explore - how much hardware does it really take. I think that "good enough" memory protection is actually not that hard to bolt onto pretty much anything, even easier with a cooperative system, and if you're willing to say that a process that doesn't yield frequently enough just gets killed, then you fix a lot of the potential issues of cooperative systems. I appreciate Garth's point that well-written code should run fine in a cooperative environment, and the more coherently designed the software suite is for such a machine the better, but the more diverse multitasking gets, the more you are bound to eventually end up running badly-written code and need a backup plan to deal with it. Perhaps any app that blocks the CPU for too long should be uninstalled from disk!
Quote:
My system would benefit from hardware write-protection. It doesn't need swapping/paging just more RAM (currently has 512KB and I've never run out when doing the usual stuff like editing and compiling). It does have one large shared 'heap' (getvec/malloc) that all processes use, so giving each process its own would be a protection benefit, but this is then when you need an MMU to grow that heap and keep it appearing contiguous in RAM, an so the complexity grows...
To make this kind of thing work comfortably, you need a processor with more address space than you have RAM, which you have in the '816. I don't think the hardware or software has to be very complicated (I don't think mine is). The limitation for me though is the 64K address space of the 6502 - while it's possible to allow individual processes to access more than that through explicit paging, that is very intrusive on the software side, akin to sideways RAM, or EMS and XMS in the PC world, and there's not really much you can do to ease that in hardware. Acorn's Turbo system is probably the most transparent thing I'm aware of, allowing use of 24-bit addresses in certain conditions and effectively functioning as an extra addressing mode at the machine code level. (Hmm, maybe I can implement that, and then find a way to run TurMASM or Turbo BASIC on this system...)

With the '816 you're much better off though - like the 386, it has way more address space than you're likely to put physical RAM behind, so presumably you can just spread your resources out (code, heaps, stacks, etc) and grow them on demand by mapping new pages as necessary. I wouldn't want to breadboard that - far too many address lines! - but I think it would be very practical on a PCB. I might be tempted to do something like that with my ARM2 computer at some point, as that too would be a good processor to use for that sort of thing.
barnacle wrote:
The SVGA/STM DMA approach (half an idea... viewtopic.php?f=4&t=7832) is looking good but I'm not buying a PCB until I've proved the VGA timing; I don't know why but I'm sure I recall in the past that the '165 or '166 latched on an edge, but the data sheets suggest that they work on levels, which complicates things a little.)
Looking forward to seeing how that goes. The 74xx166 has a synchronous load, in sync with the shift clock - I've used it a lot in the past and get very consistent pixel widths out of it. It looks like the 74xx165 is asynchronous though.

Progress:

I have been busy with work today, but took a look at the interrupt latency in a very high level way - here are oscilloscope shots of serial output, showing the duration of the IRQB pulses and the effect on the output stream:
20231219_003214.jpg
20231219_003251.jpg
It's leading to an almost-doubling of the stop bit duration (at 9600 baud), and the IRQ response time to the point where the next character is sent appears to be about 64us. This is with a 4MHz clock, so that's 256 cycles. When I get more time I'll look into where exactly all those cycles are going.

As an afterthought, here's another shot now showing - on top - the "supervisor" state:
20231219_012349.jpg
It dropped into user mode briefly then back to super mode just before the interrupt - this is likely a system call - and then you can see it spent about 55us in the system call, leading to most of the delay in processing the IRQ. I need to look at getting interrupts enabled during system calls, then measure this again.

Re: Fairly complete multiprocess computer design

Posted: Tue Dec 19, 2023 6:43 am
by GARTHWILSON
gfoot wrote:
drogon wrote:
Of-course when a program crashes and writes all over RAM then it all goes horribly wrong and you reach for the big red button - again.

And that's why we have MMUs with memory protection and hardware supervisor/user modes but then the complexity starts to go up almost exponentially[...]
That's sort of what I was interested to explore - how much hardware does it really take. I think that "good enough" memory protection is actually not that hard to bolt onto pretty much anything, even easier with a cooperative system, and if you're willing to say that a process that doesn't yield frequently enough just gets killed, then you fix a lot of the potential issues of cooperative systems. I appreciate Garth's point that well-written code should run fine in a cooperative environment, and the more coherently designed the software suite is for such a machine the better, but the more diverse multitasking gets, the more you are bound to eventually end up running badly-written code and need a backup plan to deal with it. Perhaps any app that blocks the CPU for too long should be uninstalled from disk!
We had discussion about this in the past, but I can't seem to find it now to link to.  When I'm developing software, the crashes I get virtually never go writing trash all over RAM, but instead just get into loops whose exit conditions are never met.  So an idea was to have a combination of coöperative and preëmptive, where most handoffs are coöperative and thus enjoy the much lower overhead because each task determines when it's at a good stopping point and there's very little that needs saving, but if a task takes unreasonably long, a timer interrupt seizes control from the offending process, and most likely shuts it down.  It makes the hardware design much simpler.

Re: Fairly complete multiprocess computer design

Posted: Tue Dec 19, 2023 9:09 am
by BigEd
Just a quick question George - is the schematic and overview in the head post still a good reference? I like your ideas but haven't yet studied the detail or the implementation.