6502.org • View topic - Fairly complete multiprocess computer design

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Hardware

All times are UTC

Fairly complete multiprocess computer design

Page 3 of 5

[ 68 posts ]

Go to page Previous 1, 2, 3, 4, 5 Next

Previous topic | Next topic

Author

Message

gfoot

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 2:49 pm

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

My multitasking kernel has come together quite well so far - here it is running four dummy tasks:

Code:

gfoot's multitasking computer bootstrap ROM

Checking ZP/stack... OK
Loading second stage...
Stage 1.5
Checking private RAM works... OK
Loading second stage...

mtos kernel starting
32K found
Testing paged RAM... init, test, OK
Clearing paged RAM... OK
debugspawnprocess: got page 01
debugspawnprocess: got PID 01
debugspawnprocess: got page 02
debugspawnprocess: got PID 02
debugspawnprocess: got page 03
debugspawnprocess: got PID 03
debugspawnprocess: got page 04
debugspawnprocess: got PID 04
scheduler: running process 01
scheduler: running process 02
scheduler: running process 03
scheduler: running process 04
scheduler: running process 01
scheduler: running process 02
scheduler: running process 03

These tasks simply execute a no-op system call in a tight loop:

Code:

loop:
    brk : .byte $00
    bra loop

This syscall is non-yielding, so the kernel will return control to the same process afterwards. Tasks may yield by using a different syscall, but the system will also force them to yield when a VIA timer expires, currently set to 1024 cycles. As a result, in execution the trace shows a lot of syscalls being made by each process, before the timer expires and another process is forced to run. My I/O is very slow (blocking reads and writes, 9600 baud) so I don't log anything apart from process switches at the moment - but hoglet's decoder shows what is going on underneath:

Code:

8D08 : 40       : RTI            : 6 : A=00 X=00 Y=00 SP=FF N=0 V=1 D=0 I=0 Z=1 C=0
0200 : 00 00    : BRK #00        : 7 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C6F : 8D 00 F4 : STA F400       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C72 : 68       : PLA            : 4 : A=72 X=00 Y=00 SP=FD N=0 V=1 D=0 I=1 Z=0 C=0
8C73 : 48       : PHA            : 3 : A=72 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C74 : 9C 0F 86 : STZ 860F       : 4 : A=72 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C77 : 29 10    : AND #10        : 2 : A=10 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C79 : D0 11    : BNE 8C8C       : 3 : A=10 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C8C : 8E 01 F4 : STX F401       : 4 : A=10 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C8F : 20 7F 8D : JSR 8D7F       : 6 : A=10 X=00 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D7F : A6 06    : LDX 06         : 3 : A=10 X=02 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D81 : BD 00 80 : LDA 8000,X     : 4 : A=02 X=02 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D84 : 8D 00 91 : STA 9100       : 4 : A=02 X=02 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D87 : BA       : TSX            : 2 : A=02 X=FA Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D88 : E8       : INX            : 2 : A=02 X=FB Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D89 : E8       : INX            : 2 : A=02 X=FC Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D8A : E8       : INX            : 2 : A=02 X=FD Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D8B : E8       : INX            : 2 : A=02 X=FE Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D8C : 38       : SEC            : 2 : A=02 X=FE Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=1
8D8D : BD 00 11 : LDA 1100,X     : 4 : A=02 X=FE Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=1
8D90 : E9 01    : SBC #01        : 2 : A=01 X=FE Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=1
8D92 : 85 00    : STA 00         : 3 : A=01 X=FE Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=1
8D94 : E8       : INX            : 2 : A=01 X=FF Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=1
8D95 : BD 00 11 : LDA 1100,X     : 4 : A=02 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=1
8D98 : E9 00    : SBC #00        : 2 : A=02 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=1
8D9A : C9 10    : CMP #10        : 2 : A=02 X=FF Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8D9C : 90 15    : BCC 8DB3       : 3 : A=02 X=FF Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8DB3 : 09 10    : ORA #10        : 2 : A=12 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=0
8DB5 : 85 01    : STA 01         : 3 : A=12 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=0 C=0
8DB7 : B2 00    : LDA (00)       : 5 : A=00 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=1 C=0
8DB9 : 0A       : ASL A          : 2 : A=00 X=FF Y=00 SP=FA N=0 V=0 D=0 I=1 Z=1 C=0
8DBA : AA       : TAX            : 2 : A=00 X=00 Y=00 SP=FA N=0 V=0 D=0 I=1 Z=1 C=0
8DBB : E0 04    : CPX #04        : 2 : A=00 X=00 Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8DBD : B0 03    : BCS 8DC2       : 2 : A=00 X=00 Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8DBF : 7C 77 8D : JMP (8D77,X)   : 6 : A=00 X=00 Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8D7B : 18       : CLC            : 2 : A=00 X=00 Y=00 SP=FA N=1 V=0 D=0 I=1 Z=0 C=0
8D7C : 60       : RTS            : 6 : A=00 X=00 Y=00 SP=FC N=1 V=0 D=0 I=1 Z=0 C=0
8C92 : AE 01 F4 : LDX F401       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=1 C=0
8C95 : 90 E9    : BCC 8C80       : 3 : A=00 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=1 C=0
8C80 : A5 06    : LDA 06         : 3 : A=02 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=0 C=0
8C82 : 8D 0F 86 : STA 860F       : 4 : A=02 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=0 C=0
8C85 : AD 00 F4 : LDA F400       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=1 C=0
8C88 : 2C 01 86 : BIT 8601       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=1 C=0
8C8B : 40       : RTI            : 6 : A=00 X=00 Y=00 SP=FF N=0 V=1 D=0 I=0 Z=1 C=0
0202 : 80 FC    : BRA 0200       : 3 : A=00 X=00 Y=00 SP=FF N=0 V=1 D=0 I=0 Z=1 C=0
0200 : 00 00    : BRK #00        : 7 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C6F : 8D 00 F4 : STA F400       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C72 : 68       : PLA            : 4 : A=72 X=00 Y=00 SP=FD N=0 V=1 D=0 I=1 Z=0 C=0
...

In this sequence it resumed a process after an interrupt (8D08), executed a BRK (0200), the kernel determined that it was a BRK not an IRQ (8C6F-8C8F), the BRK handler went to a lot of trouble to read the BRK's operand from the user process's memory space (8D7F-8DB7) then used it to look up in a jump table (8DB9-8DBF), executed the syscall (8D7B-8D7C), then returned to the same process (8C92-8C8B) only to have it loop back and BRK again.

There is a lot of overhead to read the operand to the BRK instruction. I was trying to avoid needing the user to use registers for passing parameters, but I think that's a bad choice - it will cut a lot out if I just use A to select the syscall type, and ignore the BRK operand.

There's also the usual dilemma over how to determine that it was a BRK - here I did it the Acorn way, saving A somewhere and pulling the flags from the stack. It would be quite simple to add hardware to latch this flag and make it easy to read and branch on without needing any registers - we just need to track the last state of the D4 data line during a write operation, then the IRQ handler can begin with a BIT instruction to read that back. It would save a few instructions, or could even be used to make a different vector get called in hardware - however I don't think that's worth it as it's just a BIT and a branch that would be saved.

Interrupt latency is something I've been conscious of and I'm interested in seeing how bad it is in practice, and what can be done to improve it. As yet I'm not using hardware interrupts for anything other than the preempting timer, but there's a 65C51 there which can issue receive interrupts, and I have it connnected to the VIA's PB6 so that VIA T2 can issue transmit interrupts. I may also use the VIA's shift register, CB1 and CB2 for SD cards, PS/2, etc, at some point - so I will get to try these things out soon.

I wrote the code bearing this latency in mind, but there are no doubt more improvements to make. Here's the source code for the irqhandler routine that appears above:

Code:

irqhandler:
.(
    ; This could be an IRQ or a BRK.  We can check the stack to find out which.
    ; There's no need to be reentrant here, but while the active process is still selected,
    ; we mustn't write to zero page or the stack.
    sta var_saveda
    pla : pha

    ; Switch to process 0
    stz PID

    ; Is it a BRK?
    and #$10
    bne isbrk

    ; Handle the interrupt
    jsr irqhandler2

    ; Was it preempted?  If so pick another process
    bcs preempted

resume:
    ; If not, return to the same process
    lda zp_prevprocess : sta PID
    lda var_saveda
    bit ENDSUPER
    rti

isbrk:
    stx var_savedx
    jsr syscall
    ldx var_savedx
    bcc resume      ; resume current process if carry clear, otherwise fall through

preempted:
    ; Save the process's context and run the scheduler.  The process's flags and return
    ; address are already on its stack.  We free up the Y register, and use it to index
    ; into the process register arrays.

    phy   ; save Y temporarily

    ldy zp_prevprocess

    lda var_saveda : sta var_process_regs_a,y  ; restore A's value and save it
    txa : sta var_process_regs_x,y             ; save X's value
    pla : sta var_process_regs_y,y             ; restore Y's value and save it
    tsx : txa : sta var_process_regs_sp,y      ; save the stack pointer too

    ;jmp scheduler_run
.)

At the end it falls through to scheduler_run, so the jump is commented out.

irqhandler2 is a fairly standard hardware interrupt handling routine that queries various devices to see what needs attention. It can prioritise urgent devices of course. It returns with the carry set or clear to indicate whether we should switch processes or resume the same one. Resuming the same process is more efficient, so it only sets the carry if the interrupt was due to the preempting timer running out.

Resuming the same process is efficient because we haven't really corrupted its state much - the stack pointer is exactly as it left it, the X and Y registers haven't been touched, and the A register has been saved in a cheap global variable. All we need to do is restore the PID register setting, restore the A register, and return to user mode (the "resume" block above). If we're switching processes though, at the very least we need to store all the process state somewhere it can stay for the long term, then figure out what to run next, and then restore that process's state. The "preempted" block above does the first bit of this, but there is at least as much work still to follow in scheduler_run. So, returning to the same process is better in general.

The two things that will really affect interrupt latency are increases in the code that runs before the interrupt is handled - which are minimal, in the above code that's really just the "stz PID" - but also, any other code that runs with interrupts disabled, including the recovery after the interrupt (things like running the scheduler) and syscalls. I haven't implemented this, but I think a decent mitigation for these is to re-enable interrupts while still in supervisor mode wherever possible. For example, almost all syscall code can execute with interrupts enabled - the only thing that will interrupt them is an actual hardware interrupt, so they'd only need to disable interrupts if they were accessing some complex state that's also manipulated by the interrupt handling code. It should be possible to avoid conflicts between these things in general without having to disable interrupts.

Similarly, the epilogue of the interrupt handler might be able to run with interrupts enabled - as soon as the hardware interrogation is complete, perhaps we can enable interrupts again. This would allow another interrupt to be serviced quickly, if one happened to come in while the previous one was still working out how best to return to user mode.

That said, except in the "preempt" case, where we are switching processes, the return to user mode is not too bad. Here's a full capture of the preempt interrupt being handled, and returning to a different process:

Code:

0200 :          : INTERRUPT !!   : 7 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C6F : 8D 00 F4 : STA F400       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C72 : 68       : PLA            : 4 : A=62 X=00 Y=00 SP=FD N=0 V=1 D=0 I=1 Z=0 C=0
8C73 : 48       : PHA            : 3 : A=62 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C74 : 9C 0F 86 : STZ 860F       : 4 : A=62 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8C77 : 29 10    : AND #10        : 2 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C79 : D0 11    : BNE 8C8C       : 2 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8C7B : 20 09 8D : JSR 8D09       : 6 : A=00 X=00 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=1 C=0
8D09 : AD 0D 86 : LDA 860D       : 4 : A=C0 X=00 Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D0C : 10 08    : BPL 8D16       : 2 : A=C0 X=00 Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D0E : 2D 0E 86 : AND 860E       : 4 : A=C0 X=00 Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D11 : 2C 0D 86 : BIT 860D       : 4 : A=C0 X=00 Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D14 : 70 02    : BVS 8D18       : 3 : A=C0 X=00 Y=00 SP=FA N=1 V=1 D=0 I=1 Z=0 C=0
8D18 : A9 40    : LDA #40        : 2 : A=40 X=00 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D1A : 8D 0D 86 : STA 860D       : 4 : A=40 X=00 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=0
8D1D : 38       : SEC            : 2 : A=40 X=00 Y=00 SP=FA N=0 V=1 D=0 I=1 Z=0 C=1
8D1E : 60       : RTS            : 6 : A=40 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8C7E : B0 17    : BCS 8C97       : 3 : A=40 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8C97 : 5A       : PHY            : 3 : A=40 X=00 Y=00 SP=FB N=0 V=1 D=0 I=1 Z=0 C=1
8C98 : A4 06    : LDY 06         : 3 : A=40 X=00 Y=01 SP=FB N=0 V=1 D=0 I=1 Z=0 C=1
8C9A : AD 00 F4 : LDA F400       : 4 : A=00 X=00 Y=01 SP=FB N=0 V=1 D=0 I=1 Z=1 C=1
8C9D : 99 00 F6 : STA F600,Y     : 5 : A=00 X=00 Y=01 SP=FB N=0 V=1 D=0 I=1 Z=1 C=1
8CA0 : 8A       : TXA            : 2 : A=00 X=00 Y=01 SP=FB N=0 V=1 D=0 I=1 Z=1 C=1
8CA1 : 99 00 F7 : STA F700,Y     : 5 : A=00 X=00 Y=01 SP=FB N=0 V=1 D=0 I=1 Z=1 C=1
8CA4 : 68       : PLA            : 4 : A=00 X=00 Y=01 SP=FC N=0 V=1 D=0 I=1 Z=1 C=1
8CA5 : 99 00 F8 : STA F800,Y     : 5 : A=00 X=00 Y=01 SP=FC N=0 V=1 D=0 I=1 Z=1 C=1
8CA8 : BA       : TSX            : 2 : A=00 X=FC Y=01 SP=FC N=1 V=1 D=0 I=1 Z=0 C=1
8CA9 : 8A       : TXA            : 2 : A=FC X=FC Y=01 SP=FC N=1 V=1 D=0 I=1 Z=0 C=1
8CAA : 99 00 F9 : STA F900,Y     : 5 : A=FC X=FC Y=01 SP=FC N=1 V=1 D=0 I=1 Z=0 C=1
8CAD : A4 06    : LDY 06         : 3 : A=FC X=FC Y=01 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8CAF : A2 00    : LDX #00        : 2 : A=FC X=00 Y=01 SP=FC N=0 V=1 D=0 I=1 Z=1 C=1
8CB1 : C8       : INY            : 2 : A=FC X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8CB2 : B9 00 F5 : LDA F500,Y     : 4 : A=01 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8CB5 : D0 06    : BNE 8CBD       : 3 : A=01 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=0 C=1
8CE6 : 84 06    : STY 06         : 3 : A=02 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8CE8 : 8C 0F 86 : STY 860F       : 4 : A=02 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8CEB : BE 00 F9 : LDX F900,Y     : 4 : A=02 X=FC Y=02 SP=FC N=1 V=1 D=0 I=1 Z=0 C=0
8CEE : 9A       : TXS            : 2 : A=02 X=FC Y=02 SP=FC N=1 V=1 D=0 I=1 Z=0 C=0
8CEF : B9 00 F6 : LDA F600,Y     : 4 : A=00 X=FC Y=02 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8CF2 : 8D 00 F4 : STA F400       : 4 : A=00 X=FC Y=02 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8CF5 : BE 00 F7 : LDX F700,Y     : 4 : A=00 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8CF8 : B9 00 F8 : LDA F800,Y     : 4 : A=00 X=00 Y=02 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8CFB : A8       : TAY            : 2 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8CFC : AD 02 F4 : LDA F402       : 4 : A=04 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8CFF : 8D 05 86 : STA 8605       : 4 : A=04 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=0 C=0
8D02 : AD 00 F4 : LDA F400       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=1 D=0 I=1 Z=1 C=0
8D05 : 2C 01 86 : BIT 8601       : 4 : A=00 X=00 Y=00 SP=FC N=0 V=0 D=0 I=1 Z=1 C=0
8D08 : 40       : RTI            : 6 : A=00 X=00 Y=00 SP=FF N=0 V=1 D=0 I=0 Z=1 C=0

8C97-8CAA are the "preempted" block in the above code snippet; the code after than is what scheduler_run is doing. It loads the last process ID that ran, increments it, and scans the process array for the next runnable process. It finds it quickly as it's the very next one; however it's a lousy algorithm, I should probably use a run queue instead. Then it restores that process's context and executes the RTI to user mode.

Overall I don't think this is too bad in terms of overheads, but there is still room for improvement, and enabling interrupts as early as possible would mitigate a lot of the effects on interrupt latency.

Something else that's come up is the ease or difficulty of reading user process memory, and especially following pointers there. Here's the code for the syscall routine, that reads the BRK operand. I think the comments explain the context pretty well:

Code:

syscall:
.(
    ; On entry:
    ;
    ;   PID = 0;         caller's PID is in zp_prevprocess
    ;   A is undefined;  caller's A is in var_saveda
    ;   X is caller's, and has been saved to var_savedx
    ;   Y is caller's
    ;   SP is two lower than caller's (our return address was pushed)

    ; Ideally here we'd re-enable interrupts as soon as we can, as servicing a system call is not high priority.
    ; But that's an improvement for later.

    ; We need to determine the type of system call and act accordingly.
    ;
    ; We'll read the user process's LP 0 mapping, and write it into our LP 1 so that we can read the user stack from there.  
    ; Then we'll have a user pointer to the BRK instruction, and can use it directly if it's into the user's LP 0, but will need
    ; to make another mapping change if it's to a different logical page in the user's address space.

    ldx zp_prevprocess
    lda $8000,x               ; read process's LP 0 mapping
    sta $9100                 ; set our LP 1 read mapping to the same page

    ; Here we use "inx" so that it wraps properly and we avoid issues with SP > $FC
    tsx
    inx : inx : inx : inx     ; SP => X and advance it to point at the PCL from the interrupt frame

    ; Copy the PC to zp_ptr, subtracting 1 as we go
    sec : lda $1100,x : sbc #1 : sta zp_ptr
    inx : lda $1100,x : sbc #0

    ; If zp_ptr is in a different LP, we need to map that one now
    cmp #$10 : bcc ismapped

    ; The right page is not mapped, map it now

    tax                       ; save the high byte of zp_ptr for now

    and #$f0                  ; extract LP number
    bpl a15clear : ora #2     ; set bit 1 if bit 7 is set (addr needs to be 1,A14,A13,A12,0,0,A15,RWB,PID7,PID6,...)
a15clear:
    ora #$80                  ; set bit 7
    sta zp_ptr2+1
    lda zp_prevprocess
    sta zp_ptr2

    lda (zp_ptr2)             ; Read the process's write mapping for the LP that zp_ptr is in
    sta $9100                 ; Apply it to our LP 1 (read)

    txa                       ; Restore zp_ptr's high byte

ismapped:
    ; The right page is mapped now - set zp_ptr's LP portion to 1
    ora #$10 : sta zp_ptr+1

    ; Read the value after BRK from user memory via LP 1
    lda (zp_ptr)

    ; Double it and jump via the jump table
    asl
    tax
    cpx #syscalljumptablesize    ; but not if it's too large
    bcs badsyscall

    jmp (syscalljumptable,x)

badsyscall:
    jmp error_badsyscall
.)

So each logical page is 4K - LP0 is from $0000-$0FFF, LP1 is from $1000-$1FFF, etc. The user's stack is in its logical page 0 as the four-character hex address starts with a 0. The kernel uses its own LP0 for its stack and zero page, so instead of mapping that, it maps its LP1 to the same physical page that the user process is using for its LP0. This means the user's stack at $0100 shows up in the kernel's address space at $1100 (i.e. $1000 + $0100).

Having adjusted the X register to point four entries higher than SP - so it points at PCL - we can then read the return address. It points at the instruction after the BRK, and we want the operand instead, so we subtract one from that address while copying it to zp_ptr.

Now if that address is also in the user process's LP 0 (i.e., it's less than $1000) then we're fine, we already have that mapped from the kernel's LP 1, so the kernel can read that byte, it just needs to adjust zp_ptr to point into LP1 instead of LP0 (adding $1000). But if the BRK instruction hadn't been in the user process's LP 0, then the kernel's current LP 1 mapping wouldn't work for it because it's the wrong physical page - so the code above would then look up what the user process's mapping actually is for whatever LP contains the address PC-1, and it would then remap the kernel's LP1 to point to that physical page instead of pointing to the user process's first physical page. After that it can then adjust zp_ptr to point into LP1 as before, and read the byte via LP1.

Note that the pagetable addresses are a bit swizzled - as the comment in the code says, whereas a user process would refer to a page like this:

Code:

    A15  A14  A13  A12  x    x    x    x    x    x    x    x    x    x    x    x

the pagetable entries for this logical page have the following form:

Code:

1 A14 A13 A12 0 0 A15 RWB PID7 PID6 PID5 ... PID0

Dealing with different pages is quite a bit of work, and if we wanted to do something like read a string, or just multiple bytes of any data, we'd have to be cautious about data that spans a logical page boundary in the user's address space, and make appropriate remappings to deal with that. What I wrote above was just my first shot at this, and I'm sure with more calls implemented I'll find cleaner and more generic ways to do this - but for the moment, it is awkward. It could be viable for most syscalls to require the user process to store any data that the kernel needs in its LP0 ($0000-$0FFF), so that the kernel doesn't need to map other pages - but that's a bit of a cop-out.

Top

BigEd

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 4:34 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England

Splendid! Doesn't seem too bad to me to have some restriction on where to put buffers for OS calls - especially if the OS arranges some space for the purpose in the typical user memory map, for example by making &400 the default address to load and run at. Or something.

(If you're running at 2MHz or more, then your interrupt latency is going to be somewhat like the beeb's, and that was fine.)

Top

gfoot

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 5:45 pm

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

Yes in practice it's probably not too bad to do that. Aside from this though, user processes don't really need to know anything about the paging system, page size, etc, that's why I'm a little resistant to it - and the OS doesn't put any constraints on how user processes use the address space, they get the full 64K of RAM to do whatever they want with, no locations have any special purpose.

One good strategy to avoid the restriction is to use kernel's logical pages 1 and 2 together to span a region of user data - this can reliably support up to 4K of data with any offset. So trying to follow a linked list still wouldn't be very good, but for as OSWORD-like block of data this would work well and the user process wouldn't need to know anything apart from that the block must be less than 4K in size (which is huge for a 6502).

To see how this would work for example consider a block of 32 bytes starting at $6FF3 in the user process's memory space, up to $7012. The user process would just send its pointer - $6FF3. The kernel would then see that this is in LP6, so it would map its own LP1 to wherever the user's LP6 is currently mapped, and in addition it would map its LP2 to wherever the user's next logical page - LP7 - is mapped. This way, the kernel now has two logical pages in a row mapped the same as the two logical pages of the user process, and the user's data block is available at $1FF3-$2012.

The second mapping could be skipped in cases where we know the block size doesn't cross a logical page boundary, but where the data size is not known up front I think this method could be useful, and would be easy to encapsulate generically for lots of kernel routines to use.

I have uploaded my code so far in case anybody is interested to see more of it: https://github.com/gfoot/multitasking6502/tree/main/src The OS code is in the "mtos" subdirectory, the rest are test programs and ROM images.

Top

BigDumbDinosaur

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 5:56 pm

Joined: Thu May 28, 2009 9:46 pm
Posts: 8514
Location: Midwestern USA

gfoot wrote:

My multitasking kernel has come together quite well so far...There is a lot of overhead to read the operand to the BRK instruction.

That’s the unfortunate reality when the MPU doesn’t have any stack addressing instructions. Things do go faster with the 65C816, since less code is needed. I use COP #<i> to make firmware API calls in my POC V1.3 unit, where <i> is the API index. The most recent iteration of the COP front end uses about 85 clocks to save machine state, fetch the index from the caller’s execution space and direct execution to the selected API...and that is with using stack-relative addressing and 16-bit registers to do the grunt work. It would be a lot slower if I had to do it a byte at a time and without the handy <offset>,S addressing mode.

Quote:

I was trying to avoid needing the user to use registers for passing parameters, but I think that’s a bad choice - it will cut a lot out if I just use A to select the syscall type, and ignore the BRK operand.

One model to follow is the UNIX/Linux method. A register (typically EAX in x86 hardware or D0 in Motorola hardware) is loaded with the API index and parameters needed by the call are pushed to the stack, followed by a software interrupt. With the 65C02 lacking convenient stack access instructions, parsing a stack frame in the API demands the use of complicated, i.e., slow, code. Also, there is the problem of differentiating a BRK from an IRQ.

Probably better with the 65C02 is to point to a parameter block somewhere in user space, using two registers—.A and .Y would be my choice, with the block’s LSB in .A, and load the API index in .X, with the index always being an even value. For example, API #3 would be described in .X as 6, avoiding a shift on the API number to get a jump table index. A JMP (<apitab>,X) instruction would run the desired API, where <apitab> is an internal vector table.

In my POC firmware, I actually use the 816’s JSR (<APITAB>,X) instruction to vector API service requests. With that arrangement, each API module can simply use RTS to return to the API back end. That arrangement also simplifies internal calls to the API.

Quote:

Interrupt latency is something I’ve been conscious of...As yet I’m not using hardware interrupts for anything other than the preempting timer, but there’s a 65C51...

There are several ways you can ease the problem a bit in hardware. One would be to wire IRQ sources to a priority encoder and read the encoder after your IRQ preamble has executed. You’d only service the devices that the encoder says are interrupting. Reading the encoder entails a small amount of overhead, but that is more than offset by not having to poll everything on every IRQ.

Another method is to wire the IRQ outputs of each device to a negative logic bus transceiver, such as a 74AHCT540, as well as to the inputs of an AND gate (each input has to be pulled up if the mating device has an open-drain IRQ outpu). The latter’s output would drive the C02’s IRQB. The former would be mapped into I/O space and when an IRQ occurs, your ISR will read the transceiver, which will return a bit pattern of active IRQs—a set bit means that particular device is interrupting. A simple loop and some shifting is all it would take to find out who’s interrupting.

Quote:

I wrote the code bearing this latency in mind, but there are no doubt more improvements to make. Here’s the source code for the irqhandler routine that appears above:

Code:

irqhandler:
.(
    ; This could be an IRQ or a BRK.  We can check the stack to find out which.
    ; There’s no need to be reentrant here, but while the active process is still selected,
    ; we mustn’t write to zero page or the stack.
    sta var_saveda
    pla : pha

    ; Switch to process 0
    stz PID

    ; Is it a BRK?
    and #$10
    bne isbrk

    ; Handle the interrupt
    jsr irqhandler2   <———

...

You should avoid subroutine calls in interrupt handlers. This especially becomes a concern with interrupt-driven I/O, where that JSR - RTS pair may get executed hundreds or even thousands of times per second.

_________________
x86? We ain't got no x86. We don't NEED no stinking x86!

Top

BigEd

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 6:05 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England

Indeed, the operand byte of BRK, or for that matter COP or WDM, is just not very useful. The value in A is a good signifier.

(Acorn's Communicator has a dual ABI - both vector based and software interrupt based - it might be worth a look, or it might be best regarded as a failed experiment)

Top

BigEd

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 6:06 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England

Using two supervisor pages to map an unaligned user data structure seems like a fine idea!

Top

BigDumbDinosaur

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 7:23 pm

Joined: Thu May 28, 2009 9:46 pm
Posts: 8514
Location: Midwestern USA

BigEd wrote:

Indeed, the operand byte of BRK, or for that matter COP or WDM, is just not very useful. The value in A is a good signifier.

(Acorn’s Communicator has a dual ABI - both vector based and software interrupt based - it might be worth a look, or it might be best regarded as a failed experiment)

In my nascent 816NIX kernel, I’m looking at calling APIs by pushing a stack frame, loading .A with the API index and then using COP #0 to call the service. COP’s operand could be anything, however, since it would only be present to satisfy syntactical requirements. Using a stack frame to pass parameters into the API frees me of any limitations imposed by having only three 16-bit, general-purpose registers. Stack manipulation with the 816 is fast and relatively easy, so there’s no reason to avoid it.

_________________
x86? We ain't got no x86. We don't NEED no stinking x86!

Top

Proxy

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 9:16 pm

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany

ooh, i'd be interested to hear more about your 816NIX kernel BDD!
i've been trying to get some work done on my own tiny little OS for the 65816, though my idea was to keep the user and kernel stack (and direct page) seperate, so passing things through the stack wouldn't really work. instead i want to use the signature byte and accumulator for selecting a specific function, and the X/Y Registers to hold a 32-bit pointer to the process' memory where the data for the system call is located. though i'm kinda thinking of reworking that system to just have a single stack.

Top

BigEd

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 9:26 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England

Sorry George...

Top

drogon

Post subject: Re: Fairly complete multiprocess computer design

Posted: Fri Dec 15, 2023 10:10 pm

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland

Proxy wrote:

That's effectively the OSWORD call in the Acorn MOS which I more or less wrote my own implementation of for my Ruby 6502 and 816 boards.

e.g. to call OSWORD 0 which is "read in a line of text":

Code:

        ldx     #<getLineData
        ldy     #>getLineData
        lda     #0
        jsr     osWord

and elsewhere:

Code:

getLineData:
        .word   lineInput               ; Address of input buffer
        .byte   maxLen                  ; Max length
        .byte   32                      ; Smallest value to accept
        .byte   126                     ; largest...

I decided when moving to the '816 that my executable code would only ever run in bank 0, but there's no reason for this limitation and making the call - in this case it's JSR, but could be other means.

OSWORD does machiney sort of stuff like reading text, sound, etc. and there are other calls that work in a similar manner for filing system and for other local stuff with less parameters there is OSBYTE - code in A, parameter bytes in X and Y.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

BigDumbDinosaur

Post subject: Re: Fairly complete multiprocess computer design

Posted: Sat Dec 16, 2023 7:05 am

Joined: Thu May 28, 2009 9:46 pm
Posts: 8514
Location: Midwestern USA

Proxy wrote:

ooh, i’d be interested to hear more about your 816NIX kernel BDD!

I’ve mentioned bits and pieces of it in the past around here, but for the moment, can’t find them.

Quote:

i’ve been trying to get some work done on my own tiny little OS for the 65816, though my idea was to keep the user and kernel stack (and direct page) seperate, so passing things through the stack wouldn’t really work.

Separate stacks can be done. Picture this...

After the front end of your kernel API handler has taken care of preliminaries, it copies the current stack pointer, which is pointing at the bottom of the user stack -1, into the accumulator, increments the accumulator and then stores the result in a convenient location on the kernel’s direct page, such as at userSP, for example. userSP would be 3 contiguous direct page bytes, with userSP+2 having $00 stored in it. The kernel then loads its own stack pointer that it had saved during a previous API call and goes on with its business.

If the kernel needs to access something on the user’s stack, it does it with indirect addressing. For example to read the lowest 16-bit word on the user’s stack, all you need is LDA [userSP], assuming the accumulator is set to 16 bits. Need to rewrite the word at the user’s SP+4? LDY #4 followed STA [userSP],Y will do it.

As an alternative, you could do LDA userSP followed by TCD. Now, the user stack is temporarily direct page. With that done, you would do STA 4 to rewrite the word at SP+4.

Quote:

instead i want to use the signature byte and accumulator for selecting a specific function...

Uh...you want to use one or the other, not both. If you use the signature, it becomes the API index. Otherwise, you load .A with the API index and ignore the signature.

Fetching the signature is a convoluted process, since it will be located in the bank in which the calling program is running, which won’t necessarily be the bank in which the kernel is running. Either you have to diddle with DB (data bank) or you have to set up a 24-bit pointer on direct page to fetch the signature. Neither method is very speedy, something I can tell you from experience. :o

My preferred method is to push a stack frame with whatever parameters are needed, load a register with the API index and then do COP #0 to call the API, although any signature could be used, as it is strictly there for syntactical reasons.

Quote:

...and the X/Y Registers to hold a 32-bit pointer to the process’ memory where the data for the system call is located. though i’m kinda thinking of reworking that system to just have a single stack.

Pushing pointers to data into a stack frame is more efficient, since the memory usage is ephemeral. The 816 makes this pretty easy to do. Best part is the stack frame can be temporarily turned into direct page, opening the door to all sorts of programming stunts.

_________________
x86? We ain't got no x86. We don't NEED no stinking x86!

Top

gfoot

Post subject: Re: Fairly complete multiprocess computer design

Posted: Sat Dec 16, 2023 11:29 am

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

I think using at least one register to pass the API number is sensible, for the 65C02, and making all the API call indexes be even is an interesting idea. I had been avoiding registers because one idea was to support both the OSBYTE and the OSWORD systems, as separate system calls - and each already requires potentially all three registers - so at one point I thought I could use the BRK operand to select an API call (perhaps the last byte of Acorn's entry point for the API) leaving all three registers free for passing the parameters. But it's not worth the overhead just for compatibility with that API.

It's not too hard to read data from the user's stack, in my case - just a load and a store to set up the paging - so using the registers plus pushing any additional data to the stack would work. It takes a few more instructions to figure out the pagetable addresses for an arbitrary pointer - I could simplify that a bit by adding one more multiplexer to the circuit, but I cut that corner. API calls that need a lot of data are probably rather slow anyway so it doesn't matter too much if there's a little overhead here.

So something like an OSFILE equivalent which for the sake of argument uses Y and X to point to a parameter block, whose first entry is a two-byte pointer to a filename string, with the rest of the bytes providing other data about what we want to do with the file (load it, save to it, etc), would require mapping the right page for the block (maybe two pages if it crosses a page boundary), then mapping another two pages for the filename - but whatever we're going to do to the file is going to take ages anyway, and we'll have to do more mappings for reading and writing the actual data, so the efficiency of the initial part is not too important.

These kinds of parameter blocks are really useful, compared to pushing a lot of data onto the stack, because they reduce the amount of data churn considerably. Sometimes you embed a parameter block into a program with its fields already filled in; sometimes you alter a parameter block and reuse it (e.g. changing a "load" into a "save", or making it save to a different filename by changing the filename pointer); and sometimes you just have several different blocks at different locations doing different things. The flexibility is really powerful, and pushing that much data to the stack would be quite expensive, so I'm keen to keep supporting them, and also allow them to be located anywhere in user memory as embedding them in programs is useful and flexible.

But using a register for the initial API number makes sense to at least allow relatively simple syscalls to be decoded and executed quickly. Making them be multiples of two is appealing but I think the cost of then checking that they are multiples of two would actually cancel out the gains. I'll think about it though. It is important that the kernel doesn't just call random addresses when given an odd syscall number. Receiving the syscall number in A allows it to be bounds checked, then doubled, then move to X fairly efficiently.

I think - eventually, and with some care - all the above can be done with interrupts reenabled, as well, so won't cause further issues for interrupt latency - as soon as I know it's a syscall, not a hardware interrupt, they can be turned back on. The preempt interrupt is the thorny one there, do we allow task switching during a syscall? We can't really allow another process to also issue a syscall, I don't think that would be practical to support. So perhaps we'd need to disable (or just ignore) the preempt interurpt during the syscall, but unmask interrupts in general. If a preempt interrupt does occur then the syscall can finish its work, but return control to a new process, so that the preempt is obeyed.

On interrupt latency - the jsr in my interrupt service routine is there because the reset handler also chains to the same nested subroutine, so that it can trigger a cycle of interrupt handling before going ahead and killing the process that stopped responding. I'll see if there's another way to do that - I do find it scary how much code actually needs to run every time a character is received, for example - but then, the early 80s systems also had to do this, at lower clock speeds, and Acorn's at least had quite a substantial amount of code in its ISR, so I guess it just is what it is! The fastest rate I've received serial data on a BBC Micro is 76800 baud, but I had to disable the interrupt handler to do this, and poll the 6850 directly. Any situation with the interrupt handler enabled led to data loss due to latency caused by servicing other devices, combined with the lack of FIFOs in the UART.

I've been looking for an excuse to use a priority encoder, but I don't think it'll be useful here as I only have two devices that can cause interrupts - the 65C51 and the 65C22. The 65C51 can only cause receive interrupts, which are urgent, so I'll poll those first. The 65C22 doesn't have any urgent interrupts yet - it will handle transmit interrupts, and already causes the preempt interrupts, both of which are not time-sensitive. Later it may have more urgent ones, particularly if I use it for PS/2 communication where there's an external clock involved, and I'll just need to check for those first after the 65C51 ones (or before, depending which ends up being more cricital or expensive to handle).

Thanks for all the thoughts, opinions, and advice, it is interesting to see different perspectives!

My roadmap at the moment is:

Implement support for loading user processes from the server - at the moment it just pokes specific code into the processes when it starts them, so it's hard to iterate on user code
Implement print character / print string / get character APIs so user processes are more interesting
Make the serial input be interrupt-driven
Make the serial output be interrupt-driven
Measure how bad the interrupt latency is, how close we are to losing data
Port GIBL or something similar - an interesting option here is putting the whole of GIBL in a shared read-only page, so that many user processes can share the interpreter

Top

AndrewP

Post subject: Re: Fairly complete multiprocess computer design

Posted: Sat Dec 16, 2023 7:02 pm

Joined: Mon Aug 30, 2021 11:52 am
Posts: 287
Location: South Africa

gfoot wrote:

I think - eventually, and with some care - all the above can be done with interrupts reenabled, as well, so won't cause further issues for interrupt latency - as soon as I know it's a syscall, not a hardware interrupt, they can be turned back on. The preempt interrupt is the thorny one there, do we allow task switching during a syscall? We can't really allow another process to also issue a syscall, I don't think that would be practical to support. So perhaps we'd need to disable (or just ignore) the preempt interurpt during the syscall, but unmask interrupts in general. If a preempt interrupt does occur then the syscall can finish its work, but return control to a new process, so that the preempt is obeyed.

That's my (half formed) thought too. If the timer interrupt ticks during a syscall then set a flag or something saying the timer has ticked and RTI immediately to the syscall. Then at the end of the syscall check if that flag has been set. If so call the do the normal preemptive system work and return control to the next user process afterwards.

Top

drogon

Post subject: Re: Fairly complete multiprocess computer design

Posted: Sat Dec 16, 2023 8:15 pm

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland

gfoot wrote:

My roadmap at the moment is:

Implement support for loading user processes from the server - at the moment it just pokes specific code into the processes when it starts them, so it's hard to iterate on user code
Implement print character / print string / get character APIs so user processes are more interesting
Make the serial input be interrupt-driven
Make the serial output be interrupt-driven
Measure how bad the interrupt latency is, how close we are to losing data
Port GIBL or something similar - an interesting option here is putting the whole of GIBL in a shared read-only page, so that many user processes can share the interpreter

That would be awesome... GIBL is practically stateless to the point if you swapped out page 0 and whatever other pages it needs for data if it all won't fit into ZP per process.

A trick you can use with interrupt driven serial IO is to make the buffers a power of 2 and aligned on a power of 2, and keep a head and tail pointer. Makes wrapping easy and fast and you don't need to disable interrupts when the reader pulls data out and updates the tail pointer, or the writer pushes data in and updates the head pointer. A "gotcha" might be when using the same serial line for multiple processes - simple characters are fine, but swapping processes when half way through a VDU sequence can produce some interesting results...

Here is your challenge:

https://www.youtube.com/watch?v=ZL1VI8ezgYc

That's all done via a 115200 baud serial line...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

gfoot

Post subject: Re: Fairly complete multiprocess computer design

Posted: Sat Dec 16, 2023 8:25 pm

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741

AndrewP wrote:

Yes, I use a VIA so I can easily disable that interrupt source during a syscall and still check whether the interrupt would have fired if it was enabled.

This has made me realise why something I tried earlier today didn't work - after hooking up support for loading program code into new processes from the serial host, I tried making the child process disable interrupts, to check that the watchdog timer still worked:

Code:

    sei
loop:
    brk : brk
    bra loop

However the timer didn't trigger. The reason is that the BRK is causing supervisor mode to be active, which resets the watchdog, and even though the preempt interrupt doesn't get serviced, the watchdog also never picks up on this.

It is an edge case with a few possible solutions, but is also related to whether syscalls enable interrupts. To some extent it doesn't make sense for code to disable interrupts and then issue a system call. Acorn's stance on this was generally that the OS might enable interrupts while processing the API call.

The value of a user process disabling interrupts at all is questionable and I could clear the I flag on the stack before RTI from the syscall, or make the syscall handler chain into the interrupt service routine at its end in case anything is pending.

I also wondered a bit about user processes getting callbacks from interrupts, so they can execute their own asynchronous responses. I think I could do this by manipulating the process stack, pushing an extra frame, so that it returns into this callback, and the callback itself would then RTI when it is finished. This would allow things like timed callbacks, vsync callbacks, or notifications when I/O is available, without requiring threads and select().

Threads, now that's a whole other topic...

Top

Page 3 of 5

[ 68 posts ]

Go to page Previous 1, 2, 3, 4, 5 Next

Board index » 6502.org Users Forum » Hardware

All times are UTC

Who is online

Users browsing this forum: No registered users and 75 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum