I'll be a contrarian. If 256 bytes is enough for a system monitor then 256 bytes is enough for task switcher or bank switcher.
I'll describe a hypothetical implementation. BigDumbDinosaur will regard this description as redundant compared to 65816. Dr Jefyll will regard this description as redundant compared to one or more bank registers which are populated or briefly selected with idle 65C02 opcodes. This allows contiguous access to 64KB banks with relatively little over-head.
Assume we want 24 bit address-space where each 16 bit bank has page $FE for I/O and page $FF for ROM. The remaining 63.5KB is bank switched RAM. Bank zero has operating system state and each non-zero bank has one application or may be used to hold data. The I/O page has a write-only latch to set bank number. It also has two addresses which selectively allow over-ride of the bank number and select bank zero. Bank number is set to zero upon reset.
An application may make a system call and the system call may resemble a Commodore or Acorn system call. As an example, JSR $FFD2 may then execute the sequence JMP $FF00 which may then execute STA $FE00 and continue execution at $FF03 in bank zero and jump to an arbitrary address in the bottom 63.5KB in bank zero. At the end of all system calls, it is possible to jump back to an arbitrary address within page $FF, STA $FE01 to select the application's bank and RTS. And we probably require some sacrificial NOPs to make interrupts work correctly.
Unfortunately, there is a huge limitation to this arrangement. Very little context is available after bank switching. Specifically, only the contents of RegA, RegX, RegY and flags. If the operating system is required to read the application's memory, then it has to come back and snoop. This is easiest if there are two or more reserved bytes in zero page. If not, one pass is required to collect and save two bytes from the application's zero page. Another pass is required to read or write one byte with (zp,X) or 65C02 (zp) address mode. A third pass is required restore the application's zero page. At this point, Dr Jefyll may be bristling that we should use the K0, K1, K2 and K3 read/write latch arrangement which was devised around 1988 for this specific purpose.
Although this hypothetical implementation is slow, it allows numerous pointers using less hardware. It is also 65802/65816 compatible because it doesn't use any opcodes for signaling. Most significantly, it is entirely compatible with interrupts. If an interrupt occurs when in application-space, a similar process occurs and it may be nested with snooping of the application's memory-space.
In practice, I strongly recommend against eight address line I/O decode because it will have its own overhead. Address decode which has more than four bits will run slowly on FPGA with four input LUT, such as Lattice iCE40. Address decode which has more than five bits will run slowly on FPGA with five input LUT, such as Xilinx. Address decode which has more than six bits will run slowly on CPLD with six inputs. With poor choice of address range, discrete decode may be hindered before reaching six bits.
So, I agree with the recommendations to have 4KB or 8KB for ROM because you should
never allocate less than 1KB for ROM or I/O. With this granularity, a good design is a very different proposition. A minimal design might be:
- One 6502, 65C02 or 65816.
- One 74HC139 for read/write qualification and 4*16KB address decode. (Substitute two 74HC138 if '139 unavailable.)
- Two RAM chips.
- Up to ten 65xx peripheral chips where one 6522 is connected to NMI and the others form a tree of IRQs.
- One ROM.
- Connect four bits of one 6522 to upper address lines of first RAM chip with pull-up resistors so that it defaults to 1111 at reset.
- Connect four bits of one 6522 to upper address lines of second RAM chip with pull-up resistors so that it defaults to 1111 at reset.
This will run at 16MHz (or whatever the ROM can achieve) and allows one 16KB window into each RAM chip. This allows, for example, 16KB operating system, 16KB text editor and 16*16KB text buffer. Although 14 bit extended addressing is less than ideal, it is easy to handle interrupts with a readable bank latch. Likewise, it is easy to copy kilobytes of data between banks.