If I'm understanding stuff correctly, the 65816 can do everything but swap. And I know that processes will always be limited to 16 megs per process (Excluding software-driven paging)
My current thinking on the "best" way to do a cheap MMU is probably a piece of SRAM, run off of the top lines of the address bus. This lets you do all kinds of things. So, for the point of argument, let's say that you run the top 8 bits of the address bus, giving you 64k "pages".
You can set up some logic on the outputs of the SRAM to compare one of the bits to various signals. This gives you the ability to make a page read-only, write-only, or no-execute.
You no longer need to map the address space linearly, which then means that IO can go anywhere. This also means that you can move around the zero page and stack much better. Remember, on the 65816 the zero page and stack must live in the first 64k of memory, which tends to get in the way of multiple processes.
There's nothing preventing you from having a 16 bit wide SRAM, instead of an 8 bit wide SRAM. Say we use 4 bits for flags. That means that we have 4 more bits of address space -- now we can fit 256 megs of RAM, but can only see 16 megs at a time. This shouldn't be much of a surprise; just about every z80 or 6502 based system did at least some of that.
Now, what about multiple processes having their own address space? Well, there's a few ways to go about it, depending on how fancy you want to be. The safe way to make sure that the modes are seperated is to set a flip-flop to be set whenever an interupt fires (use the vector pull line, I think) and clearable by accessing an IO address.
You can use an AND gate to compare that signal to a MMU line, to mark parts of the memory as supervisor-only. This way, you can have the MMU SRAM appear somewhere in memory always, but only be changable in supervisor mode.
Now, you can have the supervisor do a copy every time the memory map changes. This means that the first thing you do, right after receiving an interupt, is change to the supervisor mapping. And then, before you can go back to the process, you need to change back. You can make this less painful by having your supervisor always mapped to memory, but that will be memory not usable by the user apps.
Remember, I said only the top 8 lines -- that's 256 bytes. So one possible avenue is to have a 1k SRAM chip and a latch for process number, so now you have four possible processes running at the same time. Or you can have a 512 byte SRAM chip and switch between user and supervisor modes. Remember, no matter what you do, you will run out of MMU SRAM if you have enough processes. So you can just write some logic that copies stuff in and out of the MMU SRAM but keeps the 4 most frequently accessed processes available, say. Or you just say, "There are 4 threads. Deal"
So, what is your PC doing differently? Well, a few things. The most important thing to remember is that all modern processors can restart all stopped instructions, which the 65816 cannot do. With this scheme, there's no way to recover if you reach an invalid address. On a PC, you can take a page of memory, write it to disk and let the process run. If it needs that memory again, it'll fire an interupt, the supervisor retrieves that page from disk, and restarts execution right there. They also use this to avoid storing the entire memory map at once. So they'll have say the last 7 mappings stored on the chip. They compare each of the 7 mappings and then pull from memory if they can't find it in those 7 mappings. The x86 magically does this for you, but some of the other architectures don't.
You can do a hardware assisted mapping-pull, although I bet that the logic would be pretty hefty and require either a CPLD or at least 2 chips per mapping item in your cache.
The x86 also has 4k pages. Which is fine, you can trap the top 12 address lines if you want, but remember that's 4k of mappings to copy every context switch.
You can swap stuff to disk. There's two ways to do it, both of which have been done, but neither of which is nearly as nice as how a moderm PC's MMU does things.
The first way is to swap out entire processes. The second way is to have a mostly-cooperative memory manager. You can allocate "pages" and then lock and unlock them. Unlocked pages may end up on disk.
The biggest problem with the 65816, other than restarting stopped instructions, is that all instructions are equally privelaged. You can make sure that interupts go where they should by taking advantage of the vector pull line. But the stack pointer can be moved around, which means that a user process can really gum things up for the supervisor if it isn't careful. I think that you can come to an acceptable solution if you have a blank page specifically to recover the location of the stack, post interupt. You can just not move the stack in your client application, but that's won't defeat a determined user.