I've been working on a 6502 backend for LLVM, and I'm familiar with gcc as well, so I might be able to speak from a compiler perspective.
In very important senses, both the ARM and the AVR instruction set architectures, ARE 32-bit 6502s. ARM was heavily inspired by the 6502 architecture. The flags are extremely similar from 6502 to ARM.
Both gcc and LLVM assume broadly, but not absolutely, that they are targeting processors that have multiple registers within a register type. For example, ARM has 16 32-bit registers, all of which could be addressed equivalently in any instruction that accesses a register.
This does not mean that these compilers cannot target 6502 style registers; it just means that additional effort is required per instruction to target them.
Any compiler or OS maker will probably treat your 8-bit addressable pages as virtual register banks, in the ARM style, where one or more banks would be user mode and another bank would be privileged mode. The designers of the 6502 were well aware of the limited number of registers in their designs, and so they provided plenty of ways to do 8-bit addressing, both indirectly and directly.
Your stacks are too small for most embedded work. 4KB stacks were the absolute minimum by the early 1990s. If you have a sufficient amount of 8-bit addressable memory generally, there's no convincing reason to make your stacks addressable in a byte. Link-time optimization takes away stack pushes and pops, if you have sufficient register space. Stack space is one of those things that each project will have its own needs, so I suggest that you don't burn stack locations into hardware.
Byte addressing is very important to the embedded market and makes string handling easier.
I know of zero embedded applications that are gated by the speed of string handling. You are optimizing the wrong thing. Far more useful would be a built in DMA engine to handle fast memory copies.
Why not have a single, flat 32-bit stack? First of all, this is boring.
God protect all programmers from "interesting" hardware. We LOVE boring. We LOVE industry standard behavior.
Having 4 “fast pages” makes implementing a C compiler much easier.
Keep in mind that all C compiler makers will treat whatever 8-bit addressable pages you give us, as (virtual) registers. I suggest you think more carefully through the cases where we do. For example, it would be extremely useful to have an instruction that takes a four-byte value at an 8-bit location, treats it as a 32-bit pointer, and loads that 32-bit location into 8-bit memory. Please think through all the cases of 8 bit, 16 bit, and 32 bit indirect loads and stores, with your 8-bit addressable page as a virtual register file. You might even alias A, X, Y, SR, SP and PC to the first and/or last few bytes of those 8-bit addressable pages.
I know folks love their registers, and their C-compilers, but it is not for me.
There has never been a mass-produced CPU that did not have a C compiler. Perhaps what you are trying to create is a purely artistic expression, rather than a CPU that might be used on someone else's project.
Having 4 small stacks and 4 “fast pages” also makes “small multitasking” easy to implement, allowing you to run a couple of tasks concurrently without having to swap in and out of memory.
Again, I think you are optimizing the wrong thing. You have no control as a hardware designer how many tasks or threads that an application designer will want; so it might be better to focus on user vs. protected modes, or MMU versus no MMU, and permit software to make deeper choices.
Of course an MMU could be added later which could isolate stacks between kernel and user space programs. It could also remap the 4 stacks and "fast pages" anywhere in memory, but then they aren’t as fast anymore. I really dislike it when the 65XX starts to look too much like just another “large system” processor.
OK, but you simply cannot run any of those modern operating systems you've listed without an MMU. I've researched this issue. All of them assume that an MMU exists. If you don't have one, we'll have to emulate it.
If you're adding features onto a 32-bit 6502 die, the single most important feature you'll want is a multiplier that operates on your 8-bit addressable memory. The longer the better. Almost all modern applications assume that 32-bit multiplies are fast. The 6502 takes forever to do even an 8-bit multiply. For bonus points, put several multipliers in parallel and have them all able to hit 8-bit addressable memory at once. Most modern embedded designs for machine learning depend on this. See in particular the dgemm() operation in BLAS, and think about how fast you can get that to be on your design.
When you are designing new hardware, you are designing something for a market that already has certain expectations about how hardware should be. It is almost always the case, that new hardware designs should follow software requirements solely; quirky or "opinionated" hardware often gets binned in history. Programmers like flat memory models. Operating systems like MMUs. Compilers like register banks. DSP applications like vector multipliers.
Don't start from hardware features. Think about applications, and then think about the hardware features that those applications need.
I hope some of this is helpful and inspires your designs.