cr1901, if you expected a variety of ideas regarding this topic then you haven't been disappointed!
But in order to evaluate any expansion approach, we need to establish what the goals are.
One common goal is to be able to easily code applications that store and examine large data objects -- for example, arrays whose size exceeds 64K. This sort of thing is effortless for a 68000 family chip, for example. And we want our expanded 65xx to be able to play with the big kids!
An expanded-memory 6502/C02 can handle large objects by simulating a "flat" linear space. But the execution speed varies drastically between the various expansion schemes. The difference is most evident if the application routinely makes lots of little fetches that span the entire array (or other structure), as required for mainstay activities such as searching and sorting. Traversing a linked list is another example.
The poor locality of reference can poison performance when that sort of "big data" application is attempted. Each individual reference accesses a few bytes or even less -- and yet, each reference incurs dozens of cycles of delay doing the translation (linear to page-and-offset) and outputting the result to the paging hardware so the access can proceed. For applications that make numerous, small references the ratio of real work to housekeeping plummets.
If this impediment is incompatible with your goals, the remedy is to streamline as much as possible the process of linear-to-page-and-offset translation and the subsequent outputting to the paging hardware. It's maybe helpful to mention some numbers here, so an example is in order -- preferably a simple one.
I'll propose a function (aka subroutine / Forth word / whatever) that gets passed a linear address held in zero-page at X, X+1 and X+2. The address is random, so no assumptions can be made. The function begins by fetching the byte at the specified address. How long will it take to fetch that byte?
Code:
LDK1 2,X ; 2 byte 4~. Load address bits 23-16 to bank register K1
K1_ ; 1 byte 1~. K1 prefix says apply K1 to the following instruction
LDA 0,X ; 2 byte 6~. (An otherwise ordinary LDA instruction)
The KK computer does it in 11 cycles -- which AFAIK is unrivaled in the realm of 65xx memory retrofits. The favorable performance is mostly due to using a page size of 64K (although 256 bytes would work well too, as noted above). If you want to even remotely approach 68000-like nimbleness in regard to accessing randomly-specified locations in the linear map, then these page sizes are clearly attractive options. Yes 64K pages require some well-thought-out hardware (you need to get a grip on the cycle-by-cycle timing) but the scheme can be built around something as simple as a shift register.
cheers,
Jeff
ETA: another good approach would be if the hardware were wired in a way that managed the mask-and-shift of the translation. With that task no longer burdening the software, a non-64K window would be fine.
_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html