Alright, I have several ideas, and some of them don't smell very nice although they could perhaps be made to work!
I can't fully address the details right now, but there are three basic strategies.
One strategy we're already discussing is to add more time to the read cycle that inputs the RX bit, thus giving the RC a longer period in which to resolve. For example, to briefly pulse RDY low, we could have a resistor pulling RDY high but also have a capacitor that ties MLB to RDY. With luck, a 3-cycle low pulse on MLB will produce a 1-cycle (approx) low pulse on RDY. Among other downsides, it'd be obligatory for the serial input code to use only RMW instructions to read RX. Also,
every RMW instruction would get slowed down, even if it's in the application code (as opposed to the serial input code).
Another strategy perhaps worth considering is to partially "precharge" (actually
discharge) the RC so it'll resolve more quickly. As is, data line D7 will always be high in the cycle before the RC curve begins. That's because the final operand byte is being read in that cycle, and bit7 of that operand ($FF) is a 1. So, there's a huge asymmetry. If RX is 1 then no time at all is required for the RC to resolve, but the time required is very substantial if RX is a 0, because the capacitance needs to be dragged from a 1 down to just below 50% of Vcc (the transition point of the C02's inputs). Off the top of my head -- and this is dicey -- I wonder if the CPU's (presently unused) PHI2O output could, by means of a couple of diodes acting as a zener, pull data line d7 down to roughly 50% during the first half of the cycle. Ideally (!), the diodes will instantly turn off in the 2nd half of the cycle, and the RC curve now has only to rise slightly above or slightly below 50%. I have misgivings, but perhaps this or a similar scheme could be made to work!
The third strategy is to change the means by which RX is read. Presently it's memory mapped, but instead it'd also be possible to channel RX into the SOB input. I picture a gate (or a gate equivalent, using just a resistor and a diode) that feeds SOB with the logical OR of RX and MLB. The program clears the V flag, does a RMW instruction, then conditionally branches on the state of V.
But my favorite proposal uses RX as an address input to the PLD/ROM, which would contain TWO almost identical programs... and the 'C02 would routinely "change horses in midstream" between the two programs as the data comes in. (It's a trick I first used decades ago; see the "slightly OT" paragraphs in
this post.) As noted, the programs would be almost identical. But here and there, one will have an instruction that says, "trust me -- no need to test anything; just shift in a 1" and the other sequence will say, "just shift in a 0."
Each of the programs would be less than 32 bytes, so IIUC there will no longer be any need to feed A5 to the PLD/ROM. That gains us an extra input pin, one to which RX can be connected.
RX needs to be synchronized to the CPU clock (otherwise there's be a risk the ROM output may be in flux when Phi2 falls), and that means employing one of the flipflops in the PLD. And AFAICT that costs us another pin, because (unlike some PLDs) the 22V10 can't feed both the pin and the Q of the pin's macrocell back into the interconnection matrix.
But it's possible to scrounge another pin if we simply cease to fed A11 to the PLD. This negatively impacts the memory map. There can still be two chunks carved out of the top of the space, but instead of each chunk being $800 in size, each will be $1000 in size.
Is that too high a price for opening the door to higher clock rates? It depends one's priorities, so there's no single "right" answer. But IMO this solution is much cleaner than any of the others I've mentioned.
-- Jeff