And you've chosen to snoop the data bus. I gather that when the LDA ($b0),y reads from the $4000-$DFFF region, the data bus gets copied to the shift reg at the end of the same cycle that does the read. This eliminates the delay otherwise required for a subsequent STA to explicitly write to the shift reg. Nice!
The other trick I had in mind would be overkill for what you're presently doing, but it's a favorite of mine despite being somewhat gnarly... or maybe because it's somewhat gnarly!
Anyway, the way my brain works I'd be thinking about eliminating the reformatting, and to do that it's necessary to deal with the effect of page crossings on timing. I'd wanna eliminate the single-cycle NOP, and "spend" that cycle in one of two ways:
- arrange for a Wait State that only occurs when there's no Page Crossing. I know that's not entirely straightforward, but I'm sure it's doable, and it would let you guarantee that it always takes 8 cycles to move each byte, thus ensuring consistent timing. Or, ...
- replace each LDA ($b0),y with a STA ($b0),y. STA using (ind),y mode always take six cycles, regardless of whether or not there's a page crossing. So, instead of generating conditional Wait States, the challenge instead will be to keep the /WE pin of the RAM high so it reads instead of writing. Also, you'd need to somehow avoid bus contention between the RAM and the CPU. Do those guys connect directly together, or does your CPLD act as a middleman? If they connect directly together then this idea is a dud!
-- Jeff