I like jeffythedragonslayer's recent questions. I presume there is an interest to really understand 6502 from the view of a compiler writer, virtual machine implementer and at the transistor level. This question is perhaps the most intriguing and shows a particular thoroughness. In particular, it reveals a binary incompatible 6502 variant with less transistors and more functionality.
I regarded STA immediate as the Technetium of the 6502's opcode chart. It never occurred to me that it has a useful semantic - as implemented by the 6800 designers. As noted in a recent
narrated animation about the history of 6502 (and 6800), the designers were very focused on embedded applications, such as automotive. In this context, there is a strict division between program and data. From this view, store immediate is worthless - or even undesirable behavior.
Another example of this code/data separation is the functionality of NMOS 6502 RDY signal which only works when reading. This is sufficient for process control from a cheap ROM. In this arrangement, the system cycle time is limited by the worse path in the I/O segment. If you have, for example, eight RIOTs or ten VIAs, I/O speed is fairly equally matched with the processor. Just add another cycle or two to cover slow ROM. That is more adequate for stepper motors and similar. However, if you want the fastest, general purpose computer and you want to interface 2MHz processor with 1MHz peripheral, a RDY signal which is ignored during write cycles is an annoying impediment.
It was corrected in 65C02 but this only makes a three way corner case. (RDY stretching, write cycles, NMOS: choose any two.) Unfortunately, I've been affected by this problem.
Despite expert advice, people insist on using dodgy, random components. Therefore, I've made sure that my work is NMOS compatible. This is quite easy in software. Like the 6502 designers, I was concerned about NMOS compatible slow ROM access. Unlike the 6502 designers, I didn't clock stretch I/O correctly. It is a regrettable decision made downstream from the 6800 and 6502 design decisions. It was caused because I wanted maximum performance rather than a cheap and cheerful system.
In the very hypothetical situation of designing an 8 bit instruction set where more than 1/3 of the opcodes are unused, I'd probably leave STA immediate unallocated and put all of the fiddly exceptions (LDX#, LDY#, INX, DEX, INY, DEY, CLV) into one group. I'd also make RegX and RegY down counters so that DEX and DEY would be one cycle. However, that probably requires more transistors.