Actually, the 816’s only native data size is 16 bits when operating in native mode. The changeable register widths do not affect the ALU, which always processes 16 bits at a time. That is one of the reasons that, for example, a branch across a page boundary in native mode does not consume any more Ø2 cycles than a branch that stays within a page boundary. It is also why register-to-register copies, e.g., TXA, take two Ø2 cycles, regardless of the register sizes.
As for the 816 not having the simplicity of the 65C02, there’s no surprise in that. The 816 is a significant advancement over the 65C02, just as the 80286 was an advancement over the 8086. When the 286 came on the scene, a whole new programming paradigm arrived with the introduction of protected mode. Rather than grouse about how the “simplicity” of the 8086 had been lost, programmers learned how to use protected mode and other new features, resulting in programs that could do more and at a faster rate.
Something like that is progress, pure and simple.
Incidentally, getting the most out of the 65C816 does demand treating it as a different beast than the 65C02. When I first started writing native-mode 816 code, I often found myself using a stone axe when I had a chainsaw available.
There is a learning curve to it and some planning is required to minimize register size switching (register sizes, by the way, aren’t “modes”—changing a register’s size has no effect whatsoever on how instructions behave). In many of my programs, I store numerics as 16- or 32-bits, even if a particular numeric will never exceed 255. In many algorithms, the slight additional execution time needed to fetch or store a word instead of a byte is effaced by the elimination of frequent REP and SEP instructions. This becomes especially true with pointer arithmetic—I use 32-bit pointers for “long” accesses, even though only 24 bits are needed. That waste here and there of a byte in this fashion is inconsequential when one considers that pointer arithmetic can be carried out without having to monkey with the accumulator’s size to manipulate the most significant word.
Is the 65C816 perfect? Not at all. Having different opcodes determine the data size being manipulated, instead of selecting register widths via a distinct instruction, would make for more streamlined code and reduce the opportunity to add gnarly bugs to one’s programs. Some aspects of the banking scheme can be annoying as well—I wish execution could span banks, since that would make it possible to load a program anywhere convenient. However, the 816 retains the efficiency of the 65C02, as well as the excellent interrupt performance. Plus the stack acrobatics that can be done with the 816 have no analogue with the 65C02. That’s good enough for me.