Warning: thinking out loud follows...
I'm trying to write a compiler backend for the 6502 (again). The 6502 is rather hard to do this for due to it being a mixed memory-register machine. 8-bit arithmetic is fairly straightforward, but for 16-bit and above it's essentially a 3op machine where operations look like this:
Code:
lda lhs+n
adc rhs+n
sta dest+n
Worst case, when lhs rhs and dest are in normal memory, this ends up at nine bytes per byte of data: so a 16-bit add is 18 bytes. (Ignoring the clc for now.)
However, what's really bad from my perspective is that lhs, rhs and dest can have numerous different addressing modes. lhs and rhs can be one of {constant, memory ref, indirect memory ref}; dest can be one of {memory ref, indirect memory ref}, resulting in 18 different combinations. This means that my rule-based compiler backend needs 18 different rules. The code quality's also not good for chained operations, because I end up with one operation writing back to a temporary memory location and then the next reading it right back again.
What if I could keep intermediate results in registers instead?
Obviously this would only work for 16-bit arithmetic. I need to reserve Y for indexing so my 16-bit values go in XA. The best I've come up with is:
Code:
lda lhs+0
ldx lhs+1
adc rhs+0
pha
txa
adc rhs+1
tax
pla
sta dest+0
stx dest+1
While that's actually worse in total (22 vs 18 bytes), It should very quickly win if I want to do chained operations like i+j+k; I just need to repeat the central section, which is only ten bytes worst case. Plus, I now have three discrete operations, each of which can come in only two or three varieties, which fits the compiler better.
Does this seem reasonable? Am I missing anything obvious? Clearly I'll have to go the full hog for 32-bit arithmetic, but I think this could help 16-bit arithmetic a lot. But I don't like using the stack there...