Hi qus,
I'm aware that I'm entering a bit late to this thread, and I may not be fully qualified to discuss on it, particularly because I have no practical experience with the 6502 architecture and instruction set. However, I was recently able to implement a LLVM compiler backend for a custom architecture, and maybe I am able to point out something that might be useful. Please forget this if you have already considered it or if it's an idea that doesn't apply.
If I understand it correctly, you are basing code generation on the physical existence of a Data Stack that you use to store temporary values and perform operations, in a Forth like fashion. So for example to perform an addition, you push the first operand to the stack, then the second operand, and then perform the add operation, which result is available on top of the stack (Please correct me if this is not the case)
For what I have learned from the LLVM compiler, I think that modern compilers use a different approach. Compilers will build a graph based on an Intermediate Representation of the source code in which each node is in "Static Single Assignment" form or SSA. The idea is that the compiler initially assumes an infinite number of virtual registers, where each variable or intermediate result is assigned exactly once to a virtual register, and every variable is always defined before it is used. So for example the expression
Code:
x = a+b*c
will be represented in SSA form like this (assuming all are global variables):
Code:
vr0 = (globaladdressof x)
vr1 = (globaladdressof a)
vr2 = load vr1
vr3 = (globaladdressof b)
vr4 = load vr3
vr5 = (globaladdressof c)
vr6 = load vr5
vr7 = vr4 * vr6
vr8 = vr2 + vr7
(void) = store vr8 -> vr0
To generate that in the proper order, a compile-time stack can be used to help with operator precedence, and in fact this output almost resembles a stack based machine, but the point is that the stack does not explicitly exist in the machine. Since there is no dependence of a physical stack and there's infinite number of virtual registers to play with, array indexing or struct member access is not much harder than simple arithmetic operations, as the former can be turned into the later. In fact everything except jumps and calls can be turned into sequences of arithmetic or logical operations. For example, the code you proposed:
Code:
objectArrayInstance[4].someField = 69
can be turned into the following SSA sequence:
Code:
vr0 = (globaladdressof objectArrayInstance)
vr1 = (constantsizeof objectArrayInstance)
vr2 = 4
vr3 = vr1 * vr2
vr4 = vr0 + vr3
vr5 = (constantoffsetof somefield)
vr6 = vr4 + vr5
vr7 = 69
(void) = store vr7 -> vr6
The last SSA instruction, the store, does not create any new virtual register because it does not generate any new result, but of course both the destination address and the stored value are still available in vr6 and vr7. The SSA instruction sequences are preferably represented as a linked graph in the compiler, but some compilers such as the gcc I think that just keep them plainly linearly in arrays, which should definitely be easier to debug, although (according to the llvm compiler developers) this can make some optimisations harder and time consuming.
After the SSA sequence (or graph) is generated, it is optimised. This essentially searches for repeated patterns and opportunities for register reuse and elimination. In the case above, clearly, vr1 and vr2 registers are no longer used after the third instruction, so any of the registers used later can be replaced by them and so on. This step can also apply some target specific optimisations or insert/replace IR instructions by convenient 'pseudo' instructions that will be easily converted later into physical ones. Compilers are kind of 'black art' so I suppose anything that can improve final code generation is allowed anywhere.
Next, physical register allocation begins. This essentially means replacing the remaining virtual registers by existing physical registers, and creating spills for the ones that are not available. Since the 6502 does not have by any means the number of physical registers required by such kind of compiled code, my approach would be to reserve a sufficiently large number of memory slots in page zero and use that memory as 'physical registers'. Instead of a Forth-like data stack, I would have just maybe 8 or 16 slots, or even 32 slots in memory, that would be treated as actual physical registers during compiler code generation. The more 'physical registers' the better, so I if you have already allocated some fixed memory for the data stack, then the same memory can be used for as many registers as would fit in there (!)
The last step is converting the optimized internal instruction representation into actual 6502 instructions. This offers additional optimisation opportunities, that in this case are purely target specific (for the 6502 processor). For example, several intermediate instructions in a row may get folded into simpler native 6502 sequences, such as load/store pairs or constant stores, etc.
As said, the above is what I have learned from my recent work on a LLVM backend, and I'm unsure if any of this would be useful or applicable for a 6502 compiler. Anyway, I hope this was an interesting reading.
Joan