Quick Update:
Following the discussions about interrupt handling here, I started playing around with making the "interrupt-friendly" calling convention the default. This seems to have been a good decision, so I've made the change, and we'll see what breaks in continuous integration.
The short version is that C functions now make an attempt to "leave no trace" when they execute: they're allowed to freely overwrite only 4 bytes of the zero page. They can touch any of the rest of the zero page given to the compiler, but they have to put it back the way they found it before they return. This means that interrupt handlers only need to push the 4 scratch bytes to the stack, along with A, X, and Y (and a handful of compiler temporaries I'm getting rid of.) The C calling convention implicitly handles the rest.
Code quality took somewhat of a hit from this, but it's not nearly as bad as I've feared. The obvious downside is that functions begin by saving off one zero page register after another to the soft stack. Quite a lot; from 10-50 locations. The upshot is that once this is done, the bodies of the functions operate almost entirely in the zero page, as you might expect, while still being fully reentrant. If the function can be proven non-recursive, then the soft stack is just an absolute memory region, so the function copies out of the zero page into another page. Which is a bit silly.
That silliness was quite illuminating, though, and the problem isn't quite as serious as I'd feared.
Say you have the following function:
Code:
int foo(int a, int b) {
int c = a + b;
bar();
return c;
}
In order to return a+b, you have to stash it somewhere safe from "bar".
LLVM knows two techniques for this:
1) Save it to the stack before the function call, then reload it from the stack afterwards
2) Save a callee-saved (zero page) register at function entry, and restore it at function exit. Then, just compute the value right into the callee-saved register; bar has already promised not to touch those.
Without any more information, 2 is almost strictly superior to 1, no matter what function you ask the question about. They each produce the exact same number of stores and reloads, so they have the same downside. However, once you do 2), the register is also free throughout the whole function to use for all kinds of things, so 2 has a unique upside.
It's not quite that straightforward in our case though. Using static stacks can be almost as cheap as the zero page, and zero page accesses themselves are about a third of the cost of a soft stack access. And the reload in 1) can often be folded into whatever computation uses the value, e.g., LDY #42, EOR (SP),Y. This becomes dramatically more likely (and cheaper!) if static stacks are used: EOR sstk+42. So in a number of cases we'd really really want to do 1) instead, and those are precisely the cases where it's be really silly to have a prologue that frees up a bunch of the zero page. That's what the prior caller-save-heavy calling convention was actually buying us, it was forcing the compiler to choose 1 instead of 2 by limiting the number of callee-saved registers available.
So we can just teach LLVM how to effectively choose between 1) and 2) on MOS, and that should make the compiler quite tactical about when it's willing to save and restore a zero page register for internal use. We can also do all of the other whole-program zero page allocation discussed previously; this would just have the effect of making the use of reserved zero page registers free for the functions they were reserved for. Other functions could still use them, but they'd have to pay to save/restore them.
So, at the very least, I think there's a relatively clear path from here to good, performant code. Once this change percolates through testing, I'll be back to my end-to-end audit of the compiler for correctness.