Well, it's been another month since my last update, and optimization is still progressing. There's been some wins, there's been some losses. I've discovered that optimizing a compiler can be a bit like trying to compress a balloon. Still, more wins than losses overall; the balloon is smaller now than it was a month ago.
I'm getting pretty close to the point where it'll make sense to cut a first beta release; there's only a few more big obvious things to implement, then I'll be doing a sweep through the benchmarks to look for anything too horrific that I've missed. I'll almost certainly find a few things, so it's just something I'm thinking about the timing of; no clear plans yet.
Some interesting work stands out (the rest is scattershot tuning):
Setjmp/longjmp and the Calling ConventionThe biggest changes by far involves setjmp/longjmp. Even though freestanding C implementations don't technically have to provide them, I wanted to make sure that it was possible to implement these before the first release. They're notorious for being really difficult to Macguyver into an implementation that didn't keep them in mind.
Along those lines, I found that there's just no practical way to write them without sharply limiting the number of imaginary registers. Which is what I ended up having to do. The mechanism for setting the number of imaginary registers has been completely removed. The number is now fixed at 32 bytes; 16 pairs of two consecutive bytes each. This is the same exact footprint as SWEET16, which I consider a bit serendipitous. Same for the Commander X16 ABI, if the project ever gets off the ground.
The good news is that there was no observable performance change from limiting the number of imaginary registers in the simulator from 256 to 32. There really just aren't usually more than 32 bytes of data live in a typical function at a given time, and modern compilers are pretty good at packing values into registers.
16 pointer pairs is also not that many more than the fewest I've been able to get the register allocator to accept (IIRC, it was around 10). So it sucks to *require* 32 bytes of the zero page to use the compiler, but it does make the compiler quite a bit easier to develop and change, it fixes the calling convention across targets, makes it feasible to implement libcalls in assembly, and just seems overall to be worth the tradeoff, in my opinion. But if you have really strong opinions about this, please let me know!
Shift/rotate loweringI always try to show at least some code in these posts, so here it is. Previously, the compiler special-cased shifts and rotates by 1 and by multiples of 8, but any other amount would emit a libcall that uses a loop. Now, we've got static code patterns for all constant shifts and rotates. There may be better ways to do some of these, but they're all better than loops, for sure. As always, note that patterns are high-level: the registers used, the instructions used, and even the instructions' order, are determined dynamically by the needs of the surrounding context. In the examples below, these needs are mostly due to the C calling convention. Here's some examples:
Code:
unsigned lshrby2(unsigned x) { return x >> 2; }
unsigned lshrby7(unsigned x) { return x >> 7; }
int ashrby7(int x) { return x >> 7; }
Code:
lshrby2:
stx mos8(__rc2)
lsr mos8(__rc2)
ror
lsr mos8(__rc2)
ror
ldx mos8(__rc2)
rts
lshrby7:
stx mos8(__rc2)
asl
rol mos8(__rc2)
lda #0
rol
tax
lda mos8(__rc2)
rts
ashrby7:
sta mos8(__rc2)
txa
bpl .LBB2_2
ldx #-1
jmp .LBB2_3
.LBB2_2:
ldx #0
.LBB2_3:
asl mos8(__rc2)
rol
stx mos8(__rc2)
rol mos8(__rc2)
ldx mos8(__rc2)
rts
Divmod libcallsWhen you compute a / b and a % b for the same a and b, the compiler will now emit a combined division and modulus library call, rather than two separate library calls. This nearly halves the amount of work necessary, although this case isn't terribly frequent.
Late optimization passThe compiler will now try to elide comparisons with zero by looking to see if N and Z were already set by an earlier instruction. This is done rather late, as this type of optimization depends on how the final assembly shook out.
Similarly, immediate loads of X (or Y) can be replaced with INX or DEX if X can be proven to contain the value minus one or plus one. This isn't any faster, but it saves a byte.
Register HintingPreviously, LLVM didn't really understand that certain registers were more efficient for certain instructions on the 6502. For example, you can ASL in either A or the zero page, but A is quite a lot better. This didn't bite us too much, since we made it try to assign to A first, but there wasn't a full reckoning of the costs of assigning a value to each possible location.
Now there is; for each value, we add up the costs of placing each instruction in each possible location, then pick the cheapest. This is done greedily one value at a time, and it's not at all perfect, but it seems to work fairly well. It turns out to have made a relatively minor difference in the overall quality of the output, since generally where to put values is more decided by what operation will need to use them next, rather then there being a ton of alternatives to try.