LLVM 6502 Codegen
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Sean wrote:
I am reading this correctly that ".LBB0_1" is the start of loop for the inlined puts implementation and "jsr 65490" is the call to a subroutine that actually outputs the character?
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: LLVM 6502 Codegen
Quote:
.Lstr:
.asciz "HELLO, 6502!"
.asciz "HELLO, 6502!"
First, I don't see the '\n' character here, though that may just be a transcription error. Or the '.asciz' pseudo op may not actually terminate the string with '\0' but with '\n'. (it must be there somewhere, otherwise the main loop never outputs one)
Second, if it does terminate with a zero byte, the main loop does not use that but instead actually determines the length of the string and counts. This implies that any terminating zero byte is essentially wasting space.
Third, the loop starts with the 'H' pre-loaded into the accumulator. Which is fine, but why then bother storing another copy that is never used? (although I'll grant you it's probably simpler just to store it anyway)
Would there be any benefit to having the compiler output something like this instead?
Code: Select all
.Lstr:
.ascii 'ELLO, 6502!\n'
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
The newline is also in the code, right after the main loop. There's a LDA of 10 (\n) and another call to JSR.
You could trim the string, but the logic required would be rather tricky.
You could trim the string, but the logic required would be rather tricky.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: LLVM 6502 Codegen
So, LLVM provides the entire zero-terminated string, presumably for other manipulations like strcpy(), but doesn't use the first or last byte when outputting it? That seems to me like a dubious route to optimization for size or speed, but my experience in this arena is definitely limited.
Three types of optimizations I have seen in native assembly are backwards strings, high-bit terminators and "negative indexing", but none of them help much for well-behaved C strings.
Three types of optimizations I have seen in native assembly are backwards strings, high-bit terminators and "negative indexing", but none of them help much for well-behaved C strings.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
The actual optimization here is "loop rotation", the effects on strings are incidental. If compiling for absolute minimum size (-Oz), LLVM won't do this, but this in turn disables a number of other optimizations. The usual size optimized mode (-Os) still does this transform, since it's pretty essential for the other loop passes to work well, which can also have an impact on size.
I recompiled the SDK and hello world with -Oz to see what happens; here's what it produced. The total execution path of main should a bit longer in this version. If you walk through it, there's an extra indexed load and an extra branch. These are peeled off by loop rotation, and turned into an immediate load and a fall-through, respectively.
I guess it's a bit like a constant-overhead version of loop unrolling; it's definitely a bit of a speed-size tradeoff, but the examples I've seen aren't very severe (yet). This is one of those things we'd want to revisit if we come across an example where the difference is more than just a couple bytes.
I recompiled the SDK and hello world with -Oz to see what happens; here's what it produced. The total execution path of main should a bit longer in this version. If you walk through it, there's an extra indexed load and an extra branch. These are peeled off by loop rotation, and turned into an immediate load and a fall-through, respectively.
I guess it's a bit like a constant-overhead version of loop unrolling; it's definitely a bit of a speed-size tradeoff, but the examples I've seen aren't very severe (yet). This is one of those things we'd want to revisit if we come across an example where the difference is more than just a couple bytes.
Code: Select all
main:
ldx #0
jmp .LBB0_2
.LBB0_1:
lda .Lstr,x
jsr 65490
inx
.LBB0_2:
cpx #12
bne .LBB0_1
lda #10
jsr 65490
rts
Re: LLVM 6502 Codegen
Oh, -Oz is new to me!
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: LLVM 6502 Codegen
I think I'm starting to understand. In my size-optimizing mentality, I say to myself "Self, the LDA absolute,X is already there, why not use it and get rid of the LDA# and CPX#?". But LLVM has been taught to understand that trading one of them for a LDA# is faster, even if it's outside the loop. Getting rid of the CPX# inside the loop should start to pay off for any non-trivial strings, though, and a string like "Hello, World!" should benefit measurably. Of course, the elephant in the room is the JSR ...
or the slightly faster but slightly trickier (and more dangerous in the general case)
I don't have the foggiest clue how to teach a compiler to do what my dusty old brain just did.
Code: Select all
main:
ldx #0
beq .LBB0_2
.LBB0_1:
jsr 65490
inx
.LBB0_2:
lda .Lstr,x
bne .LBB0_1
lda #10
jmp 65490Code: Select all
main:
ldx #0
lda #72 ; whatever's @ .Lstr+0
.LBB0_1: ; (please don't be '\0')
jsr 65490
inx
lda .Lstr,x
bne .LBB0_1
lda #10
jmp 65490Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
The human writing assembly can start with an idea of their loop, then code it out, then say, "hmm, maybe I should try comparing against 13 instead of looking at the load." Then they'd notice, "nah, that's not actually any faster" in one case, where in another they'd be like "whoa, that's a bit faster." In the first case, they'd go back and undo their mistake.
Compilers essentially never go back. Like a human, it starts out with "a general idea about the loop" in the form of compiler IR. Then, it gradually makes decisions about it, one aspect at a time. If a decision to be made later would inform a decision to be made earlier, then that's just too bad! It makes a guess, and if it doesn't pay off later (like rewriting the loop test in our example), then it doesn't pay off.
This is, of course, done for performance reasons, which matters a bit more when compiling a binary in the gigabytes than 64 kilobytes. But LLVM's still the best game in town; phase-coupled code generation is a surprisingly young field of research, as it only recently became feasible. Compiler problems get a lot harder when you try to treat them "all at once" like this, the way a human would. But there's been some breakthroughs in solving the kinds of NP-hard problems that humans are usually good at (Mixed Integer Programming, Neural Net methods, etc).
My initial foray into 6502 compiler development was actually along those lines: straight from high-level IR to 6502 assembly. But the problem was still intractably difficult even for modestly-sized routines, and the techniques I was using to deal with it started to look more and more like LLVM...
I don't think this is actually a 6502 problem though; now that I've got some experience with LLVM, I've noticed some similarly questionable asm fragments in test cases for other targets. My goal has similarly shifted to producing a good LLVM backend for the 6502, commensurate with the ones for MIPS, x86, etc. But that's still probably not going to beat any of you at micro-optimization!
Compilers essentially never go back. Like a human, it starts out with "a general idea about the loop" in the form of compiler IR. Then, it gradually makes decisions about it, one aspect at a time. If a decision to be made later would inform a decision to be made earlier, then that's just too bad! It makes a guess, and if it doesn't pay off later (like rewriting the loop test in our example), then it doesn't pay off.
This is, of course, done for performance reasons, which matters a bit more when compiling a binary in the gigabytes than 64 kilobytes. But LLVM's still the best game in town; phase-coupled code generation is a surprisingly young field of research, as it only recently became feasible. Compiler problems get a lot harder when you try to treat them "all at once" like this, the way a human would. But there's been some breakthroughs in solving the kinds of NP-hard problems that humans are usually good at (Mixed Integer Programming, Neural Net methods, etc).
My initial foray into 6502 compiler development was actually along those lines: straight from high-level IR to 6502 assembly. But the problem was still intractably difficult even for modestly-sized routines, and the techniques I was using to deal with it started to look more and more like LLVM...
I don't think this is actually a 6502 problem though; now that I've got some experience with LLVM, I've noticed some similarly questionable asm fragments in test cases for other targets. My goal has similarly shifted to producing a good LLVM backend for the 6502, commensurate with the ones for MIPS, x86, etc. But that's still probably not going to beat any of you at micro-optimization!
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: LLVM 6502 Codegen
Quote:
The newline is also in the code, right after the main loop. There's a LDA of 10 (\n) and another call to JSR.
I'm not sure if the C64 KERNAL treats them the same way, but IIRC the BASIC interpreter's subroutine to output a newline first sends decimal 13, then checks the file number it is currently outputting to. If it's >= 128 (high bit set), then it also sends a decimal 10.
In any case, the original C program has this:
Code: Select all
void main(void) {
printf("HELLO, 6502!\n");
}Code: Select all
.Lstr:
.asciz "HELLO, 6502!"To me it seems shorter to store the decimal 10 and output it as part of the loop, but I'm not a compiler writer. Is the real optimization replacing 'printf()' with 'puts()', thereby avoiding dragging in all the unneeded machinery of 'printf()'?
I'm curious what the compiler does with something like this:
Code: Select all
#include <stdio.h>
void main(void) {
char s[] = "HELLO, 6502!\n";
printf(s);
}-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
teamtempest wrote:
I'm not sure if the C64 KERNAL treats them the same way, but IIRC the BASIC interpreter's subroutine to output a newline first sends decimal 13, then checks the file number it is currently outputting to. If it's >= 128 (high bit set), then it also sends a decimal 10.
I did a quick read-through the standard, and one approach is to do a DOS/Windows-style translation between newline and CR in the target-specific implementations of things like putchar.
Code: Select all
void __putchar(char c) {
if (__builtin_expect(c == '\n', 0)) c = '\r';
__chrout(c);
}
Code: Select all
main:
ldx #1
lda #72
.LBB0_1:
cmp #10
beq .LBB0_4
.LBB0_2:
jsr 65490
lda .Lstr,x
inx
cpx #13
bne .LBB0_1
lda #13
jsr 65490
rts
.LBB0_4:
lda #13
jmp .LBB0_2
The other approach would be to consider '\n' to be equal to '\r' in the "execution character set" for the c64. This gets into needing different execution character sets per-target in LLVM, and the last time I did a cursory sweep through, there isn't all that great of support for that. LLVM just calls out to iconv to do character set translation, so I'd probably be looking at writing and packaging iconv plugins for each tweaked characters set if I wanted to go that route, otherwise building a new character set management system into LLVM.
There's real advantages to keeping LLVM ASCII; character sets are absurdly infectious in C; they even affect the pre-processor. You can say "#if 'a' == 65", and the results of that may differ from "if ('a' == 65)" in the code. The rules for this are exceedingly arcane, and basically no C programmers know them, so I'm inclined to try to avoid that sort of chicanery.
It looks like someone's working on building a character set management subsystem in LLVM right now, for IBM System Z's EBCDIC (lmao of course).
https://reviews.llvm.org/D88741
teamtempest wrote:
Is the real optimization replacing 'printf()' with 'puts()', thereby avoiding dragging in all the unneeded machinery of 'printf()'?
teamtempest wrote:
I'm curious what the compiler does with something like this:
Code: Select all
#include <stdio.h>
void main(void) {
char s[] = "HELLO, 6502!\n";
printf(s);
}
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: LLVM 6502 Codegen
Thanks for all your hard work! It didn't occur to me that the compiler would put that string on the stack.
Put this way it sounds as if puts() is supposed to output the string it is given and then send a newline. Shouldn't the example output two newlines then? The one in the input string and then one of puts()'s own?
What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?
Quote:
'puts()' has "append a newline" as part of its contract
What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: LLVM 6502 Codegen
I think the key goal of an optimizer is to produce identical results more efficiently. If (and only if) it can replace printf("Something.\n"); with puts("Something."); then it green-lights the transform, for any pattern of "Something.".
A simple wrapper for $FFD2 should solve the '\n' issue, and can be expanded to deal with other PETSCII idiosyncrasies as well.
A simple wrapper for $FFD2 should solve the '\n' issue, and can be expanded to deal with other PETSCII idiosyncrasies as well.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
teamtempest wrote:
Put this way it sounds as if puts() is supposed to output the string it is given and then send a newline. Shouldn't the example output two newlines then? The one in the input string and then one of puts()'s own?
What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?
What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?
barrym95838 wrote:
I think the key goal of an optimizer is to produce identical results more efficiently. If (and only if) it can replace printf("Something.\n"); with puts("Something."); then it green-lights the transform, for any pattern of "Something.".
barrym95838 wrote:
A simple wrapper for $FFD2 should solve the '\n' issue, and can be expanded to deal with other PETSCII idiosyncrasies as well.
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: LLVM 6502 Codegen
I see. A very specific optimization, but workable if you know what it's doing.
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Well, it's been another month since my last update, and optimization is still progressing. There's been some wins, there's been some losses. I've discovered that optimizing a compiler can be a bit like trying to compress a balloon. Still, more wins than losses overall; the balloon is smaller now than it was a month ago.
I'm getting pretty close to the point where it'll make sense to cut a first beta release; there's only a few more big obvious things to implement, then I'll be doing a sweep through the benchmarks to look for anything too horrific that I've missed. I'll almost certainly find a few things, so it's just something I'm thinking about the timing of; no clear plans yet.
Some interesting work stands out (the rest is scattershot tuning):
Setjmp/longjmp and the Calling Convention
The biggest changes by far involves setjmp/longjmp. Even though freestanding C implementations don't technically have to provide them, I wanted to make sure that it was possible to implement these before the first release. They're notorious for being really difficult to Macguyver into an implementation that didn't keep them in mind.
Along those lines, I found that there's just no practical way to write them without sharply limiting the number of imaginary registers. Which is what I ended up having to do. The mechanism for setting the number of imaginary registers has been completely removed. The number is now fixed at 32 bytes; 16 pairs of two consecutive bytes each. This is the same exact footprint as SWEET16, which I consider a bit serendipitous. Same for the Commander X16 ABI, if the project ever gets off the ground.
The good news is that there was no observable performance change from limiting the number of imaginary registers in the simulator from 256 to 32. There really just aren't usually more than 32 bytes of data live in a typical function at a given time, and modern compilers are pretty good at packing values into registers.
16 pointer pairs is also not that many more than the fewest I've been able to get the register allocator to accept (IIRC, it was around 10). So it sucks to *require* 32 bytes of the zero page to use the compiler, but it does make the compiler quite a bit easier to develop and change, it fixes the calling convention across targets, makes it feasible to implement libcalls in assembly, and just seems overall to be worth the tradeoff, in my opinion. But if you have really strong opinions about this, please let me know!
Shift/rotate lowering
I always try to show at least some code in these posts, so here it is. Previously, the compiler special-cased shifts and rotates by 1 and by multiples of 8, but any other amount would emit a libcall that uses a loop. Now, we've got static code patterns for all constant shifts and rotates. There may be better ways to do some of these, but they're all better than loops, for sure. As always, note that patterns are high-level: the registers used, the instructions used, and even the instructions' order, are determined dynamically by the needs of the surrounding context. In the examples below, these needs are mostly due to the C calling convention. Here's some examples:
Divmod libcalls
When you compute a / b and a % b for the same a and b, the compiler will now emit a combined division and modulus library call, rather than two separate library calls. This nearly halves the amount of work necessary, although this case isn't terribly frequent.
Late optimization pass
The compiler will now try to elide comparisons with zero by looking to see if N and Z were already set by an earlier instruction. This is done rather late, as this type of optimization depends on how the final assembly shook out.
Similarly, immediate loads of X (or Y) can be replaced with INX or DEX if X can be proven to contain the value minus one or plus one. This isn't any faster, but it saves a byte.
Register Hinting
Previously, LLVM didn't really understand that certain registers were more efficient for certain instructions on the 6502. For example, you can ASL in either A or the zero page, but A is quite a lot better. This didn't bite us too much, since we made it try to assign to A first, but there wasn't a full reckoning of the costs of assigning a value to each possible location.
Now there is; for each value, we add up the costs of placing each instruction in each possible location, then pick the cheapest. This is done greedily one value at a time, and it's not at all perfect, but it seems to work fairly well. It turns out to have made a relatively minor difference in the overall quality of the output, since generally where to put values is more decided by what operation will need to use them next, rather then there being a ton of alternatives to try.
I'm getting pretty close to the point where it'll make sense to cut a first beta release; there's only a few more big obvious things to implement, then I'll be doing a sweep through the benchmarks to look for anything too horrific that I've missed. I'll almost certainly find a few things, so it's just something I'm thinking about the timing of; no clear plans yet.
Some interesting work stands out (the rest is scattershot tuning):
Setjmp/longjmp and the Calling Convention
The biggest changes by far involves setjmp/longjmp. Even though freestanding C implementations don't technically have to provide them, I wanted to make sure that it was possible to implement these before the first release. They're notorious for being really difficult to Macguyver into an implementation that didn't keep them in mind.
Along those lines, I found that there's just no practical way to write them without sharply limiting the number of imaginary registers. Which is what I ended up having to do. The mechanism for setting the number of imaginary registers has been completely removed. The number is now fixed at 32 bytes; 16 pairs of two consecutive bytes each. This is the same exact footprint as SWEET16, which I consider a bit serendipitous. Same for the Commander X16 ABI, if the project ever gets off the ground.
The good news is that there was no observable performance change from limiting the number of imaginary registers in the simulator from 256 to 32. There really just aren't usually more than 32 bytes of data live in a typical function at a given time, and modern compilers are pretty good at packing values into registers.
16 pointer pairs is also not that many more than the fewest I've been able to get the register allocator to accept (IIRC, it was around 10). So it sucks to *require* 32 bytes of the zero page to use the compiler, but it does make the compiler quite a bit easier to develop and change, it fixes the calling convention across targets, makes it feasible to implement libcalls in assembly, and just seems overall to be worth the tradeoff, in my opinion. But if you have really strong opinions about this, please let me know!
Shift/rotate lowering
I always try to show at least some code in these posts, so here it is. Previously, the compiler special-cased shifts and rotates by 1 and by multiples of 8, but any other amount would emit a libcall that uses a loop. Now, we've got static code patterns for all constant shifts and rotates. There may be better ways to do some of these, but they're all better than loops, for sure. As always, note that patterns are high-level: the registers used, the instructions used, and even the instructions' order, are determined dynamically by the needs of the surrounding context. In the examples below, these needs are mostly due to the C calling convention. Here's some examples:
Code: Select all
unsigned lshrby2(unsigned x) { return x >> 2; }
unsigned lshrby7(unsigned x) { return x >> 7; }
int ashrby7(int x) { return x >> 7; }
Code: Select all
lshrby2:
stx mos8(__rc2)
lsr mos8(__rc2)
ror
lsr mos8(__rc2)
ror
ldx mos8(__rc2)
rts
lshrby7:
stx mos8(__rc2)
asl
rol mos8(__rc2)
lda #0
rol
tax
lda mos8(__rc2)
rts
ashrby7:
sta mos8(__rc2)
txa
bpl .LBB2_2
ldx #-1
jmp .LBB2_3
.LBB2_2:
ldx #0
.LBB2_3:
asl mos8(__rc2)
rol
stx mos8(__rc2)
rol mos8(__rc2)
ldx mos8(__rc2)
rts
When you compute a / b and a % b for the same a and b, the compiler will now emit a combined division and modulus library call, rather than two separate library calls. This nearly halves the amount of work necessary, although this case isn't terribly frequent.
Late optimization pass
The compiler will now try to elide comparisons with zero by looking to see if N and Z were already set by an earlier instruction. This is done rather late, as this type of optimization depends on how the final assembly shook out.
Similarly, immediate loads of X (or Y) can be replaced with INX or DEX if X can be proven to contain the value minus one or plus one. This isn't any faster, but it saves a byte.
Register Hinting
Previously, LLVM didn't really understand that certain registers were more efficient for certain instructions on the 6502. For example, you can ASL in either A or the zero page, but A is quite a lot better. This didn't bite us too much, since we made it try to assign to A first, but there wasn't a full reckoning of the costs of assigning a value to each possible location.
Now there is; for each value, we add up the costs of placing each instruction in each possible location, then pick the cheapest. This is done greedily one value at a time, and it's not at all perfect, but it seems to work fairly well. It turns out to have made a relatively minor difference in the overall quality of the output, since generally where to put values is more decided by what operation will need to use them next, rather then there being a ton of alternatives to try.