LLVM 6502 Codegen

mysterymath · Post by **mysterymath** » Wed Nov 10, 2021 7:08 am

Sean wrote:

I am reading this correctly that ".LBB0_1" is the start of loop for the inlined puts implementation and "jsr 65490" is the call to a subroutine that actually outputs the character?

Yep, this is using the Commodore 64 target, which has a simple putchar implementation using the KERNAL.

teamtempest · Post by **teamtempest** » Thu Nov 11, 2021 3:14 pm

Quote:

.Lstr:
.asciz "HELLO, 6502!"

I'm a little curious about this bit.

First, I don't see the '\n' character here, though that may just be a transcription error. Or the '.asciz' pseudo op may not actually terminate the string with '\0' but with '\n'. (it must be there somewhere, otherwise the main loop never outputs one)

Second, if it does terminate with a zero byte, the main loop does not use that but instead actually determines the length of the string and counts. This implies that any terminating zero byte is essentially wasting space.

Third, the loop starts with the 'H' pre-loaded into the accumulator. Which is fine, but why then bother storing another copy that is never used? (although I'll grant you it's probably simpler just to store it anyway)

Would there be any benefit to having the compiler output something like this instead?

Code: Select all

.Lstr:
.ascii 'ELLO, 6502!\n'

mysterymath · Post by **mysterymath** » Thu Nov 11, 2021 3:38 pm

The newline is also in the code, right after the main loop. There's a LDA of 10 (\n) and another call to JSR.

You could trim the string, but the logic required would be rather tricky.

barrym95838 · Post by **barrym95838** » Thu Nov 11, 2021 5:39 pm

So, LLVM provides the entire zero-terminated string, presumably for other manipulations like strcpy(), but doesn't use the first or last byte when outputting it? That seems to me like a dubious route to optimization for size or speed, but my experience in this arena is definitely limited.

Three types of optimizations I have seen in native assembly are backwards strings, high-bit terminators and "negative indexing", but none of them help much for well-behaved C strings.

mysterymath · Post by **mysterymath** » Thu Nov 11, 2021 6:18 pm

The actual optimization here is "loop rotation", the effects on strings are incidental. If compiling for absolute minimum size (-Oz), LLVM won't do this, but this in turn disables a number of other optimizations. The usual size optimized mode (-Os) still does this transform, since it's pretty essential for the other loop passes to work well, which can also have an impact on size.

I recompiled the SDK and hello world with -Oz to see what happens; here's what it produced. The total execution path of main should a bit longer in this version. If you walk through it, there's an extra indexed load and an extra branch. These are peeled off by loop rotation, and turned into an immediate load and a fall-through, respectively.

I guess it's a bit like a constant-overhead version of loop unrolling; it's definitely a bit of a speed-size tradeoff, but the examples I've seen aren't very severe (yet). This is one of those things we'd want to revisit if we come across an example where the difference is more than just a couple bytes.

Code: Select all

main:
        ldx     #0
        jmp     .LBB0_2
.LBB0_1:
        lda     .Lstr,x
        jsr     65490
        inx
.LBB0_2:
        cpx     #12
        bne     .LBB0_1
        lda     #10
        jsr     65490
        rts

BigEd · Post by **BigEd** » Thu Nov 11, 2021 7:57 pm

Oh, -Oz is new to me!

barrym95838 · Post by **barrym95838** » Thu Nov 11, 2021 9:00 pm

I think I'm starting to understand. In my size-optimizing mentality, I say to myself "Self, the LDA absolute,X is already there, why not use it and get rid of the LDA# and CPX#?". But LLVM has been taught to understand that trading one of them for a LDA# is faster, even if it's outside the loop. Getting rid of the CPX# inside the loop should start to pay off for any non-trivial strings, though, and a string like "Hello, World!" should benefit measurably. Of course, the elephant in the room is the JSR ...

Code: Select all

main:
        ldx     #0
        beq     .LBB0_2
.LBB0_1:
        jsr     65490
        inx
.LBB0_2:
        lda     .Lstr,x
        bne     .LBB0_1
        lda     #10
        jmp     65490

or the slightly faster but slightly trickier (and more dangerous in the general case)

Code: Select all

main:
        ldx     #0
        lda     #72     ; whatever's @ .Lstr+0
.LBB0_1:                ; (please don't be '\0')
        jsr     65490
        inx
        lda     .Lstr,x
        bne     .LBB0_1
        lda     #10
        jmp     65490

I don't have the foggiest clue how to teach a compiler to do what my dusty old brain just did.

mysterymath · Post by **mysterymath** » Fri Nov 12, 2021 6:07 am

The human writing assembly can start with an idea of their loop, then code it out, then say, "hmm, maybe I should try comparing against 13 instead of looking at the load." Then they'd notice, "nah, that's not actually any faster" in one case, where in another they'd be like "whoa, that's a bit faster." In the first case, they'd go back and undo their mistake.

Compilers essentially never go back. Like a human, it starts out with "a general idea about the loop" in the form of compiler IR. Then, it gradually makes decisions about it, one aspect at a time. If a decision to be made later would inform a decision to be made earlier, then that's just too bad! It makes a guess, and if it doesn't pay off later (like rewriting the loop test in our example), then it doesn't pay off.

This is, of course, done for performance reasons, which matters a bit more when compiling a binary in the gigabytes than 64 kilobytes. But LLVM's still the best game in town; phase-coupled code generation is a surprisingly young field of research, as it only recently became feasible. Compiler problems get a lot harder when you try to treat them "all at once" like this, the way a human would. But there's been some breakthroughs in solving the kinds of NP-hard problems that humans are usually good at (Mixed Integer Programming, Neural Net methods, etc).

My initial foray into 6502 compiler development was actually along those lines: straight from high-level IR to 6502 assembly. But the problem was still intractably difficult even for modestly-sized routines, and the techniques I was using to deal with it started to look more and more like LLVM...

I don't think this is actually a 6502 problem though; now that I've got some experience with LLVM, I've noticed some similarly questionable asm fragments in test cases for other targets. My goal has similarly shifted to producing a good LLVM backend for the 6502, commensurate with the ones for MIPS, x86, etc. But that's still probably not going to beat any of you at micro-optimization!

teamtempest · Post by **teamtempest** » Fri Nov 12, 2021 3:32 pm

Quote:

The newline is also in the code, right after the main loop. There's a LDA of 10 (\n) and another call to JSR.

I saw that, but I usually consider decimal 10 to be a line feed, and decimal 13 to be a carriage return.

I'm not sure if the C64 KERNAL treats them the same way, but IIRC the BASIC interpreter's subroutine to output a newline first sends decimal 13, then checks the file number it is currently outputting to. If it's >= 128 (high bit set), then it also sends a decimal 10.

In any case, the original C program has this:

Code: Select all

void main(void) {
  printf("HELLO, 6502!\n");
}

and the assembly version has this:

Code: Select all

.Lstr:
   .asciz   "HELLO, 6502!"

so...if the decimal 10 which is output separately is what happened to the '\n' in the original C, then it appears that both the first and last characters of a literal string are treated separately from the others. When the string is stored, its first character is not used nor is its last byte (which has been replaced by a zero byte).

To me it seems shorter to store the decimal 10 and output it as part of the loop, but I'm not a compiler writer. Is the real optimization replacing 'printf()' with 'puts()', thereby avoiding dragging in all the unneeded machinery of 'printf()'?

I'm curious what the compiler does with something like this:

Code: Select all

#include <stdio.h>

void main(void) {
  
  char s[] = "HELLO, 6502!\n";

  printf(s);
}

mysterymath · Post by **mysterymath** » Fri Nov 12, 2021 6:02 pm

teamtempest wrote:

I'm not sure if the C64 KERNAL treats them the same way, but IIRC the BASIC interpreter's subroutine to output a newline first sends decimal 13, then checks the file number it is currently outputting to. If it's >= 128 (high bit set), then it also sends a decimal 10.

EGAD, you're right! It appears I never actually tried printing strings that span multiple lines!

I did a quick read-through the standard, and one approach is to do a DOS/Windows-style translation between newline and CR in the target-specific implementations of things like putchar.

Code: Select all

void __putchar(char c) {
  if (__builtin_expect(c == '\n', 0)) c = '\r';
  __chrout(c);
}

With that, the example becomes:

Code: Select all

main:
        ldx     #1
        lda     #72
.LBB0_1:
        cmp     #10
        beq     .LBB0_4
.LBB0_2:
        jsr     65490
        lda     .Lstr,x
        inx
        cpx     #13
        bne     .LBB0_1
        lda     #13
        jsr     65490
        rts
.LBB0_4:
        lda     #13
        jmp     .LBB0_2

This isn't all that efficient, since it's doing the transformation at runtime. Before I check in the fix to the SDK, I'm going to check if there's any Windows-specific hackery in LLVM to make translations like this happen at compile-time.

The other approach would be to consider '\n' to be equal to '\r' in the "execution character set" for the c64. This gets into needing different execution character sets per-target in LLVM, and the last time I did a cursory sweep through, there isn't all that great of support for that. LLVM just calls out to iconv to do character set translation, so I'd probably be looking at writing and packaging iconv plugins for each tweaked characters set if I wanted to go that route, otherwise building a new character set management system into LLVM.

There's real advantages to keeping LLVM ASCII; character sets are absurdly infectious in C; they even affect the pre-processor. You can say "#if 'a' == 65", and the results of that may differ from "if ('a' == 65)" in the code. The rules for this are exceedingly arcane, and basically no C programmers know them, so I'm inclined to try to avoid that sort of chicanery.

It looks like someone's working on building a character set management subsystem in LLVM right now, for IBM System Z's EBCDIC (lmao of course).
https://reviews.llvm.org/D88741

teamtempest wrote:

Is the real optimization replacing 'printf()' with 'puts()', thereby avoiding dragging in all the unneeded machinery of 'printf()'?

Yes, exactly. 'puts()' has "append a newline" as part of its contract, and LLVM rewrites the printf to puts to avoid dragging in printf.

teamtempest wrote:

I'm curious what the compiler does with something like this:

Code: Select all

#include <stdio.h>
void main(void) {
  
  char s[] = "HELLO, 6502!\n";

  printf(s);
}

I was really surprised to find that LLVM doesn't actually transform this example. It looks like it's because space for s is allocated on the stack upon entry to main. If this ever comes up outside of a toy example, I don't know that it'd even be that hard to fix; seems like it might've just been overlooked. Changing it to a char * triggers the optimization, as does making s a global or static const char[] (but not a char[] array).

teamtempest · Post by **teamtempest** » Sat Nov 13, 2021 2:15 pm

Thanks for all your hard work! It didn't occur to me that the compiler would put that string on the stack.

Quote:

'puts()' has "append a newline" as part of its contract

Put this way it sounds as if puts() is supposed to output the string it is given and then send a newline. Shouldn't the example output two newlines then? The one in the input string and then one of puts()'s own?

What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?

barrym95838 · Post by **barrym95838** » Sat Nov 13, 2021 4:02 pm

I think the key goal of an optimizer is to produce identical results more efficiently. If (and only if) it can replace printf("Something.\n"); with puts("Something."); then it green-lights the transform, for any pattern of "Something.".

A simple wrapper for $FFD2 should solve the '\n' issue, and can be expanded to deal with other PETSCII idiosyncrasies as well.

mysterymath · Post by **mysterymath** » Sat Nov 13, 2021 6:17 pm

teamtempest wrote:

Put this way it sounds as if puts() is supposed to output the string it is given and then send a newline. Shouldn't the example output two newlines then? The one in the input string and then one of puts()'s own?

What happens if the input string has two or more trailing newlines? Do they all get stripped off in favor of the single one output by puts()?

barrym95838 wrote:

I think the key goal of an optimizer is to produce identical results more efficiently. If (and only if) it can replace printf("Something.\n"); with puts("Something."); then it green-lights the transform, for any pattern of "Something.".

Yep, that's exactly how the optimization is written. Looks for a printf("Constant string with no format characters ending in newline\n"), replaces it with puts("New version of that string without the newline"). Another optimization pass notices that the original string is now dead and deletes it. If both are actually referenced, a string merging pass would transform the latter into a subset of the former.

barrym95838 wrote:

A simple wrapper for $FFD2 should solve the '\n' issue, and can be expanded to deal with other PETSCII idiosyncrasies as well.

Yeah, that's what I ended up going with. In the above example, the C "__putchar" wraps the C64 KERNAL "__chrout". The whole shebang ends up getting inlined for this simple example. The __builtin_expect tells the compiler that the newline branch will only very rarely be taken, this allows LLVM to remove the handling from the loop proper, which tightens up the code.

teamtempest · Post by **teamtempest** » Sun Nov 14, 2021 2:12 pm

I see. A very specific optimization, but workable if you know what it's doing.

mysterymath · Post by **mysterymath** » Sun Dec 12, 2021 6:30 am

Well, it's been another month since my last update, and optimization is still progressing. There's been some wins, there's been some losses. I've discovered that optimizing a compiler can be a bit like trying to compress a balloon. Still, more wins than losses overall; the balloon is smaller now than it was a month ago.

I'm getting pretty close to the point where it'll make sense to cut a first beta release; there's only a few more big obvious things to implement, then I'll be doing a sweep through the benchmarks to look for anything too horrific that I've missed. I'll almost certainly find a few things, so it's just something I'm thinking about the timing of; no clear plans yet.

Some interesting work stands out (the rest is scattershot tuning):

Setjmp/longjmp and the Calling Convention
The biggest changes by far involves setjmp/longjmp. Even though freestanding C implementations don't technically have to provide them, I wanted to make sure that it was possible to implement these before the first release. They're notorious for being really difficult to Macguyver into an implementation that didn't keep them in mind.

Along those lines, I found that there's just no practical way to write them without sharply limiting the number of imaginary registers. Which is what I ended up having to do. The mechanism for setting the number of imaginary registers has been completely removed. The number is now fixed at 32 bytes; 16 pairs of two consecutive bytes each. This is the same exact footprint as SWEET16, which I consider a bit serendipitous. Same for the Commander X16 ABI, if the project ever gets off the ground.

The good news is that there was no observable performance change from limiting the number of imaginary registers in the simulator from 256 to 32. There really just aren't usually more than 32 bytes of data live in a typical function at a given time, and modern compilers are pretty good at packing values into registers.

16 pointer pairs is also not that many more than the fewest I've been able to get the register allocator to accept (IIRC, it was around 10). So it sucks to *require* 32 bytes of the zero page to use the compiler, but it does make the compiler quite a bit easier to develop and change, it fixes the calling convention across targets, makes it feasible to implement libcalls in assembly, and just seems overall to be worth the tradeoff, in my opinion. But if you have really strong opinions about this, please let me know!

Shift/rotate lowering

I always try to show at least some code in these posts, so here it is. Previously, the compiler special-cased shifts and rotates by 1 and by multiples of 8, but any other amount would emit a libcall that uses a loop. Now, we've got static code patterns for all constant shifts and rotates. There may be better ways to do some of these, but they're all better than loops, for sure. As always, note that patterns are high-level: the registers used, the instructions used, and even the instructions' order, are determined dynamically by the needs of the surrounding context. In the examples below, these needs are mostly due to the C calling convention. Here's some examples:

Code: Select all

unsigned lshrby2(unsigned x) { return x >> 2; }
unsigned lshrby7(unsigned x) { return x >> 7; }
int ashrby7(int x) { return x >> 7; }

Code: Select all

lshrby2:                                
  stx mos8(__rc2)
  lsr mos8(__rc2)
  ror
  lsr mos8(__rc2)
  ror
  ldx mos8(__rc2)
  rts
lshrby7:                                
  stx mos8(__rc2)
  asl
  rol mos8(__rc2)
  lda #0
  rol
  tax
  lda mos8(__rc2)
  rts
ashrby7:
  sta mos8(__rc2)
  txa
  bpl .LBB2_2
  ldx #-1
  jmp .LBB2_3
.LBB2_2:
  ldx #0
.LBB2_3:
  asl mos8(__rc2)
  rol
  stx mos8(__rc2)
  rol mos8(__rc2)
  ldx mos8(__rc2)
  rts

Divmod libcalls
When you compute a / b and a % b for the same a and b, the compiler will now emit a combined division and modulus library call, rather than two separate library calls. This nearly halves the amount of work necessary, although this case isn't terribly frequent.

Late optimization pass
The compiler will now try to elide comparisons with zero by looking to see if N and Z were already set by an earlier instruction. This is done rather late, as this type of optimization depends on how the final assembly shook out.

Similarly, immediate loads of X (or Y) can be replaced with INX or DEX if X can be proven to contain the value minus one or plus one. This isn't any faster, but it saves a byte.

Register Hinting
Previously, LLVM didn't really understand that certain registers were more efficient for certain instructions on the 6502. For example, you can ASL in either A or the zero page, but A is quite a lot better. This didn't bite us too much, since we made it try to assign to A first, but there wasn't a full reckoning of the costs of assigning a value to each possible location.

Now there is; for each value, we add up the costs of placing each instruction in each possible location, then pick the cheapest. This is done greedily one value at a time, and it's not at all perfect, but it seems to work fairly well. It turns out to have made a relatively minor difference in the overall quality of the output, since generally where to put values is more decided by what operation will need to use them next, rather then there being a ton of alternatives to try.

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen