LLVM 6502 Codegen

BigEd · Post by **BigEd** » Fri Oct 15, 2021 9:01 am

An interesting puzzle. It feels to me like the tail of the routine should look like

  dec mos8(__rc3)
  dec mos8(__rc2)
  dex
  clc
  adc #255
  rts

and we should branch into the appropriate part depending on the trailing zeros we find.

Code: Select all

dec_i32:
  cmp #0
  bne dec0
  cpx #0
  bne dec1
  cmp mos8(__rc2)
  bne dec2
  dec mos8(__rc3)
dec2:
  dec mos8(__rc2)
dec1:
  dex
dec0:
  clc
  adc #255
  rts

Of course I haven't tested this... and it feels like DEC A should be easier than that.

I'm a bit surprised that a 32 bit parameter isn't passed in a soft stack (or a zero page register) - is breaking it into 3 pieces turning out generally to be a good tactic?

barrym95838 · Post by **barrym95838** » Fri Oct 15, 2021 3:44 pm

I think that keeping the low 16 bits of "TOS" in X:A with Y as a helper should be a win in speed, but the resultant code generated probably looks a bit less straightforward to a typical human observer, and it's probably not more space-efficient than the strategy of keeping all of TOS in RAM, due to register spilling and filling. A clever compiler can minimize the spilling and filling based on the surrounding context, for a net win there as well in many cases.

mysterymath · Post by **mysterymath** » Fri Oct 15, 2021 5:49 pm

From the POV of our backend, A, X, Y, and a region of the ZP are all 8-bit processor registers, so it's not that we're treating A and X as a cache for the stack, but that we're using a modern register based calling convention. It's fairly typical for wide arguments to be split across registers in this fashion in RISC calling conventions

There's still a question of whether or not A and X should be used to pass arguments. The downside I've seen is that it's rare in functions with a lot of arguments that the first bytes are the ones you want in A and X, so they're immediately copied to other registers. The upside is that for functions with small amounts of arguments, it's rather likely that you want them in A and X, since it's the most interesting thing to operate on.

There's a lot of variants we could try; just X, X and Y, just A, etc. Once the compiler has better output, I'm planning to whip up a bunch of different varieties and benchmark them all.

BigEd · Post by **BigEd** » Fri Oct 15, 2021 6:35 pm

Sounds great!

mysterymath · Post by **mysterymath** » Fri Oct 15, 2021 6:46 pm

Alas, these results appear to have been quite fragile. Our nightly verification run caught an issue with G_SELECT lowering, and fixing it slightly changes the order in which selects are lowered, which turns the decrement sequence to mush.

The gist of the problem is this: on the 6502, a huge number of operation types can only be efficiently computed using a sequence of branches. We need to introduce these branches before scheduling instructions, since branching produces control flow regions that don't share register pressure; for example, A can have different values on each side of the branch.

However, the branches *themselves* also need to be scheduled, since LLVM isn't capable of moving instructions across a branch! So what's happening is that we're introducing branches really early in the pipeline on essentially unordered generic machine code. LLVM can't move instructions across these cuts, so whatever randomly happened to be on one side stays on one side. That means that any values that are live across the branch stay live across the branch, even if moving an instruction across could shorten the live range to zero. This means that there are a lot more values that the register allocator has to keep live at any given point in the program, which produces a bogus overall compile.

So, we need to run a scheduling pass *before* we introduce branches to make sure that instructions are on the right side of them. Since we can nest branches in branches, we also have to run this scheduling after branches are lowered and instructions are sunk into each side. So the scheduling needs to be a part of the G_SELECT lowering, interleaved with it.

This'll be a fairly major project to fix, but it's becoming apparent that G_SELECT lowering is the single most important part of the backend. The answer to how do I XX on the 6502 apparently always involves a tricky sequence of branches of one form or another.

barrym95838 · Post by **barrym95838** » Fri Oct 15, 2021 7:23 pm

Yeah, the fastest 6502 code seems to be more branch-heavy than the more sophisticated processors because 6502 branches are so efficient. If you want to introduce a branch-less template for this task, maybe you could just teach it:

Code: Select all

dec_i32:
  sec
  sbc #1          ; (or any unsigned 8-bit int)
  tay
  txa
  sbc #0
  tax
  lda mos8(__rc2)
  sbc #0
  sta mos8(__rc2)
  lda mos8(__rc3)
  sbc #0
  sta mos8(__rc3)
  tya
  rts

and get back to the challenge (of speeding it up) later.

mysterymath · Post by **mysterymath** » Fri Oct 15, 2021 8:59 pm

That's actually the sequence it was using before the latest work; but intrinsic in using the INC and DEC instructions for >1 byte is a facility with branching.

But the compiler can get worse before it gets better; that's the biggest reason why I'm doing all this work before a v1 release, since at that point it'll actually be a problem if performance regresses.
Better to speak INC and DEC poorly than not at all; at least that way the code is continuously being exercised and not bit-rotting.

BigEd · Post by **BigEd** » Fri Oct 15, 2021 9:28 pm

I like that Mike! Addition or subtraction with a short constant simpler than increment or decrement.

barrym95838 · Post by **barrym95838** » Fri Oct 15, 2021 9:59 pm

Simpler and more versatile, but certainly not the best on clock ticks in the general case.

BigEd · Post by **BigEd** » Sat Oct 16, 2021 8:55 am

Is it the case that LLVM needs to discover (or re-invent) sequences like these, or is it seeded with handy routines that it can later rejig according to surrounding code?

mysterymath · Post by **mysterymath** » Sat Oct 16, 2021 8:17 pm

The answer is somewhere in between. Our backend doesn't really have anything resembling real 6502 assembly until really really late in the compilation process, one or two passes before the final machine code is generated. All of the work of the compiler occurs on internal representations that very very gradually come to resemble 6502 assembly. So we can teach LLVM generic patterns, but it's difficult to teach it flat instruction sequences.

WARNING: A way-too-detailed look at what LLVM is doing follows.

The code that LLVM sees starts nearly as vague and high level as C; each LLVM pass either makes the sequence faster or makes the sequence more specific.
The advantage to doing things this way is that you avoid needing to go back and correct poor decisions later; instead the compiler attempts to avoid making decisions at all until a point where those decisions are relatively clear. This can't be done perfectly, but there's also hard limits to how many bad decisions you can clean up at the end.

For example, the DEC function under discussion starts out as:

Code: Select all

define i32 @dec_i32(i32 %a) {
entry:
  %0 = sub i32 %a, 1
  ret i32 %0
}

The calling convention is implemented:

Code: Select all

    liveins: $a, $x, $rc2, $rc3
  
    %1:_(s8) = COPY $a
    %2:_(s8) = COPY $x
    %3:_(s8) = COPY $rc2
    %4:_(s8) = COPY $rc3
    %0:_(s32) = G_MERGE_VALUES %1(s8), %2(s8), %3(s8), %4(s8)
    %5:_(s32) = G_CONSTANT i32 1
    %6:_(s32) = G_SUB %0, %5
    %7:_(s8), %8:_(s8), %9:_(s8), %10:_(s8) = G_UNMERGE_VALUES %6(s32)
    $a = COPY %7(s8)
    $x = COPY %8(s8)
    $rc2 = COPY %9(s8)
    $rc3 = COPY %10(s8)
    RTS implicit $a, implicit $x, implicit $rc2, implicit $rc3

Now, there are operations in here you can't natively do on the 6502: a 32-bit decrement. These are replaced with sequences that can plausibly be performed on the 6502 (legalization):

Code: Select all

  bb.1.entry:
    liveins: $a, $x, $rc2, $rc3
  
    %1:_(s8) = COPY $a
    %2:_(s8) = COPY $x
    %3:_(s8) = COPY $rc2
    %4:_(s8) = COPY $rc3
    %16:_(s8) = G_CONSTANT i8 1
    %17:_(s8) = G_SUB %1, %16
    %18:_(s8) = G_CONSTANT i8 0
    %64:_(s1) = G_CONSTANT i1 true
    %75:_(s8), %76:_(s1), %77:_, %78:_, %79:_ = G_SBC %1, %18, %64
    %19:_(s1) = COPY %79(s1)
    %45:_(s8) = G_SUB %2, %16
    %70:_(s8), %71:_(s1), %72:_, %73:_, %74:_ = G_SBC %2, %18, %64
    %46:_(s1) = COPY %74(s1)
    %60:_(s8) = G_SUB %3, %16
    %65:_(s8), %66:_(s1), %67:_, %68:_, %69:_ = G_SBC %3, %18, %64
    %61:_(s1) = COPY %69(s1)
    %62:_(s8) = G_SUB %4, %16
    %63:_(s8) = G_SELECT %61(s1), %62, %4
    %56:_(s8) = G_SELECT %46(s1), %60, %3
    %57:_(s8) = G_SELECT %46(s1), %63, %4
    %37:_(s8) = G_SELECT %19(s1), %45, %2
    %38:_(s8) = G_SELECT %19(s1), %56, %3
    %39:_(s8) = G_SELECT %19(s1), %57, %4
    $a = COPY %17(s8)
    $x = COPY %37(s8)
    $rc2 = COPY %38(s8)
    $rc3 = COPY %39(s8)
    RTS implicit $a, implicit $x, implicit $rc2, implicit $rc3

This is the pattern that I mentioned for decrementing: always decrement the low byte, and decrement the high bytes if and only if the low byte was previously zero.

G_SELECT is like a C ternary: G_SELECT %61, %62, %4 = %62 if %61, otherwise %4.
G_SBC is a generalized SBC operation: it provides an output and all the possible flags, and takes two arguments and a carry in. Depending on what's used, this covers the behavior of the various comparison and subtraction instructions.

Note that at this phase, we haven't picked instructions, what registers things are going to go into, anything like that at all. Just the basic plan of how it's going to decrement exists.

From there, we lower the G_SELECT instructions to (still totally generic) branches.
EDIT: Surprisingly, the lowering below looks alright to me. I'm either missing something, or the problem actually lies elsewhere...

Code: Select all

  bb.1.entry:
    successors: %bb.2(0x40000000), %bb.3(0x40000000)
    liveins: $a, $x, $rc2, $rc3
  
    %1:_(s8) = COPY $a
    %2:_(s8) = COPY $x
    %3:_(s8) = COPY $rc2
    %4:_(s8) = COPY $rc3
    %16:_(s8) = G_CONSTANT i8 1
    %17:_(s8) = G_SUB %1, %16
    %18:_(s8) = G_CONSTANT i8 0
    %64:_(s1) = G_CONSTANT i1 true
    %75:_(s8), %76:_(s1), %77:_, %78:_, %79:_ = G_SBC %1, %18, %64
    G_BRCOND_IMM %79(s1), %bb.2, 1
    G_BR %bb.3
  
  bb.2.entry:
    successors: %bb.5(0x40000000), %bb.6(0x40000000)
  
    %45:_(s8) = G_SUB %2, %16
    %70:_(s8), %71:_(s1), %72:_, %73:_, %74:_ = G_SBC %2, %18, %64
    G_BRCOND_IMM %74(s1), %bb.5, 1
    G_BR %bb.6
  
  bb.5.entry:
    successors: %bb.8(0x40000000), %bb.9(0x40000000)
  
    %60:_(s8) = G_SUB %3, %16
    %65:_(s8), %66:_(s1), %67:_, %68:_, %69:_ = G_SBC %3, %18, %64
    G_BRCOND_IMM %69(s1), %bb.8, 1
    G_BR %bb.9
  
  bb.8.entry:
    successors: %bb.10(0x80000000)
  
    %62:_(s8) = G_SUB %4, %16
    G_BR %bb.10
  
  bb.9.entry:
    successors: %bb.10(0x80000000)
  
    G_BR %bb.10
  
  bb.10.entry:
    successors: %bb.7(0x80000000)
  
    %63:_(s8) = G_PHI %62(s8), %bb.8, %4(s8), %bb.9
    G_BR %bb.7
  
  bb.6.entry:
    successors: %bb.7(0x80000000)
  
    G_BR %bb.7
  
  bb.7.entry:
    successors: %bb.4(0x80000000)
  
    %57:_(s8) = G_PHI %63(s8), %bb.10, %4(s8), %bb.6
    %56:_(s8) = G_PHI %60(s8), %bb.10, %3(s8), %bb.6
    G_BR %bb.4
  
  bb.3.entry:
    successors: %bb.4(0x80000000)
  
    G_BR %bb.4
  
  bb.4.entry:
    %39:_(s8) = G_PHI %57(s8), %bb.7, %4(s8), %bb.3
    %38:_(s8) = G_PHI %56(s8), %bb.7, %3(s8), %bb.3
    %37:_(s8) = G_PHI %45(s8), %bb.7, %2(s8), %bb.3
    $a = COPY %17(s8)
    $x = COPY %37(s8)
    $rc2 = COPY %38(s8)
    $rc3 = COPY %39(s8)
    RTS implicit $a, implicit $x, implicit $rc2, implicit $rc3

From this, we select idealized versions of the 6502 instructions to use. These still aren't actual 6502 opcodes, but they capture the kinds of constraints that exist on the 6502 about where you can put things: for example, you can only natively compare when the LHS is A, X, or Y, but not an imaginary register, and the rhs has to either be an immediate or an imaginary register.

Code: Select all

bb.1.entry:
    successors: %bb.2(0x40000000), %bb.3(0x40000000)
    liveins: $a, $x, $rc2, $rc3
  
    %1:gpr = COPY $a
    %2:gpr = COPY $x
    %3:gpr = COPY $rc2
    %4:gprimag8 = COPY $rc3
    %17:gprimag8 = DEC %1
    %89:cc = CMPImmTerm %1, 0, implicit-def $nz
    BR %bb.2, $z, 1
    JMP %bb.3
  
  bb.2.entry:
    successors: %bb.5(0x40000000), %bb.6(0x40000000)
  
    %45:gprimag8 = DEC %2
    %88:cc = CMPImmTerm %2, 0, implicit-def $nz
    BR %bb.5, $z, 1
    JMP %bb.6
  
  bb.5.entry:
    successors: %bb.8(0x40000000), %bb.9(0x40000000)
  
    %60:gprimag8 = DEC %3
    %87:cc = CMPImmTerm %3, 0, implicit-def $nz
    BR %bb.8, $z, 1
    JMP %bb.9
  
  bb.8.entry:
    successors: %bb.10(0x80000000)
  
    %62:gprimag8 = DEC %4
    JMP %bb.10
  
  bb.9.entry:
    successors: %bb.10(0x80000000)
  
    JMP %bb.10
  
  bb.10.entry:
    successors: %bb.7(0x80000000)
  
    %63:anyi8 = PHI %62, %bb.8, %4, %bb.9
    JMP %bb.7
  
  bb.6.entry:
    successors: %bb.7(0x80000000)
  
    JMP %bb.7
  
  bb.7.entry:
    successors: %bb.4(0x80000000)
  
    %57:anyi8 = PHI %63, %bb.10, %4, %bb.6
    %56:anyi8 = PHI %60, %bb.10, %3, %bb.6
    JMP %bb.4
  
  bb.3.entry:
    successors: %bb.4(0x80000000)
  
    JMP %bb.4
  
  bb.4.entry:
    %39:anyi8 = PHI %57, %bb.7, %4, %bb.3
    %38:anyi8 = PHI %56, %bb.7, %3, %bb.3
    %37:anyi8 = PHI %45, %bb.7, %2, %bb.3
    $a = COPY %17
    $x = COPY %37
    $rc2 = COPY %38
    $rc3 = COPY %39
    RTS implicit $a, implicit $x, implicit $rc2, implicit $rc3

Note that at this stage, LLVM still hasn't decided where to put anything! DEC instructions are generalized: they can operate on A, X, Y, or any ZP register. The A variant uses CLC ADC; and will later use DEC A on the 65C02.

And then, much much later, after a ton of additional manipulations, LLVM allocates registers and decides where to put everything:

Code: Select all

  bb.0.entry:
    successors: %bb.1(0x40000000), %bb.6(0x40000000)
    liveins: $a, $x, $rc2, $rc3
  
    renamable $y = COPY $x
    renamable $x = COPY renamable $a
    renamable $x = DEC killed renamable $x
    renamable $rc4 = COPY killed renamable $x
    dead renamable $c = CMPImmTerm killed renamable $a, 0, implicit-def dead $nz, implicit-def $z
    BR %bb.1, killed $z, 1
    JMP %bb.6
  
  bb.1.entry:
    successors: %bb.2(0x40000000), %bb.5(0x40000000)
    liveins: $y, $rc2, $rc3, $rc4
  
    renamable $x = COPY renamable $y
    renamable $x = DEC killed renamable $x
    dead renamable $c = CMPImmTerm killed renamable $y, 0, implicit-def dead $nz, implicit-def $z
    BR %bb.2, killed $z, 1
    JMP %bb.5
  
  bb.2.entry:
    successors: %bb.3(0x40000000), %bb.4(0x40000000)
    liveins: $x, $rc2, $rc3, $rc4
  
    renamable $y = COPY killed renamable $rc2
    renamable $rc2 = COPY renamable $y
    renamable $rc2 = DEC killed renamable $rc2
    renamable $a = COPY killed renamable $rc4
    dead renamable $c = CMPImmTerm killed renamable $y, 0, implicit-def dead $nz, implicit-def $z
    BR %bb.3, killed $z, 1
    JMP %bb.4
  
  bb.3.entry:
    successors: %bb.7(0x80000000)
    liveins: $a, $x, $rc2, $rc3
  
    renamable $rc3 = DEC killed renamable $rc3
    JMP %bb.7
  
  bb.4.entry:
    successors: %bb.7(0x80000000)
    liveins: $a, $x, $rc2, $rc3
  
    JMP %bb.7
  
  bb.5.entry:
    successors: %bb.7(0x80000000)
    liveins: $x, $rc2, $rc3, $rc4
  
    renamable $a = COPY killed renamable $rc4
    JMP %bb.7
  
  bb.6.entry:
    successors: %bb.7(0x80000000)
    liveins: $y, $rc2, $rc3, $rc4
  
    renamable $x = COPY killed renamable $y
    renamable $a = COPY killed renamable $rc4
    JMP %bb.7
  
  bb.7.entry:
    liveins: $a, $x, $rc2, $rc3
  
    RTS implicit $a, implicit $x, implicit $rc2, implicit $rc3

BigEd · Post by **BigEd** » Sat Oct 16, 2021 8:38 pm

Thanks for the explanation! Of course I only skimmed it, but it'll be good to come back to.

It's really encouraging that LLVM can tackle this problem even reasonably well. I'm very grateful that you're doing the work to make it real!

mysterymath · Post by **mysterymath** » Sat Oct 16, 2021 9:54 pm

It looks like the bad compile was actually caused by the register allocator; needing to keep the original operands alive was quite a challenge.

But there's an easy way to reduce the pressure on the register allocator: compare the result of the decrement with 255 rather than comparing the argument with zero. This isn't optimal, since it prevents the comparisons from ever being folded away. Immediate compares are pretty quick and pretty small though, and the spill code required to keep the arguments alive can apparently be more than that in the worst case.

With that, the sequence is now:

Code: Select all

dec_i32:
  ldy mos8(__rc2)
  clc
  adc #255
  cmp #255
  bne .LBB6_4
  dex
  cpx #255
  bne .LBB6_4
  dey
  cpy #255
  bne .LBB6_4
  dec mos8(__rc3)
.LBB6_4:
  sty mos8(__rc2)
  rts

mysterymath · Post by **mysterymath** » Wed Nov 10, 2021 5:28 am

Optimization work is progressing steadily, but slowly.

As always, "low-hanging fruit" turns out to take 10 times as long as I've estimated.

Still, there's some new stuff!

Standard library call optimizations
LLVM has now been told that our standard library function behave as C standard library functions should. This allows it to, for example, rewrite "printf("Hello, world!")" to "puts("Hello, world!")". This is actually a pretty major optimization; it's common C programmer practice to printf things they should puts (especially while debugging), and LLVM is already capable of inlining the puts.

Better loop optimization
I've mostly taught LLVM about the 6502's addressing modes, which required significantly extending a key LLVM optimization pass. LLVM will now attempt to narrow variables that vary with the loop ("induction variables") to 8-bits wherever possible, zero-extending them to wider types if necessary. Zero-extending is "free" on the 6502, so that's a good tradeoff. This code is deeply immature and imperfect though; it's clear I'm going to have to keep hammering on it throughout the life of the project.

Switch Jump Tables
LLVM now lowers switch statements to jump tables where appropriate. Depending on how the cases are laid out, it may do a binary search to narrow the cases, then several jump tables for cases that are clustered together.

The jump tables are separated into two arrays in classic 6502 fashion: first all the low bytes, then all the high bytes. The size of each array is limited to 256, guaranteeing lookup using the absolute indexed addressing mode.

We're not using the RTS trick at the moment; it's a speed/size tradeoff (4 bytes smaller, 1 byte slower), and it's slightly more complicated to code in LLVM. Instead, we indirect JMP through a zero-page pointer, which also handily works around the NMOS indirect jump bug.

Computed Goto
On the way to the above, we've also enabled GCC's computed goto extension. You can take the address of C labels, stash them away as pointers, and call them via a special goto syntax. This turns into an indirect jump, just like switch lowering.

Hello, world!
The first two optimizations bring a return to form for the essential "Hello, world" example:

Code: Select all

#include <stdio.h>

void main(void) {
  printf("HELLO, 6502!\n");
}

Code: Select all

main:
	ldx	#1
	lda	#72
.LBB0_1:
	;APP
	jsr	65490
	;NO_APP
	lda	.Lstr,x
	inx
	cpx	#13
	bne	.LBB0_1
	lda	#10
	;APP
	jsr	65490
	;NO_APP
	rts

.Lstr:
	.asciz	"HELLO, 6502!"

Until next time!

Sean · Post by **Sean** » Wed Nov 10, 2021 6:10 am

Awesome. It sounds like you're making a lot of progress. I am reading this correctly that ".LBB0_1" is the start of loop for the inlined puts implementation and "jsr 65490" is the call to a subroutine that actually outputs the character?

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen