LLVM 6502 Codegen

nollkolltroll · Post by **nollkolltroll** » Fri Feb 19, 2021 8:11 am

This looks very good, hoping to get some time in the not-so-distant future to test it out!

mysterymath · Post by **mysterymath** » Wed Mar 31, 2021 2:03 am

Quick project update:

Still alive! Not much new on the code generation front; instead, we've been focused on hooking up an end-to-end compilation flow.
Standout new work:

Assembler and Linker: The code generator and johnwbyrd's assember/linker have been merged together into one llvm-mos codebase. This allows binaries to be generated end-to-end from C to target native file format.
Full Link Time Optimization (LTO) by Default: This is the moral equivalent of merging the entire project into one C file before compilation, but much faster, since it operates on low-level LLVM-IR bitcode, not high-level C source code. This allows us to ship the platform libraries as one set of bitcode objects: since machine code generation is deferred until link-time, the same set of libraries can be optimized on-the-fly based on how they're used and how many zero-page locations are available for a given target. This also allows whole-program inlining, code stripping, specialization.
SDK: A minimal set of platform libraries and linker scripts is under development. At present, only one memory configuration for the C64 (KERNAL, no BASIC, PRG file with BASIC header) is supported, but there's nothing that we know of precluding other platforms from being developed and included. The initial release will probably include a minimal C runtime only for a handful of stand-out platforms; a full libc will be brought in sometime further down the line.
Cleanup/Refactoring: Now that the project isn't just me, a lot of work has been put towards making the code generator cleaner, more understandable, and better documented.
Worse is Better: Much of the optimizations to get really nice code for demonstration purposes has been removed. The code generation isn't terrible now, but it's considerably worse than it was before. We're focusing on reaching C99 compatibility at this point; to get there, we'll want the implementation to be as simple as we can manage, since there's a ton of ground to cover. Once we're a reasonably standards-compliant compiler, we'll add it all back in, but this time, with proper benchmarking on real test suites, not toy examples.

Here's a working example of how the compiler can be invoked. Note that, thanks to LTO, the __chrout routine included with the SDK is fully inlined into its use; the JSR into the KERNAL is emitted immediately without first going through the platform library routine. This is similar to what would occur if __chrout were in the same C file as hello_world, but they're actually linked together from separate files.

Code: Select all

$ cd ..
$ clang-mos --config build/commodore/64.cfg -o hello.prg examples/hello_world.c

$ cat examples/hello_world.c
#include <chrout.h>

int main(void) {
  const char *cur = "HELLO, WORLD!\n";
  while (*cur)
    __chrout(*cur++);
  return 0;
}

$ hexdump -C hello.prg
00000000  01 08 0b 08 5d 1e 9e 32  30 36 31 00 00 00 4c 10  |....]..2061...L.|
00000010  08 a2 00 a9 48 20 d2 ff  bd 25 08 e8 e0 0e d0 f5  |....H ...%......|
00000020  a9 00 a2 00 60 48 45 4c  4c 4f 2c 20 57 4f 52 4c  |....`HELLO, WORL|
00000030  44 21 0a 00                                       |D!..|
00000034

$ clang-mos --config build/commodore/64.cfg -o hello.s -Wl,--lto-emit-asm examples/hello_world.c

$ cat hello.s
	.text
	.file	"ld-temp.o"
	.section	.text.main,"ax",@progbits
	.globl	main
	.type	main,@function
main:
	ldx	#0
	lda	#72
LBB0_1:
	;APP
	jsr	65490
	;NO_APP
	lda	.str+1,x
	inx
	cpx	#14
	bne	LBB0_1
LBB0_2:
	lda	#0
	ldx	#0
	rts
.Lfunc_end0:
	.size	main, .Lfunc_end0-main

	.type	.str,@object
	.section	.rodata.str1.1,"aMS",@progbits,1
.str:
	.asciz	"HELLO, WORLD!\n"
	.size	.str, 15

	.addrsig

johnwbyrd · Post by **johnwbyrd** » Sun Apr 04, 2021 9:37 pm

I would point out in mysterymath's example, that clang may be asked to emit gas-compatible assembly code, which in turn is being assembled by LLVM. Intermediate assembly is not required as an intermediate step however.

The approach for code generation in llvm-mos is somewhat radical. mysterymath has taken the approach of representing the 65xx instruction set in terms of functional groups, so that 6502-specific optimizations and code lowering occur fairly early on in the LLVM pass list. While this approach is counterintuitive, it permits LLVM to come up with many surprising and unlikely optimizations.

As you see above, LLVM preinitializes the 6502 accumulator with the ASCII value for "A", because it's the first letter in the string "HELLO, WORLD!" It's not special-case code within LLVM that figures out this particular 6502 optimization; it's just LLVM's optimization framework itself, with a fair representation of the 6502 instruction set, and how to lower to it.

Even though it's still early days, several common assumptions about 6502 compilers, are now refutable by our work in general.

First, the assumption that a modern compiler framework, such as LLVM, cannot be targeted towards an old 8-bit CPU such as the 6502. LLVM's new GlobalISel architecture can very well be targeted to the 6502, and it can indeed produce superior code, if permitted to do so.

Second, the assumption that because the 6502 is "stackless," it is not a good host for C. In truth, the 6502 is an excellent target for C, if the compiler is designed from the ground up to take advantage of the 6502's architecture. By that, I mean that the compiler must be aware of zero page, and it must be able to manage a user-specified chunk of it. The original 6502 designers were well aware of the 6502's register limitations, and so provided a bunch of zero-page addressing modes to compensate. llvm-mos uses an "imaginary register" abstraction to produce performant 6502 code, by permitting the compiler to manage that zero-page memory in a manner compatible with all existing 6502 targets.

Third, that "simpler is better" for producing a performant compiler for 8-bit targets. llvm-mos's architecture and design choices are not at all simple. I haven't counted, but I think llvm-mos is doing about 100 passes through the code, about 8 of which are specific to the MOS 6502.

Fourth, that the 6502 is implicitly some sort of "special" architecture, and it therefore requires special compilers, linkers, binary file formats, etc. We treat the 6502, ultimately, as just another target within the LLVM framework, and as such it benefits from all the industry-standard ELF-compatible file formats.

Fifth, that because the 6502 is a small target, it requires a smaller compiler and smaller tools. This assumption never really made sense anyway. In fact the opposite is true: if you want to do advanced codegen for the 6502, you need a really intelligent (and large) compiler and toolchain framework, not a small one. The state of the art of optimization has advanced leaps and bounds in the past three decades, and the poor old 6502 has received none of those benefits, until the current work.

Sixth, that peephole optimization produces the best codegen for the 6502. In fact, llvm-mos gets the most benefit out of 6502-specific optimizations relatively early in the LLVM machine function pass pipeline, and the code it produces (in small tests) is quite efficient, even without any 6502-specific peephole optimizer at all.

Seventh, that because the 6502 is small, it requires some sort of specialized language (in the David A. Wheeler sense) in order to generate performant code. In fact, source language representation is not a limiting factor. LLVM will likely be able to handle compiling other languages to the 6502, at some point in the future. C works just fine for the 6502, if you don't start with the non-goal of writing a "simple" compiler.

All these "common knowledge" assumptions have been demonstrated to be unsupportable, by counterexample.

BigEd · Post by **BigEd** » Mon Apr 05, 2021 7:39 am

Very interesting indeed! Many thanks for the notes. (And, of course, for all the efforts which lie behind them.)

mysterymath · Post by **mysterymath** » Thu Apr 29, 2021 7:45 pm

Update:

I've done a ton of work over the last month, and I have absolutely nothing to show for it!

Well, except for this 318K (text) compiler intermediate file (attached).

Toy examples are all well and good, but I'll never be able to call this a compiler unless it has some degree of standards conformance.

The best means at my disposal to measure standards conformance is LLVM's enormous end-to-end test suite. However, even running the simplest of those tests requires a working printf() implementation, since that's how those tests output their results for evaluation.

So, to begin working towards feature completeness and standards compliance, the first step is to get a working printf. I chose the single-file printf https://github.com/mpaland/printf, since it can be easily hooked up to a char-out hook provided by any of a number of existing 6502 simulators.

I've discovered that going from toy to printf is quite a journey; when you break it all the way down to 8 bits, a full printf implementation (sans float support) covers a pretty wide swath of integer bitwise, arithmetic, and comparison instructions.

LLVM operates in passes. When I started, the very first pass that we owned choked on printf. After gradually massaging and generalizing the code generator, I got it through that pass, and the next, and the next.

Getting the code through the register allocator was particularly tricky business, since unlike my toy examples, hundreds and hundreds of live values fly around inside printf, each needing to be in A, X, or Y, the zero page, or the stack at various times, in an intricate dance. Large function calls threw a wrench into that dance; the calling sequence I naively emitted pinned the A, X, and Y registers right at the beginning of the call, so there were no free registers available to load zero-page registers with the rest of the arguments. I had to teach the instruction scheduler that A, X, and Y are precious resources, and that it needed to work very diligently to keep them free around calls.

Along the way I've had to add support for generalized multi-byte signed and unsigned comparisons, and, or, xor, certain 1-bit rotations, certain 1-bit shifts, the V register, using SBC as a comparison operator (for V).

So, I'm now through all the way to stack access emission, which is more than 2/3rds of the way through the pipeline. The generated code so far is also pretty terrible; I'm leaving hundreds of easy optimization opportunities on the floor to try to race to completeness as quickly as possible. Even so, there's only maybe a 20% chance that printf actually works at the end of this process; after this, there will likely be a truly grueling debugging phase to get the reams of generated 6502 code to actually work. But, once it does, the project will have gone from "toy compiler" to "unfinished compiler." Even better, we'll be able to quantify just how unfinished it is, then track progress objectively towards "terrible finished compiler".

BigEd · Post by **BigEd** » Thu Apr 29, 2021 8:04 pm

printf is a beast! I know this because of how much smaller things are when you can do without it.

Well done - great progress!

aleferri · Post by **aleferri** » Thu May 06, 2021 12:30 am

mysterymath, are you using cc65's routines for signed comparisons, 32 bit arithmetic, etc?
Or are you developing them independently?

EDIT: spelling

mysterymath · Post by **mysterymath** » Thu May 06, 2021 5:34 am

aleferri wrote:

mysterymath, are you using cc65's routines for signed comparisons, 32 bit arithmetic, etc?
Or are you developing them independently?

Neither, sort of.

At the moment, the compiler only generates calls to a couple of routines. For now, just integer division/modulus, memset, and memcpy. The attached SDK project will come with implementations of these or users are free to provide their own; they each use the normal C calling convention established by the compiler. At the moment, these aren't implemented at all, and the linker will fail if the compiler emits calls to them.

As for how I'm actually doing things like integer comparison and what not:
LLVM's "legalization" phase lowers a given 32-bit operation to a collection of 8-bit operations that accomplish the same thing.

Once we have a whole function lowered down to a sequence of 8-bit operations, various combine and optimization passes happen to take a look at what's actually used, and how. For example, if you do a 32-bit addition, then use a 32-bit AND and 32-bit >> to extract the high 8 bits into, say, a char, the optimizer will realize that the other 24 bits are totally dead, and the vast majority of the instructions will be marked dead and removed. This turns the whole sequence into an 8-bit addition and an 8-bit AND; the shift is removed entirely. This sort of collapse happens more than you might think.

Even if large operations remain, there are still some advantages to expanding things out. Consider a 32-bit add coded as 4 LDA, ADC, STA triples. This requires only the A register, along with wherever the inputs and outputs are already expected to be. Doing this with a pseudo, you still may need 4 LDA STA pairs just to shuffle the data into whatever fixed location the pseudo expects them to be in, then another 4 to shuffle them from wherever the pseudo places is output to somewhere else (so you can use them for another call to the addition pseudo, etc.)

There's a few things in we can do if we still run into trouble. There's a "machine outliner" that can automatically factor instructions into ad-hoc subroutines, sort of like a JSR-based dictionary cruncher. We can also fall back to cc65-style pseudos in the worst case. I'm intentionally punting on as many of these sorts of decisions as possible at this point; the implementation doesn't preclude doing any of them, and having real examples of working code to look at should help.

BigEd · Post by **BigEd** » Thu May 06, 2021 7:35 am

One interesting case is wide comparisons, because it's possible to start at the high byte and short-circuits when the result is in no doubt. In that way, a comparison is unlike a subtraction. It's probably demanding a lot to do this very well, but how does this come out presently?

BillG · Post by **BillG** » Thu May 06, 2021 8:34 am

mysterymath wrote:

For example, if you do a 32-bit addition, then use a 32-bit AND and 32-bit >> to extract the high 8 bits into, say, a char, the optimizer will realize that the other 24 bits are totally dead, and the vast majority of the instructions will be marked dead and removed. This turns the whole sequence into an 8-bit addition and an 8-bit AND; the shift is removed entirely. This sort of collapse happens more than you might think.

How do you know that adding the lower 24 bits did not carry into the upper 8?

mysterymath · Post by **mysterymath** » Thu May 06, 2021 7:35 pm

BigEd wrote:

One interesting case is wide comparisons, because it's possible to start at the high byte and short-circuits when the result is in no doubt. In that way, a comparison is unlike a subtraction. It's probably demanding a lot to do this very well, but how does this come out presently?

Ah, this is interesting! The default LLVM lowering for comparisons does just as you say: compare the high byte, then the next, then the next. However, it attempts to do so using a SELECT pseudoinstruction to avoid needing to do actual branches (and stalling the pipeline). Based on whether the high byte is equal, it selects either the comparison result for all but the highest byte or the highest byte's comparison result. The lower comparison result is recursively expanded the same way.

The natural way to lower a SELECT pseudo is via a branch between two copies, one for true and one for false, both to the same destination (the "diamond control flow pattern"). The normal short-circuiting way of implementing CMP falls out of this completely automatically: all the selects are emitted as control flow, the generation of the select's arguments are sunk down into the conditional basic blocks, and the copies are elided. The magic of a multipass compiler architecture!

Legalization in LLVM (at least the new way we're doing it) is function global, so we'll also be able to do a really neat optimization eventually. If we detect that there's an existing subtraction with the same arguments as the comparison, we can merge the two together and emit a combined "subtract and make a note of the C, V and N flags" sequence.

EDIT: I actually tried this, and the above doesn't actually happen yet. There's no sinking phase after a select is expanded, so the control flow ends up different than I was thinking, and LLVM becomes unable to untangle it later. It should be relatively straightforward for us to sink the arguments of a select when we're building the control flow; just another thing on the optimization TODO list.

BillG wrote:

How do you know that adding the lower 24 bits did not carry into the upper 8?

This is what happens when you write off the cuff examples at 11PM. You're right; this would only be safe if you're extracting the low 8 bits. In that case, there's no shift, just a truncate. In the original case, the compiler would still need the full 32-bit addition, but it wouldn't need to save the outputs anywhere, since only the carry would be live. Thus it'd emit something like CLC LDA ADC LDA ADC LDA ADC LDA ADC STA.

BillG · Post by **BillG** » Fri May 07, 2021 1:07 am

I had to read it several times to be sure you did not mean a 32-bit OR followed by the and and shift. The lower 24 bits of that can be optimized out.

mysterymath · Post by **mysterymath** » Thu May 13, 2021 5:07 pm

Alright, after implementing simple routines for integer division, multiplication, and modulus and a simple memset, and fixing a half dozen more compiler bugs, a hello_world.c using printf can now make it all the way through the compiler.

It worked on the first try!

Trying printf("HELLO, %s\n", "WORLD") produced "HELLO, <heart>", of course. So now we begin the process of getting printf to actually work.

The whole thing is just under 11KB of terrible, terrible code. I particularly like how it does multibyte equality comparisons by EORing the all the bytes together. Oh and how it uses an extremely long sequence of branches, immediate loads, and an EOR to XOR N and V together for the signed comparison result. Reeeeeeal nice.

It's very difficult to resist the temptation to go in and fix all of this right now, but there's bigger fish to fry.

I've attached an assembly-language version of the output, in case anyone is sufficiently masochistic to wade through it.

pzembrod · Post by **pzembrod** » Fri May 14, 2021 10:05 pm

That's some serious generated assembly code indeed, wow!
Are the sources of hello.c and printf.c viewable somewhere from which it was generated?

mysterymath · Post by **mysterymath** » Sat May 15, 2021 1:01 am

pzembrod wrote:

Are the sources of hello.c and printf.c viewable somewhere from which it was generated?

Yep, everything's in our SDK: https://github.com/llvm-mos/llvm-mos-sdk

The printf code and libcalls are under common/lib/.
You can see the hello_world.c I'm using in the left hand pane of the screenshot, it's just "printf("HELLO, WORLD");".

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen