LLVM 6502 Codegen

mysterymath · Post by **mysterymath** » Sun Jan 10, 2021 11:53 pm

Hi all,

I've been working on generating 6502 assembly from Clang/LLVM. This has been attempted many times by many different folks, so I'm wary of heaping another attempt on the pile, but I think the results I've gotten thus far are promising enough to share.

https://github.com/mysterymath/clang6502

TL;DR example:

Code: Select all

int main(void) {
	const char *cur = "HELLO, WORLD!\n";
	while (*cur) {
		char c = *cur++;
		asm volatile ("JSR\t$FFD2" : "+a"(c));
	}
}

Code: Select all

.code
.global	main
main:
	LDX	#0
	LDA	#72
LBB0__1:
	;APP
	JSR	$FFD2
	;NO_APP
	LDA	_2Estr+1,X
	INX
	CPX	#14
	BNE	LBB0__1
	LDA	#0
	LDX	#0
	RTS

.rodata
_2Estr:
	.byt	72,69,76,76,79,44,32,87,79,82,76,68,33,10,0

I've been in communication with johnwbyrd; at some point, we'll likely merge together this code generator and his ASM backend. Right now, the code generator spits out ca65-compatible assembly, but cannot natively emit machine code.

The compiler is extremely incomplete. Because I've been tackling problems "riskiest first," the code generator does very little beyond what is necessary to get the milestones working. However, each milestone has been selected to showcase a difficult task in 6502 code generation, and I've honed each until LLVM generates roughly as good of assembly as I can. Once the hard problems are mostly squared away, I'll go back and fill out the code generator, which should be mostly mechanical.

Amongst the hard problems I've got at least partial solutions for:

Interplay of hard and soft stacks. Each frame is presently allowed 4 bytes of hard stack, which it will place the most important values in. Hard stack saves/loads are optimized to PHA and PLA in some cases.
Zero page usage. The zero page is treated as a large register file, then lowered to real memory accesses later. Values that aren't live across JSRs can often live their whole life in the zero page, even in recursive functions.
8-bit loop indices detection. The compiler rewrites loop induction variables to fit within 8 bits wherever possible. This allows using indexed addressing modes and INX for what are logically pointer induction variables.

Things that I've got a good idea how they will work, but don't yet:

Whole-program destackification: LLVM already has a pass that takes global variables and moves them to the stack for non-recursive functions; for the 6502, we'd want to reverse the process to lower stack accesses to globals.
Whole-program outlining: Instead of pre-designating certain "meta-operations" as JSR compiler intrinisics, LLVM supports outlining passes that essentially reverse inlining. That way, common code sequences can be automatically detected and code size reduced in a general fashion, without special casing, and in a way sensitive to the execution frequency of affected code paths.

Please let me know what you think!

BillG · Post by **BillG** » Mon Jan 11, 2021 12:24 am

mysterymath wrote:

Code: Select all

int main(void) {
	const char *cur = "HELLO, WORLD!\n";
	while (*cur) {
		char c = *cur++;
		asm volatile ("JSR\t$FFD2" : "+a"(c));
	}
}

Code: Select all

.code
.global	main
main:
	LDX	#0
	LDA	#72
LBB0__1:
	;APP
	JSR	$FFD2
	;NO_APP
	LDA	_2Estr+1,X
	INX
	CPX	#14
	BNE	LBB0__1
	LDA	#0
	LDX	#0
	RTS

.rodata
_2Estr:
	.byt	72,69,76,76,79,44,32,87,79,82,76,68,33,10,0

It does not matter for this example, but the meaning of the generated code does not really match what the C code says. The C code says to output characters from the string until a zero byte is encountered. The generated code just outputs 14 characters, no matter what they are. If you were to have a zero embedded within the string, the generated code blindly outputs it and keeps going.

What does it generate if the string is not known at compile time?

Worse than that, better code can be generated in this case if it did follow the exact meaning of the C code:

Code: Select all

.code
.global	main
main:
	LDX	#0
	LDA	#72
LBB0__1:
	;APP
	JSR	$FFD2
	;NO_APP
	INX
	LDA	_2Estr,X
	BNE	LBB0__1
	LDA	#0
	LDX	#0
	RTS

.rodata
_2Estr:
	.byt	72,69,76,76,79,44,32,87,79,82,76,68,33,10,0

mysterymath · Post by **mysterymath** » Mon Jan 11, 2021 12:54 am

LLVM runs up to 100 iterations of the routine at compile-time, then uses this to determine an upper bound on the number of possible loop iterations. That's where the 14 comes from in the generated code. If there were a zero byte anywhere else in the string, the bound would change, and if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.

It is a bit unfortunate that the CMP 14 is retained in the output; LLVM "canonicalizes" the induction variable to a comparison against the loop exit count, which allows it to remember the loop exit count for later passes. This is what lets it use LDA _2Estr+1,X: it can prove that X never wraps around.

It would be cleaner to have it take the exit condition back to the original != 0 check, but I'm not sure how to actually make that work. It's unlikely that I'm ever going to get *perfect* code out of LLVM, only better or worse.

teamtempest · Post by **teamtempest** » Mon Jan 11, 2021 4:04 am

Not that I'm a compiler expert, but

Code: Select all

.code
.global   main
main:
   LDX   #0
   BEQ   LBB0__2

LBB0__1:
   ;APP
   JSR   $FFD2
   ;NO_APP
   INX

LBB0__2:
   LDA   _2Estr,X
   BNE   LBB0__1
   TAX
   RTS

.rodata
_2Estr:
   .byt   72,69,76,76,79,44,32,87,79,82,76,68,33,10,0

seems reasonable to me once the compiler proves the X register will not wrap, even if it does know the exact count. That "LDA #72" in particular seems a bit redundant, since 72 also occurs in the actual string.

Incidentally, if the first char happens to be a zero byte, will the compiler skip generating any code at all?

mysterymath · Post by **mysterymath** » Mon Jan 11, 2021 5:50 am

If I understand LLVM correctly, the LDA #72 exists because of a loop rotation transformation which removes the introductory BEQ from your example. This optimization probably isn't quite as important for the 6502, since there's no branch predictor. There seem to be a number of places where LLVM goes out of its way to avoid branching; heck, the whole compiler-rt library is written with absurd branchless versions of div and mul routines to avoid a mispredict and pipeline stall on modern CPUs.

That seems to be the fundamental tradeoff with the compiler so far: LLVM is a giant cannon's worth of compiling power, but it's aimed the wrong way. At present, I'm trying to turn it as hard as I can, but I have to pick my battles, there's too many otherwise.

Accordingly, there's no way for me to say to LLVM: "for this example, I want you to generate this output". The output is the end result of nearly a hundred disparate optimization passes, each with its own list of caveats. For example, the machine-code copy propagation pass absolutely refuses to remove copies unless there's no branch after the copy. Why? Meh, it's probably fine, and determining whether the copy is used after the branch is hard, and not worth it for X86/Arm/AMDGPU/whathaveyou.

On the other hand, yes, if you replace the string with "", the compiler generates:

Code: Select all

.code
.global	main
main:
	LDA	#0
	LDX	#0
	RTS

And I didn't have to do anything special to get it to do that.

So that's the hope: that broader efficiencies in zero page, stack, and memory layout will pay for somewhat poor code generation. It definitely won't pay for cc65 quality code, but the examples I'm working with are already leagues better. Looking at LLVM examples for X86 and ARM and so forth, that seems to be the broader tradeoff LLVM makes. It's codebases is absolutely chock-full of FIXMEs and caveats, and there's dozens of easy scenarios for a human to optimize that just don't quite fit the model. But a compiler doesn't need to beat the human in the short game, only in the long one.

BigEd · Post by **BigEd** » Mon Jan 11, 2021 9:18 am

Welcome, mysterymath! This looks like a great start, and I very much hope you get to the finish line!

Absolute optimality, for me, is not so important as correctness and (near) completeness. Something which is competitive with cc65 would be great.

teamtempest · Post by **teamtempest** » Tue Jan 12, 2021 5:14 pm

Thanks for clarifying that. It sounds like a huge task; I look forward to seeing what you can manage!

mysterymath · Post by **mysterymath** » Wed Jan 20, 2021 9:30 pm

I've been working on tuning the following milestone:

Code: Select all

char next_char();
void report_counts(int counts[256]);

void char_stats() {
	int counts[256] = {0};
	char c;
	while ((c = next_char())) {
		counts[c]++;
	}
	report_counts(counts);
}

Here the "report_counts" function is external to the program, so it could call "char_stats()" recursively. This forces the use of a soft C-stack, as the 512-byte count array cannot fit into the 4 bytes of hard stack I allow each call frame.

In addition to the soft stack, this example also exercises 16-bit shifts (multiply by 2) and instruction scheduling for register pressure (performing each byte of the count increment in a pipelined fashion; i.e., LDA ADC STA LDA ADC STA, not LDA LDX ADC TAY TXA ADC STY STX or something).

Scheduling instructions into little pipelines is particularly interesting: it generates smaller code than the two-byte-at-a-time approach since it lowers register pressure. In this case, both the low and high bytes of the count can live their entire lives in A, since they're never alive at the same time. This decreases code size and cycle count over a more naive version, but still likely takes more code than breaking out 16-bit operations into subroutines would.

Generated code for this example (-O2):

Code: Select all

.code
.global	char__stats
char__stats:
	CLC
	LDA	#254
	ADC	z:__SPhi
	STA	z:__SPhi
	LDA	z:__SPlo
	STA	z:__ZP__0
	LDA	z:__SPhi
	STA	z:__ZP__1
	LDA	#0
	LDX	#0
	LDY	#2
	JSR	memset
	JMP	LBB0__2
LBB0__1:
	ASL	A
	STA	z:__ZP__0
	LDA	#0
	ROL	A
	STA	z:__ZP__1
	LDA	z:__SPlo
	LDX	z:__SPhi
	CLC
	ADC	z:__ZP__0
	STA	z:__ZP__0
	TXA
	ADC	z:__ZP__1
	STA	z:__ZP__1
	LDY	#0
	LDA	(__ZP__0),Y
	CLC
	ADC	#1
	STA	(__ZP__0),Y
	LDY	#1
	LDA	(__ZP__0),Y
	ADC	#0
	STA	(__ZP__0),Y
LBB0__2:
	JSR	next__char
	CMP	#0
	BNE	LBB0__1
	LDA	z:__SPlo
	STA	z:__ZP__0
	LDA	z:__SPhi
	STA	z:__ZP__1
	JSR	report__counts
	CLC
	LDA	#2
	ADC	z:__SPhi
	STA	z:__SPhi
	RTS

.global	__SPhi
.global	__SPlo
.global	__ZP__0
.global	__ZP__1
.global	memset
.global	next__char
.global	report__counts

pzembrod · Post by **pzembrod** » Mon Jan 25, 2021 10:44 pm

Wow, this is great! I'm excited to see an active LLVM 6502 effort.

mysterymath · Post by **mysterymath** » Sat Feb 06, 2021 6:24 pm

The raison d'etre feature, static stacks, is now working fairly well. Here's the same code example as before, with a twist.

char_lib.h

Code: Select all

__attribute__((leaf)) char next_char();
__attribute__((leaf)) void report_counts(int counts[256]);

char_stats.c

Code: Select all

#include "char_lib.h"

void char_stats() {
	int counts[256] = {0};
	char c;
	while ((c = next_char())) {
		counts[c]++;
	}
	report_counts(counts);
}

The only effective difference from before is that the two external function calls are marked with "__attribute__((leaf))".

Without attribute leaf, the compiler must assume that next_char and report_counts can call any external function in the program. This makes char_stats possibly recursive, so counts must be placed on a stack.

But, attribute leaf on an external function tells the compiler that calling such a function cannot in turn call any other function in the caller's translation unit. Such functions are "leaves" from the point of view of the caller. Very nearly every library routine can be so annotated, whether C standard library, OS. This is standard practice in e.g. glibc, where this enables similar whole-program optimizations.

LLVM can then prove char_stats is not recursive, so it can convert the entire stack frame to a single global array, char_stats_sstk. (sstk stands for "static stack", my name for the concept). This completely removes the stack pointer from the generated assembly, and produces something much closer to what I'd write by hand. Eventually, the static stack frames for each function will be merged together by a big interprocedural pass, since any two functions that cannot be alive at the same time can have overlapping static stack frames. That's why I'm calling it "static stack": the eventual array will mirror what the runtime call stack would have looked like, except that its layout is determined at compile time, not at run time.

Attribute leaf is only needed if the compiler can't see the definitions for the functions, if, for example, Link-time Optimization is used, then it would be perfectly apparent to the compiler that char_stats is not recursive. In fact, the whole trio of routines would probably be inlined into one.

The nice thing about attribute leaf is that it doesn't *require* LTO to produce good code; separable compilation can work as well, so long as the boundaries between translation units are appropriately annotated. Functions themselves don't need to be annotated as non-recursive, so there's no burden on the average programmer, so long as they're using LTO. It falls on the maintainers, not the users, of a binary-only or assembly-langage library to add the appropriate leaf annotations.

Generated code:

Code: Select all

.code
.global	char__stats
char__stats:
	LDA	#0
	LDY	#2
	LDX	#<char__stats__sstk
	STX	z:__ZP__0
	LDX	#>char__stats__sstk
	STX	z:__ZP__1
	TAX
	JSR	memset
	JMP	LBB0__2
LBB0__1:
	ASL	A
	STA	z:__ZP__0
	LDA	#0
	ROL	A
	STA	z:__ZP__1
	LDA	#<char__stats__sstk
	LDX	#>char__stats__sstk
	CLC
	ADC	z:__ZP__0
	STA	z:__ZP__0
	TXA
	ADC	z:__ZP__1
	STA	z:__ZP__1
	LDY	#0
	LDA	(__ZP__0),Y
	CLC
	ADC	#1
	STA	(__ZP__0),Y
	LDY	#1
	LDA	(__ZP__0),Y
	ADC	#0
	STA	(__ZP__0),Y
LBB0__2:
	JSR	next__char
	CMP	#0
	BNE	LBB0__1
LBB0__3:
	LDA	#<char__stats__sstk
	STA	z:__ZP__0
	LDA	#>char__stats__sstk
	STA	z:__ZP__1
	JSR	report__counts
	RTS

.bss
char__stats__sstk:
	.res	512

.global	__ZP__0
.global	__ZP__1
.global	memset
.global	next__char
.global	report__counts

BigEd · Post by **BigEd** » Sat Feb 06, 2021 10:31 pm

> convert the entire stack frame to a single global array

I like this! Recursion is useful, but also unusual.

Sheep64 · Post by **Sheep64** » Tue Feb 16, 2021 3:32 pm

mysterymath on Mon 11 Jan 2021 wrote:

if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.

You should get a job fixing Atmel's compiler because avr-gcc doesn't do that.

dmsc · Post by **dmsc** » Tue Feb 16, 2021 9:45 pm

Hi!

Sheep64 wrote:

mysterymath on Mon 11 Jan 2021 wrote:

if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.

You should get a job fixing Atmel's compiler because avr-gcc doesn't do that.

In that code, avr-gcc does the exact same transformation, including the loading of #72 at the start of the loop.

Have Fun!

mysterymath · Post by **mysterymath** » Thu Feb 18, 2021 6:41 am

Project Update:

The calling convention is now C99 complete, ignoring any bugs.
Variable length argument lists and C99 variable length arrays both appear to be working.

The ... portion of arguments are passed through the stack, while named arguments continue to be passed in registers. This is generally considered the most efficient way to implement C varargs, since va_list must point to a valid location even if itself passed to another function. The disadvantage is that the varargs and non-varargs calling conventions differ, which is explicitly allowed by the C standard, but can cause compatibility problems on established platforms.
AArch64, notably, is willing to copy their entire register bank to va_list to only have one calling convention, except on Apple platforms, where they split the convention as I have.

For variable length arrays, a frame pointer must be kept, since otherwise it's unclear what the stack pointer should be returned to on exit. The frame pointer is kept below the local stack area, since the 6502 can only index upward in memory, negative offsets don't exist. Notably, in non-recursive functions, it's likely that the only thing on the soft stack will be the dynamically-sized array. Everything else could be placed on the static stack in such cases.

I've got a few more risky aspects of CodeGen to square away before I start hammering at end-to-end C99 compliance. Once I'm relatively close, I'll work with johnwbyrd to get LLVM's test suite passing on an emulated C64. From there, it's optimization city.

pzembrod · Post by **pzembrod** » Thu Feb 18, 2021 9:19 pm

Amazing progress!

LLVM 6502 Codegen

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen