6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 12:21 pm

All times are UTC




Post new topic Reply to topic  [ 155 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next
Author Message
 Post subject: LLVM 6502 Codegen
PostPosted: Sun Jan 10, 2021 11:53 pm 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
Hi all,

I've been working on generating 6502 assembly from Clang/LLVM. This has been attempted many times by many different folks, so I'm wary of heaping another attempt on the pile, but I think the results I've gotten thus far are promising enough to share.

https://github.com/mysterymath/clang6502

TL;DR example:

Code:
int main(void) {
   const char *cur = "HELLO, WORLD!\n";
   while (*cur) {
      char c = *cur++;
      asm volatile ("JSR\t$FFD2" : "+a"(c));
   }
}


Code:
.code
.global   main
main:
   LDX   #0
   LDA   #72
LBB0__1:
   ;APP
   JSR   $FFD2
   ;NO_APP
   LDA   _2Estr+1,X
   INX
   CPX   #14
   BNE   LBB0__1
   LDA   #0
   LDX   #0
   RTS

.rodata
_2Estr:
   .byt   72,69,76,76,79,44,32,87,79,82,76,68,33,10,0


I've been in communication with johnwbyrd; at some point, we'll likely merge together this code generator and his ASM backend. Right now, the code generator spits out ca65-compatible assembly, but cannot natively emit machine code.

The compiler is extremely incomplete. Because I've been tackling problems "riskiest first," the code generator does very little beyond what is necessary to get the milestones working. However, each milestone has been selected to showcase a difficult task in 6502 code generation, and I've honed each until LLVM generates roughly as good of assembly as I can. Once the hard problems are mostly squared away, I'll go back and fill out the code generator, which should be mostly mechanical.

Amongst the hard problems I've got at least partial solutions for:
  • Interplay of hard and soft stacks. Each frame is presently allowed 4 bytes of hard stack, which it will place the most important values in. Hard stack saves/loads are optimized to PHA and PLA in some cases.
  • Zero page usage. The zero page is treated as a large register file, then lowered to real memory accesses later. Values that aren't live across JSRs can often live their whole life in the zero page, even in recursive functions.
  • 8-bit loop indices detection. The compiler rewrites loop induction variables to fit within 8 bits wherever possible. This allows using indexed addressing modes and INX for what are logically pointer induction variables.

Things that I've got a good idea how they will work, but don't yet:
  • Whole-program destackification: LLVM already has a pass that takes global variables and moves them to the stack for non-recursive functions; for the 6502, we'd want to reverse the process to lower stack accesses to globals.
  • Whole-program outlining: Instead of pre-designating certain "meta-operations" as JSR compiler intrinisics, LLVM supports outlining passes that essentially reverse inlining. That way, common code sequences can be automatically detected and code size reduced in a general fashion, without special casing, and in a way sensitive to the execution frequency of affected code paths.

Please let me know what you think!


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 11, 2021 12:24 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
mysterymath wrote:
Code:
int main(void) {
   const char *cur = "HELLO, WORLD!\n";
   while (*cur) {
      char c = *cur++;
      asm volatile ("JSR\t$FFD2" : "+a"(c));
   }
}


Code:
.code
.global   main
main:
   LDX   #0
   LDA   #72
LBB0__1:
   ;APP
   JSR   $FFD2
   ;NO_APP
   LDA   _2Estr+1,X
   INX
   CPX   #14
   BNE   LBB0__1
   LDA   #0
   LDX   #0
   RTS

.rodata
_2Estr:
   .byt   72,69,76,76,79,44,32,87,79,82,76,68,33,10,0



It does not matter for this example, but the meaning of the generated code does not really match what the C code says. The C code says to output characters from the string until a zero byte is encountered. The generated code just outputs 14 characters, no matter what they are. If you were to have a zero embedded within the string, the generated code blindly outputs it and keeps going.

What does it generate if the string is not known at compile time?

Worse than that, better code can be generated in this case if it did follow the exact meaning of the C code:

Code:
.code
.global   main
main:
   LDX   #0
   LDA   #72
LBB0__1:
   ;APP
   JSR   $FFD2
   ;NO_APP
   INX
   LDA   _2Estr,X
   BNE   LBB0__1
   LDA   #0
   LDX   #0
   RTS

.rodata
_2Estr:
   .byt   72,69,76,76,79,44,32,87,79,82,76,68,33,10,0


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 11, 2021 12:54 am 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
LLVM runs up to 100 iterations of the routine at compile-time, then uses this to determine an upper bound on the number of possible loop iterations. That's where the 14 comes from in the generated code. If there were a zero byte anywhere else in the string, the bound would change, and if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.

It is a bit unfortunate that the CMP 14 is retained in the output; LLVM "canonicalizes" the induction variable to a comparison against the loop exit count, which allows it to remember the loop exit count for later passes. This is what lets it use LDA _2Estr+1,X: it can prove that X never wraps around.

It would be cleaner to have it take the exit condition back to the original != 0 check, but I'm not sure how to actually make that work. It's unlikely that I'm ever going to get *perfect* code out of LLVM, only better or worse.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 11, 2021 4:04 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Not that I'm a compiler expert, but

Code:
.code
.global   main
main:
   LDX   #0
   BEQ   LBB0__2

LBB0__1:
   ;APP
   JSR   $FFD2
   ;NO_APP
   INX

LBB0__2:
   LDA   _2Estr,X
   BNE   LBB0__1
   TAX
   RTS

.rodata
_2Estr:
   .byt   72,69,76,76,79,44,32,87,79,82,76,68,33,10,0


seems reasonable to me once the compiler proves the X register will not wrap, even if it does know the exact count. That "LDA #72" in particular seems a bit redundant, since 72 also occurs in the actual string.

Incidentally, if the first char happens to be a zero byte, will the compiler skip generating any code at all?


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 11, 2021 5:50 am 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
If I understand LLVM correctly, the LDA #72 exists because of a loop rotation transformation which removes the introductory BEQ from your example. This optimization probably isn't quite as important for the 6502, since there's no branch predictor. There seem to be a number of places where LLVM goes out of its way to avoid branching; heck, the whole compiler-rt library is written with absurd branchless versions of div and mul routines to avoid a mispredict and pipeline stall on modern CPUs.

That seems to be the fundamental tradeoff with the compiler so far: LLVM is a giant cannon's worth of compiling power, but it's aimed the wrong way. At present, I'm trying to turn it as hard as I can, but I have to pick my battles, there's too many otherwise.

Accordingly, there's no way for me to say to LLVM: "for this example, I want you to generate this output". The output is the end result of nearly a hundred disparate optimization passes, each with its own list of caveats. For example, the machine-code copy propagation pass absolutely refuses to remove copies unless there's no branch after the copy. Why? Meh, it's probably fine, and determining whether the copy is used after the branch is hard, and not worth it for X86/Arm/AMDGPU/whathaveyou.

On the other hand, yes, if you replace the string with "", the compiler generates:
Code:
.code
.global   main
main:
   LDA   #0
   LDX   #0
   RTS


And I didn't have to do anything special to get it to do that.

So that's the hope: that broader efficiencies in zero page, stack, and memory layout will pay for somewhat poor code generation. It definitely won't pay for cc65 quality code, but the examples I'm working with are already leagues better. Looking at LLVM examples for X86 and ARM and so forth, that seems to be the broader tradeoff LLVM makes. It's codebases is absolutely chock-full of FIXMEs and caveats, and there's dozens of easy scenarios for a human to optimize that just don't quite fit the model. But a compiler doesn't need to beat the human in the short game, only in the long one.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 11, 2021 9:18 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Welcome, mysterymath! This looks like a great start, and I very much hope you get to the finish line!

Absolute optimality, for me, is not so important as correctness and (near) completeness. Something which is competitive with cc65 would be great.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Tue Jan 12, 2021 5:14 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Thanks for clarifying that. It sounds like a huge task; I look forward to seeing what you can manage!


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Wed Jan 20, 2021 9:30 pm 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
I've been working on tuning the following milestone:

Code:
char next_char();
void report_counts(int counts[256]);

void char_stats() {
   int counts[256] = {0};
   char c;
   while ((c = next_char())) {
      counts[c]++;
   }
   report_counts(counts);
}


Here the "report_counts" function is external to the program, so it could call "char_stats()" recursively. This forces the use of a soft C-stack, as the 512-byte count array cannot fit into the 4 bytes of hard stack I allow each call frame.

In addition to the soft stack, this example also exercises 16-bit shifts (multiply by 2) and instruction scheduling for register pressure (performing each byte of the count increment in a pipelined fashion; i.e., LDA ADC STA LDA ADC STA, not LDA LDX ADC TAY TXA ADC STY STX or something).

Scheduling instructions into little pipelines is particularly interesting: it generates smaller code than the two-byte-at-a-time approach since it lowers register pressure. In this case, both the low and high bytes of the count can live their entire lives in A, since they're never alive at the same time. This decreases code size and cycle count over a more naive version, but still likely takes more code than breaking out 16-bit operations into subroutines would.

Generated code for this example (-O2):
Code:
.code
.global   char__stats
char__stats:
   CLC
   LDA   #254
   ADC   z:__SPhi
   STA   z:__SPhi
   LDA   z:__SPlo
   STA   z:__ZP__0
   LDA   z:__SPhi
   STA   z:__ZP__1
   LDA   #0
   LDX   #0
   LDY   #2
   JSR   memset
   JMP   LBB0__2
LBB0__1:
   ASL   A
   STA   z:__ZP__0
   LDA   #0
   ROL   A
   STA   z:__ZP__1
   LDA   z:__SPlo
   LDX   z:__SPhi
   CLC
   ADC   z:__ZP__0
   STA   z:__ZP__0
   TXA
   ADC   z:__ZP__1
   STA   z:__ZP__1
   LDY   #0
   LDA   (__ZP__0),Y
   CLC
   ADC   #1
   STA   (__ZP__0),Y
   LDY   #1
   LDA   (__ZP__0),Y
   ADC   #0
   STA   (__ZP__0),Y
LBB0__2:
   JSR   next__char
   CMP   #0
   BNE   LBB0__1
   LDA   z:__SPlo
   STA   z:__ZP__0
   LDA   z:__SPhi
   STA   z:__ZP__1
   JSR   report__counts
   CLC
   LDA   #2
   ADC   z:__SPhi
   STA   z:__SPhi
   RTS

.global   __SPhi
.global   __SPlo
.global   __ZP__0
.global   __ZP__1
.global   memset
.global   next__char
.global   report__counts


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Mon Jan 25, 2021 10:44 pm 
Offline

Joined: Sat Sep 05, 2020 3:12 pm
Posts: 22
Wow, this is great! I'm excited to see an active LLVM 6502 effort.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Sat Feb 06, 2021 6:24 pm 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
The raison d'etre feature, static stacks, is now working fairly well. Here's the same code example as before, with a twist.

char_lib.h
Code:
__attribute__((leaf)) char next_char();
__attribute__((leaf)) void report_counts(int counts[256]);


char_stats.c
Code:
#include "char_lib.h"

void char_stats() {
   int counts[256] = {0};
   char c;
   while ((c = next_char())) {
      counts[c]++;
   }
   report_counts(counts);
}


The only effective difference from before is that the two external function calls are marked with "__attribute__((leaf))".

Without attribute leaf, the compiler must assume that next_char and report_counts can call any external function in the program. This makes char_stats possibly recursive, so counts must be placed on a stack.

But, attribute leaf on an external function tells the compiler that calling such a function cannot in turn call any other function in the caller's translation unit. Such functions are "leaves" from the point of view of the caller. Very nearly every library routine can be so annotated, whether C standard library, OS. This is standard practice in e.g. glibc, where this enables similar whole-program optimizations.

LLVM can then prove char_stats is not recursive, so it can convert the entire stack frame to a single global array, char_stats_sstk. (sstk stands for "static stack", my name for the concept). This completely removes the stack pointer from the generated assembly, and produces something much closer to what I'd write by hand. Eventually, the static stack frames for each function will be merged together by a big interprocedural pass, since any two functions that cannot be alive at the same time can have overlapping static stack frames. That's why I'm calling it "static stack": the eventual array will mirror what the runtime call stack would have looked like, except that its layout is determined at compile time, not at run time.

Attribute leaf is only needed if the compiler can't see the definitions for the functions, if, for example, Link-time Optimization is used, then it would be perfectly apparent to the compiler that char_stats is not recursive. In fact, the whole trio of routines would probably be inlined into one.

The nice thing about attribute leaf is that it doesn't *require* LTO to produce good code; separable compilation can work as well, so long as the boundaries between translation units are appropriately annotated. Functions themselves don't need to be annotated as non-recursive, so there's no burden on the average programmer, so long as they're using LTO. It falls on the maintainers, not the users, of a binary-only or assembly-langage library to add the appropriate leaf annotations.

Generated code:
Code:
.code
.global   char__stats
char__stats:
   LDA   #0
   LDY   #2
   LDX   #<char__stats__sstk
   STX   z:__ZP__0
   LDX   #>char__stats__sstk
   STX   z:__ZP__1
   TAX
   JSR   memset
   JMP   LBB0__2
LBB0__1:
   ASL   A
   STA   z:__ZP__0
   LDA   #0
   ROL   A
   STA   z:__ZP__1
   LDA   #<char__stats__sstk
   LDX   #>char__stats__sstk
   CLC
   ADC   z:__ZP__0
   STA   z:__ZP__0
   TXA
   ADC   z:__ZP__1
   STA   z:__ZP__1
   LDY   #0
   LDA   (__ZP__0),Y
   CLC
   ADC   #1
   STA   (__ZP__0),Y
   LDY   #1
   LDA   (__ZP__0),Y
   ADC   #0
   STA   (__ZP__0),Y
LBB0__2:
   JSR   next__char
   CMP   #0
   BNE   LBB0__1
LBB0__3:
   LDA   #<char__stats__sstk
   STA   z:__ZP__0
   LDA   #>char__stats__sstk
   STA   z:__ZP__1
   JSR   report__counts
   RTS

.bss
char__stats__sstk:
   .res   512

.global   __ZP__0
.global   __ZP__1
.global   memset
.global   next__char
.global   report__counts


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Sat Feb 06, 2021 10:31 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
> convert the entire stack frame to a single global array

I like this! Recursion is useful, but also unusual.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Tue Feb 16, 2021 3:32 pm 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
mysterymath on Mon 11 Jan 2021 wrote:
if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.


You should get a job fixing Atmel's compiler because avr-gcc doesn't do that.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Tue Feb 16, 2021 9:45 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 138
Hi!

Sheep64 wrote:
mysterymath on Mon 11 Jan 2021 wrote:
if the upper bound could not be proven < 256, the compiler would form the loop using 16-bit pointers as you'd expect.


You should get a job fixing Atmel's compiler because avr-gcc doesn't do that.


In that code, avr-gcc does the exact same transformation, including the loading of #72 at the start of the loop.

Have Fun!


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Thu Feb 18, 2021 6:41 am 
Offline

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64
Project Update:

The calling convention is now C99 complete, ignoring any bugs.
Variable length argument lists and C99 variable length arrays both appear to be working.

The ... portion of arguments are passed through the stack, while named arguments continue to be passed in registers. This is generally considered the most efficient way to implement C varargs, since va_list must point to a valid location even if itself passed to another function. The disadvantage is that the varargs and non-varargs calling conventions differ, which is explicitly allowed by the C standard, but can cause compatibility problems on established platforms.
AArch64, notably, is willing to copy their entire register bank to va_list to only have one calling convention, except on Apple platforms, where they split the convention as I have.

For variable length arrays, a frame pointer must be kept, since otherwise it's unclear what the stack pointer should be returned to on exit. The frame pointer is kept below the local stack area, since the 6502 can only index upward in memory, negative offsets don't exist. Notably, in non-recursive functions, it's likely that the only thing on the soft stack will be the dynamically-sized array. Everything else could be placed on the static stack in such cases.

I've got a few more risky aspects of CodeGen to square away before I start hammering at end-to-end C99 compliance. Once I'm relatively close, I'll work with johnwbyrd to get LLVM's test suite passing on an emulated C64. From there, it's optimization city.


Top
 Profile  
Reply with quote  
 Post subject: Re: LLVM 6502 Codegen
PostPosted: Thu Feb 18, 2021 9:19 pm 
Offline

Joined: Sat Sep 05, 2020 3:12 pm
Posts: 22
Amazing progress!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 155 posts ]  Go to page 1, 2, 3, 4, 5 ... 11  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron