LLVM 6502 Codegen

mysterymath · Post by **mysterymath** » Wed Sep 01, 2021 6:05 am

Well, it looks like I spoke too soon!

I told LLVM that the standard library functions that ship with the compiler are, in fact, the C standard library functions, not some other random functions with the same names.

This allows LLVM passes to reason about the semantics of things like strcpy, so it can manipulate, combine, and optimize calls to them.

This brings the effective performance at -Os to:

11,128 Dhrystones/sec

Why the huge jump? Well, most of the string copies must not actually have achieved anything, so away they go... the size of the binary fell from around 900 bytes to 601 bytes.
I'm more or less presuming that the reasoning is correct, since the test suite still seems to pass at -Os, and all I did was enable an optimization far far above us in the stack.
Probably won't see another big jump like this; usually, the earlier an optimization can take root, the more effect it has.

BigEd · Post by **BigEd** » Wed Sep 01, 2021 8:01 am

Is this a case of the compiler being clever enough not to run all the code that the benchmark contains?

John West · Post by **John West** » Wed Sep 01, 2021 8:22 am

There's a lot of opportunity for a modern compiler to throw away code in the Dhrystone benchmark. The various functions that are called from the main loop are called only once, and will likely be inlined. From there, some of them can easily be optimised down to "variable = constant", and lifted out of the loop.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.

mysterymath · Post by **mysterymath** » Wed Sep 01, 2021 4:58 pm

Okay, I'm axing my hacky CPU timing mechanism; it's just bad. Instead, I'll redo these figures with raw cycle counts from the simulator.
With that, my earlier figures are adjusted to:
753 D/s => 189 D/s,
11,128 D/s => 3050 D/s,

These are generated by simple cycle counting and assumption of a 1MHz 6502; these are thus perfectly repeatable. I won't be able to use any of LLVM's tooling to track these figures, but that's a small price to pay for accuracy and repeatability, and I may be able to jury-rig some automation later.

On the 6502-specific optimization front, I'm adding a simple combine to fix a corner case where abs,x addressing would be used even though x was a constant.
By itself, this brings the figure to:
3137 D/s

The various optimizations remaining will likely be similarly incremental, so I'll post back once a big slew of them have piled up.

vbc · Post by **vbc** » Wed Sep 01, 2021 5:00 pm

John West wrote:

There's a lot of opportunity for a modern compiler to throw away code in the Dhrystone benchmark. The various functions that are called from the main loop are called only once, and will likely be inlined. From there, some of them can easily be optimised down to "variable = constant", and lifted out of the loop.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.

Version 1 of Dhrystone that was mentioned is even more susceptible to code removal. AFAIK that was the main reason for the creation of version 2. However, we are still talking about the 80s and - as you correctly pointed out - modern compilers will still optimize away big parts of that version as well. Nevertheless most optimizing compilers typically only improve the Dhrystone results by a factor of about 2-4 when compared with optimizations turned off. So, the results mentioned here are at least a bit unusual. As the benchmark does not take any input and does not really calculate anything, a compiler could practically eliminate the entire benchmark, though. Another factor is how unoptimized is the unoptimized code.

I once had a nice talk with the Dhrystone author at a university summer party. He was of course very aware of these (and other) shortcomings at that time (early 2000s).

mysterymath · Post by **mysterymath** » Wed Sep 01, 2021 5:26 pm

vbc wrote:

Nevertheless most optimizing compilers typically only improve the Dhrystone results by a factor of about 2-4 when compared with optimizations turned off.

Just for fun, I ran one billion loops through the x86_64 target of clang:
-O0: 45.0592 s
-Os: 0.3692 s

So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.

vbc · Post by **vbc** » Thu Sep 02, 2021 12:50 am

mysterymath wrote:

Just for fun, I ran one billion loops through the x86_64 target of clang:
-O0: 45.0592 s
-Os: 0.3692 s

Interesting. With gcc9 on my machine -O0 gives 47s and -O3 gives15s. It seems recent llvm removes almost all the code from the main loop. I did not follow its development, but I think I have seen earlier results from llvm that were pretty close to gcc (probably always using Dhrystone version 2, though).

Quote:

So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.

If much of the code is moved outside the loop, the number of iterations may also increase the relation. You probably did not try a billion for the 6502.

BillG · Post by **BillG** » Thu Sep 16, 2021 11:07 am

I am curious what your compiler can do with these two benchmarks:

viewtopic.php?p=78835#p78835

mysterymath · Post by **mysterymath** » Fri Sep 17, 2021 3:56 am

Both of those benchmarks involve signed comparisons, which is one of the poison cases for the code generator at the moment (I'm working on it presently).

On a simulated 2MHz NMOS 6502, fib(30) takes:
As written: 9.9s
With unsigned integers: 7.4 s

Unfortunately, the llvm-mos code generator presently uses up to 4 bytes of hard stack per invocation, so tarai's max stack depth of 55 required 330 bytes of stack.
I'll add a limit to the amount of hard stack used per frame at some point, but it's relatively low on my TODO list.

Still lots of room for improvement, of course.

The compiler can't yet decrement or directly access the stack (has to load to ZP first), both of which would help this benchmark.

BillG · Post by **BillG** » Fri Sep 17, 2021 9:31 am

In this post, I relate my experience with these benchmarks and PL/M-80 which does not have signed variables:

viewtopic.php?p=78835#p78835

I look forward to seeing you beat SDCC and GCC.

drogon · Post by **drogon** » Fri Sep 17, 2021 10:10 am

At the risk of side-tracking this thread, for now, while still interested in C, I'm currently using BCPL and I know that those 2 benchmarks aren't that good in BCPL on my '816 system, however the fib() one isn't bad either, so bearing in-mind that my system is an '816 running at 16Mhz, interpreting a 32-bit bytecode VM, all numbers are signed 32-bits and there is overhead to make the 64K pages transparent to user-land the results are:

Code: Select all

% bench
  fib result: 28657 -  3.762
tarai result:    10 - 43.852

first number is the result, 2nd the time in seconds to the nearest millisecond. Multiply by 8 to get the effective speed on a 2Mhz system, so fib() comes out at 30 seconds - only ~3 times slower than the LLVM one above which really isn't bad all things considered, and tarai() to 351 seconds. Ah well, can't win 'em all!

(although it may win on code-size at 248 bytes!)

-Gordon

code:

Code: Select all

GET "libhdr.h"
GET "sys.h"

LET fib (n) = VALOF
{
  TEST n < 2 THEN
    RESULTIS n
  ELSE
    RESULTIS fib (n - 1) + fib (n - 2)
}


LET tarai (x, y, z) = VALOF
{
  TEST x <= y THEN
    RESULTIS y
  ELSE
    RESULTIS tarai (tarai (x - 1, y, z),
                    tarai (y - 1, z, x),
                    tarai (z - 1, x, y))
}

AND start () = VALOF
{
  LET tStart, tEnd = ?, ?
  LET r = ?

  tStart := sys (Sys_cputime)
  r := fib (23)
  tEnd   := sys (Sys_cputime)
  
  writef ("  fib result: %5d - %6.3d*n", r, tEnd - tStart)

  tStart := sys (Sys_cputime)
  r := tarai (10, 1, 0)
  tEnd   := sys (Sys_cputime)
  
  writef ("tarai result: %5d - %6.3d*n", r, tEnd - tStart)

  RESULTIS 0
}

geon · Post by **geon** » Sun Sep 19, 2021 5:16 pm

mysterymath wrote:

It would be cleaner to have it take the exit condition back to the original != 0 check, but I'm not sure how to actually make that work. It's unlikely that I'm ever going to get *perfect* code out of LLVM, only better or worse.

Well, you can't expect a C compiler to output asm as a human would write it. And since the observable behavior isn't impacted, the compiler is free to implement it however it sees fit.

mysterymath · Post by **mysterymath** » Wed Sep 29, 2021 1:08 am

Project Update

Two big items of note:

Had a discussion with a member of one of the RISC-V committees about why their calling convention was the way it was, and a lot of the same concerns apply to us as well. Switched out our calling convention for one based loosely on RISC-V; this generally improves performance, but in a corner case, may also increase interrupt latency. If interrupt latency ends up being too slow in practice, then we do have a simple design that can dedicate zero page register sets to interrupt handlers.
Integer comparisons (signed and unsigned, single-byte and multi-byte) now emit relatively idiomatic 6502 assembly.

The biggest effect of the former is that routines may be able to fit their entire working set of SSA values in the zero page registers. This is apparently the biggest reason behind the relatively large number of caller-saved registers in recent architectures; you get diminishing returns much past 16 (we were previously using all 256, but our limited experience is well aligned with this).

The effect of the latter is enormous; there's typically at least one integer comparison per loop iteration, and the new sequences are around 4 times shorter than the old ones. I'm particularly proud of how the code generation came together; I only vaguely waved my hands in the direction of idiomatic 6502 patterns (EOR #80, etc.), then added key optimizations to select lowering, instruction selection, legalization, etc, and all the details just sort themselves out. This makes the implementation pretty flexible, as it can handle pretty much any variation in where the individual bytes of a comparison come from. As a neat side-effect, it can also fold comparisons and subtractions together, re-using the flag outputs from an earlier subtraction. (This doesn't come up much in practice, it seems.)

Here's an example:

Code: Select all

#include <stdbool.h>
bool eq(unsigned a, unsigned b) { return a == b; }
bool uge(unsigned a, unsigned b) { return a >= b; }
bool slt(int a, int b) { return a < b; }

Code: Select all

eq:
	cpx	mos8(__rc3)
	bne	.LBB0_3
	cmp	mos8(__rc2)
	bne	.LBB0_3
	lda	#1
	rts
.LBB0_3:
	lda	#0
	rts

uge:
	cpx	mos8(__rc3)
	bne	.LBB1_3
	cmp	mos8(__rc2)
	bcc	.LBB1_4
.LBB1_2:
	lda	#1
	rts
.LBB1_3:
	cpx	mos8(__rc3)
	bcs	.LBB1_2
.LBB1_4:
	lda	#0
	rts

slt:
	tay
	txa
	cpy	mos8(__rc2)
	sbc	mos8(__rc3)
	bvc	.LBB2_2
	eor	#-128
.LBB2_2:
	cmp	#0
	bpl	.LBB2_4
	lda	#1
	rts
.LBB2_4:
	lda	#0
	rts

There's a few redundant comparisons in there; these are best removed by a classic peephole redundant instruction removal pass at the very end of the compiler. I'm punting that down the line, since the gains are relatively small for how much work it is.

Also, this work does improve Dhrystone, but looking at the generated assembly, it's now far too silly for me to be comfortable using it to track future optimization work. We'll work to integrate some better benchmarks into our tooling (suggestions appreciated!).

Until next time!

mysterymath · Post by **mysterymath** » Fri Oct 15, 2021 4:35 am

Quick update, two new things:

- Nightly benchmark dashboard! https://llvm-mos.github.io/llvm-test-suite/dev/bench/
This now also has a selection of sgradat@'s 6502 compiler benchmarks. These have been tweaked pretty heavily to fool LLVM, so they're not directly comparable to results for other compilers. They're intended to track compiler performance over time; I'll leave the compiler comparisons to others.

- Multi-byte increments and decrements!

Note that this hasn't taken effect in the dashboard as of 10/14, but probably will tomorrow.

Taught our backend the mathematical structure of how increments and decrements are supposed to work:
- Increment the lowest byte. If the result is zero, recursively increment the highest order bytes.
- Decrement the lowest byte. If the byte was zero before being decremented, recursively decrement the highest order bytes.

From this, and some more G_SELECT optimization legwork, some fairly nice increment and decrement patterns started falling out.

For example, here's a test function that decrements and returns its 32-bit argument. As per the calling convention, this is passed in A, X, RC2, RC3, from low to high.

Code: Select all

dec_i32:
  tay
  clc
  adc #255
  cpy #0
  bne .LBB6_5
  cpx #0
  bne .LBB6_4
  ldy mos8(__rc2)
  dec mos8(__rc2)
  cpy #0
  bne .LBB6_4
  dec mos8(__rc3)
.LBB6_4:
  dex
.LBB6_5:
  rts

Until next time!

barrym95838 · Post by **barrym95838** » Fri Oct 15, 2021 8:03 am

With a bit of thought I can manually code the same task in 19 bytes instead of 24, but the compiler still seems to be doing a pretty decent job! I suppose that trying to teach it all the nuances of the carry flag would be a non-trivial adventure ...

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen