LLVM 6502 Codegen
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Well, it looks like I spoke too soon!
I told LLVM that the standard library functions that ship with the compiler are, in fact, the C standard library functions, not some other random functions with the same names.
This allows LLVM passes to reason about the semantics of things like strcpy, so it can manipulate, combine, and optimize calls to them.
This brings the effective performance at -Os to:
11,128 Dhrystones/sec
Why the huge jump? Well, most of the string copies must not actually have achieved anything, so away they go... the size of the binary fell from around 900 bytes to 601 bytes.
I'm more or less presuming that the reasoning is correct, since the test suite still seems to pass at -Os, and all I did was enable an optimization far far above us in the stack.
Probably won't see another big jump like this; usually, the earlier an optimization can take root, the more effect it has.
I told LLVM that the standard library functions that ship with the compiler are, in fact, the C standard library functions, not some other random functions with the same names.
This allows LLVM passes to reason about the semantics of things like strcpy, so it can manipulate, combine, and optimize calls to them.
This brings the effective performance at -Os to:
11,128 Dhrystones/sec
Why the huge jump? Well, most of the string copies must not actually have achieved anything, so away they go... the size of the binary fell from around 900 bytes to 601 bytes.
I'm more or less presuming that the reasoning is correct, since the test suite still seems to pass at -Os, and all I did was enable an optimization far far above us in the stack.
Probably won't see another big jump like this; usually, the earlier an optimization can take root, the more effect it has.
Re: LLVM 6502 Codegen
Is this a case of the compiler being clever enough not to run all the code that the benchmark contains?
Re: LLVM 6502 Codegen
There's a lot of opportunity for a modern compiler to throw away code in the Dhrystone benchmark. The various functions that are called from the main loop are called only once, and will likely be inlined. From there, some of them can easily be optimised down to "variable = constant", and lifted out of the loop.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Okay, I'm axing my hacky CPU timing mechanism; it's just bad. Instead, I'll redo these figures with raw cycle counts from the simulator.
With that, my earlier figures are adjusted to:
753 D/s => 189 D/s,
11,128 D/s => 3050 D/s,
These are generated by simple cycle counting and assumption of a 1MHz 6502; these are thus perfectly repeatable. I won't be able to use any of LLVM's tooling to track these figures, but that's a small price to pay for accuracy and repeatability, and I may be able to jury-rig some automation later.
On the 6502-specific optimization front, I'm adding a simple combine to fix a corner case where abs,x addressing would be used even though x was a constant.
By itself, this brings the figure to:
3137 D/s
The various optimizations remaining will likely be similarly incremental, so I'll post back once a big slew of them have piled up.
With that, my earlier figures are adjusted to:
753 D/s => 189 D/s,
11,128 D/s => 3050 D/s,
These are generated by simple cycle counting and assumption of a 1MHz 6502; these are thus perfectly repeatable. I won't be able to use any of LLVM's tooling to track these figures, but that's a small price to pay for accuracy and repeatability, and I may be able to jury-rig some automation later.
On the 6502-specific optimization front, I'm adding a simple combine to fix a corner case where abs,x addressing would be used even though x was a constant.
By itself, this brings the figure to:
3137 D/s
The various optimizations remaining will likely be similarly incremental, so I'll post back once a big slew of them have piled up.
Last edited by mysterymath on Wed Sep 01, 2021 5:11 pm, edited 1 time in total.
Re: LLVM 6502 Codegen
John West wrote:
There's a lot of opportunity for a modern compiler to throw away code in the Dhrystone benchmark. The various functions that are called from the main loop are called only once, and will likely be inlined. From there, some of them can easily be optimised down to "variable = constant", and lifted out of the loop.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.
I once had a nice talk with the Dhrystone author at a university summer party. He was of course very aware of these (and other) shortcomings at that time (early 2000s).
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
vbc wrote:
Nevertheless most optimizing compilers typically only improve the Dhrystone results by a factor of about 2-4 when compared with optimizations turned off.
-O0: 45.0592 s
-Os: 0.3692 s
So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.
Re: LLVM 6502 Codegen
mysterymath wrote:
Just for fun, I ran one billion loops through the x86_64 target of clang:
-O0: 45.0592 s
-Os: 0.3692 s
-O0: 45.0592 s
-Os: 0.3692 s
Quote:
So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Both of those benchmarks involve signed comparisons, which is one of the poison cases for the code generator at the moment (I'm working on it presently).
On a simulated 2MHz NMOS 6502, fib(30) takes:
As written: 9.9s
With unsigned integers: 7.4 s
Unfortunately, the llvm-mos code generator presently uses up to 4 bytes of hard stack per invocation, so tarai's max stack depth of 55 required 330 bytes of stack.
I'll add a limit to the amount of hard stack used per frame at some point, but it's relatively low on my TODO list.
Still lots of room for improvement, of course.
The compiler can't yet decrement or directly access the stack (has to load to ZP first), both of which would help this benchmark.
On a simulated 2MHz NMOS 6502, fib(30) takes:
As written: 9.9s
With unsigned integers: 7.4 s
Unfortunately, the llvm-mos code generator presently uses up to 4 bytes of hard stack per invocation, so tarai's max stack depth of 55 required 330 bytes of stack.
I'll add a limit to the amount of hard stack used per frame at some point, but it's relatively low on my TODO list.
Still lots of room for improvement, of course.
Re: LLVM 6502 Codegen
In this post, I relate my experience with these benchmarks and PL/M-80 which does not have signed variables:
viewtopic.php?p=78835#p78835
I look forward to seeing you beat SDCC and GCC.
viewtopic.php?p=78835#p78835
I look forward to seeing you beat SDCC and GCC.
Re: LLVM 6502 Codegen
At the risk of side-tracking this thread, for now, while still interested in C, I'm currently using BCPL and I know that those 2 benchmarks aren't that good in BCPL on my '816 system, however the fib() one isn't bad either, so bearing in-mind that my system is an '816 running at 16Mhz, interpreting a 32-bit bytecode VM, all numbers are signed 32-bits and there is overhead to make the 64K pages transparent to user-land the results are:
first number is the result, 2nd the time in seconds to the nearest millisecond. Multiply by 8 to get the effective speed on a 2Mhz system, so fib() comes out at 30 seconds - only ~3 times slower than the LLVM one above which really isn't bad all things considered, and tarai() to 351 seconds. Ah well, can't win 'em all!
(although it may win on code-size at 248 bytes!)
-Gordon
code:
Code: Select all
% bench
fib result: 28657 - 3.762
tarai result: 10 - 43.852(although it may win on code-size at 248 bytes!)
-Gordon
code:
Code: Select all
GET "libhdr.h"
GET "sys.h"
LET fib (n) = VALOF
{
TEST n < 2 THEN
RESULTIS n
ELSE
RESULTIS fib (n - 1) + fib (n - 2)
}
LET tarai (x, y, z) = VALOF
{
TEST x <= y THEN
RESULTIS y
ELSE
RESULTIS tarai (tarai (x - 1, y, z),
tarai (y - 1, z, x),
tarai (z - 1, x, y))
}
AND start () = VALOF
{
LET tStart, tEnd = ?, ?
LET r = ?
tStart := sys (Sys_cputime)
r := fib (23)
tEnd := sys (Sys_cputime)
writef (" fib result: %5d - %6.3d*n", r, tEnd - tStart)
tStart := sys (Sys_cputime)
r := tarai (10, 1, 0)
tEnd := sys (Sys_cputime)
writef ("tarai result: %5d - %6.3d*n", r, tEnd - tStart)
RESULTIS 0
}--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/
Re: LLVM 6502 Codegen
mysterymath wrote:
It would be cleaner to have it take the exit condition back to the original != 0 check, but I'm not sure how to actually make that work. It's unlikely that I'm ever going to get *perfect* code out of LLVM, only better or worse.
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Project Update
Two big items of note:
The effect of the latter is enormous; there's typically at least one integer comparison per loop iteration, and the new sequences are around 4 times shorter than the old ones. I'm particularly proud of how the code generation came together; I only vaguely waved my hands in the direction of idiomatic 6502 patterns (EOR #80, etc.), then added key optimizations to select lowering, instruction selection, legalization, etc, and all the details just sort themselves out. This makes the implementation pretty flexible, as it can handle pretty much any variation in where the individual bytes of a comparison come from. As a neat side-effect, it can also fold comparisons and subtractions together, re-using the flag outputs from an earlier subtraction. (This doesn't come up much in practice, it seems.)
Here's an example:
There's a few redundant comparisons in there; these are best removed by a classic peephole redundant instruction removal pass at the very end of the compiler. I'm punting that down the line, since the gains are relatively small for how much work it is.
Also, this work does improve Dhrystone, but looking at the generated assembly, it's now far too silly for me to be comfortable using it to track future optimization work. We'll work to integrate some better benchmarks into our tooling (suggestions appreciated!).
Until next time!
Two big items of note:
- Had a discussion with a member of one of the RISC-V committees about why their calling convention was the way it was, and a lot of the same concerns apply to us as well. Switched out our calling convention for one based loosely on RISC-V; this generally improves performance, but in a corner case, may also increase interrupt latency. If interrupt latency ends up being too slow in practice, then we do have a simple design that can dedicate zero page register sets to interrupt handlers.
- Integer comparisons (signed and unsigned, single-byte and multi-byte) now emit relatively idiomatic 6502 assembly.
The effect of the latter is enormous; there's typically at least one integer comparison per loop iteration, and the new sequences are around 4 times shorter than the old ones. I'm particularly proud of how the code generation came together; I only vaguely waved my hands in the direction of idiomatic 6502 patterns (EOR #80, etc.), then added key optimizations to select lowering, instruction selection, legalization, etc, and all the details just sort themselves out. This makes the implementation pretty flexible, as it can handle pretty much any variation in where the individual bytes of a comparison come from. As a neat side-effect, it can also fold comparisons and subtractions together, re-using the flag outputs from an earlier subtraction. (This doesn't come up much in practice, it seems.)
Here's an example:
Code: Select all
#include <stdbool.h>
bool eq(unsigned a, unsigned b) { return a == b; }
bool uge(unsigned a, unsigned b) { return a >= b; }
bool slt(int a, int b) { return a < b; }
Code: Select all
eq:
cpx mos8(__rc3)
bne .LBB0_3
cmp mos8(__rc2)
bne .LBB0_3
lda #1
rts
.LBB0_3:
lda #0
rts
uge:
cpx mos8(__rc3)
bne .LBB1_3
cmp mos8(__rc2)
bcc .LBB1_4
.LBB1_2:
lda #1
rts
.LBB1_3:
cpx mos8(__rc3)
bcs .LBB1_2
.LBB1_4:
lda #0
rts
slt:
tay
txa
cpy mos8(__rc2)
sbc mos8(__rc3)
bvc .LBB2_2
eor #-128
.LBB2_2:
cmp #0
bpl .LBB2_4
lda #1
rts
.LBB2_4:
lda #0
rts
Also, this work does improve Dhrystone, but looking at the generated assembly, it's now far too silly for me to be comfortable using it to track future optimization work. We'll work to integrate some better benchmarks into our tooling (suggestions appreciated!).
Until next time!
-
mysterymath
- Posts: 79
- Joined: 10 Jan 2021
Re: LLVM 6502 Codegen
Quick update, two new things:
- Nightly benchmark dashboard! https://llvm-mos.github.io/llvm-test-suite/dev/bench/
This now also has a selection of sgradat@'s 6502 compiler benchmarks. These have been tweaked pretty heavily to fool LLVM, so they're not directly comparable to results for other compilers. They're intended to track compiler performance over time; I'll leave the compiler comparisons to others.
- Multi-byte increments and decrements!
Note that this hasn't taken effect in the dashboard as of 10/14, but probably will tomorrow.
Taught our backend the mathematical structure of how increments and decrements are supposed to work:
- Increment the lowest byte. If the result is zero, recursively increment the highest order bytes.
- Decrement the lowest byte. If the byte was zero before being decremented, recursively decrement the highest order bytes.
From this, and some more G_SELECT optimization legwork, some fairly nice increment and decrement patterns started falling out.
For example, here's a test function that decrements and returns its 32-bit argument. As per the calling convention, this is passed in A, X, RC2, RC3, from low to high.
Until next time!
- Nightly benchmark dashboard! https://llvm-mos.github.io/llvm-test-suite/dev/bench/
This now also has a selection of sgradat@'s 6502 compiler benchmarks. These have been tweaked pretty heavily to fool LLVM, so they're not directly comparable to results for other compilers. They're intended to track compiler performance over time; I'll leave the compiler comparisons to others.
- Multi-byte increments and decrements!
Note that this hasn't taken effect in the dashboard as of 10/14, but probably will tomorrow.
Taught our backend the mathematical structure of how increments and decrements are supposed to work:
- Increment the lowest byte. If the result is zero, recursively increment the highest order bytes.
- Decrement the lowest byte. If the byte was zero before being decremented, recursively decrement the highest order bytes.
From this, and some more G_SELECT optimization legwork, some fairly nice increment and decrement patterns started falling out.
For example, here's a test function that decrements and returns its 32-bit argument. As per the calling convention, this is passed in A, X, RC2, RC3, from low to high.
Code: Select all
dec_i32:
tay
clc
adc #255
cpy #0
bne .LBB6_5
cpx #0
bne .LBB6_4
ldy mos8(__rc2)
dec mos8(__rc2)
cpy #0
bne .LBB6_4
dec mos8(__rc3)
.LBB6_4:
dex
.LBB6_5:
rts
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: LLVM 6502 Codegen
With a bit of thought I can manually code the same task in 19 bytes instead of 24, but the compiler still seems to be doing a pretty decent job! I suppose that trying to teach it all the nuances of the carry flag would be a non-trivial adventure ...
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)