6502.org • View topic - LLVM 6502 Codegen

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

LLVM 6502 Codegen

Page 8 of 11

[ 155 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11 Next

Previous topic | Next topic

Author

Message

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 6:05 am

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

Well, it looks like I spoke too soon!

I told LLVM that the standard library functions that ship with the compiler are, in fact, the C standard library functions, not some other random functions with the same names.

This allows LLVM passes to reason about the semantics of things like strcpy, so it can manipulate, combine, and optimize calls to them.

This brings the effective performance at -Os to:

11,128 Dhrystones/sec

Why the huge jump? Well, most of the string copies must not actually have achieved anything, so away they go... the size of the binary fell from around 900 bytes to 601 bytes.
I'm more or less presuming that the reasoning is correct, since the test suite still seems to pass at -Os, and all I did was enable an optimization far far above us in the stack.
Probably won't see another big jump like this; usually, the earlier an optimization can take root, the more effect it has.

Top

BigEd

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 8:01 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Is this a case of the compiler being clever enough not to run all the code that the benchmark contains?

Top

John West

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 8:22 am

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336

There's a lot of opportunity for a modern compiler to throw away code in the Dhrystone benchmark. The various functions that are called from the main loop are called only once, and will likely be inlined. From there, some of them can easily be optimised down to "variable = constant", and lifted out of the loop.
If the loop over Ch_Index is unrolled (which is likely, as it only loops twice and is very short), the 'if' inside it will always be false, and that strcpy will never execute. That leaves one strcpy to Str_2_loc in the loop, and that can be lifted out too.
I'm looking at the source for version 2.2 linked from wikipedia. LLVM's might be different, but it sounds like it's not different enough.

Top

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 4:58 pm

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

Okay, I'm axing my hacky CPU timing mechanism; it's just bad. Instead, I'll redo these figures with raw cycle counts from the simulator.
With that, my earlier figures are adjusted to:
753 D/s => 189 D/s,
11,128 D/s => 3050 D/s,

These are generated by simple cycle counting and assumption of a 1MHz 6502; these are thus perfectly repeatable. I won't be able to use any of LLVM's tooling to track these figures, but that's a small price to pay for accuracy and repeatability, and I may be able to jury-rig some automation later.

On the 6502-specific optimization front, I'm adding a simple combine to fix a corner case where abs,x addressing would be used even though x was a constant.
By itself, this brings the figure to:
3137 D/s

The various optimizations remaining will likely be similarly incremental, so I'll post back once a big slew of them have piled up.

Last edited by mysterymath on Wed Sep 01, 2021 5:11 pm, edited 1 time in total.

Top

vbc

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 5:00 pm

Joined: Thu Apr 23, 2020 5:04 pm
Posts: 53

John West wrote:

Version 1 of Dhrystone that was mentioned is even more susceptible to code removal. AFAIK that was the main reason for the creation of version 2. However, we are still talking about the 80s and - as you correctly pointed out - modern compilers will still optimize away big parts of that version as well. Nevertheless most optimizing compilers typically only improve the Dhrystone results by a factor of about 2-4 when compared with optimizations turned off. So, the results mentioned here are at least a bit unusual. As the benchmark does not take any input and does not really calculate anything, a compiler could practically eliminate the entire benchmark, though. Another factor is how unoptimized is the unoptimized code.

I once had a nice talk with the Dhrystone author at a university summer party. He was of course very aware of these (and other) shortcomings at that time (early 2000s).

Top

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 01, 2021 5:26 pm

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

vbc wrote:

Nevertheless most optimizing compilers typically only improve the Dhrystone results by a factor of about 2-4 when compared with optimizations turned off.

Just for fun, I ran one billion loops through the x86_64 target of clang:
-O0: 45.0592 s
-Os: 0.3692 s

So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.

Top

vbc

Post subject: Re: LLVM 6502 Codegen

Posted: Thu Sep 02, 2021 12:50 am

Joined: Thu Apr 23, 2020 5:04 pm
Posts: 53

mysterymath wrote:

Just for fun, I ran one billion loops through the x86_64 target of clang:
-O0: 45.0592 s
-Os: 0.3692 s

Interesting. With gcc9 on my machine -O0 gives 47s and -O3 gives15s. It seems recent llvm removes almost all the code from the main loop. I did not follow its development, but I think I have seen earlier results from llvm that were pretty close to gcc (probably always using Dhrystone version 2, though).

Quote:

So the -Os version is ~120x faster on x86_64, while mine is only ~16x faster. I'd wager a lot of that is pipeline and cache effects, which don't exist on a 6502.

If much of the code is moved outside the loop, the number of iterations may also increase the relation. You probably did not try a billion for the 6502.

Top

BillG

Post subject: Re: LLVM 6502 Codegen

Posted: Thu Sep 16, 2021 11:07 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas

I am curious what your compiler can do with these two benchmarks:

viewtopic.php?p=78835#p78835

Top

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Fri Sep 17, 2021 3:56 am

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

Both of those benchmarks involve signed comparisons, which is one of the poison cases for the code generator at the moment (I'm working on it presently).

On a simulated 2MHz NMOS 6502, fib(30) takes:
As written: 9.9s
With unsigned integers: 7.4 s

Unfortunately, the llvm-mos code generator presently uses up to 4 bytes of hard stack per invocation, so tarai's max stack depth of 55 required 330 bytes of stack.
I'll add a limit to the amount of hard stack used per frame at some point, but it's relatively low on my TODO list.

Still lots of room for improvement, of course.

The compiler can't yet decrement or directly access the stack (has to load to ZP first), both of which would help this benchmark.

Top

BillG

Post subject: Re: LLVM 6502 Codegen

Posted: Fri Sep 17, 2021 9:31 am

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas

In this post, I relate my experience with these benchmarks and PL/M-80 which does not have signed variables:

viewtopic.php?p=78835#p78835

I look forward to seeing you beat SDCC and GCC.

Top

drogon

Post subject: Re: LLVM 6502 Codegen

Posted: Fri Sep 17, 2021 10:10 am

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland

At the risk of side-tracking this thread, for now, while still interested in C, I'm currently using BCPL and I know that those 2 benchmarks aren't that good in BCPL on my '816 system, however the fib() one isn't bad either, so bearing in-mind that my system is an '816 running at 16Mhz, interpreting a 32-bit bytecode VM, all numbers are signed 32-bits and there is overhead to make the 64K pages transparent to user-land the results are:

Code:

% bench
  fib result: 28657 -  3.762
tarai result:    10 - 43.852

first number is the result, 2nd the time in seconds to the nearest millisecond. Multiply by 8 to get the effective speed on a 2Mhz system, so fib() comes out at 30 seconds - only ~3 times slower than the LLVM one above which really isn't bad all things considered, and tarai() to 351 seconds. Ah well, can't win 'em all!

(although it may win on code-size at 248 bytes!)

-Gordon

code:

Code:

GET "libhdr.h"
GET "sys.h"

LET fib (n) = VALOF
{
  TEST n < 2 THEN
    RESULTIS n
  ELSE
    RESULTIS fib (n - 1) + fib (n - 2)
}


LET tarai (x, y, z) = VALOF
{
  TEST x <= y THEN
    RESULTIS y
  ELSE
    RESULTIS tarai (tarai (x - 1, y, z),
                    tarai (y - 1, z, x),
                    tarai (z - 1, x, y))
}

AND start () = VALOF
{
  LET tStart, tEnd = ?, ?
  LET r = ?

  tStart := sys (Sys_cputime)
  r := fib (23)
  tEnd   := sys (Sys_cputime)
  
  writef ("  fib result: %5d - %6.3d*n", r, tEnd - tStart)

  tStart := sys (Sys_cputime)
  r := tarai (10, 1, 0)
  tEnd   := sys (Sys_cputime)
  
  writef ("tarai result: %5d - %6.3d*n", r, tEnd - tStart)

  RESULTIS 0
}

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

geon

Post subject: Re: LLVM 6502 Codegen

Posted: Sun Sep 19, 2021 5:16 pm

Joined: Tue Sep 14, 2021 5:36 pm
Posts: 8

mysterymath wrote:

It would be cleaner to have it take the exit condition back to the original != 0 check, but I'm not sure how to actually make that work. It's unlikely that I'm ever going to get *perfect* code out of LLVM, only better or worse.

Well, you can't expect a C compiler to output asm as a human would write it. And since the observable behavior isn't impacted, the compiler is free to implement it however it sees fit.

Top

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Wed Sep 29, 2021 1:08 am

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

Project Update

Two big items of note:

Had a discussion with a member of one of the RISC-V committees about why their calling convention was the way it was, and a lot of the same concerns apply to us as well. Switched out our calling convention for one based loosely on RISC-V; this generally improves performance, but in a corner case, may also increase interrupt latency. If interrupt latency ends up being too slow in practice, then we do have a simple design that can dedicate zero page register sets to interrupt handlers.
Integer comparisons (signed and unsigned, single-byte and multi-byte) now emit relatively idiomatic 6502 assembly.

The biggest effect of the former is that routines may be able to fit their entire working set of SSA values in the zero page registers. This is apparently the biggest reason behind the relatively large number of caller-saved registers in recent architectures; you get diminishing returns much past 16 (we were previously using all 256, but our limited experience is well aligned with this).

The effect of the latter is enormous; there's typically at least one integer comparison per loop iteration, and the new sequences are around 4 times shorter than the old ones. I'm particularly proud of how the code generation came together; I only vaguely waved my hands in the direction of idiomatic 6502 patterns (EOR #80, etc.), then added key optimizations to select lowering, instruction selection, legalization, etc, and all the details just sort themselves out. This makes the implementation pretty flexible, as it can handle pretty much any variation in where the individual bytes of a comparison come from. As a neat side-effect, it can also fold comparisons and subtractions together, re-using the flag outputs from an earlier subtraction. (This doesn't come up much in practice, it seems.)

Here's an example:

Code:

#include <stdbool.h>
bool eq(unsigned a, unsigned b) { return a == b; }
bool uge(unsigned a, unsigned b) { return a >= b; }
bool slt(int a, int b) { return a < b; }

Code:

eq:
   cpx   mos8(__rc3)
   bne   .LBB0_3
   cmp   mos8(__rc2)
   bne   .LBB0_3
   lda   #1
   rts
.LBB0_3:
   lda   #0
   rts

uge:
   cpx   mos8(__rc3)
   bne   .LBB1_3
   cmp   mos8(__rc2)
   bcc   .LBB1_4
.LBB1_2:
   lda   #1
   rts
.LBB1_3:
   cpx   mos8(__rc3)
   bcs   .LBB1_2
.LBB1_4:
   lda   #0
   rts

slt:
   tay
   txa
   cpy   mos8(__rc2)
   sbc   mos8(__rc3)
   bvc   .LBB2_2
   eor   #-128
.LBB2_2:
   cmp   #0
   bpl   .LBB2_4
   lda   #1
   rts
.LBB2_4:
   lda   #0
   rts

There's a few redundant comparisons in there; these are best removed by a classic peephole redundant instruction removal pass at the very end of the compiler. I'm punting that down the line, since the gains are relatively small for how much work it is.

Also, this work does improve Dhrystone, but looking at the generated assembly, it's now far too silly for me to be comfortable using it to track future optimization work. We'll work to integrate some better benchmarks into our tooling (suggestions appreciated!).

Until next time!

Top

mysterymath

Post subject: Re: LLVM 6502 Codegen

Posted: Fri Oct 15, 2021 4:35 am

Joined: Sun Jan 10, 2021 11:25 pm
Posts: 64

Quick update, two new things:

- Nightly benchmark dashboard! https://llvm-mos.github.io/llvm-test-suite/dev/bench/
This now also has a selection of sgradat@'s 6502 compiler benchmarks. These have been tweaked pretty heavily to fool LLVM, so they're not directly comparable to results for other compilers. They're intended to track compiler performance over time; I'll leave the compiler comparisons to others.

- Multi-byte increments and decrements!

Note that this hasn't taken effect in the dashboard as of 10/14, but probably will tomorrow.

Taught our backend the mathematical structure of how increments and decrements are supposed to work:
- Increment the lowest byte. If the result is zero, recursively increment the highest order bytes.
- Decrement the lowest byte. If the byte was zero before being decremented, recursively decrement the highest order bytes.

From this, and some more G_SELECT optimization legwork, some fairly nice increment and decrement patterns started falling out.

For example, here's a test function that decrements and returns its 32-bit argument. As per the calling convention, this is passed in A, X, RC2, RC3, from low to high.

Code:

dec_i32:
  tay
  clc
  adc #255
  cpy #0
  bne .LBB6_5
  cpx #0
  bne .LBB6_4
  ldy mos8(__rc2)
  dec mos8(__rc2)
  cpy #0
  bne .LBB6_4
  dec mos8(__rc3)
.LBB6_4:
  dex
.LBB6_5:
  rts

Until next time!

Top

barrym95838

Post subject: Re: LLVM 6502 Codegen

Posted: Fri Oct 15, 2021 8:03 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA

With a bit of thought I can manually code the same task in 19 bytes instead of 24, but the compiler still seems to be doing a pretty decent job! I suppose that trying to teach it all the nuances of the carry flag would be a non-trivial adventure ...

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Top

Page 8 of 11

[ 155 posts ]

Go to page Previous 1 ... 5, 6, 7, 8, 9, 10, 11 Next

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 11 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum