LLVM 6502 Codegen

Rob Finch · Post by **Rob Finch** » Tue Aug 17, 2021 12:33 pm

I have been following this topic with interest. Some of the ideas could be applied to other accumulator based processors. The 6809 comes to mind.

Quote:

(Perhaps important to note, the Virtual Machine mentioned here is not for execution, but as an intermediate target for an intermediate code, which can be optimised before translation to an actual hardware instruction set. And then possibly optimised again.)

I wonder how long before someone comes up with a processor that directly executes the VM.

drogon · Post by **drogon** » Tue Aug 17, 2021 1:31 pm

Rob Finch wrote:

I have been following this topic with interest. Some of the ideas could be applied to other accumulator based processors. The 6809 comes to mind.

Quote:

(Perhaps important to note, the Virtual Machine mentioned here is not for execution, but as an intermediate target for an intermediate code, which can be optimised before translation to an actual hardware instruction set. And then possibly optimised again.)

I wonder how long before someone comes up with a processor that directly executes the VM.

While I've not looked at the VM produced by LLVM, something similar (ish) happened in the 80's in the form of the Inmos Transputer. From what I can tell, its microcoded instruction set was influenced by the Cintcode instructions generated by the BCPL compiler - this is a stack based VM (only 2 stack registers in Cintcode) with a variable width instruction set which was more "orthogonalised" in the Transputer. (4-bit wide base instructions and 3 registers in the push-down stack) I'd love to create a microcoded CPU that executed the BCPL Cintcode directly - but it's relatively easy to interpret on a modern 32-bit CPU. Harder on an '816 though, but possible (because I've done it)

-Gordon

mysterymath · Post by **mysterymath** » Wed Aug 25, 2021 4:28 pm

Major Update!

Alright, I've finally finished going through and auditing the compiler for correctness and C99 completeness.

Not a lot changed, but I'll call the fixes out below. The main significance of this milestone is that it's now optimization time! (Finally!

)

First up, I'll be getting our automated test suite capable of producing size and cycle count figures for the Dhrystone included with LLVM. I'll post those figures once I have them. From there, I'll be optimizing the crud out of it, and we'll see how small we can get it! (And no cheating; I'll avoid "optimizations" that make Dhrystone faster and other programs slower.)

Fixes:
https://github.com/llvm-mos/llvm-mos/issues/27
Due to the 6502 indexing bug, indexed addressing may cause spurious reads to memory one page below the final location. This could be a problem if the spurious read hit an I/O register with hardware behavior triggered on read. To deal with this, I've disabled indexed addressing entirely for loads and stores to volatile objects. If the address is dynamic, it's issued through "LDY #0; xDA (ptr),Y" to ensure pages aren't crossed. It's on the programmer to ensure that all indexed accesses to values one page above I/O registers occur through volatile objects. Shouldn't be too onerous, since I/O registers are typically contiguous in high address space, and you usually need volatile objects to refer to those registers anyway.

https://github.com/llvm-mos/llvm-mos/issues/7
The soft stack pointer was always set in low then high order. When decreasing the stack pointer on function entry, if a borrow occurs, the 16-bit stack pointer might temporarily increase past its value on entry to the function. Were an interrupt to occur at this point, the interrupt might overwrite the stack of the interrupted function's caller. There are broadly 3 ways to fix this:
1. A interrupt-dedicated soft stack
2. Disabling interrupts while updating SP
3. Always set the high byte first when decreasing SP.

(1) requires an additional burden on the programmer and time to swap out the stack pointers. (2) doesn't work for NMIs without an additional target-specific mechanism present. (3)'s only downside, AFAICT, is an additional 2 bytes and 7 cycles of PHA/PLA on entry to a function that uses the soft stack. Given the tradeoffs, I went with (3).

https://github.com/llvm-mos/llvm-mos/issues/73
Finally, I was able to remove 3 bytes of compiler temporaries, so these no longer need to be allocated by the linker or saved and restored by interrupt handlers. This decreases the number of bytes (beyond A, X, and Y) of memory that an interrupt handler needs to save and restore in the worst case to 6, and 4 of those bytes are in the zero page.

Ta ta for now!

Proxy · Post by **Proxy** » Thu Aug 26, 2021 12:28 am

oh i didn't realize this was designed around the NMOS 6502 instead of the CMOS one.
i hope sometime in the future it will be able to make use of the 65C02's feature and lack of hardware bugs.

either way i'm following this project with great interest, as i hope to use a somewhat final version of it as a learning tool for getting into writing an LLVM backend myself.
the CPU0 example from online didn't really work for me, so i'm hoping a backend based around a CPU i know will allow me to better understand what is going on with the various files and fomats.

cjs · Post by **cjs** » Thu Aug 26, 2021 4:15 am

BigDumbDinosaur wrote:

It was the Motorola 6800 in which the idea of using zero page... When Chuck Peddle, Bill Mensch et al left Motorola for MOS Technology they "borrowed" the direct page concept (calling it "zero page"), as well as many other 6800 features (e.g., relative branches and most of the 6800 assembly language).... The 6502's main innovation was low cost.

Thanks for the correction giving credit where it's due to the 6800. I should mention that, though low cost was certainly a huge factor in the 6502's success, the 6502 was not entirely without innovation on the ISA front.

In particular, the ability to do indirect indexed access with addresses stored in the zero page is something that doesn't exist on the 6800 and is a very handy feature, giving you essentially dozens of index registers. (That comes theoretically at some cost in access time, but in practice not much at all since the 6800 did not well optimize its indexed access.) It's handy enough that, now having done about as much 6800 programming as 6502 programming, I'd say it easily makes up for the many other limitations the 6502 introduced (8-bit index registers, loss of an accumulator, loss off useful instructions like BRA, tiny stack, etc.).

(This is probably getting rather off-topic for this thread, so if anybody's interested in getting into a substantive discussion about this it's probably best to start a new thread.)

vbc · Post by **vbc** » Thu Aug 26, 2021 7:56 am

mysterymath wrote:

The soft stack pointer was always set in low then high order. When decreasing the stack pointer on function entry, if a borrow occurs, the 16-bit stack pointer might temporarily increase past its value on entry to the function. Were an interrupt to occur at this point, the interrupt might overwrite the stack of the interrupted function's caller. There are broadly 3 ways to fix this:
1. A interrupt-dedicated soft stack
2. Disabling interrupts while updating SP
3. Always set the high byte first when decreasing SP.

(1) requires an additional burden on the programmer and time to swap out the stack pointers. (2) doesn't work for NMIs without an additional target-specific mechanism present. (3)'s only downside, AFAICT, is an additional 2 bytes and 7 cycles of PHA/PLA on entry to a function that uses the soft stack. Given the tradeoffs, I went with (3).

Interesting. When I had to make that decision for vbcc, I immediately chose (1). What burden on the programmer do you see?

Slowing down every (stack using) function in the code to reduce the overhead of an interrupt handler is something I can understand for specific use-cases (although it would usually not be my choice). However, if the interrupt handler calls other (stack using) C functions, will you not loose that improvement in your sub-functions (which, I assume, will have the overhead of the stack pointer adjustment)? Maybe I am overlooking something, but it does not seem beneficial to me.

But more important, as I understand it, this mechanism may incur an additional up to 255 bytes of stack-usage, depending on a - possibly very rare - race condition as well as the layout of the stack. I think you can even produce cases where reducing the stack usage of a function will effectively increase the total stack usage. That seems pretty unmaintainable to me.

GARTHWILSON · Post by **GARTHWILSON** » Thu Aug 26, 2021 8:13 am

I guess I skipped over that earlier. I missed it. If you're doing a software stack, you always make the new stack byte or cell available before putting the data in it, and don't remove it until after you're done with the data. Then there's no problem, ever, and interrupts won't interfere. I do it all the time; and in my applications, I cannot be delaying the interrupt the few instructions between SEI and CLI, so I don't. Stacks, by definition, avoid the problems you're mentioning. I think you're still making the interrupt & stack stuff more difficult than it needs to be.

cjs · Post by **cjs** » Thu Aug 26, 2021 8:31 am

GARTHWILSON wrote:

If you're doing a software stack, you always make the new stack byte or cell available before putting the data in it, and don't remove it until after you're done with the data.

Right, but how do you decrement (or increment) a 16-bit stack (or other) pointer atomically on a 6502? It's easy if there's no carry/borrow, but if there is it seems to me that you always have a period between updating the LSB and MSB where the pointer is invalid.

GARTHWILSON · Post by **GARTHWILSON** » Thu Aug 26, 2021 9:34 am

cjs wrote:

GARTHWILSON wrote:

If you're doing a software stack, you always make the new stack byte or cell available before putting the data in it, and don't remove it until after you're done with the data.

Right, but how do you decrement (or increment) a 16-bit stack (or other) pointer atomically on a 6502? It's easy if there's no carry/borrow, but if there is it seems to me that you always have a period between updating the LSB and MSB where the pointer is invalid.

That's a good point. I've had my head in 65816 stuff recently which does 16-bit increments and decrements atomically. The '02 requires a lot more overhead; and actually, now that I think about it more, an ISR using the same stack space (to service an interrupt in an HLL) won't be so urgent that the SEI...CLI pair in mysterymath's #2 would cause any trouble. (It's all a non-issue in the way I service interrupts in Forth though.)

Otherwise, for the '02, you'll examine the low 8 bits before deciding how to take action, rather than incrementing or decrementing the low 8 bits and then seeing if you need to do the high 8 bits too. How's this for the stack growing downward and you have a stack pointer in ZP we'll call PTR and post-index the indirect with Y:

Code: Select all

        CPY  #0
        IF_EQ
           DEC  PTR+1
        END_IF
        DEY

        STA  (PTR),Y   ; Store something in the new stack byte.

The few times an interrupt hits between the DEC and the DEY, if the ISR uses the same stack, [Edit: and you're on the page border,] you'll hop over a whole page; but you'll get it back when the ISR finishes. The ISR will not have anything straddling that gap, and won't step on stack space that's holding things that are still needed by the background program.

After you're done with a stack byte and want to free it up, you could do something like:

Code: Select all

        CPY  #$FF
        IF_EQ
           INY
           INC  PTR+1
        ELSE_
           INY
        END_IF

Again there's the chance of hopping over a whole page if an interrupt using the same stack hits between the INY and the INC [Edit: and you're on the page border]; but you'll get it back when this ISR finishes, and it won't step on stack space still needed by the background program.

When performance is paramount, the above could be put in macros; otherwise make them subroutines to save memory at the expense of the 12-cycle JSR-RTS pair. The subroutines will be easier to justify when you want to add or delete two or more stack bytes at once. If these always go in byte pairs such that you'll never have an odd number in Y, it should make things a little simpler when you add or remove a 16-bit stack cell, compared to going through the process twice of adding or removing a byte.

It's well past my bedtime. Hopefully I didn't make any dumb mistakes in my bleary-eyed effort.

mysterymath · Post by **mysterymath** » Thu Aug 26, 2021 3:49 pm

vbc wrote:

Interesting. When I had to make that decision for vbcc, I immediately chose (1). What burden on the programmer do you see?

It's possible that you've a different version of (1) in mind. The way I could manage to think of interrupt-specific stacks working, you'd have to find a place in the memory map for each of them, such that they wouldn't conflict with one another or run into anything else. If an interrupt could be interrupted, then you'd need two stacks, and so on. In general, it seems like this ends up being essentially how threads work; dynamically allocated fixed-size stacks per thread. But one of the advantages of the C runtime model is that you can get asynchronicity working on just one stack, and I hoped to preserve this advantage.

vbc wrote:

Slowing down every (stack using) function in the code to reduce the overhead of an interrupt handler is something I can understand for specific use-cases (although it would usually not be my choice). However, if the interrupt handler calls other (stack using) C functions, will you not loose that improvement in your sub-functions (which, I assume, will have the overhead of the stack pointer adjustment)? Maybe I am overlooking something, but it does not seem beneficial to me.

But more important, as I understand it, this mechanism may incur an additional up to 255 bytes of stack-usage, depending on a - possibly very rare - race condition as well as the layout of the stack. I think you can even produce cases where reducing the stack usage of a function will effectively increase the total stack usage. That seems pretty unmaintainable to me.

I'm hoping to avoid use of the soft stack as much as possible, and a PHA/PLA is a rounding error given the current state of the compiler. I've been trading off performance for simplicity of implementation at every stage of the compiler so far, since I'm working on correctness in a vacuum. This will definitely be something I'll have to revisit, in the context where I can compare its impact to other forms of optimization.

Along those lines, there's another approach with a different tradeoff, too. The SP can never be seen to be more than one page above it's true value, so an interrupt handler could be required to summarily increase SP by one page. This removes the PHA/PLA from every non-interrupt function, and adds a SP increment/decrement to the interrupt handler (only if it doesn't already have one). The downside is that 256 bytes are wasted for every interrupt on the call stack, which is the absolute worst case of the PHA/PLA approach. For PHA/PLA, usually zero bytes will be wasted, since usually interrupts don't occur within SP increments. The chances of multiple interrupt invocations wasting space decrease exponentially with the number of interrupt invocations on the stack.

GARTHWILSON wrote:

Otherwise, for the '02, you'll examine the low 8 bits before deciding how to take action, rather than incrementing or decrementing the low 8 bits and then seeing if you need to do the high 8 bits too. How's this for the stack growing downward and you have a stack pointer in ZP we'll call PTR and post-index the indirect with Y:

That's similar in spirit to what I ended up with, just generalizing for the usual case where we're decrementing the stack by a 16-bit constant. Compute the low byte, stash it on the hard stack, and then feed its carry into the high byte sum. Set the high byte, then set the low byte from the stashed value.

In the case where you're decrementing by 1, the carry into the high byte sum reduces to the compare of the low byte with zero, and the use of that carry reduces to the control flow avoiding the decrement of the the high byte.

vbc · Post by **vbc** » Sat Aug 28, 2021 6:15 am

mysterymath wrote:

It's possible that you've a different version of (1) in mind. The way I could manage to think of interrupt-specific stacks working, you'd have to find a place in the memory map for each of them, such that they wouldn't conflict with one another or run into anything else.

Sounds like this would do the job:

Code: Select all

char _ISRstack[MYISRSIZE];

Quote:

Along those lines, there's another approach with a different tradeoff, too. The SP can never be seen to be more than one page above it's true value, so an interrupt handler could be required to summarily increase SP by one page. This removes the PHA/PLA from every non-interrupt function, and adds a SP increment/decrement to the interrupt handler (only if it doesn't already have one).

That actually looks like a much better solution to me.

Quote:

The downside is that 256 bytes are wasted for every interrupt on the call stack, which is the absolute worst case of the PHA/PLA approach.

Which happens to be exactly the space you want to assign to your stack. Code that exhibits its worst-case stack usage during each execution is a dream. Code that only exhibits it every thousand hours, depending on the phase of the moon, is a nightmare to test. So I would actually see this as a benefit. Of course I still think sacrificing 256 bytes of RAM is rarely worth it on a 6502.

mysterymath · Post by **mysterymath** » Sat Aug 28, 2021 5:21 pm

vbc wrote:

That actually looks like a much better solution to me.

The determinism argument is pretty convincing to me; I created https://github.com/llvm-mos/llvm-mos/issues/77 for this work. Probably won't get to it for the initial release, but definitely before we establish backwards compatibility guarantees.

My goal for the default ISR prologue/epilogue is to establish safe defaults that are as easy to use as possible, and improving the determinism of this behavior should help with that. As you've mentioned, there's always an easy out performance wise: you can disable the ISR prologue/epilogue generation and write your own that sets SP to a dedicated stack. This is also an improvement over the way I have things now, since you can't currently turn off the PHA/PLA in function prologues.

The impact of wasting 256 bytes should also be mitigatible in the long run; I've thought off and on about establishing a weird sort of 1-page red zone above the stack pointer, and doing so very carefully would remove the need for ISRs to increase SP at all (the non-atomic SP would always be in a memory region without live values). That's a ton of additional work though, and I haven't designed it end-to-end, just in bits and pieces.

mysterymath · Post by **mysterymath** » Sun Aug 29, 2021 3:20 am

First Benchmark Results!

I removed the malloc from LLVM's copy of Dhrystone v1.1 and got it running.

I also modified our simulator to spin until it takes as much CPU time as an imaginary 10MHz NMOS 6502 would (this time obtained by cycle counting). This allows us to run the tests fairly quickly, while still giving meaningful results for a 1MHz 6502; just divide by 10. We can use LLVM's built in execution time tracking too.

With that configuration, 2,000,000 loops of Dhrystone takes 26.565 sec on our imaginary 10MHz 6502, so it would have taken 265.65 seconds on a 1MHz one, for a final figure of:

EDIT: These figures were wrong by a factor of 10; the correct number of loops was 200,000.

Dhrystone v1.1 is a benchmark known to be extremely sensitive to compiler optimizations; hopefully this will be the case for us as well. High level optimizations have already taken a huge chunk out of the work this benchmark is doing, as seen by the huge jump from O0 to Os.

These numbers may or may not be comparable to any other compiler, but they're definitely comparable to themselves. I'll be back with the new figures once I start picking through the generated assembly.

EDIT: High level optimization might be a bit too effective for this particular benchmark; there's only about 400 lines of assembly left afterwards, and it's all slurped right into main. Everything uses static stack as a result too. Still, there's some pretty bad assembly here, even in this example. Onward!

BigEd · Post by **BigEd** » Sun Aug 29, 2021 11:10 am

Great to see numbers!

Even 240 is pretty good - high 30s were mentioned in this earlier thread.

mysterymath · Post by **mysterymath** » Wed Sep 01, 2021 2:39 am

Well this is embarassing; I missed a zero when counting the number of iterations for the previous post, my figures were inflated by a factor of 10.

The correct figures should be:
At -Os: 753 Dhrystones / second.
At -O0: 24 Dhrystones / second.

Sincerest apologies.

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen