LLVM 6502 Codegen

mysterymath · Post by **mysterymath** » Sun Aug 01, 2021 5:40 pm

It sounds like the suggestion is to implement more or less the "normal" C interrupt model when user tags a function as interrupt: save the (very small number of ) caller-saved registers, and store everything in the interrupt handler on the stack. It'd still be desirable to have the main program (i.e., the part not reachable from the interrupt handler) still use static stacks and other optimizations. If that's incorrect, let me know

That's essentially what the "interrupt" attribute does. First, it makes all functions use less zero page so an interrupt handler can be timely. Second, it does a call graph analysis, just to figure out which parts of the program are interrupt handler and which parts aren't.

If it didn't make functions use less zero page, then the ISR save sequence would push 256 bytes of zero page to the hard stack and crash. Or I'd have to rewrite it so it pushed 256 bytes of zero page to the soft stack. The entire zero page could plausibly contain live values at the time of the interrupt, and any C function that the interrupt handler calls could plausibly overwrite all of it. We could just carte-blanche limit the amount of zero page that the compiler's allowed to use, whether or not there are interrupts. Or we could reserve dedicated regions of zero page for interrupt routines; that's already in the works.

EDIT: I missed a few posts above; BigEd made a suggestion to have only the interrupt handlers use less zero page. This would work, and I was originally going to do this, but a subtle problem sprung up. We use zero page registers to pass and return arguments, so doing this would amount to having a different calling convention for functions inside and outside interrupt handlers. Trouble is, you can stick them both inside a function pointer, pass them around arbitrarily, and it becomes impossible to tell what calling convention you should use to call the pointee. Without a perfect pointer analysis or user hand-annotation, there doesn't seem to be any way to support two silently-different calling conventions in C. Hence, if there's a shift in calling convention due to the presence of interrupts, it needs to be program-wide.

EDIT2: Of course, where the identity of callees are statically known, the calling convention could differ wildly. I'm planning on taking advantage of this later...

If it didn't do the call graph analysis, we'd either have to require the user to tag each individual function as part of an interrupt handler or not, or we'd have to globally disable static stacks, even for the main part of the program. Otherwise there's no way to know which parts of the program can safely use static stacks, and which part can't.

MichaelM wrote:

(1) what is being handled in the function, (2) what is the maximum rate at which the function can be vectored to, (3) what happens if a second event queues before the function completes, etc.

The only information our optimizations need at present is a conservative estimate of which function can be active at the same time. This is because we can do the static stack optimization if and only if at most one invocation of the same function can ever be active.

The only affect (2) and (3) have on that question is whether an interrupt handler can be called while another invocation of that same interrupt handler is active. (We assume that different interrupt handlers can interrupt each other.) We allow users to make that determination based on their knowledge of their interrupt handlers: if that can't happen (which is expected to be the usual case), they can annotate it with "interrupt_norecurse." If they can't or don't want to verify this property, they can say "interrupt", which is always safe, but might be slower.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Aug 01, 2021 9:33 pm

As I earlier said, you are stumbling over why the 6502 has never been a good compiler target.

Regarding interrupt handling, a basic rule of any ISR running on a 6502 is it should be small and succinct. "Small and succinct" doesn't describe the machine code generated by a typical C compiler. A fat and slow ISR may not be able to meet deadlines for servicing multiple interrupt sources, leading to inadvertent reentrancy. Reentrancy during interrupt processing on the 6502 can quickly exhaust resources and crash the machine due to stack wrap (been there, done that, long ago).

I'm with Garth and Mike: you are over-complicating it. If an ISR written in C needs use of some zero page, that space should be statically allocated to the ISR only—foreground tasks should never touch it. Frankly, I think attempting to write an ISR in C is not a good idea. There are some things that are best done in assembly language on a 6502, and this is one of them.

mysterymath · Post by **mysterymath** » Sun Aug 01, 2021 10:02 pm

BigDumbDinosaur wrote:

"Small and succinct" doesn't describe the machine code generated by a typical C compiler.

Interrupt handlers are regularly written in C for non-6502 targets. While small and succinct doesn't describe the code produced by the current smattering of 6502 compilers, I'm not convinced that the laws of physics and of thought require that it cannot be so. If I did, I wouldn't have started the project.

Like many hackers, I'm interested in producing the tool that I myself would like to use. I don't actually particularly like writing 6502 assembly by hand; it feels tedious and tiresome to me. On the other hand, I do have great sentimental attachment to 6502 systems; I did all of my programming on them in my youth in a high level language (BASIC). I never really dabbled in "USR statement magic" back then. For me, what I want out of a C compiler is "the best possible BASIC", one that compiles down to efficient, clean 6502 assembly, and where you can write essentially anything that you'd write in 6502 assembly, albeit maybe not with the absolute perfection attainable by hand-optimizing each byte of the 64K addressable space. I also like the challenge of doing something "everyone knows is impossible." Even if I don't end up getting the compiler as efficient as I'd like, I want to walk right up against the line of what I can do, and I'm nowhere close to that point yet.

But for those who would never dream of writing an ISR in C, nothing in the compiler requires you to. The compiler already has a flag to control the number of zero page registers it uses, so already fairly easy to reserve a region for exclusive use by assembly-language interrupt handlers. It wouldn't be possible for such handlers to safely call C routines without using the interrupt annotations to inform the compiler that this can occur. But for those who prefer to write ISRs exclusively in assembly, this is no loss at all.

barrym95838 · Post by **barrym95838** » Mon Aug 02, 2021 12:01 am

mysterymath wrote:

Interrupt handlers are regularly written in C for non-6502 targets. While small and succinct doesn't describe the code produced by the current smattering of 6502 compilers, I'm not convinced that the laws of physics and of thought require that it cannot be so. If I did, I wouldn't have started the project.

I am almost certain that I'm not alone in applauding your efforts, regardless of the degree to which they eventually succeed. Your consistent "can-do" attitude impresses me, and I wish you only the best.

P.S. Judicious use of the #pragma directive might be a useful path for certain interrupt stack optimizations that are 6502-specific.

GARTHWILSON · Post by **GARTHWILSON** » Mon Aug 02, 2021 5:27 am

mysterymath wrote:

I also like the challenge of doing something "everyone knows is impossible."

There have been a few times when I could see a way to do something, so when everyone said it was impossible, it served all the more to motivate me. (So in a sense, they did me a favor to tell me it couldn't be done.) That's not to say that what I had in mind was always a good idea, as there were times I really was unknowingly aiming to do something the hard way; but one good one I particularly remember was my zero-overhead interrupt service in high-level Forth which I subsequently wrote up and got published in the Jul/Aug '94 issue of Forth Dimensions magazine. Responses published later in "Letters to the Editor" were very much ones of praise.

I suspect that a C compiler could be written that would be smart enough to understand the execution goal and completely rework the approach to fit the way the 6502 does things, and optimize accordingly, but that it just hasn't been done because there's not the necessary market to justify the many man-years it would take to do it. I do suspect it's possible though.

OTOH, not every use requires maximum execution speed or memory compactness. For example, I still use my very slow but otherwise very capable HP-41cx hand-held computer for certain things. (Those not familiar with it might say it only qualifies as a calculator, not a computer; but it does have boolean functions, a file system, text capability and a text editor, insane (albeit slow) I/O capability to control and take data from lots of lab instruments at once, plus the more-mundane things like monitor, mass storage, printers, etc.), alarms, assembler, and more.)

Quote:

Like many hackers, I'm interested in producing the tool that I myself would like to use. On the other hand, I do have great sentimental attachment to 6502 systems

This kind of thing has greatly enriched the 65 world.

Quote:

I don't actually particularly like writing 6502 assembly by hand; it feels tedious and tiresome to me.

Most assembly-language code I see on the forum or on web pages is totally lacking any visible structure and is quite a chore to follow. Assembly language doesn't have to be that way though. Macros to the rescue, including to make program flow-control structures like HLLs have. It's what my article is about, at http://wilsonminesco.com/StructureMacros/ . (Veterans here have seen this before.) There are a couple of more-extended examples in the last 40% of the page on multitasking, at http://wilsonminesco.com/multitask/, showing nested IF...ELSE...END_IF's, BEGIN...UNTIL, CASE, etc.. In most cases, the macros will assemble exactly the same thing you would by hand, meaning there's no penalty in run speed or memory taken; it's just that now you don't have to look at the ugly internal details every time, and your code becomes much more readable, maintainable, and bug-free, and the programmer becomes much more productive. There's more description of how the macros figure out where to branch to and how they keep from confusing the various targets, in the related chapter of the 6502 stacks treatise, at http://wilsonminesco.com/stacks/pgmstruc.html . Near the end of the page, you can mouse over the various words in the CASE structure code example, and a box will come up telling what the macro does there, and what's on the assembler's own stack at that point (not the target 6502's stack, which is not involved).

BigEd · Post by **BigEd** » Mon Aug 02, 2021 8:12 am

Yes, mysterymath, you go ahead - there are people here interested and encouraging. Negative commentary from people who don't know the subject matter is, I think, of limited value. One can watch and learn.

sark02 · Post by **sark02** » Thu Aug 05, 2021 7:42 pm

This is an awesome project. Very impressive.
It's good to have the option of using 'C' to write interrupt handlers (assuming it's within the performance constrains for the project). Not all engineers that might need to maintain the project may be as comfortable with assembly. In my experience, very few are.
It's a bit weird for the presence of an interrupt handler to globally affect the compile. I wonder how that affects libraries (which would presumably be built in isolation without knowledge of whether the final application would use an interrupt handler). Presumably you have good reasons to do it that way, though.

mysterymath · Post by **mysterymath** » Thu Aug 05, 2021 8:15 pm

That's a really good point about libraries; in particular, it broke separable compilation. This is probably a bit too severe a divergence from the usual C runtime model; you wouldn't expect that a module that uses interrupts and one that doesn't could have different calling conventions.

I'll have to go in and make the calling convention switch a compiler flag. Then, it'd be an error to use interrupts without the flag, or to compile together different modules with different flag values. It'd essentially be a target feature that affects the calling convention (I think that's not unheard of.)

I was trying to avoid the extra step, but I can emit a good error message, and it's probably worth letting folks know that using C interrupts comes with some additional program-wide requirents.

BigEd · Post by **BigEd** » Fri Aug 06, 2021 12:17 pm

Could you perhaps explain (again) why you need a separate mode? It might be that a fresh explanation will help. It still feels to me like there shouldn't be a need, which is to say, it shouldn't be too complicated not to have a separate mode.

mysterymath · Post by **mysterymath** » Fri Aug 06, 2021 4:11 pm

Ah, I'm realizing there's a lot of context that goes into a decision like this, so I'll try to provide a more thorough description of the problem space.

TL;DR: After writing all this out, I'm realizing that it's probably better to use the interrupt calling convention basically all the time. It'll be much easier to understand the calling convention, and we can add optimizations to get that to be nearly as efficient as the current calling convention.

Say you have a 6502 floating in the void. It has 256 zero page locations.

Say there's two programmers, Alice and Bob, and each is going to write one subroutine: A, and B, respectively.
At various points, the subroutine A will call the subroutine B.

The question is, how can Alice and Bob share the same zero page.

We'd like to maintain the "separate compilation" guarantee that C implicitly provides: that Alice and Bob can do their work completely separately from one another, sharing only the barest descriptions of their exported functions.

Accordingly, the usual 6502 assembly language answer "divvy up the zero page between Alice and Bob," doesn't work in this context. If Alice is doing all the work, that works fine; this is akin to link-time optimization. But that should be an option, otherwise the compiler will spit out bad code if you use it the wrong way.

So what other answers are there? We can say that both Alice and Bob are free to use as much zero page as they want for anything they want. But then when A calls B, any values that A is currently storing in the zero page might be overwritten by B. So A is responsible for saving these locations to the stack before the call to B, then restoring them after. This is the "caller-saved" convention, where each register is a "scratch" register.

We could also say that all routines need to put the zero page exactly back the way they found it. Then A doesn't have to worry about calling B; all it's zero-page values will still be alright after the call. However, both A and B would be responsible for saving all zero page registers they touch to the stack in their "prologue", right as they're entered, and restoring all zero page registers in their "epilogue", right before they return. This is the "callee-saved" convention.

Both conventions have their advantages and disadvantages.

Scratch registers are free if the values A uses aren't needed after the call to B anyway, which is actually surprisingly common, especially if they're used to compute arguments for the B call. However, if a value needs to be live across a huge number of calls, it sucks to save and restore the value around each use.

Conversely, callee-saved registers HAVE to be saved if they're used even once! But you only ever have to do so once! Once you've saved one, that register becomes completely free for a subroutine to use, and you don't have to worry about subroutine calls clobbering it. So it's perfect for the values that are terrible for scratch registers!

Each zero page location can be assigned a different convention, so long as the whole program agrees on what's being used. Typically, this is defined in stone in some ABI document, so there's not much risk of disagreements.

Each convention is good at different things, so to emit performant code in this context, you need a mix of conventions.

For llvm-mos, I noticed that the fastest way to save and restore zero page registers is on the hard stack, not the soft stack. But there would need to be a really tight cap on how many we save to the hard stack: it's really tiny! So, I capped the number of callee-saved registers to 4. This means each function could have up to six bytes (two bytes for the return value) on it's hard stack frame (and an arbitrary number of bytes on its soft stack frame).

The rest would naturally be caller-saved. When LLVM saves caller-saved values, it uses a totally different mechanism than for callee-saved registers, so it's actually kinda hard to get them on the hard stack. So I just let LLVM save them to the soft stack like it wants to.

This model is nice, neat, and clean (IMO), and it works great in a world where there are no interrupts. Which is the world llvm-mos was living in until very recently.

Say Carol writes an interrupt handler. We can think of an interrupt handler as a function that might "be called" from absolutely any line in any function in the whole program. It's possible to reason at a much finer degree of detail than this, but that's not necessarily possible in a world of separate compilation, since Carol might not be able to see the definitions of any of the functions they interrupt.

Because of this, the caller-save convetion simply doesn't work for interrupt handlers. The caller is some random line of some function, so the caller can't possibly be able to handle the responsibility of saving all of its live values. Thus, that responsibility must fall completely on the interrupt handler.

Thus, the interrupt handler must additionally save all caller-saved registers it uses. Worse, if it calls a C function outside of Carol's module, then it has to save EVERY caller-saved register, since it has no idea which or how many that function clobbers! But there are up to 250 of them! It's not practical to expect an interrupt handler to actually do this.

So that's the problem space. There's a huge number of potential solutions.

One obvious one is to switch the convention: make most registers callee saved. This would limit the number of locations that an interrupt handler would absolutely have to save; instead, functions would save and restore only the registers they actually happen to use, on the fly. We'd still only be able to save 4 callee-saved registers to the hard stack; the rest would need to be saved to the soft stack. LLVM is currently really really callee-saved-register-happy, so until I can get it to avoid them a bit more, all functions would begin with a huge prologue that saves a ton of zero page registers to the stack. The body of the function would afterwards be pretty clean though; it'd operate almost entirely in the zero page. And the register allocator will still minimize the number of zero page locations a register can use without degrading performance.

The option that I had chosen was to do the above switch, but *only* in the case where interrupts are occurring somewhere.

One of the reasons I wanted to type this out is to formalize my own thinking on the topic.

Accordingly, it's increasingly sounding to me that it'd probably be a better idea to just make most registers callee-saved, all the time. We can improve the way LLVM makes the decision of whether to use the soft stack or callee-saved registers. We can also use whole-program analysis to reserve zero page registers to functions; this would remove the need for those functions to save and restore them, but their bodies would be otherwise unchanged. Since this calling convention would put most of the "reentrancy ugliness" in the prologue and epilogue, optimizations that shrink the prologue and epilogue would bring the function asymptotically closer to neat clean 6502 assembly.

BigEd · Post by **BigEd** » Fri Aug 06, 2021 4:50 pm

Thanks for the explanation!

Just thinking aloud...

I think my inclination is to cut the pie in a different place: something has to give, and separate compilation seems like the right thing - because it's difficult to preserve zero page locations, because it's costly, and because 6502 code is never so large that whole-program compilation is a big cost. Also because distributing libraries in source form seems like a good idea.

Having said which, I like a fully-featured OS, and arguably an OS should be something we can augment with a resident library, in which case we need an ABI, and we need a scheme for allocating zero page. (We'd always expect the OS to have an allocation in zero page.)

mysterymath · Post by **mysterymath** » Fri Aug 06, 2021 6:03 pm

Alas, even whole program optimization doesn't save us here.

First, because you still need to be able to write functions in Assembly language. Given self modifying code, a compiler can't safely reason about the zero page usage of such functions. So the calling convention also provides the set of guarantees that their routines need to adhere to to be callable from C.

Determining where exactly a function pointer points is similarly difficult. If we can't do it, then we need to have a standardized set of assumptions the caller can rely on the callee to adhere to.

The practical upshot of this is that you can freely write code that swaps out behavior behind the scenes. OS syscall vectors are often like this; the basic interface is standardized, but the implementation differs by device type. In C this would be done with function pointers. The only difference is that in C you can take the address of any function, in asm typically calling conventions aren't universally standardized.

In practice, llvm-mos will probably end up splitting functions into a standardized wrapper around an ideal body. If you know which function you're calling, you'll call the body, otherwise, you'll call the wrapper. But this convention really only matters for the wrappers (which is all we have at the moment.)

GARTHWILSON · Post by **GARTHWILSON** » Fri Aug 06, 2021 7:55 pm

Think of ZP as 256 bytes of processor registers, more than many higher-scale processors have. You can however have a data stack in ZP, which is what Forth does (keeping the ZP data stack separate from the return stack which is the page-1 hardware stack), which is why we can service interrupts in high-level Forth without any overhead, and even have nested interrupts, ie, that a more-urgent interrupt with quicker service can cut in on the servicing of a less-urgent interrupt that may take longer to service. There's no need to save and restore anything, unless global variables are going to me stepped on. Stacks by nature are re-entrant. I go into these things in depth in my 6502 stacks treatise.

That said, Jonathan Halliday has written a very impressive multitasking OS with GUI for 8-bit Atari computers. You can see it at http://atari8.co.uk/gui/ . It supports up to 16 simultaneous processes, including duplicates, IOW, you can have multiple instances of the same one running at the same time. I asked him about how they share the stack space which could potentially get pretty tight if you have a lot of these running at once, and he said it looks for which tasks have been dormant the longest and swaps out their stack space to use for something that needs it more urgently. That way he's not copying out portions of stack space every time there's a task switch. You can see in the video there that the responsiveness is excellent even at 1.7MHz.

mysterymath · Post by **mysterymath** » Fri Aug 06, 2021 8:08 pm

If you squint at it the right way, a ZP data stack is another type of contract between caller and callee.

It says: no function is allowed to touch the zero page before this dynamic top of stack pointer. Functions are totally free to write to any region below the top of stack pointer. Functions must bump the pointer to protect themselves against interrupts.

The downside is of course that all accesses are relative to a dynamic location, so they require an indexed addressing mode. Still better than a full 16-bit soft stack, but not as good as direct addressing. It's places additional pressure on the X register, too. But, I'll have to think about it. Worth noting that you could reserve all zp regs above a certain point as a data stack, so long as it were contiguous (which we don't currently require our zero page to be).

barrym95838 · Post by **barrym95838** » Fri Aug 06, 2021 8:31 pm

Yeah, maintaining the index plus the extra cycle every time you index certainly can add up, but your compiler should be much simpler than one that has to keep track of the dynamic allocations for every conceivable use case. Trade-offs, trade-offs ...

LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen

Re: LLVM 6502 Codegen