6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Apr 29, 2024 11:11 am

All times are UTC




Post new topic Reply to topic  [ 13 posts ] 
Author Message
 Post subject: 6502 HLL call convention
PostPosted: Sat Apr 13, 2013 7:19 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
I think one of the primary reason CC65 produce so terrible code is that it's calling convention is horrible at first. All local variables and calling paramters are placed on a "software stack" which is adressed with (),Y instructions.

Does anyone see any problem with my proposal for a new convention (to be implemented on the top to wathever compiler - which is just an idea right now).


Code:
6502 general purpose calling convention.

The arguments or return values are first reordered as a list of bytes.

The first byte is in A, the second byte in Y, the third byte in X.
No function is ever expected to preserve A, X, nor Y, even if they are not used to pass arguments or return values.

If there are still more bytes, only THEN they are passed on the stack, and on the true hardware stack.
Arguments are pushed before the return address, by the caller. It is also the job of the called to pop them.
If Return values should be passed on the stack, then the caller should push dummy values if there is more bytes in the return than in the arguments.

In summary, a caller program should push
max (0, arguments_byte_count - 3, return_value_byte_count - 3)
bytes when calling a function.

Bytes 4 and up are accessed this way :

tsx
lda $103,X   ;Reads argument/return byte 4
sta $104,X   ;Write argument/return byte 5
etc....

For example :
void function1(u16 a, u16 b)
The low part of a is in A, the high part of a in Y, the low part of b in X and the high part of it on the stack

void fuction2(u16 a, u8 b)
The low part of a is in A, the high part of a in Y and b is in X

void function3(u8 b, u16 a)
b is in A, the low part of a in Y and the high part of a in X

The following optimisations are impossible for non static functions, as they could be written in a different language and compiled by a different compiler at a different time.

== optimisation 1 (static functions only) ==
of course, the code generated for function2 and function3 could be more efficient than the other while being strictly identical because of the call convention.

If the function is static, then the compiler could try to internally remap arguments of a function and see if the function is smaller or faster, and keep that version of the function if this is the case.


== optimisation 2 ==
If the compiler detects one of the arguments or return values is actually just a bool, then the argument uses one of the processor status flags instead to pass the arguments. Other (non bool) arguments are passed as if the bool argument was removed from the list.
Which flag exactly is used can depend. C and V are more "general purpose" (affected by few instructions), but Z or N could be good to pass conditions like (variable == 0) or (variable < 0) respectively.

== optimisation 3 ==
If a function can manage to spare completely X, Y or A registers, the caller function can be made "aware" of it and rely on this feature, so it avoids saving it's stuff in temporary storage

== optimisation 4 ==
(available only for argument, not for returns). If there is more than 3 bytes of arguments, but that *all* of them are "constant" i.e. in all calls the arguments evaluate into a simple numeric number, the arguments does not have to be passed to the stack.
Instead they can be included directly in the code like this :

jsr call_a_fun_with_lots_of_args
.db $8   ;Number of bytes of arguments
.dw funct
.dw $04, $05, $06, $ffff    ; A lot of arguments

This particular technique is slow but allows to save some precious bytes. It would only be used if optimisation for small code size is nessesarly.

== optimisation 4b (non 6502 specific) ==
If an argument not only reduces to a numeric expression, but also that the number is actually the same number every time, the argument is removed from the list, and hard-coded into the function.

== optimisation 5 (non 6502 specific) ==
If a function is only called once, then it has to be inlined.


About types, I'd say that 8, 16 and 24-bit types should be supported natively. 32-bit types should not be supported natively, that way returning a "long" can still fit in the 3 registers be be somewhat efficient. Only returning structs could potentially use the stack for returns. It's just my opinion though, it does not have to be that way.

Sure this is quite complex, but complexity is the key to efficiency (in this case).


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 8:35 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 672
Where would stack frames live? The CPU stack is quite small. You're already putting parameter bytes there, would you still have a 16-bit user stack elsewhere for the rest?

When writing C programs, are you careful to use 8-bit variable sizes as much as possible? Most I've seen use ints more than anything else, plus many passed variables tend to be pointers. So effectively anything above 1 parameter is going to be using the stack anyway, and that's most functions.

Seeing as you're going to do some processing with those parameters, and thus need the registers to do work, the first thing that would often have to happen is creating some temp space in the stack frame and storing the register-passed parameters there regardless. Note that the 1st parameter passed isn't necessarily the first parameter that needs to be operated on.

Loading registers might be difficult. If you're going to stuff A/X/Y with bytes read via indexed addressing modes, in what order are you going to load them? You're going to have to shuffle around with temp space or pha/pla in order to actually use X/Y indexing to perform the register loading. Consider that you want to call foo(*ptr1, *ptr2, *ptr3), all with byte pointers. You're probably going to have to dereference them into temporary space before starting to load registers or stack. It'd be equal or cheaper, and simpler for the compiler, to throw them onto a user stack as they're dereferenced.


Really, the primary reason cc65's code is so terrible is that the C way of doing this is not the 6502 way of doing things.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 10:31 am 
Offline

Joined: Thu Mar 03, 2011 5:56 pm
Posts: 277
David A. Wheeler has some thoughts about 6502 code generation, that may be relevant to this discusssion: http://www.dwheeler.com/6502/


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 10:36 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
The Keil compiler for the 8051 does a reasonably good job at producing code for a device that was never meant for C. One of the things they do is a complete analysis of all functions, and try to put most of the local variables in global memory. Only for recursive functions do they actually put parameters/locals on a stack.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 11:33 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 672
Arlet: Yeah, I was going to bring it up, but needed to consider it a bit more. A potentially major downside is that it takes up some of the static footprint of the program (though you could see which routines are mutually exclusive on the call stack and overlap their use). A stack-based system only uses frame memory that is actually in use at the time.

The one benefit this method would have is that it'd free up an index register, which is very useful. The overhead of copying in/out of a static area can still be pretty comparable to doing indexed copies of the same for a user space stack, so I don't know how much real gain is to be had in terms of performance.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 5:42 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8155
Location: Midwestern USA
rwiker wrote:
David A. Wheeler has some thoughts about 6502 code generation, that may be relevant to this discusssion: http://www.dwheeler.com/6502/

From Wheeler:

    This page has some information on implementing computer languages (beyond ordinary assembly language) on the extremely old and obsolete 6502 chips.

"...extremely old and obsolete..."? Guess he doesn't know about the 65C02 and its pervasiveness in the embedded world. :D

Kidding aside, Wheeler's thinking is definitely a little eccentric in some ways, but sound in others (I completely agree with his assessment of the Aztec C compiler clumsy output). Any C compiler for the 65C02 can be made to generate reasonably efficient machine code if the hardware stack is confined to its traditional roles of return address and register storage, and a separate software stack is used for parameter passing. The overhead of maintaining the latter is inescapable with the 65C02, since it lacks any specialized hardware stack addressing features and doesn't allow stack relocation. The resulting code would be fairly bulky once library routines have been linked in, but would perform tolerably with the WDC 'C02, since it can be run at a high Ø2 rate.

The picture considerably brightens with the 65C816, thanks to its set of stack addressing instructions, relocatable direct (zero) page and stack, and 16 bit loads and stores. Although one could port an existing 65C02 compiler to the '816, a compiler that has been scratch-developed to run on the '816 in native mode would perform all parameter passing via the hardware stack and completely eliminate software stack overhead. Instructions such as PEA, PEI and PER, which can push parameters to the stack without clobbering the registers, are fast and efficient. Such features allow the compiler to produce more succinct code and in the process, consume less memory during run time. Using PER, the compiler could be made to emit fully relocatable code, assuming any linked-in libraries are similarly written. I've written a branch-to-subroutine macro for the Kowalski simulator that takes advantage of the BRL and PER instructions to simulate the Motorola 68K BSR instruction.

Arlet wrote:
The Keil compiler for the 8051 does a reasonably good job at producing code for a device that was never meant for C. One of the things they do is a complete analysis of all functions, and try to put most of the local variables in global memory. Only for recursive functions do they actually put parameters/locals on a stack.

That wouldn't be necessary with the 65C816. It almost seems in retrospect that when Bill Mensch was deciding what features to add to the '816 he had C in mind.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 10:02 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
rwiker wrote:
David A. Wheeler has some thoughts about 6502 code generation, that may be relevant to this discusssion: http://www.dwheeler.com/6502/

On that web page, David Wheeler writes,
Quote:
6502.org lists lots of assemblers. One they've omitted (why?) is Ophis, a 6502 assembler written in Python in 2006-2007; looks nice, though its syntax is a little different.

David--if you're reading this: Ophis will be added shortly. [Edit: done.]

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 16, 2013 10:38 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
Well, as I mentioned in another thread, the language Action! relied on each procedure having static areas for the parameters and local variables.

The advantage is that your "stack frame" is basically a return address.

The down side is that there is no recursion, and obviously linking will be a little more difficult. So, only static images. It also expands the memory size of the final image, since it must make room for all of the arguments within each procedure. It would encourage global state to save memory. The return result would be in a shared, static space. No reason do allocate the result for each function. Just have to copy your answer out after each function call.

But it IS fast, and simple, and you could even add an optimization pragma for the compiler to put params for certain functions in zero page.

Code:
    ; allocate block for potential results
    result .ds 16
    ; int myfunc(int a, int b) {
    ;     int c;
    ;     c = a + b;
    ;     return c;
    ; }

myfunc_a: .ds 2
myfunc_b: .ds 2
myfunc_c: .ds 2
myfunc:
    CLC
    LDA myfunc_a
    ADC myfunc_b
    STA myfunc_c
    LDA myfunc_a + 1
    ADC myfunc_b + 1
    STA myfunc_c + 1
    LDA myfunc_c
    STA result
    LDA myfunc_c + 1
    STA result + 1
    RTS
;
; void main() {
;     int a;
;     a = myfunc(1, 2);
; }
main_a: .ds 2
main:

    LDA #1
    STA myfunc_a
    LDA #0
    STA myfunc_a + 1
    LDA #2
    STA myfunc_b
    LDA #0
    STA myfunc_b + 1
    JSR myfunc
    LDA result
    STA main_a
    LDA result + 1
    STA main_a + 1
    RTS


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 17, 2013 5:17 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
White Flame wrote:
Arlet: Yeah, I was going to bring it up, but needed to consider it a bit more. A potentially major downside is that it takes up some of the static footprint of the program (though you could see which routines are mutually exclusive on the call stack and overlap their use). A stack-based system only uses frame memory that is actually in use at the time.

Static footprint should be the same, assuming you can fully overlap. Whatever you use for static memory, you save on the stack. So, if you put all your local parameters at $0100, and let the stack grown down from $01FF, it should still fit.

Of course, it would be smarter to put frequently used variables in zero page instead.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 17, 2013 11:23 am 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Quote:
The Keil compiler for the 8051 does a reasonably good job at producing code for a device that was never meant for C.

Any serious C compiler targetting the 6502 (by "serious" I mean who aims to produce reasonable code, which is less than 2x worse than hand written assembly) should OF COURSE do this. I didn't mention it because I thought it would be obvious. There is no way a "stack" based thing can be implemented efficiently on the 6502, no matter if it's software or hardware stack. However, it you have to choose between the real stack and a software one, the hardware one is a little more efficient, because you can use PHA/PLA to acess the top level, and "lda $100,X" style adressing is only required to adress subsequent bytes. Adressing a software stack with lda (),Y is an asbsolutely TERRIBLE solution (performence wise I mean).

Unfortunately there is obstacles for such an analysis. Function pointers, and separate compilation.
Imagine a table call with 32 different functions.

Theoretically they should not be mutually exclusive, so all 32 functions should use different locations for temp storage and so on. This would waste a LOT of RAM.
In practice, they are mutually exclusive, and you'd like the compiler to use the same locations for the temporary storage.

Quote:
When writing C programs, are you careful to use 8-bit variable sizes as much as possible? Most I've seen use ints more than anything else, plus many passed variables tend to be pointers. So effectively anything above 1 parameter is going to be using the stack anyway, and that's most functions.

The size of arguments is the programer's responsability. A compiler could detect cases when an "int" can obviously be transformed in an unsigned char, but still it wouldn't be able to do it in all cases (and in some cases it would actually introduce dangerous bugs if the data was actually meant to be int).
This could also be solved by making "int" 8 bit, breaking all standard for C, but making the code more intuitive. "int" should typically be the size of the processor's native registers, in our case, 8-bit.

About the part where most passed variables tends to be pointers, it's up to programming style. The question to be asked is : If you wanted to pass 3 pointers in assembly langage, how would you do it ? Then the answer should be introduced int the compiler as much as possible. In my case, I think I would do it the optimisation 4 way, getting rid of the stack passing. If the pointers are results of a calculation, then it doesn't work any longer sadly.

Quote:
Really, the primary reason cc65's code is so terrible is that the C way of doing this is not the 6502 way of doing things.

Honnestly I don't know. C was designed in the 70s, so was the 6502.
Because nobody really made a (free) really optimising 6502 compiler yet doesn't mean it's impossible. It just means it's not a trivial thing to do.
Chances are that commercial compilers were made during the 80s were of decent quality. However, they are probably lost, or is there a cracked version of one of them available, that could still run somehow today ?

I am all ears if you can suggest another high level language that is more suited for the 6502. I did not really mention C, but since it's the most used language ever, especially for embedded platforms, it would be hard to think of something else.

EDIT : What I'd have in mind is something close to David A. Wheeler's "3rd solution".
However, the optimisations mentionned would only work between calls that doesn't use function pointers, and that are within the same .c file. When calling a function by a pointer or an "external" function, some kind of global convention is required, and in thise case I think the stack usage is "required" if you want to keep C compatible. Think of the case if you call a function in another .c file, which calls back the same function. There is recursivity, but no way to detect it by compiling a single file at a time.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 17, 2013 5:56 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8428
Location: Southern California
Bregalad wrote:
However, it you have to choose between the real stack and a software one, the hardware one is a little more efficient, because you can use PHA/PLA to acess the top level, and "lda $100,X" style adressing is only required to adress subsequent bytes. Adressing a software stack with lda (),Y is an asbsolutely TERRIBLE solution (performence wise I mean).

LDA ZP,X (as used all the time in Forth) takes 4 clocks, the same as PLA, one more clock than LDA ZP. There is of course the INX or DEX (which take 2 clocks each) to adjust the stack pointer which is register X, but you usually have quite a few operations before or after the INX or DEX. (ZP),Y, which takes one more clock than ZP,X, does not get used as much. (ZP,X) gets used more than (ZP),Y, and takes one more clock than (ZP),Y but uses the X value that's already there instead of taking extra time to load an index register.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 17, 2013 6:17 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Bregalad wrote:
However, the optimisations mentionned would only work between calls that doesn't use function pointers, and that are within the same .c file. When calling a function by a pointer or an "external" function, some kind of global convention is required, and in thise case I think the stack usage is "required" if you want to keep C compatible. Think of the case if you call a function in another .c file, which calls back the same function. There is recursivity, but no way to detect it by compiling a single file at a time.

Without global analysis, the compiler isn't going to be able to generate very good code. I'm pretty sure the Keil 8051 compiler does that. Global analysis not only allows you to store most locals in static memory, but also allows call convention to be optimized per function.

Of course, it's a huge project.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 17, 2013 6:41 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Quote:
Without global analysis, the compiler isn't going to be able to generate very good code. I'm pretty sure the Keil 8051 compiler does that. Global analysis not only allows you to store most locals in static memory, but also allows call convention to be optimized per function.

I just checked the documentation of SDCC, which is basically a GNU version of the 8051 compiler. (it was ported to multiple platforms, but originated on the 8051).
It says when you compile multiple files, main.c has to be compiled last, and I imagine it is so it can do the global analyzis when the main function is compiled, and apply all the radical optimisations (elimination of the stack) for other functions.

Yes the calling convention can't be avoided when a function pointer is used or when a bridge is made to another language (typically assembly, but in theory you could merge multiple HLL in a project too if another compiler using the same convention were to be made).

Quote:
Of course, it's a huge project.

Of course, in fact I am thinking seriously about attempting to make either a 6502 SDCC port, or a huge unofficial improvement of CC65 (or anything else that is able to generate efficient 6502 code automatically from a HLL) as my master project. But it's just an idea, I'll have to find a professor which agrees with it which won't be easy considering the numerous applications of the 6502 today :roll:

Because there is no way I could do something like this in my free time.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 20 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: