Cowgol 2.0 compiler for the 6502

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 09, 2020 12:56 am

hjalfi wrote:

But the downside is that I can't index anything on the stack [...] I also can't dereference pointers, and access to local variables in outer scopes becomes impossible

Can you give examples of what you mean here; because Forth uses the separate data stack, and even on the 6502, it can do all of this. There's no difference between pointers, variables, characters, flags, and other types of cells. It doesn't use typing, which I'm glad for.

resman makes a good point I should have mentioned:

Quote:

Then why not just put the parameters directly into their memory locations instead of pushing/popping the values through the stack intermediary?

There's no need to put the input and output parameters through the page-1 stack.

hjalfi · Post by **hjalfi** » Tue Jun 09, 2020 7:52 am

@resman: Cowgol 1.0 did that! I couldn't make it work. The problem is compiling code like this:

Code: Select all

function1(1, 2, function2(3, 4));

Because function1 and function2 can share the same address for their locals, then we marshal the 1 and 2 into function1's parameters and then immediately overwrite them with the 3 and 4 for function2's parameters before function1 gets called. The same happens with this:

Code: Select all

function1(1, 2, function1(3, 4))

I could marshal the parameters into temporaries and then copy them into the function parameters just before the call, but if I'm doing that I might as well stack them --- pla and pha are really cheap one-byte instructions. It's worth spending less in the caller (which happens frequently) and instead spending more in the callee (which happens once).

A similar problem happens with return parameters where the values have to be copied to local storage rather than used in-situ in case they get overwritten by a subsequent call to a different function.

If, BTW, anyone has a clever solution then I'm all ears...

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 09, 2020 8:02 am

Stacks solve those problems. Functions will not step on calling functions' pending parameters. A subroutine's local variables will reside on the stack(s), indexed the same way every time even though the actual addresses may be different each time, depending on how deeply it's nested in the pile of subroutines calling other subroutines. The same concept automatically accommodates recursion.

Chromatix · Post by **Chromatix** » Tue Jun 09, 2020 11:03 am

Hmm, I get it. The calls to function1 and function2 don't exist in the call stack simultaneously, so their parameters and locals can overlap in storage if they are assigned statically.

But the calling function's locals should be arranged to be disjoint from both called functions. Those locals include all temporaries, including the temporarily-stored return value of function2. You should then be able to organise the parameters of function1 so that they stay out of functions2's space until the latter has returned.

It seems obvious to me that storage assignment should begin with leaf functions and move down from there. Leaf functions have no dependencies, so the choice of local storage is free; you might as well stuff it at one end of the space available. Then calling functions have a free choice of storage except for that used by any of their callees.

resman · Post by **resman** » Tue Jun 09, 2020 3:19 pm

hjalfi wrote:

@resman: Cowgol 1.0 did that! I couldn't make it work. The problem is compiling code like this:

Code: Select all

function1(1, 2, function2(3, 4));

Because function1 and function2 can share the same address for their locals, then we marshal the 1 and 2 into function1's parameters and then immediately overwrite them with the 3 and 4 for function2's parameters before function1 gets called. The same happens with this:

Code: Select all

function1(1, 2, function1(3, 4))

Did you mean to make this a recursive call?

hjalfi wrote:

I could marshal the parameters into temporaries and then copy them into the function parameters just before the call, but if I'm doing that I might as well stack them --- pla and pha are really cheap one-byte instructions. It's worth spending less in the caller (which happens frequently) and instead spending more in the callee (which happens once).

A similar problem happens with return parameters where the values have to be copied to local storage rather than used in-situ in case they get overwritten by a subsequent call to a different function.

Ok, but earlier you said this:

hjalfi wrote:

I'm actually doing static variable allocation; by analysing the call graph of the program, I can assign every variable and every expression evaluation temporary an address. If two subroutines are never called at the same time their variables can overlap --- there's never any conflict, and I can prove this at link time. This is the big win of forbidding recursion: many of these old CPUs make stack frames really painful.

So there shouldn't be any overlap with the locals. Otherwise you will have to deal with the overlap of locals regardless of where the parameters are passed.

hjalfi wrote:

If, BTW, anyone has a clever solution then I'm all ears...

Honestly, it sounds like you're working *really* hard to avoid stacks. But there is a reason they are ubiquitous in compiler design - they really solve a lot of problems (refer to Garth's answers). If you are determined not use stacks, then it's time to sit back and think through your design decisions. I like to go old school and pull out a pad of paper and pencil (and a large eraser) and work through the problems by hand. Many different scenarios to consider, like: How do you plan on dealing with function pointers? Do you really want to overlap local variables to start with?

I would make a set of simple rules to map a function's parameters, locals, and return values. For instance:

Code: Select all

function address + 0: JMP *+sizeof(parameters+return values+locals)
function address + 3: parameters
function address + 3 + sizeof(parameters): return values
function address + 3 + sizeof(parameters + return values) : locals
function address + 3 + sizeof(parameters + return values + locals) : function code

Then you only need to know the address of the function, the number of parameters and the number of return values. You can easily use function pointers, as all the parameters and return values are offset from the actual function address. It may seem wasteful to start off the function with a JMP over it's variables storage, but this hides the amount of local storage/temporaries used by the routine. If you notice, this is actually hard-coding the frame stack into the function itself.

Something to consider.

BillG · Post by **BillG** » Tue Jun 09, 2020 3:29 pm

resman wrote:

I would make a set of simple rules to map a function's parameters, locals, and return values. For instance:

Code: Select all

function address + 0: JMP *+sizeof(parameters+return values+locals)
function address + 3: parameters
function address + 3 + sizeof(parameters): return values
function address + 3 + sizeof(parameters + return values) : locals
function address + 3 + sizeof(parameters + return values + locals) : function code

Then you only need to know the address of the function, the number of parameters and the number of return values. You can easily use function pointers, as all the parameters and return values are offset from the actual function address. It may seem wasteful to start off the function with a JMP over it's variables storage, but this hides the amount of local storage/temporaries used by the routine. If you notice, this is actually hard-coding the frame stack into the function itself.

Something to consider.

Is there a particular reason the entry point must come first? If not, you can do away with that jump...

Code: Select all

"function address": parameters
"function address" + sizeof(parameters): return values
"function address" + sizeof(parameters + return values) : locals
"function address" + sizeof(parameters + return values + locals) : function entry point

resman · Post by **resman** » Tue Jun 09, 2020 3:44 pm

BillG wrote:

resman wrote:

I would make a set of simple rules to map a function's parameters, locals, and return values. For instance:

Code: Select all

function address + 0: JMP *+sizeof(parameters+return values+locals)
function address + 3: parameters
function address + 3 + sizeof(parameters): return values
function address + 3 + sizeof(parameters + return values) : locals
function address + 3 + sizeof(parameters + return values + locals) : function code

Then you only need to know the address of the function, the number of parameters and the number of return values. You can easily use function pointers, as all the parameters and return values are offset from the actual function address. It may seem wasteful to start off the function with a JMP over it's variables storage, but this hides the amount of local storage/temporaries used by the routine. If you notice, this is actually hard-coding the frame stack into the function itself.

Something to consider.

Is there a particular reason the entry point must come first? If not, you can do away with that jump...

Code: Select all

"function address": parameters
"function address" + sizeof(parameters): return values
"function address" + sizeof(parameters + return values) : locals
"function address" + sizeof(parameters + return values + locals) : function entry point

Only to Hide the local storage. But yeah, an easy optimization would be to order it as such:

Code: Select all

"function address": parameters
"function address" + sizeof(parameters): return values
function address" + sizeof(parameters + return values) : function entry point
<code>
"function address" + sizeof(parameters + return values + code) : locals

resman · Post by **resman** » Tue Jun 09, 2020 5:08 pm

resman wrote:

hjalfi wrote:

@resman: Cowgol 1.0 did that! I couldn't make it work. The problem is compiling code like this:

Code: Select all

function1(1, 2, function2(3, 4));

Because function1 and function2 can share the same address for their locals, then we marshal the 1 and 2 into function1's parameters and then immediately overwrite them with the 3 and 4 for function2's parameters before function1 gets called. The same happens with this:

Code: Select all

function1(1, 2, function1(3, 4))

Did you mean to make this a recursive call?

On re-reading this, I see I misinterpreted what you were saying. Even though it isn't quite recursive, it sort-of is. The parameters to function1 will overwrite the subsequent call to function1. So, another idea when calling functions is to have the caller allocate the parameter/return space. Less memory efficient, as every call will end up making space for the parameters and return values. One idea would be to put the parameters/return value directly after the JSR. Such as:

Code: Select all

JSR func1
params:
param1 DW 0
param2 DW 0
retvals:
retval1 DW 0
... continue code

The callee would be responsible for advancing the IP to skip the parameters/return values. Just trying to avoid pushing and popping the parameters off the H/W stack.

Are you sure you don't want to implement a frame stack? PLASMA limits the per-function framesize to ~250. That way it can easily access the locals with just an 8 bit index register.

hjalfi · Post by **hjalfi** » Tue Jun 09, 2020 8:18 pm

@resman: Right. It took me ages to see that the first time round. Once you start accessing the subroutine's parameters outside the scope of the subroutine, you're effectively extending the lifetime of the parameters to when you start evaluating the first parameter, rather than just when the jsr to the subroutine happens. Because calls can overlap, this means we have no choice but to deal with cases when we have two different values destined for the same parameter at the same time. So, they have to be marshalled into local storage somehow. Cowgol 1.0 marshalled them into static storage, but this was really expensive --- two or three bytes per byte. The processor stack is much cheaper --- one byte per byte, and less if I can copy stuff in and out with loops. As @GARTHWILSON says, stacks are ideally suited for this kind of problem.

I'm not averse to using stacks --- I just don't want to use stack frames, as they're expensive on 8-bit systems and make it very hard to access variables from an outer scope, which I need to do to make nested subroutines work.

Regarding the problems indexing things on an auxiliary stack: the compiler makes extensive use of the (zp),y and abs,x addressing modes. As pointer variables are statically assigned to zero page, then dereferencing them is extremely cheap. If the variables were kept on an auxiliary stack I would need to copy them to a static area of zero page first. This is easy for Forth, because Forth's an interpreter! Likewise, most operations with 32-bit values use loops, looping from y=0 to 3. I get to dereference pointers here for free.

Here's an example of the kind of code I'm producing, scrappily annotated by me:

Code: Select all

sub Expr2Simple(sop: uint8, uop: uint8, lhs: [Node], rhs: [Node]): (result: [Node])
	ResolveUntypedConstantsNeedingNumbers(lhs, rhs);
	var op := uop;
	if IsSNum(lhs.type) != 0 then
		op := sop;
	end if;

	var ltype := lhs.type;
	result := MidC2Op(op, NodeWidth(lhs), lhs, rhs);
	result.type := ltype;
end sub;

Code: Select all

        ; Expr2Simple
f563_Expr2Simple:
        sta ws1+79      ; unmarshal rhs into a variable (passed in xa)
        stx ws1+80
        plx                 ; remove return address from the stack
        pla 
        inx 
        stx c21_0fce   ; poke into return instruction
        sta c21_0fce+1
        ldy #3             ; unmarshal node, uop and sop together
c21_0fcf:
        pla 
        sta ws1+75,y
        dey 
        bpl c21_0fcf

        lda ws1+77 ; marshal lhs onto the stack
        ldx ws1+78
        pha 
        phx 
        lda ws1+79 ; marshal rhs into xa
        ldx ws1+80
        ajsr f556_ResolveUntypedConstantsNeedingNumbers

        lda ws1+76 ; copy uop to sop
        sta ws1+83

        ldy #8
        lda (ws1+77),y ; lhs.type high byte into x
        tax 
        dey 
        lda (ws1+77),y ; lhs.type low byte into a giving xa
        ajsr f238_IsSNum ; returns result in a
        tax                    ; set flag bytes
        beq c21_0fd4

c21_0fd3:

        lda ws1+75      ; op := sop
        sta ws1+83

        bra c21_0fd0

c21_0fd4:

c21_0fd0:

        ldy #8                ; read lhs.type again. Huh, I should move this up
        lda (ws1+77),y
        tax 
        dey 
        lda (ws1+77),y
        sta ws1+84
        stx ws1+85

        lda ws1+83      ; marshal op for call to MidC2Op
        pha 
        lda ws1+77     ; marshal lhs into xa for call to NodeWidth
        ldx ws1+78
        ajsr f178_NodeWidth ; returns result in a
        pha                ; marshal for call to MidC2Op
        lda ws1+77    ; marshal LHS
        ldx ws1+78
        pha 
        phx 
        lda ws1+79    ; marshal RHS
        ldx ws1+80
        ajsr f186_MidC2Op
        sta ws1+81
        stx ws1+82    ; result in xa

        lda ws1+84    ; write ltype to result.type
        ldx ws1+85
        ldy #7
        sta (ws1+81),y
        txa 
        iny 
        sta (ws1+81),y

        lda ws1+81    ; marshal result into xa for return
        ldx ws1+82
c21_0fce=*+1
        jmp $ffff     ; jumps to return address

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 09, 2020 9:27 pm

hjalfi wrote:

I'm not averse to using stacks --- I just don't want to use stack frames, as they're expensive on 8-bit systems and make it very hard to access variables from an outer scope, which I need to do to make nested subroutines work.

On the hardware stack, it's a matter of TSX (done only once, at the beginning of the routine), and then LDA 1xx,X for example, where the xx is how far into the stack to index. So for example, ADC 106,X reaches for the sixth byte in the hardware stack and adds it to the accumulator. You only need the TSX once (at the beginning of the routine), and there's no need for INX and DEX either, as long as LDX #FF, TXS is part of your reset routine. So reaching back in the stack is pretty cheap. ADC abs is four cycles, and so is ADC abs,X, as long as you're not crossing a page boundary, which you won't be doing anyway since the stack area on the 6502 does not cross page boundaries. So essentially the stack frame access is no more expensive than absolute addressing of variables.

So suppose you put four bytes on the stack (or maybe they're left there from previous operations) and then you call your routine that uses them. The 1xx,X addressing can access them without pulling the return address off or copying them anywhere. There's no "marshaling" involved. If you want the routine to return two more bytes of output than it took in as input, the calling routine can do PHA PHA (with so-far irrelevant data) before calling the routine, then the routine doesn't have to juggle the return address either in order to return the right number of bytes. This and lots more is covered in the stacks treatise.

If you use a separate data stack in ZP though, the data will go there instead of on the hardware stack, and again you can use it in place. There's no need to copy it somewhere else. The ZP,X indexing does take one extra cycle than non-indexed ZP access; so if you're doing a loop where something is accessed hundreds of times where the added execution time from the many accesses really adds up, you can do like Forth does with N, a ZP area of eight or so bytes for local data. N is only used inside of primitives, and a primitive is always finished the N's contents when the primitive finishes. N is never used to pass parameters, or to hold them temporarily while calling something else; so there's never any danger of one routine stepping on data that's still needed by a pending routine. If copying things from the stack to N and back takes less time than doing a lot of indexing in a loop that gets executed many times (like multiplication), then it's worth it.

Quote:

Regarding the problems indexing things on an auxiliary stack: the compiler makes extensive use of the (zp),y and abs,x addressing modes.

(ZP,X) is great too, particularly when a stack cell contains the address of something to access.

Quote:

As pointer variables are statically assigned to zero page, then dereferencing them is extremely cheap.

I'm not familiar with the term "dereferencing." What exactly does it mean? Edit: in view of what Chromatix wrote two posts down, it is extremely cheap also to have the address of something on the ZP data stack and access it with a single instruction, with the (ZP,X) addressing mode. Say the cell of interest is at the top of the ZP data stack. To read what's at the address pointed to there, all you need is LDA (0,X). There's no need to move it.

Quote:

If the variables were kept on an auxiliary stack I would need to copy them to a static area of zero page first.

Why? Just use them in place.

Quote:

This is easy for Forth, because Forth's an interpreter!

Forth does not interpret in the sense of looking things up at run time (except for token-threaded Forth, which is rarely used, because of this overhead). In indirect-threaded or direct-threaded code (ITC or DTC) Forth, what gets compiled is a list of addresses of routines, and sometimes there will be inlined data, like literal numbers or strings. In subroutine-threaded code (STC) Forth, it's a list of JSRs, except that routines can be straight-lined to avoid the JSR-RTS overhead if they're short enough or time-critical enough to justify it. For example, to DROP a stack cell, INX INX is four times as fast as JSR DROP, and even saves a byte.

resman · Post by **resman** » Tue Jun 09, 2020 9:51 pm

hjalfi wrote:

@resman: Right. It took me ages to see that the first time round. Once you start accessing the subroutine's parameters outside the scope of the subroutine, you're effectively extending the lifetime of the parameters to when you start evaluating the first parameter, rather than just when the jsr to the subroutine happens. Because calls can overlap, this means we have no choice but to deal with cases when we have two different values destined for the same parameter at the same time. So, they have to be marshalled into local storage somehow. Cowgol 1.0 marshalled them into static storage, but this was really expensive --- two or three bytes per byte. The processor stack is much cheaper --- one byte per byte, and less if I can copy stuff in and out with loops. As @GARTHWILSON says, stacks are ideally suited for this kind of problem.

My thought on this is: if you already forbid recursive calls, then forbid this kind of call sequence, too. It's pretty easy to declare calling a function as one of the parameters to the same function to be semi-recursive. Simply assign the result of function1() to a variable, then use that variable in the next invocation of function1(). Problem solved and documented as such.

hjalfi wrote:

I'm not averse to using stacks --- I just don't want to use stack frames, as they're expensive on 8-bit systems and make it very hard to access variables from an outer scope, which I need to do to make nested subroutines work.

Regarding the problems indexing things on an auxiliary stack: the compiler makes extensive use of the (zp),y and abs,x addressing modes. As pointer variables are statically assigned to zero page, then dereferencing them is extremely cheap. If the variables were kept on an auxiliary stack I would need to copy them to a static area of zero page first. This is easy for Forth, because Forth's an interpreter! Likewise, most operations with 32-bit values use loops, looping from y=0 to 3. I get to dereference pointers here for free.

Here's an example of the kind of code I'm producing, scrappily annotated by me:

..snip

You are literally doing 90% of the work I do in PLASMA to set up and tear down a frame. As for nested subroutines? Easy, nested frame pointers.

I would really commit to one implementation or the other. Absolutely no recursion (or semi-recursion) or go for broke with a frame stack. I'm trying not to push the decisions I made with PLASMA too hard, but it solved all the problems your dealing with and allowed for a recursive descent compiler to be easily implemented and self hosted.

Chromatix · Post by **Chromatix** » Tue Jun 09, 2020 11:00 pm

Quote:

I'm not familiar with the term "dereferencing." What exactly does it mean?

It means to follow the pointer to the data it refers to. In C, the unary * operator is known as the pointer dereference operator, as in:

Code: Select all

void strcpy(char* a, const char* b)
{
    while(*a++ = *b++)
        ;
}

In 6502-speak, it's called an "indirect access", because the 6502 doesn't have an actual address register.

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 09, 2020 11:10 pm

Chromatix wrote:

It means to follow the pointer to the data it refers to.

Thanks. I added an edit to the middle of my post above, showing how easy it is to do that with a ZP data stack. (It's just LDA (0,X) for example, a single instruction.)

dmsc · Post by **dmsc** » Wed Jun 10, 2020 3:52 am

Hi!

hjalfi wrote:

http://cowlark.com/cowgol/

...

I estimate the 65c02:Z80 generated code size to be about 1.1. As the 65c02 compiler compiled for the Z80 is 60210 bytes, that makes the 65c02 compiler compiled for the 65c02 to be about 66kB. So close... (But of course it also needs some memory to actually run in!)

It is great to see more languages for the 6502, specially ones with the compiler (almost) capable of running in the 6502

Also, I agree that by limiting the use of recursion you make the language potentially faster and more portable, as the usable stack sizes in the 6502 are really small.

About the compiler size, in my FastBasic I opted for a novel approach to fit all the compiler (plus runtime and full-screen editor) in only 8kB (or 9.5kB with floating-point support): the compiler is programmed in it's own interpreted bytecode, a simple "parsing machine".

This is slower than a compiler in assembler, but the parsing machine is really small, as it implements a PEG parser (via recursive descent) in which each rule is only one byte, and the emitted code is also stored in the parsing bytecode.

For example, from https://github.com/dmsc/fastbasic/blob/ ... /basic.syn this is an extract on how a "terminal expression" is parsed, in the parser gramar:

Code: Select all

T_EXPR: integer constant, variable or function
        emit { TOK_BYTE } E_NUMBER_BYTE
        emit { TOK_NUM } E_NUMBER_WORD
        "-" T_EXPR emit { TOK_NEG }
        "+" T_EXPR
        "NOT" NOT_EXPR emit { TOK_L_NOT }
        ARRAY_WORD_ADDR emit { TOK_DPEEK }
        ARRAY_BYTE_ADDR emit { TOK_PEEK }
        emit { TOK_VAR_LOAD } E_VAR_WORD
        PAR_EXPR

For each rule, there is one alternative on each line, and each alternative is composed with all the sub-rules one after the other, so the above means that to parse a "T_EXPR", you generate the VM token "TOK_BYTE" (equivalent to a "LDA #") and them try to parse "E_NUMBER_BYTE" (this will parse a number 0-255 and store it in the code). If this is not possible, the following rule tries to parse a WORD (number from -32768 to 32767), if not it tries to parse a "-" sign followed by recursively trying to again parse an T_EXPR, and then emitting the code for "TOK_NEG" (negation), etc.

Note that the parser machine will rewind the parser when a rule fails, and all the generated code will be discarded. Also, the "E_*" rules are calls to machine code that implement special functionality (so the number parsing is done in machine language).

After the parsing is complete, the generated code can be run as-is (in the interpreter), but the compiler does a peephole pass that replaces a lot of the generated expressions with simpler ones, for example, for an array access the parser generates:

Code: Select all

# Code for   B = A(X+1)
 VAR_LOAD A
 PUSH
 VAR_LOAD X
 PUSH
 LOAD_BYTE  #1
 ADD
 USHL
 ADD
 DPEEK
 VAR_STORE B

After the peephole, this is optimized to:

Code: Select all

 LOAD_1
 ADD_VAR X
 USHL
 ADD_VAR A
 DPEEK
 VAR_STORE B

As you see, all stack operations (PUSH and ADD) were eliminated, and now the code is much smaller and faster. The good thing is that the peephole does not need to be performed at the same time than the parsing, as the original program is not needed for the optimization.

I'm slowly writing now a new pass to transform the bytecode to 6502 assembler, this simply replaces each bytecode with equivalent 6502 assembler, the above would turn to:

Code: Select all

 LDA #1
 LDX #0

 CLC
 ADC VAR_X
 PHA
 TXA
 ADC VAR_X+1
 TAX
 PLA

 ASL
 PHA
 TXA
 ASL
 TAX
 PLA

 CLC
 ADC VAR_A
 PHA
 TXA
 ADC VAR_A+1
 TAX
 PLA

 STA  PTR
 STX  PTR+1
 LDY  #1
 LDA (PTR), Y
 TAX
 DEY
 LDA (PTR), Y

 STA VAR_B
 STX VAR_B+1

So, generated code is still very bad, but at least is a lot faster than the bytecode interpreter

Have Fun!

hjalfi · Post by **hjalfi** » Wed Jun 10, 2020 3:42 pm

@GARTHWILSON: Oops, I'd completely forgotten that the (zp,x) addressing mode existed! Yes, indeed, that makes accessing bytes via pointers on a data stack trivial. However, I still believe that (unless there's something else I've missed) accessing index data requires either copying the pointer to a fixed location in zero page, or doing pointer arithmetic: there's no (zp,x,y) or (zp,x),y address mode. So, for example, 32-bit arithmetic becomes very expensive. The same problem applies to accessing values directly on the hardware stack with abs,x.

BTW, I have implemented Forths...

@resman: can you provide an example of some PLASMA stack frame code? (BTW, the Plasma VM should be a really easy Cowgol target.)

It's important to remember, BTW, this Cowgol is not specific to the 6502. It's intended for other architectures as well. The 6502 is good at indexing but the 8080 and Z80 really aren't, and my language abstraction has to work there too. Statically assigning variables is important for that.

@dmsc: 9.5kB is really impressive, especially with floating point! I've only briefly played with PEG and packrat parsers; my impression was that they required memoisation to work effectively, which is hard on a small system. Are you doing anything here or just accepting the cost? How are you handling errors?

I'm using the Lemon parser generator, which is a traditional shift-reduce parser. The automaton is pretty small (465 bytes) and the compiled grammar is in about 4kB of tables. Each rule has a subroutine which does typechecking, code emission etc in code. I'm not at all averse to replacing it. The biggest killer is that I don't have function pointers or jump tables, so the rule dispatch table is enormously wasteful --- about 1.5kB. The code generator has another one. I have just come up with what I think is a workable abstraction over function pointers, so I'll have to try implementing it and seeing if that helps.

Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502