Cowgol 2.0 compiler for the 6502

hjalfi · Post by **hjalfi** » Sun Jun 07, 2020 1:55 pm

Yup, been there, referred to it extensively, but they don't suggest a top-down loop. What they've got is an unrolled top-down comparison. Because the `eor #0x80` for the sign adjustment trashes the carry then reworking it as a loop isn't trivial. What cunning tricks have I missed?

resman · Post by **resman** » Sun Jun 07, 2020 2:46 pm

Since the backend is already table driven, why not use the indexes into the table as your byte code (intermediate representation)? Writing an interpreter should be almost trivial and you would greatly improve the code density for the 6502, and probably the other 8 bit targets to some degree.

When I first wrote the PLASMA compiler I had it generate in-line 6502, subroutine threaded code, and byte code. The in-line 6502 code was 10x faster and about that much larger than the byte code. The code size for the native generation really limited the kinds of programs that could be written. Conversely, the byte code opened up programs that wouldn't have been worthwhile, otherwise. Don't forget the need to include data space requirements along with your code size. The PLASMA compiler is less than 24K of byte code, leaving enough data space on a 64K machine to compile a reasonably sized module. It wouldn't have been possible if the PLASMA compiler was compiled as native 6502 code.

By spending a lot of time improving the byte code inner interpreter, the latest version of PLASMA is about on-par with the original subroutine threaded code (some improved byte code architecture helped, too). And with the native JIT compiler, it can even exceed the performance of the original native 6502 code in some cases while still being able to run programs that would be too big if entirely natively compiled. Not to mention PLASMA can pick the best interpreter/JIT compiler at runtime for the given CPU vs whatever target was compiled to.

Chromatix · Post by **Chromatix** » Sun Jun 07, 2020 2:51 pm

My approach to this would be to compare the most-significant byte for equality. If unequal then you employ a signed comparison based only on that byte. If equal then you proceed to the next significant byte, and employ an unsigned comparison.

I would also unroll the loop unless code size is crucial, simply because loop overhead is quite high on the 6502; not only because of the cycles and bytes directly consumed, but the scarce registers and status flags it clobbers in the process.

Code: Select all

@highByte:
  SEC
  LDA lhs+3
  SBC rhs+3
  BEQ @secondByte
  BVC *+4    ; signed less-than
  EOR #$80
  BPL @false
  BRA @true
@secondByte:
  LDA lhs+2
  CMP rhs+2
  BNE :+
  LDA lhs+1
  CMP rhs+1
  BNE :+
  LDA lhs+0
  CMP rhs+0
: BCS @false    ; unsigned less-than
@true:

Unfortunately this is nearly twice as big as your routine, but certainly much faster. I think you would want to make this a subroutine - or into an advertisement for the value of VMs on the 6502.

hjalfi · Post by **hjalfi** » Sun Jun 07, 2020 6:39 pm

I got the binary down to 57900 bytes! That's actually small enough to run, and I compiled and linked my first program natively! Of course, I don't have an assembler, but it's the principle that counts...

Right now the compiler has a whole 4kB of heap to work in. I don't think that's enough for anything useful, but I don't really have a handle on how much memory it really needs at the moment, and there's plenty of things I can do to reduce memory usage.

Also, a 57900 byte program takes a really long time to load from disk.

BigEd · Post by **BigEd** » Sun Jun 07, 2020 7:16 pm

Hurrah!

Dr Jefyll · Post by **Dr Jefyll** » Mon Jun 08, 2020 1:17 am

hjalfi wrote:

...which is gruesome. Can anyone improve it?

For what it's worth, the snippet you posted can be shortened by eliminating the use of X. Y can act as both the index and the iteration counter. Y would begin with a non-zero value and the loop would exit when the increment of Y wraps it "up" (wrap-around) to 0.

Of course to compensate you'll need to adjust the base addresses used by the lda lhs, y and sbc rhs, y instructions. But you'll save 3 bytes, and maybe even eliminate the need to save/restore X.

Is this sort of trick too clever to bother with? There may be other minor tweaks you could do.

-- Jeff

hjalfi · Post by **hjalfi** » Mon Jun 08, 2020 8:32 am

@Jeff: huh. I actually knew about that one (I'm using it elsewhere). I wonder why I'm not using it here? The encroaching madness, probably... (I can't use that trick in all cases, because it only works if the neither of the operands are indexed via a pointer, but that hardly ever happens.)

That saved a whole 22 bytes in the tiny compiler! Turns out 32-bit magnitude comparisons don't happen often...

Anyway, nice catch. Thanks.

BTW, while we're talking about microoptimisations:

Parameters are passed on the processor stack, both input and output. Cowgol maintains a flat stack across subroutines (no recursion means that each subroutine can appear on the call stack either zero or once, no more). On entry to a subroutine, parameters are popped off the stack and copied into local variables, and on exit they're copied out of the local variables and pushed onto the stack. This means that the top of each subroutine needs to remove the return address from the stack and stash it somewhere.

What I'm currently doing is:

Code: Select all

  pla
  tax
  pla
  tay
  inx
  bne *+3
  iny
  stx rts_instruction+1
  sty rts_instruction+2
  ...subroutine body here...
rts_instruction:
  jmp 0xffff

Yes, self-modifying code! This seems to be cheaper than trying to pop the address, store it somewhere, and push it again, even including the increment needed because jsr actually pushes the return address minus one. I do skip all this completely if the subroutine doesn't use parameters or passes parameters in registers, because then I can just exit with rts, but there's still a lot of instances of this in the compiler. Every byte counts.

I don't think I can shave any more bytes off this as it stands, but are there any alternative approaches I should consider?

drogon · Post by **drogon** » Mon Jun 08, 2020 8:35 am

hjalfi wrote:

BTW, while we're talking about microoptimisations:

Beware (or be aware) of this:

https://projects.drogon.net/ruby816-pre ... -all-that/

-Gordon

Chromatix · Post by **Chromatix** » Mon Jun 08, 2020 8:42 am

If you're satisfied with targeting the 65C02 rather than retaining NMOS compatibility, the following is smaller and faster:

Code: Select all

  PLX
  PLY
  INX
  STX @return+1
  BNE :+
  INY
: STY @return+2
  ...
@return
  JMP $FFFF

But also, you could analyse your maximum stack depth (since you have no recursion) and decide that you have enough stack space that you don't need any of the above, so you can just use RTS as normal. Or you could ensure that JSR instructions never end on a page boundary, so you never need to increment the upper byte of the return address; this adds one byte of overhead to 1/256 JSRs, but saves several bytes from every subroutine body.

BillG · Post by **BillG** » Mon Jun 08, 2020 2:48 pm

drogon wrote:

BigEd wrote:

Gosh, quite a hill to climb.

I find myself conflicted: it feels right to desire a compiler that targets native code, and a compiler which is composed of native code. And yet, it might be that the instruction set we have doesn't quite have the power (or density) and so that leads to the idea of a virtual machine and a layer of interpretation.

And yet for the BBC Micro at least, there exists many other programming languages that are "self hosting" apart from Basic - Comal, Pascal, Lisp, Logo, Forth, and a few C's and I'm sure others, but all these were written in 6502 assembler and probably quite limited. The only other system I know where the compiler is written in itself and runs on a Beeb is BCPL - however that compiles to an extremely compact bytecode which is then interpreted (and it's only a 16-bit system on the Beeb)

So much as I personally like the idea of a self-hosting system I suspect for a language such as this, that you might end up making more sacrifices that you want to and it remains in the realms of cross-compiling for now.

Of-course there is always the '816 ...

There were a number of "native" language compilers on CP/M:

* BASIC, COBOL, FORTRAN from Microsoft
* CBASIC, PL/I, Pascal MT+ from Digital Research
* BDS C
* Pascal/Z
* Turbo Pascal from Borland

These were all well respected at the time and none were what I would call restricted.

Unless running on vintage hardware, many users today are enjoying faster hardware or even faster emulation. Mass storage is a quantum leap from the slow and small floppy disk.

Turbo Pascal is an excellent example of what can be done with limited resources, both on CP/M and DOS. I don't know why the Macintosh version did not do well.

BillG · Post by **BillG** » Mon Jun 08, 2020 2:53 pm

hjalfi wrote:

BTW, while we're talking about microoptimisations:

Parameters are passed on the processor stack, both input and output. Cowgol maintains a flat stack across subroutines (no recursion means that each subroutine can appear on the call stack either zero or once, no more). On entry to a subroutine, parameters are popped off the stack and copied into local variables, and on exit they're copied out of the local variables and pushed onto the stack. This means that the top of each subroutine needs to remove the return address from the stack and stash it somewhere.

If the number of parameters are not changed by the subroutine, you can access them with something like:

Code: Select all

    tsx
    lda   $103,X

instead of messing with the return address.

hjalfi · Post by **hjalfi** » Mon Jun 08, 2020 4:49 pm

@Chromatix: ooh, that's clever! And it's easy! Adding the nops adds 14 bytes to the size of the compiler, but simplifying the subroutine prologues reduces it again by 410 bytes, and doesn't add any compiler complexity, as it's implemented wholly in an assembler macro. That's almost 1%, and is completely worth having. Thanks very much!

@BillG: that takes three bytes per byte of parameter to access, plus the TSX, so it's probably not worth it when copying parameters into local variables (as I can pop them from the stack with one byte, especially as Chromatix's fix has just removed the overhead). What would be interesting is using the variables in-situ. This would only be possible for anything that didn't need dereferencing or indexing, so one-and-two byte simple variables only. I'm inclined to think it's not worth it, as I would still need a byte for the TSX, I'd still need extra code to retract the stack after a call (currently the callee is responsible for this), and I'd need to add quite a lot of compiler complexity to track these. There may be something cool to do with that trick. It might be possible to copy multiple variables on and off the stack in the subroutine prologue in a single loop using something like this:

Code: Select all

  tsx
  ldy #3
loop:
  lda $0101+3, x
  sta var1, y ; note that these variables are not contiguous
  lda $0105+3, x
  sta var2, y
  lda $0109+3, x
  sta var3, y
  dex
  dey
  bpl loop
  tsx
  clc
  adc #12
  tax
  txs

...but I think this would only be worth it for subroutines that took lots of parameters. That final stack adjustment is a lot of bytes.

@BillG: the only true compiler for a modern language on 8-bit machine that's written in itself that I'm aware of is Hitech C for the Z80, which is a proper ANSI C compiler. I've reverse engineered my way through it and Hitech C's calling conventions are very distinctive; it's clearly been compiled with Hitech C. However, chances are the Z80 binaries were probably cross-compiled from another machine. It's also split into two for size.

GARTHWILSON · Post by **GARTHWILSON** » Mon Jun 08, 2020 7:17 pm

hjalfi, I'm sure you don't want to start over; but a separate data stack in ZP solves a lot of problems. Without it, most of the operations could take place on the page-1 (hardware) stack without copying the input and output parameters to local variables; but if you copy them to ZP variables so those locals have to have their own ZP space, you'll quickly run out of ZP space since you can't always foresee which ones will still be needed by pending routines, so you can't do much doubling up. The ZP data stack may seem like a waste of precious ZP space; but it actually ends up saving you ZP space, because variables disappear when you're done with them, no longer taking any space. There is never any conflict in who uses which addresses. In fact, recursion is automatically accommodated. And when variables are addresses to read or store to, the (ZP,X) addressing mode works out perfectly.

It may seem like the extra cycle needed for the constant indexing is also an undesired penalty; but sometimes this is necessary just to keep control of a big project. I sure got myself tied up in knots and produced bugs in my early years because I wasn't acquainted with these stack concepts.

The 6502 stacks treatise goes into these things in detail. Sections 4, 5, 6, and 14, are especially relevant to these things.

hjalfi · Post by **hjalfi** » Mon Jun 08, 2020 10:45 pm

I'm actually doing static variable allocation; by analysing the call graph of the program, I can assign every variable and every expression evaluation temporary an address. If two subroutines are never called at the same time their variables can overlap --- there's never any conflict, and I can prove this at link time. This is the big win of forbidding recursion: many of these old CPUs make stack frames really painful.

Using a data stack would allow all computation operations to happen using cheap two-byte instructions. I'd advance the data stack frame at the beginning of the subroutine and retard it again afterwards. Output parameters would need to be copied in the subroutine epilogue (unless they were in registers)... so, adding two 16-bit numbers ends up as:

Code: Select all

clc
lda 1, x
adc 3, x
sta 3, x
lda 2, x
adc 4, x
sta 4, x

== 13 bytes, which is better than my current 16. But the downside is that I can't index anything on the stack, so adding two 32-bit numbers ends up as 25 bytes, which is worse than my current 14 bytes. (That's cheaper than the 16-bit arithmetic because the 16-bit arithmetic keeps intermediate values in XA, which punishes small operations but is a win overall.) I also can't dereference pointers, and access to local variables in outer scopes becomes impossible, which Cowgol currently requires.

It's a really neat idea, though. Thanks!

The stripped down Cowgol 65c02 compiler I've been playing with, by the way, is using 1428 bytes of main memory for variable storage, mostly I/O buffers, and 199 bytes of zero page. Plus all the heap it can eat, of course. The zero page is mostly two-byte pointers, but I've also told it to assign byte variables there to fill it up; it makes a big difference to the overall code size.

resman · Post by **resman** » Tue Jun 09, 2020 12:37 am

hjalfi wrote:

I'm actually doing static variable allocation; by analysing the call graph of the program, I can assign every variable and every expression evaluation temporary an address. If two subroutines are never called at the same time their variables can overlap --- there's never any conflict, and I can prove this at link time. This is the big win of forbidding recursion: many of these old CPUs make stack frames really painful.

Then why not just put the parameters directly into their memory locations instead of pushing/popping the values through the stack intermediary? You already know where they go. I believe some early FORTRAN compilers did this (they couldn't do recursion, either).

Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502