Cowgol 2.0 compiler for the 6502

hjalfi · Post by **hjalfi** » Sat Jun 06, 2020 3:59 pm

It lives! It lives! My creation lives! Witness the power etc!

A few years ago I wrote a compiler for small systems. It worked but was terrible. I've rewritten it so it's drastically better.

http://cowlark.com/cowgol/

It's a table-driven bootstrapped compiler written in itself. It targets the 80386, 8080, Z80, 6502 (and 65c02), and C. It's self-hosting on the 80386 (and through C). Unfortunately the compiler's too big to run on a 6502.

The language is a traditional Ada-inspired block-structured language with strong typing. It's got multiple return parameters, nested subroutines, structured data types, pointers, all the usual stuff. The language forbids recursion, which allows it to statically allocate subroutine variables, drastically improving the code on stack-frame-hostile architectures like the 6502. Variables belonging to different subroutines are overlapped when the linker can prove they're not used at the same time, which massively reduces memory usage. The code quality is... adequate, in my opinion. Here's the 65c02 code for the snippet on the linked page:

https://pastebin.com/kq9KdTW1

The backend is table driven and really easy to port (disclaimer: relative to other compilers). The 80386 backend is 1200 loc. The 6502 backend is not, as the 6502 is substantially more complex and unorthogonal, which is one reason why it won't fit (the 8080 compiler will actually fit on a 6502, if you want to cross compile for the 8080 on a 6502).

I estimate the 65c02:Z80 generated code size to be about 1.1. As the 65c02 compiler compiled for the Z80 is 60210 bytes, that makes the 65c02 compiler compiled for the 65c02 to be about 66kB. So close... (But of course it also needs some memory to actually run in!)

BigEd · Post by **BigEd** » Sat Jun 06, 2020 4:54 pm

Splendid! Is there any chance at all to run the compiler on 6502? Something optimised for size not speed, maybe. With a nearly full 64k to play with. Or with paged memory perhaps?

hjalfi · Post by **hjalfi** » Sat Jun 06, 2020 5:04 pm

One option is to remove some of the compiler rules. That would make it produce less good code (or rather, less less poor code), but would reduce the size of the compiler. There are a number of relatively expensive rules like the one for detecting in-place increments which could plausibly be omitted.

Hmm... must think about that. It'd be really easy to do as a compiler variant (you can build multiple backends from the same source file using #ifdef).

You might be interested in this writeup where I talk about jump tables and how they're really not worth it: http://cowlark.com/2020-06-06-cowgol-jumptables It's exactly the lesson I should have learned before making the compiler so complex!

BigEd · Post by **BigEd** » Sat Jun 06, 2020 5:14 pm

Sounds hopeful! (Of course, it's an excellent project as-is.)

drogon · Post by **drogon** » Sat Jun 06, 2020 5:15 pm

Re. trying to make it compile on a 6502... What about splitting the syntax checker and the code generator into 2 parts? (assuming that's possible)...

-Gordon

drogon · Post by **drogon** » Sat Jun 06, 2020 5:29 pm

drogon wrote:

Re. trying to make it compile on a 6502... What about splitting the syntax checker and the code generator into 2 parts? (assuming that's possible)...

-Gordon

So replying to my own query - I'm reading your stuff on jump tables and optimisation and I recently did this:

https://projects.drogon.net/ruby816-pre ... -all-that/

So much for "trusting yourself"..

However it might be worth while looking at what the BCPL compiler does in its switch statements... It outputs a jump table for consecutive switch cases (when's) which is pairs of "test", "relative jump". The run-time interpreter (which could be a subroutine for you) just scans the list, testing the switch value against each target and jumps or not.

And jsut when you think you might have your head round this, the BCPL compiler has a plan B: When sparse cases it sorts them and outputs them in a sort of balanced binary tree fashion, so you can still linearly test or you can skip ahead by increasing the skip through the table by a factor of 2, or 2+1 every time. You'd need to see the documentation to get your head round that one but both those would nee run-time supprt, so

Code: Select all

    jsr handleSwitch
  .word numcases
  .word default
  .word caseTest
  .word target
  .word caseTest
  .word target

with the handleSwitch subroutine popping the return address off the stack to then access the data...

It's quite an interesting sort of language though, but maybe I've been doing too much coding in BCPL recently to appreciate it. I was going to ask how you stop mutual recursion, then I read the bit about forward references... that'll be fun to stop if/when you allow reverse references or function prototypes

Cheers,

-Gordon

hjalfi · Post by **hjalfi** » Sat Jun 06, 2020 5:51 pm

@BigEd: ...ha. I chopped out just those two increment/decrement-in-place rules and the size of the compiler compiled for the Z80 dropped by almost a kilobyte. I'll keep picking away at it and see what happens!

@Gordon: that was actually what Cowgol 1.0 did. It ended up spending almost its entire time in dumping state to disk and reading it back again. I think I got it down to seven minutes to compile 'Hello World' --- here's a real-time video of an earlier version... https://www.youtube.com/watch?v=1wLATW7sVXs

Also, combining the two helps (I think) in that because the front end hands a statement directly to the backend, the backend can access things like symbol information; and then the frontend can free up information about symbols when they go out of scope, which should help overall memory usage. That bit's not done yet, and in fact I think there's a memory leak.

drogon · Post by **drogon** » Sat Jun 06, 2020 6:53 pm

hjalfi wrote:

@BigEd: ...ha. I chopped out just those two increment/decrement-in-place rules and the size of the compiler compiled for the Z80 dropped by almost a kilobyte. I'll keep picking away at it and see what happens!

@Gordon: that was actually what Cowgol 1.0 did. It ended up spending almost its entire time in dumping state to disk and reading it back again. I think I got it down to seven minutes to compile 'Hello World' --- here's a real-time video of an earlier version... https://www.youtube.com/watch?v=1wLATW7sVXs

Also, combining the two helps (I think) in that because the front end hands a statement directly to the backend, the backend can access things like symbol information; and then the frontend can free up information about symbols when they go out of scope, which should help overall memory usage. That bit's not done yet, and in fact I think there's a memory leak.

I think I'll stick to BCPL for the time being as while I'm sure the compiled code is much slower than your native compiler, however it can compile helloWorld is just a few seconds:

https://youtu.be/TBdiJiy8Hts?t=206

It is a far simple language though!

Cheers

-Gordon
Ps. great use of a (emulated) Beeb+Tube processor though!

hjalfi · Post by **hjalfi** » Sat Jun 06, 2020 10:11 pm

I made a bbctiny toolchain variant which is the 65c02 backend with all the non-critical rules chopped out. (In fact, not all of them. I left the ones which allow it to do a + b = c for 32 bit values and not create a temporary on the stack... I thought that would be unreasonably bad!) With that, plus a round of byte shaving and refactoring, I got it down to 62637 bytes. When loaded at 0x0400 this puts TOP (the top of the program) at 0xf8ad and LOMEM (the start of the heap) at 0xfea2. Given that on the BBC Micro Tube HIMEM (the top of available memory) is at 0xf800 then that leaves -1698 bytes of available heap. But at least it fits in the address space now.

There must be something I can rewrite to be smaller and more efficient...

BigEd · Post by **BigEd** » Sun Jun 07, 2020 7:17 am

Gosh, quite a hill to climb.

I find myself conflicted: it feels right to desire a compiler that targets native code, and a compiler which is composed of native code. And yet, it might be that the instruction set we have doesn't quite have the power (or density) and so that leads to the idea of a virtual machine and a layer of interpretation.

I suppose the kind of refactoring mentioned in your blog post, where a subroutine does some heavy lifting with the input of an inline argument, is more like a threaded interpreter, which elevates the machine's capabilities without quite offering a new instruction set.

drogon · Post by **drogon** » Sun Jun 07, 2020 7:30 am

BigEd wrote:

Gosh, quite a hill to climb.

I find myself conflicted: it feels right to desire a compiler that targets native code, and a compiler which is composed of native code. And yet, it might be that the instruction set we have doesn't quite have the power (or density) and so that leads to the idea of a virtual machine and a layer of interpretation.

And yet for the BBC Micro at least, there exists many other programming languages that are "self hosting" apart from Basic - Comal, Pascal, Lisp, Logo, Forth, and a few C's and I'm sure others, but all these were written in 6502 assembler and probably quite limited. The only other system I know where the compiler is written in itself and runs on a Beeb is BCPL - however that compiles to an extremely compact bytecode which is then interpreted (and it's only a 16-bit system on the Beeb)

So much as I personally like the idea of a self-hosting system I suspect for a language such as this, that you might end up making more sacrifices that you want to and it remains in the realms of cross-compiling for now.

Of-course there is always the '816 ...

BigEd wrote:

I suppose the kind of refactoring mentioned in your blog post, where a subroutine does some heavy lifting with the input of an inline argument, is more like a threaded interpreter, which elevates the machine's capabilities without quite offering a new instruction set.

I could see that ending up almost as a sort of pseudo code - or take it further and compile to a bytecode for an "ideal" Cowgol target then interpret it. Acheron-VM needs a compiler

https://www.white-flame.com/acheron/

The downside is that you'd then end up writing the VM interpreter for each separate target you want to support which might be a step backwards if the code generator is already really good.

It's never easy!

Cheers,

-Gordon

hjalfi · Post by **hjalfi** » Sun Jun 07, 2020 8:29 am

It would actually be completely possible to target Cowgol at the Acheron or BCPL VMs. (Or SWEET-16.) I'll admit that I've been mostly thinking about real machine code, because real machine code is cool?

The backend architecture is pretty flexible, and copes well with register machines, stack machines, hybrid machines, and machines with no stack at all (I had an earlier version of the compiler which produced code for the Apollo Guidance Computer).

Was BCPL bytecode standardised like Pascal pCode? Is the Beeb version documented?

drogon · Post by **drogon** » Sun Jun 07, 2020 9:12 am

hjalfi wrote:

It would actually be completely possible to target Cowgol at the Acheron or BCPL VMs. (Or SWEET-16.) I'll admit that I've been mostly thinking about real machine code, because real machine code is cool?

The backend architecture is pretty flexible, and copes well with register machines, stack machines, hybrid machines, and machines with no stack at all (I had an earlier version of the compiler which produced code for the Apollo Guidance Computer).

Was BCPL bytecode standardised like Pascal pCode? Is the Beeb version documented?

The only "standard" is what appears on Martin Richards website. Look for bcplman.pdf. I've been in-touch with him recently about it all - one down-side is that the sources to all the Beeb stuff have been lost, so just the modern 32-bit stuff remains. (and some historical non-beeb stuff)

However I understand that Cintcode is more or less the same as it has been for a very long time and it has the concept of bytes, halfwords (16-bit) and words (32-bit). It's a sort of stack register model with just 3 registers and most arithmetic opcodes "push" register A into B (discarding the old B contents). Register C is only used for byte indexing. All branches are relative (signed 8 or 16-bit offset) and it has opcodes for things like the switch statement.

A modern BCPL compiler has the ability to output 3 different target codes - INTCODE which is the old, original code, SIAL which is a code for some imaginary RISC-like system which is designed to be easy to translate into native code and Cintcode. The Beeb version was (AIUI) cut-down to just generate Cintcode for a 16-bit machine, but all three of these are described in the manual. Some of the OPC projects (See http://anycpu.org/ ) have been running BCPL programs via a SIAL to native convertor (although not self-hosting)

(I decided that as BCPL on the Beeb was 16-bit, then my implementation would be 32-bit on the '816 which has worked well - and I went with Cintcode rather than a SIAL to native translator to reduce the memory footprint. The compiler and cintcode generator is just a 42KB binary and my simple screen editor is about 2K of binary cintcode vs. 16KB of 6502 code for the C version compiled with cc65)

But yes, as you say, native code is cool

It's also often bulky and at the whim of the underlying architecture and targeting the 6502 from a compiler has never been easy.

Cheers,

-Gordon

hjalfi · Post by **hjalfi** » Sun Jun 07, 2020 1:10 pm

I spotted a place where the backend was duplicating code, and did a whole pile of refactoring, including moving a bunch of repeated code from the various backends into the frontend. The generated code got bigger. Sigh. It's now down to 62362 bytes, which is better but still too far to shout.

A specific question:

For doing signed magnitude comparisons (x lt y) on 32-bit numbers, I am generating this:

Code: Select all

  ldx #4
  ldy #0
  sec
label:
  lda lhs, y
  sbc rhs, y
  iny
  dex
  bne label
  tay ; set flags from A
  bvc *+4
  eor #0x80
  bpl false
true:

...which is gruesome. Can anyone improve it?

drogon · Post by **drogon** » Sun Jun 07, 2020 1:33 pm

hjalfi wrote:

I spotted a place where the backend was duplicating code, and did a whole pile of refactoring, including moving a bunch of repeated code from the various backends into the frontend. The generated code got bigger. Sigh. It's now down to 62362 bytes, which is better but still too far to shout.

A specific question:

For doing signed magnitude comparisons (x lt y) on 32-bit numbers, I am generating this:

Code: Select all

  ldx #4
  ldy #0
  sec
label:
  lda lhs, y
  sbc rhs, y
  iny
  dex
  bne label
  tay ; set flags from A
  bvc *+4
  eor #0x80
  bpl false
true:

...which is gruesome. Can anyone improve it?

Maybe have a read though this: http://www.6502.org/tutorials/compare_beyond.html

-Gordon

Cowgol 2.0 compiler for the 6502

Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502