Page 1 of 3

Compiler writing

Posted: Sat May 02, 2020 4:31 pm
by Dan Moos
What resources exist to learn how to write a compiler that folks might recommend? C specifically.

I should mention, I'm not doing this because I need a compiler, so I don't need to be directed towards existing ones. This is for fun and education.

A quality book, or YouTube series would be awesome.

Re: Compiler writing

Posted: Sat May 02, 2020 5:25 pm
by drogon
Dan Moos wrote:
What resources exist to learn how to write a compiler that folks might recommend? C specifically.

I should mention, I'm not doing this because I need a compiler, so I don't need to be directed towards existing ones. This is for fun and education.

A quality book, or YouTube series would be awesome.
There are a billion books on the subjects - people have written PhDs on it and it's often a standard part of a Comp. Sci degree (in the UK), so there is plenty to find and a million ways to go about doing it..

Start here: https://compilers.iecc.com/crenshaw/

Gordon

Re: Compiler writing

Posted: Sat May 02, 2020 8:53 pm
by BitWise
Aho’s Dragon book is an often referenced guide to compiler writing
https://docs.google.com/open?id=0B1Mogs ... WR5NWVTSVE

Re: Compiler writing

Posted: Sat May 02, 2020 11:20 pm
by BillG
If you want something which leans toward practice rather than theory, I liked this book back in the day:

https://holub.com/compiler/

Re: Compiler writing

Posted: Sun May 03, 2020 5:08 am
by unclouded
This one is billed "A Compiler Writing Journey" and covers the step by step implementation of a C compiler with buildable examples at the end of each chapter:

https://github.com/DoctorWkt/acwj

Re: Compiler writing

Posted: Sun May 03, 2020 5:23 am
by BigEd
Not a book, but this is a classic:
Let's Build a Compiler, by Jack Crenshaw

via this discussion which has many other pointers, and which headlines this series by Nora Sandler:
Writing a C Compiler:
Quote:
I’ve been working on my own C compiler, nqcc for the past several weeks, using Abdulaziz Ghuloum’s An Incremental Approach to Compiler Construction as a roadmap. I really like Ghuloum’s approach: you start by compiling a tiny, trivial subset of your source language all the way down to assembly. Then you add new language features, one step at a time.

Re: Compiler writing

Posted: Sun May 17, 2020 10:16 pm
by Dan Moos
Thanks all. Lots to look at here!

Re: Compiler writing

Posted: Tue May 19, 2020 2:40 pm
by resman
I saw that one of my profs, Hank Dietz, from college has posted the notes to his compiler class. A pretty quick intro into compilers: http://aggregate.org/EE380/notes.pdf

It's interesting to note that Terrence Parr worked with Prof. Dietz about the time Terrence started ANTLR.

Re: Compiler writing

Posted: Sat May 23, 2020 8:56 pm
by johnwbyrd
The Dragon book, mentioned above, seems to be well respected by older programmers, but it may be a little long in the tooth. I suggest "Engineering: A Compiler" by Cooper and Torczon.

Re: Compiler writing

Posted: Sun May 24, 2020 9:34 pm
by Dan Moos
For now I've got an old copy of the Dragon Book on the way. Everything in my research seems to imply that, although it is very dated in the languages it refers to, the concepts are described in a manner that makes it still very useful.

My hope is to create "enough" of a C compiler such that I can create games on my homebrew.

I have no need for standards compliance (I would do my best to be so though)
I have no need for total completeness (I imagine it would slowly gain features as I needed them)
I have no illusions about having even a fraction of the standard library.

Basically, I just want to have most of the flow control statements, the types and operators that I need, and the ability to have local variables. And the ability to inline assembly when the limits of my feeble compiler are hit.

I'll probably end up with a fairly limited "C like" language, and that is fine! Not caring about portability probably will be freeing in some ways!

Re: Compiler writing

Posted: Sun May 24, 2020 10:17 pm
by BitWise
C is a tricky language to compile properly. Its grammar has some ambiguities that mean you must look ahead several tokens to determine what syntax rule should be followed.

I can't remember the specifics but I think type casts (e.g. '(t) x' -- the (t) looks like a sub expression until you find work out that t is a type) and array definitions in types (e.g. char *x[] -- x is an array of char pointers) are examples. A lot of compilers use LALR(n) based code generators (e.g. yacc, bison, antlr) to handle this.

I've been searching for a good language to write a native 6502 compiler for. I've yet to find one. Early languages like BCPL and B are inherently word based and untyped so you end up doing a lot of address computations to access memory.
PASCAL and MODULA-2 have complicated stack requirements for nested procedures. BASIC and LISP are normally interpreted on 8-bit machines. FORTH can be either compiled or interpreted and works quite well but its like 'Marmite' and you either love it or hate it.

I wonder if some of these languages would be better compiled to pseudo-code and run on a virtual machine rather than native code. The 6502 works well for intepreters.

Re: Compiler writing

Posted: Mon May 25, 2020 9:55 am
by Chromatix
Have you considered Lua?

Re: Compiler writing

Posted: Mon May 25, 2020 2:35 pm
by drogon
BitWise wrote:
BASIC and LISP are normally interpreted on 8-bit machines.
BASIC is a curios one in many respects. FORTRAN too, surprisingly. One way to compile FORTRAN is a one-pass compiler that generates an intermediate code plus a symbol table - a block of variable storage, if you like, followed by an interpreter for this intermediate code. I have such a system on my PDP-8.

BASIC - often 'tokenised' then "detokenised" for the LIST or editing command (With note that some Forths do this too) However some BASICS only tokenise the keywords, so you end up with

Code: Select all

PRINT 3.14
being stored as a token for PRINT, then the textual representation of 3.14 which has to subsequently be parsed at run-time.

Some years back I wrote what I considered my "ideal" BASIC and I took it a stage further - I turned everything into a token. So the above was turned into 2 tokens - one for keyword and one for a constant - stored in floating point binary representation a symbol table. The token was an index into the symbol table. That made run-time fast as there were never any numbers to parse or variables to create storage for. It also made LIST hard because you now had a binary representation of a number to print out (So I cheated and also stored the textual representation for nothing other than the LIST command. I even tokenised comments.

Back to compilers - what I did was essentially a one-pass compiler with the symbol table and some run-time fixups (scan for goto/gosub/proc/func targets) there is also a "just in time" compiler for expression evaluation - I cache the RPN generated by the shunting yard expression evaluator so turning that into a 2-pass compiler would not be that hard - and it's on my "to-do" list once I'm happy with my BCPL system on my '816 board. (Although plan A there is to translate the existing tokeniser/sym. table builder into BCPL then have the run-time in ASM, however after doing a lot more with BCPL I'm now looking at compiling directly into code for the BCPL Cintcode VM)

Cheers,

-Gordon

Re: Compiler writing

Posted: Mon May 25, 2020 4:48 pm
by barrym95838
drogon wrote:
Some years back I wrote what I considered my "ideal" BASIC and I took it a stage further - I turned everything into a token. So the above was turned into 2 tokens - one for keyword and one for a constant - stored in floating point binary representation a symbol table. The token was an index into the symbol table. That made run-time fast as there were never any numbers to parse or variables to create storage for. It also made LIST hard because you now had a binary representation of a number to print out (So I cheated and also stored the textual representation for nothing other than the LIST command. I even tokenised comments.
I am involved in a detailed analysis of WozBASIC, which was the first 6502 BASIC interpreter. Woz hand-assembled it, so there are several obvious non-linear patches and a couple of minor bugs, but his philosophy is quite evident: tokenize as much as you reasonably can, including constants. He even has several different tokens that execute differently but LIST the same, like PRINT a blank line, PRINT a variable, PRINT a string literal, etc. His comma tokens are even more diverse. The run-time subroutines for each one are generally short and simple (sometimes just an RTS), but the tokenizer and the program LISTer are quite complex. He has numerous eccentric coding methods, but the result is compact and very efficient. It's a shame that he was too busy to add floats, a situation that I hope to rectify some day, assuming I am able to complete my analysis and extend his code without disrespecting his legacy.

Re: Compiler writing

Posted: Mon May 25, 2020 6:29 pm
by drogon
barrym95838 wrote:
drogon wrote:
Some years back I wrote what I considered my "ideal" BASIC and I took it a stage further - I turned everything into a token. So the above was turned into 2 tokens - one for keyword and one for a constant - stored in floating point binary representation a symbol table. The token was an index into the symbol table. That made run-time fast as there were never any numbers to parse or variables to create storage for. It also made LIST hard because you now had a binary representation of a number to print out (So I cheated and also stored the textual representation for nothing other than the LIST command. I even tokenised comments.
I am involved in a detailed analysis of WozBASIC, which was the first 6502 BASIC interpreter. Woz hand-assembled it, so there are several obvious non-linear patches and a couple of minor bugs, but his philosophy is quite evident: tokenize as much as you reasonably can, including constants. He even has several different tokens that execute differently but LIST the same, like PRINT a blank line, PRINT a variable, PRINT a string literal, etc. His comma tokens are even more diverse. The run-time subroutines for each one are generally short and simple (sometimes just an RTS), but the tokenizer and the program LISTer are quite complex. He has numerous eccentric coding methods, but the result is compact and very efficient. It's a shame that he was too busy to add floats, a situation that I hope to rectify some day, assuming I am able to complete my analysis and extend his code without disrespecting his legacy.
My float issue in LIST was that one day I entered a constant (I forget what, but lets say 10) and when I LISTed the program saw 10.00000001, so good luck with printing floats in the listing :-)

Sounds like a worthwhile project though.

Cheers,

-Gordon