Crenshaw - Let's Build a Compiler

barnacle · Post by **barnacle** » Thu Oct 17, 2024 2:46 pm

BDD wrote:

Once you do that, I might get a burr under my saddle and see if I can spin a 65C816 rendition.

You will be very welcome. At the moment, it's about a thousand lines of C, with a lot of comments and thought experiments, and compiles to about 26k of x86. I tend to write standard 6502, no clever illegal codes (though I will be using some of the CMOS extensions, particularly around the stack), and I prefer clear and obvious to tricksy.

Also, it's disappearing up its own stack while testing for/next at the moment. A minor detail, I'm sure.

(Nested parentheses are just a recursive call to Expression(); if it finds one, it tries to evaluate a following expression until it finds a closing parenthesis and returns the value calculated so far (or in his examples, pushes the value in the generated code).)

Neil

drogon · Post by **drogon** » Thu Oct 17, 2024 2:55 pm

barnacle wrote:

You will be very welcome. At the moment, it's about a thousand lines of C, with a lot of comments and thought experiments, and compiles to about 26k of x86. I tend to write standard 6502, no clever illegal codes (though I will be using some of the CMOS extensions, particularly around the stack), and I prefer clear and obvious to tricksy.

FWIW: I have an editor application which is about 1500 lines of C that compiles to about 15KB of 65C02 code with cc65. So it might actually cross compile for a 6502..

-Gordon

Yuri · Post by **Yuri** » Thu Oct 17, 2024 5:12 pm

BigDumbDinosaur wrote:

I’m quite fuzzy on how nested parentheses get evaluated, which is kind of key to a lot of language implementations...except, maybe, Forth.

Took a long while for me to wrap my head around that concept with compilers/languages as well. I'll take a stab at an explanation though:

With the compiler/interpreter that Crenshaw is describing is recursive decent parser, which is just a kinda fancy way of saying that you're relying on the nature of a high level language's way to effectively have a new local state with each recursive call to a function.

The magic of recursive decent is that it works backwards through the grammar if you will. That is to say, we start with a call with the thing with the least precedence and work our way down to the thing with the most precedence; effectively allowing the high precedence thing to take a stab at parsing what we need first.

I'll limit this to just +, * and () for the sake of simplicity, but we start of basically with this:

Code: Select all

int parseAdditive(){ return 0; }

Now the parseAdditive() function doesn't really know what token the lexer just found, nor does it really need to care at this point. What it does know, is that it needs a left hand part of an additive expression, and that left hand part needs to follow the rules for multiplicative, so it asks the parseMultiplicative() to figure out what that left hand side is:

Code: Select all

int parseMultiplicative() { return 0; }

int parseAdditive()
{
    return parseMultiplicative(); // Don't know what's there yet, but we'll make the higher precedence figure that out.
}

At this point parseMultiplicative() also has no clue what's going on really, but it knows that it also needs a left hand side, so it asks parseFactor() to give it an item to work with:

Code: Select all

int parseFactor() { return 0; }

int parseMultiplicative()
{
    return parseFactor(); // Hey I need a left hand side please, thanks!
}

int parseAdditive()
{
    return parseMultiplicative(); // Don't know....
}

Now at this point parseFactor() kind has no one else to go to but the lexer itself, so it looks at the token to figure out what to do:

Code: Select all

int parseFactor()
{
    int rval = 0;

    switch (currentToken)
    {
    case IDENT: // We have a variable, look it up!  That's our left hand side!
        rval = getSymValue(tokenLiteral);
        accept(IDENT);
        break;

    case NUMBER: // We have a number, this is our left hand side!
        rval = atoi(tokenLiteral);
        accept(NUMBER);
        break;

    case '(': // Oh hey, it's a paren, the human is telling me there's a nested thing here
        accept('('); // Eat the '(', we don't really need it any more.
        rval = parseAdditive(); // Recursively call the chain
        expect(')'); // Make sure there is a closing ')' or we give out a syntax error
        break;

    default:
        reject(token); // This isn't an expected token, report an error.
        break;
    }

    return rval;
}

The magic happens in that parseFactor() and the fact that on each entrance into the recursive call to parseAdditive(), we'll get a fresh new state on the stack for those calls that holds the state of what we're parsing at that particular moment. If you follow this hopefully you can see that it doesn't matter how many nested parens you have, it will still work out. (Stack overflow potentials not withstanding)

To complete the other two functions, they roughly work out as:

Code: Select all

int parseMultiplicative()
{
    int result = parseFactor();

    while (currentToken == '*')
    {
        accept('*'); // Eat the '*' we don't really need it any more
        int rightSide = parseFactor();
        result *= rightSide;
    }

    return result;
}

int parseAdditive()
{
    int result = parseFactor();

    while (currentToken == '+')
    {
        accept('+'); // Eat the '+' we don't really need it any more
        int rightSide = parseMultiplicative();
        result += rightSide;
    }

    return result;
}

To really simplify this problem consider this:

Code: Select all

int parseFactor()
{
    int rval = 0;

    switch (current)
    {
    case NUMBER:
        rval = atoi(tokenLiteral);
        accept(NUMBER);
        break;

    case '(':
        accept('(');
        rval = parseFactor(); // Recursively call ourselves
        expect(')');
        break;

    default:
        reject(current); // Syntax error
        break;
    }

    return rval;
}

In the above we've really simplified our grammar down to just accepting a number with any amount of parens surrounding it.

barnacle · Post by **barnacle** » Thu Oct 17, 2024 7:02 pm

Thanks Yuri; if Dr Crenshaw had explained it that way it might not have taken me years to get my head around it.

(Although, to be fair, doing it using a HLL I don't speak and an assembly language I don't really grok didn't help

)

Neil

barnacle · Post by **barnacle** » Thu Oct 17, 2024 7:52 pm

So, it turns out that for/next loops require recursion, which surprised me (mostly because I hadn't thought about it!) but that's now in the system:

Code: Select all

> list
   10 for q = 1 to 3
   15   for r = 4 to 6
   20     print q;r
   30     next
   40   print "inner"
   50   next
   60 print "outer"
   70 end
> run
1 4 
1 5 
1 6 
inner
2 4 
2 5 
2 6 
inner
3 4 
3 5 
3 6 
inner
outer
>

It does negative numbers, too.

Neil

Yuri · Post by **Yuri** » Thu Oct 17, 2024 11:22 pm

barnacle wrote:

Thanks Yuri; if Dr Crenshaw had explained it that way it might not have taken me years to get my head around it.

Yeay! I'm glad it helps clarify it.

barnacle · Post by **barnacle** » Fri Oct 18, 2024 6:52 pm

Extended the interpreter slightly to allow an implicit 'let' (saves typing!)

Code: Select all

> 10 a = 10
> 20 b = 7
> 30 c = a*b
> 40 print a; b; c
> run
10 7 70 
>

It still allows 'let', of course.

Neil

barnacle · Post by **barnacle** » Tue Oct 22, 2024 5:03 am

And now it allows an input; either an integer or a string. I toyed with just using the expression parser as the numeric input, but because it can access the variables which are unknown to the (basic program) user, decided against it. So it takes a little more code, since GetNum() can only accept a positive value.

Code: Select all

> list
   10 print "What's your name";
   20 input $
   30 print "Hello, ";$
   40 end
> run
What's your name? Neil
Hello, Neil
>

Neil

barnacle · Post by **barnacle** » Thu Oct 24, 2024 6:09 am

A minor change in direction.

At the moment, the flow control is determined by line number, and the first thing the main loop does is to find the line it has to execute next. Which it did by calling find_line(), which starts at the first location in the program store and checking every number to see what's next.

Which is fine, until you get to a long program, where execution will slow down as you get to the later numbers...

It struck me that I don't actually need to do that. I don't even need to know the numbers, except in a handful of cases, just _where_ the next line starts. Which I know since I know where this line is, and I know how long it is, so moving forward to the next line is a read and an addition...doh. (That's actually how find_line() works, but as I said, it starts at the beginning every time).

Flow control is more complex.

Generally, we just go to the next line - easy.
'for' requires us to proceed with the following line as the loop, and the 'next' is a marker to say the loop has ended. To allow for nested loops, we need to call that section of code recursively, unwound by the 'next's. When the loop is finished, we have to find a way to return the address of the line following.
'if' is similar; it needs to know the address of the line after the 'then'
'do' does nothing but call the following line recursively, but 'while' needs to return either to the line following 'do' or, if the condition fails, the line following itself.

I've got those all working with line numbers and find_line(), but I think things can be improved because in most cases - normal sequential instructions - I don't care what the line number is. And if I do need it, it's easy to get. But 'goto' will always need to start at the beginning since I can't scan backwards.

Neil

Yuri · Post by **Yuri** » Fri Oct 25, 2024 9:17 am

barnacle wrote:

...
I've got those all working with line numbers and find_line(), but I think things can be improved because in most cases - normal sequential instructions - I don't care what the line number is. And if I do need it, it's easy to get. But 'goto' will always need to start at the beginning since I can't scan backwards.

One approach to solving this might be to keep a look up table where each line begins in memory. The table would always have 2 byte indexes so each index would be the "next" line number or something to that effect.

You'd have to do some reshuffling of the table when a user inserts or removes a line number in the middle the program, but that cost would likely not be as painful as re-scanning the entire program on each goto.

The other draw back is that if the program is sufficiently "long" you'll eat up some memory pretty quickly. (2 extra bytes per line after all)

Another option might be to scan the whole program on "run" and look up the line number indexes for those spots where you know you'll need them and build a smaller table of jump locations from there.

Food for thought

barnacle · Post by **barnacle** » Fri Oct 25, 2024 10:17 am

I've considered both of those, but they do get long quite quickly.

Currently, I'm breaking things into a handful of cases. In most cases, we just need the next line, so that can be quickly generated by the execute before the statement line is even called. Gosub and goto need to search the whole memory space, so there may be scope for something there. If needs to find the associated endif, but that's always in a forward direction; similarly 'for' needs to know where to go when it's finished so it needs to find the first next - but again, that's always forwards (and both are called recursively so they can be nested). Do/while needs only to remember where the line after do was, and it's also called recursively; the associated with while can return either a null (i.e. got to the end of the code block) or the following line if the condition fails.

Of course, there is great potential for goto (considered harmful!) to really screw up both the program flow if you e.g. jump out of for/next or do/while or even if/endif and equally if you jump into the middle... but if you can't keep track of your control flow, you probably deserve all you get

Neil

BigEd · Post by **BigEd** » Fri Oct 25, 2024 12:01 pm

Hmm, I don't think FOR needs to find a NEXT - because it's at the NEXT that it detects it's done. (Helpful here is that FOR always executes the first time through.)

Whereas, WHILE...ENDWHILE would be more difficult, as it might need to execute zero times. (The BBC wanted WHILE...ENDWHILE for BBC Basic but were evidently satisfied with REPEAT...UNTIL)

drogon · Post by **drogon** » Fri Oct 25, 2024 12:31 pm

barnacle wrote:

I've considered both of those, but they do get long quite quickly.

And possibly turning into a not quite so tiny basic .... Where do you draw the line?

Quote:

Currently, I'm breaking things into a handful of cases. In most cases, we just need the next line, so that can be quickly generated by the execute before the statement line is even called. Gosub and goto need to search the whole memory space, so there may be scope for something there. If needs to find the associated endif, but that's always in a forward direction; similarly 'for' needs to know where to go when it's finished so it needs to find the first next - but again, that's always forwards (and both are called recursively so they can be nested). Do/while needs only to remember where the line after do was, and it's also called recursively; the associated with while can return either a null (i.e. got to the end of the code block) or the following line if the condition fails.

There are many GOTO/GOSUB scenarios - one might be to simply start at the top and search, looking for the line number - this is pretty "traditional" in MS style tokenising Basics - and once upon a time we all used to put commonly used subroutines at the top of the program on a single line, where possible to maximise speed... Another might be to check if the target linenum is >= to the current line number and search forwards or re-start from the top (or even search backwards) but every time you do something to make it faster you increase the code complexity and size...

I'm surprised you're using recursion to handle FOR - but it's a valid way although I think I'd be looking at a "longjump" to help with error recovery.. I got/get into all sorts of shenanigans trying to unwind the stack when my big Basic does recursion (functions & procedures can call themselves recursively) and when/if I re-write it I'll use longjump as my get out of jail free card...

"Traditional" tiny basics (and my own) maintain a software stack and push target value, controlling variable and the "cursor" value - which in a FOR statement is actually a pointer to the end of the statement, so NEXT never has to search, but can reload the "cursor" pointer directly to effect the GOTO once it's updated the controlling variable. DO/UNTIL works in a similar manner with it's own stack but no controlling variable. The test happens at UNTIL then it can reload the cursor pointer or just carry on to the next statement. (or line).

And, ah, multi-line IF ... Never easy in a Basic. I don't do it in my Tiny, but do do it in the C version.

Quote:

Of course, there is great potential for goto (considered harmful!) to really screw up both the program flow if you e.g. jump out of for/next or do/while or even if/endif and equally if you jump into the middle... but if you can't keep track of your control flow, you probably deserve all you get

Neil

Left as an exercise to the user (in this case the person writing the basic programs!

)

-Gordon

barnacle · Post by **barnacle** » Fri Oct 25, 2024 1:13 pm

@BigEd: Do [this] while [that] though always executes once, while for/next can not execute if the starting condition is untrue. It has to search for the next in the case where it _doesn't_ execute at all... the recursion is sorta like this (I'm making this up as I go along, because I just broke everything by changing from line numbers to memory pointers)

The main loop sees run as a statement (in the terminal) and calls, er, run(). run() executes single lines, one at a time, by calling execute() which is a big case statement that (a) finds the next line address, (b) calls individual statement handlers if necessary, and (c) returns either the address of the next line (precalculated or modified by the statement handler), or NULL if it's run out of lines, met an end, or met a loop terminator. Execute() returns the next address to execute back to run which either runs the next statement, or quits back to the terminal.

So far so good.

If execute() meets a loop start, the loop start handler calls execute() itself - recursively - until it gets to the end of the loop, at which point it is presented with a NULL (done here!) and returns... back to execute the (new) next line - which might be just the next line. But running out of program or meeting an end before the official end of loop will still return the loop, but not necessarily do what you intended.

So you don't get into a mess trying to build a separate stack for control returns; it's all on the stack. Which is getting fatter every day but until I run something I won't really have a feel for how much stack is needed...

@drogon:
The basic is so tiny everything is limited to a single statement per line. So you can't say 'if x < 4 goto 100'; that requires three lines - the comparison, the goto, and the endif. But it gets me multi-line if for free.

I'm looking at find_next_line, which is a no-brainer; but I can't search backwards except byte-at-a-time until I find a CR of the previous line, and then read forwards. I think it will have to start at the current position if the required target is higher than the current line, or start at the beginning if it's lower.

This writing a tiny basic is proving to be an interesting challenge. It's still only around 1200 lines of C, and a lot of those are MISRA style braces; I have hopes that it won't prove too large; under 4k is the target, under 2k the challenge. Which I don't expect to meet.

Neil

BigEd · Post by **BigEd** » Fri Oct 25, 2024 1:15 pm

Ah, you have a different model of FOR...NEXT than BBC Basic (or, I think, MS Basic). Fair enough!

Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler

Re: Crenshaw - Let's Build a Compiler