Hi!
@dmsc: 9.5kB is really impressive, especially with floating point! I've only briefly played with PEG and packrat parsers; my impression was that they required memoisation to work effectively, which is hard on a small system. Are you doing anything here or just accepting the cost? How are you handling errors?
The parser does not use memoization, as you only need memoization if your grammar needs a lot of look-ahead that causes the parser to rewind and try the same rules again.
In this case, I tried to reduce the backtracking by rewriting the rules to a more LL(1) form, and also minimized recursion in the rules by designing the language to make parsing easier.
About error reporting, in the 6502 compiler I only keep track of the rightmost position in the input that the parser has reached, and if the parser returns a not-match this signals an error at that position.
In the cross-compiler (that uses a C++ implementation of the parsing machine), the parser keeps the rightmost position and the description of all the rules that failed at that position, and returns the position and the string on non-match. This produces this kind of errors, not good but at least usable:
Code: Select all
testsuite/tests/err-bad-pos.bas:2:5: parse error, expected: statement or variable assignment, integer expression, integer constant, variable or function, left parenthesis, floating point constant, variable or function, string expression, variable assignment
IF <--- HERE -->X>10
In that example, the problem is that "X" was not defined before, so it was not recognized as a variable name.
One of the funny things of using a PEG parser is that some expressions that should be invalid are parsed ok, so the language seems richer, for example, this is valid code:
Code: Select all
IF = 3
IF IF = 3
PRINT "Yes??"
ENDIF
The parser tries first the rule for "IF (expression)", but it fails so it tries the rule for "variable = expression" and it succeeds creating the "IF" variable.
I'm using the Lemon parser generator, which is a traditional shift-reduce parser. The automaton is pretty small (465 bytes) and the compiled grammar is in about 4kB of tables. Each rule has a subroutine which does typechecking, code emission etc in code. I'm not at all averse to replacing it. The biggest killer is that I don't have function pointers or jump tables, so the rule dispatch table is enormously wasteful --- about 1.5kB. The code generator has another one. I have just come up with what I think is a workable abstraction over function pointers, so I'll have to try implementing it and seeing if that helps.
Well, in my parser, each rule uses one byte:
- 0 is "return not-match from the rule",
- 1 is "return match from the rule",
- 2 is "return match and emit one code byte".
- 3-31 are "emit 1 to 28 bytes of code",
- 32-127 are "match this character from the input,
- 128-255 are "call the parsing rule 0 to 127".
Also, any call to a new parsing rule skips white-space from the input, and parsing rules above certain number are calls to machine language routines. For each parsing rule, the rule address is also stored in a table, so it uses two extra bytes. Overall, the full floating-point parsing bytecode uses 2216 bytes, plus 402 bytes for the interpreter and 955 bytes for the assembly actions, so 3573 bytes total.
One of the limitations of the parser is that certain constructs use a lot of stack space, and as we only have 256 bytes of stack in the 6502, this limits the complexity of the supported expressions. This is an example of an expression that can't be parsed:
Code: Select all
data y()=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55
I started using PEG parsers from C and C++ code and liked that you don't need a separate lexer pass, so the parser is simpler. I wrote a parser for the Atari BASIC language for writing minimized ten-liners, and then expanded it to parse an extended BASIC language, and in all my tests I never encountered an exponential time in parsing. So, I started to write the 6502 parser that became FastBasic. You can see the full code at
https://github.com/dmsc/fastbasic
Have Fun!