Cowgol 2.0 compiler for the 6502

GARTHWILSON · Post by **GARTHWILSON** » Wed Jun 10, 2020 7:33 pm

hjalfi wrote:

@GARTHWILSON: Oops, I'd completely forgotten that the (zp,x) addressing mode existed! Yes, indeed, that makes accessing bytes via pointers on a data stack trivial. However, I still believe that (unless there's something else I've missed) accessing index data requires either copying the pointer to a fixed location in zero page, or doing pointer arithmetic: there's no (zp,x,y) or (zp,x),y address mode. So, for example, 32-bit arithmetic becomes very expensive. The same problem applies to accessing values directly on the hardware stack with abs,x.

A few options:

Instead of INY or DEY for the non-existent (ZP,X),Y addressing mode, you could INC ZP,X or DEC ZP,X. If you keep the applicable range from crossing a page boundary, you won't need to fool with the high byte. INC/DEC ZP,X takes 6 cycles (compared to INY/DEY's 2 cycles), but it's still a single instruction.
If you'll be doing a lot of indexing (like a multiplication or division, with their looping), copy parameters from the data stack to N (which is also in ZP).
If you use the 65816 and put the parameters on the hardware stack, you can use the (sr,S),Y addressing mode.
Self-modifying code might present a possibility or two also.

dmsc · Post by **dmsc** » Thu Jun 11, 2020 2:09 am

Hi!

hjalfi wrote:

@dmsc: 9.5kB is really impressive, especially with floating point! I've only briefly played with PEG and packrat parsers; my impression was that they required memoisation to work effectively, which is hard on a small system. Are you doing anything here or just accepting the cost? How are you handling errors?

The parser does not use memoization, as you only need memoization if your grammar needs a lot of look-ahead that causes the parser to rewind and try the same rules again.

In this case, I tried to reduce the backtracking by rewriting the rules to a more LL(1) form, and also minimized recursion in the rules by designing the language to make parsing easier.

About error reporting, in the 6502 compiler I only keep track of the rightmost position in the input that the parser has reached, and if the parser returns a not-match this signals an error at that position.

In the cross-compiler (that uses a C++ implementation of the parsing machine), the parser keeps the rightmost position and the description of all the rules that failed at that position, and returns the position and the string on non-match. This produces this kind of errors, not good but at least usable:

Code: Select all

testsuite/tests/err-bad-pos.bas:2:5: parse error, expected: statement or variable assignment, integer expression, integer constant, variable or function, left parenthesis, floating point constant, variable or function, string expression, variable assignment
  IF <--- HERE -->X>10

In that example, the problem is that "X" was not defined before, so it was not recognized as a variable name.

One of the funny things of using a PEG parser is that some expressions that should be invalid are parsed ok, so the language seems richer, for example, this is valid code:

Code: Select all

IF = 3
IF IF = 3
   PRINT "Yes??"
ENDIF

The parser tries first the rule for "IF (expression)", but it fails so it tries the rule for "variable = expression" and it succeeds creating the "IF" variable.

Quote:

I'm using the Lemon parser generator, which is a traditional shift-reduce parser. The automaton is pretty small (465 bytes) and the compiled grammar is in about 4kB of tables. Each rule has a subroutine which does typechecking, code emission etc in code. I'm not at all averse to replacing it. The biggest killer is that I don't have function pointers or jump tables, so the rule dispatch table is enormously wasteful --- about 1.5kB. The code generator has another one. I have just come up with what I think is a workable abstraction over function pointers, so I'll have to try implementing it and seeing if that helps.

Well, in my parser, each rule uses one byte:
- 0 is "return not-match from the rule",
- 1 is "return match from the rule",
- 2 is "return match and emit one code byte".
- 3-31 are "emit 1 to 28 bytes of code",
- 32-127 are "match this character from the input,
- 128-255 are "call the parsing rule 0 to 127".

Also, any call to a new parsing rule skips white-space from the input, and parsing rules above certain number are calls to machine language routines. For each parsing rule, the rule address is also stored in a table, so it uses two extra bytes. Overall, the full floating-point parsing bytecode uses 2216 bytes, plus 402 bytes for the interpreter and 955 bytes for the assembly actions, so 3573 bytes total.

One of the limitations of the parser is that certain constructs use a lot of stack space, and as we only have 256 bytes of stack in the 6502, this limits the complexity of the supported expressions. This is an example of an expression that can't be parsed:

Code: Select all

data y()=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55

I started using PEG parsers from C and C++ code and liked that you don't need a separate lexer pass, so the parser is simpler. I wrote a parser for the Atari BASIC language for writing minimized ten-liners, and then expanded it to parse an extended BASIC language, and in all my tests I never encountered an exponential time in parsing. So, I started to write the 6502 parser that became FastBasic. You can see the full code at https://github.com/dmsc/fastbasic

Have Fun!

resman · Post by **resman** » Thu Jun 11, 2020 2:12 am

hjalfi wrote:

@resman: can you provide an example of some PLASMA stack frame code? (BTW, the Plasma VM should be a really easy Cowgol target.)

These are the routines to allocate a frame & copy parameters and deallocate it:

Code: Select all

;*
;* ENTER FUNCTION WITH FRAME SIZE AND PARAM COUNT
;*
ENTER   LDA     IFPL            ; ALLOCATE FRAME
        INY
        SEC
        SBC     (IP),Y
        STA     IFPL
        BCS     +
        DEC     IFPH
+       INY
        LDA     (IP),Y           ; COPY PARAMETERS
        BEQ     +
        ASL
        TAY
-       LDA     ESTKH,X
        DEY
        STA     (IFP),Y
        LDA     ESTKL,X
        INX
        DEY
        STA     (IFP),Y
        BNE     -
+       LDY     #$03
        JMP     FETCHOP
;*
;* LEAVE FUNCTION
;*
LEAVE   LDA     IFPL
        INY                     ;+INC_IP
        CLC
        ADC     (IP),Y
        STA     IFPL
        BCS     +
        RTS
+       INC     IFPH
RET     RTS

hjalfi wrote:

It's important to remember, BTW, this Cowgol is not specific to the 6502. It's intended for other architectures as well. The 6502 is good at indexing but the 8080 and Z80 really aren't, and my language abstraction has to work there too. Statically assigning variables is important for that.

I understand not wanting to implement a frame stack. How about dis-allowing the semi-recursion so parameters can be written directly into memory and implementing function pointers? Function pointers are so incredibly useful and not that hard to implement.

teamtempest · Post by **teamtempest** » Sat Jun 13, 2020 2:22 am

Some idle thoughts, completely untested or completely worked out.

IF you have a dedicated area for stashing return addresses, you could do this:

Code: Select all

Entry:
  ldx retStkPtr
  pla
  sta retStk,x
  inx
  pla
  sta retStk,x
  inx
  stx retStkPtr

[..function code..]

  ldx retStkPtr
  dex
  lda retStk,x
  pha
  dex
  lda retStk,x
  pha
  stx retStkPtr
  rts

..which limits the depth of calling to at most 128 levels, or less if you don't allocate a whole page to storage. Or 256 if you allocate two whole pages and store the lo byte in one page and the hi byte in another. Or split whatever storage you do use in lo/hi parts:

Code: Select all

Entry:
  ldx retStkPtr
  pla
  sta retStkLo,x
  pla
  sta retStkHi,x
  inc retStkPtr

[..function code..]
 
  dec retStkPtr
  ldx retStkPtr
  lda retStkHi,x
  pha
  lda retStkLo,x
  pha
  rts

How deep do you allow function calls to nest? You might even split the hardware stack in two, a return stack and a parameter stack, say 32 or 64 bytes for return addresses and the remainder for parameters:

Code: Select all

Entry:
  tsx
  stx retStkPtr
  ldx argStkPtr
  txs

[..function code..]

  ldx retStkPtr
  txs
  rts

Mmm, that wouldn't work if you allow other functions to be called within this function (which of course you do, just no recursion, which would mess up your current return scheme)…

Code: Select all

Entry:
  tsx
  txa
  ldx retStkPtr
  sta retStkSav
  inc retStkPtr
  ldx argStkPtr  ; don't think this needs to be stacked as only ever need the current value (?)
  txs

[..function code..]

  dec retStkPtr
  ldx retStkPtr
  txs
  rts

Actually I take it back. If the top (say) 64 bytes of the hardware stack are reserved for return addresses (and, um, interrupt servicing), then you know each function call uses two bytes of it. All you need to track is the current number of active calls:

Code: Select all

Entry:
  inc retStkCnt  ; initialized to zero
  ldx argStkPtr
  txs

[..function code..]

  lda retStkCnt   ; if one address on return stack, we want to set hardware stack pointer to $FE
  dec retStkCnt
  asl                  ; clears carry ($01 -> $02 )
  eor #$ff          ; ($02 -> $FD)
  adc #$01        ; ($FD -> $FE)
  tax
  txs
  rts

or perhaps better:

Code: Select all

Entry:
  dec retStkCnt  ; initialized to zero
  ldx argStkPtr 
  txs

[..function code..]

  lda retStkCnt
  inc retStkCnt
  asl
  tax
  txs
  rts

I guess I'm assuming the parameter stack is also used for return values (overwriting the passed arguments). The caller probably has to deal with manipulating that pointer...or maybe it looks like this:

Code: Select all

Entry:
  dec retStkCnt
  ldx argStkPtr
  txs

[..function code..]

  stx argStkPtr
  lda retStkCnt
  inc retStkCnt
  asl
  tax
  txs
  ldx argStkPtr
  rts

hjalfi · Post by **hjalfi** » Sat Jun 13, 2020 10:57 am

@GARTHWILSON: changing the values on the stack frame isn't really an option is that's where the variables live, as I might need to refer to them again. However, it's totally doable with a stack-based architecture, as values on the stack get consumed during computation.

I have actually implemented a DTC backend in the Forth style, in the hope of getting denser code. It hasn't worked as well as I thought --- even with specialist words to, e.g., load and store using addresses or indexes in the instruction stream to avoid having to encode pushes and adds, the code size is only marginly less than my 6502 backend. The 8080 compiler compiles to: 55894 bytes of native 65c02 code, 54466 bytes of DTC 6502 code, 50420 bytes of 8080 code, and 48661 bytes of Z80 code. So that's not very encouraging. The plus side is that the backend is so much simpler that the DTC 6502 compiler in native 65c02 code is only 47763 bytes, which is a drastic improvement.

I might try a TTC implementation. If I make sure that all non-token routines are 2-byte aligned and all my tokens have the bottom bit set, then I can dsitinguish between tokens and addresses in the instruction stream by looking at the bottom bit. That ought to be only marginally slower than DTC but significantly more dense.

@resman: I don't actually want to pass parameters by writing into subroutine variables, though --- it's always going to be more expensive than passing them on the stack. I can push a byte onto the stack with a PHA, so that's one byte per byte. Storing to memory is going to be two or three bytes, depending. It does make the subroutine prologue and epilogue easier but they only appear in the program once, while calls appear many times, so saving bytes at the call site is a much bigger win than saving bytes in the prologue and epilogue.

I am actually working on function pointers; the biggest blocker there was coming up with a language abstraction which still allows call graph analysis to work (because I need to be able to tell statically which subroutines can be called from each call site). But I have that sorted now.

hjalfi · Post by **hjalfi** » Sat Jun 13, 2020 12:01 pm

Huh, that was straightforward! TTC reduces code size by an additional 20%. (But now you need the entire interpreter in every program, which is not small, where previously the linker could discard words the program wasn't using.) The full-fat 65c02 compiler now compiles into 6502 TTC code at 53892 bytes, which should actually be runnable. The 6502 TTC compiler in 6502 TTC code is a relatively lightweight 40796 bytes.

With luck the function pointer change will slash this again...

dmsc · Post by **dmsc** » Sat Jun 13, 2020 1:35 pm

Hi!

hjalfi wrote:

Huh, that was straightforward! TTC reduces code size by an additional 20%. (But now you need the entire interpreter in every program, which is not small, where previously the linker could discard words the program wasn't using.)

Why?

I you make each token value an export from the token code, the linker will only include tokens (and code) used in the program.

I use this technique, and a few macros to automatically build the token address table so it only includes the addresses of used words.

This is the macro: https://github.com/dmsc/fastbasic/blob/ ... ok.inc#L29

And here, an usage example: https://github.com/dmsc/fastbasic/blob/ ... nd.asm#L41

Have fun!

GARTHWILSON · Post by **GARTHWILSON** » Sat Jun 13, 2020 6:27 pm

hjalfi wrote:

I might try a TTC implementation. If I make sure that all non-token routines are 2-byte aligned and all my tokens have the bottom bit set, then I can dsitinguish between tokens and addresses in the instruction stream by looking at the bottom bit. That ought to be only marginally slower than DTC but significantly more dense.

I have never seen a TTC Forth, let alone the innards, but I've been (mildly) interested in TTC. I imagine it using 254 byte values (probably 1 to $FE) for the most commonly used words, one more (probably 0) for telling that the next two bytes are the CFA of one that's not among the 254 most common ones, and one more (FF) as a NOP for alignment if needed. My HP-41 calculator does a similar thing, and programs are very, very dense, but still allow functions in modules virtually unlimited except by the memory-map size (and even then, a module may do page-switching to get more material in however many 4K pages it takes up in the memory map).

Quote:

@GARTHWILSON: changing the values on the stack frame isn't really an option is that's where the variables live, as I might need to refer to them again. However, it's totally doable with a stack-based architecture, as values on the stack get consumed during computation.

In that case, you can put additional space on the stack for the callee to use. For example, after you've pushed your input parameters, do PHA PHA again for two more stack bytes for the callee to use however it wants. It doesn't matter what their values are until the callee stores something there. You might want to do that anyway if the callee needs to return more output bytes than you passed to it as input parameters. (The callee could take care of it, but it's more overhead that way since it has to keep the return address intact for RTS.) Or, for local space, the callee can just use N as mentioned earlier, as long as you're always done with the data stored there when calling or returning from a subroutine. There's still no need for a call graph analysis.

resman · Post by **resman** » Sat Jun 13, 2020 9:44 pm

hjalfi wrote:

@resman: I don't actually want to pass parameters by writing into subroutine variables, though --- it's always going to be more expensive than passing them on the stack. I can push a byte onto the stack with a PHA, so that's one byte per byte. Storing to memory is going to be two or three bytes, depending. It does make the subroutine prologue and epilogue easier but they only appear in the program once, while calls appear many times, so saving bytes at the call site is a much bigger win than saving bytes in the prologue and epilogue.

I am actually working on function pointers; the biggest blocker there was coming up with a language abstraction which still allows call graph analysis to work (because I need to be able to tell statically which subroutines can be called from each call site). But I have that sorted now.

I guess it depends on how you define expensive: space or time. But there are many ways to skin that cat and you have to decide what works best for you.

As for static analysis of function pointers, how do you keep that from exploding?

GARTHWILSON wrote:

I have never seen a TTC Forth, let alone the innards, but I've been (mildly) interested in TTC. I imagine it using 254 byte values (probably 1 to $FE) for the most commonly used words, one more (probably 0) for telling that the next two bytes are the CFA of one that's not among the 254 most common ones, and one more (FF) as a NOP for alignment if needed. My HP-41 calculator does a similar thing, and programs are very, very dense, but still allow functions in modules virtually unlimited except by the memory-map size (and even then, a module may do page-switching to get more material in however many 4K pages it takes up in the memory map).

I stumbled upon this a few days ago. A TTC written in C for portability. Pretty interesting and the concepts could be applied to different front ends, I'm sure: https://github.com/philburk/pforth

Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502

Re: Cowgol 2.0 compiler for the 6502