6502.org

Posted: **Tue Apr 24, 2018 4:11 am**

Dr Jefyll wrote:

You mentioned Forth using STC (subroutine threaded code), which indeed is quite a lot faster than ITC. I don't understand your comment about the '816 needing better support for X as a data-stack pointer (for the most part I find it adequate). Be that as it may, there's at least one other fast alternative -- namely, Forth that uses DTC (direct-threaded-code).

The 65c816 lacks a post-increment of X (for pulling a datum from the data-stack) and a pre-decrement of X for pushing a datum to the data-stack. Because of this, your code ends up with a lot of INX INX and DEX DEX code sequences --- this makes the code bloaty and slow.

I've given some thought to the subject of the design for a Forth for the 65c816. I don't have any intention of actually implementing this though --- there aren't enough people interested in the 65c816 (the only people I know of who use the 65c816 are on this forum, there are only a handful of them, and most of them are convinced that I am not a "language expert" in regard to Forth) --- in the unlikely case that a lot of interest develops, I might do it.

I briefly examined Garth's Forth and IMHO there are three problems:

It is ITC rather than STC.
It uses X rather than S for the data-stack (note that UR/Forth used SP for the data-stack and BP for the return-stack).
It holds the entire data-stack in memory, rather than told the TOS in A (note that UR/Forth held the TOS in BX).

I came up with a design that fixes these problems. I would expect the code generated to be about 8 times faster than Garth's Forth for these reasons:

I'm holding the TOS in A --- always holding the TOS in a register is well-known to boost the speed significantly --- note that always holding both the SOS and TOS in a register doesn't work very well (and is a non-issue on the 65c816 that suffers from a shortage of registers).
I'm using S as the data-stack --- this allows me to use PHA and PLA for DUP and DROP respectively --- these are one-byte instructions, so they can be inlined to increase speed and are 1/3 the size of a JSR so they also reduce bloat.
STC uses JSR/RTS, which is faster than NEXT in ITC (or DTC).
STC uses BEQ (along with some code to pop the TOS and test it) rather than a 0BRANCH primitive and BRA rather than a BRANCH primitive, which is much faster (and speeds up loops which tend to be the big time-sinks in most programs).
STC allows small functions such as OVER + @ ! etc. to be inlined, which is much faster than doing a function call, and in some cases is smaller than a JSR.
STC allows jump-termination (replace JSR xxx RTS with JMP xxx), which boosts the speed and helps to avoid return-stack overflow in recursive functions.
STC allows peephole-optimization to be done. This mostly involves using literal values as operands. For example, a @ or ! to an absolute address does not require the address to be loaded into a register and then the register used indirectly, but can be optimized to use either absolute or zero-page addressing-modes.
STC allows ISRs to be written in Forth and to have very little overhead --- reducing interrupt latency is often the best way to boost the speed of micro-controllers.

For the most part, STC is all about inlining code and doing peephole-optimization --- code is going to be quite bloaty --- of course, bloaty code is only a problem if you have users who want to write large programs, which makes it a non-problem for the foreseeable future.

I wrote a document describing my design --- it is short enough that I can just inline it in this post:

6502.org

vino816 Forth design

vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design

Re: vino816 Forth design