vino816 Forth design
Posted: Tue Apr 24, 2018 4:11 am
Dr Jefyll wrote:
You mentioned Forth using STC (subroutine threaded code), which indeed is quite a lot faster than ITC. I don't understand your comment about the '816 needing better support for X as a data-stack pointer (for the most part I find it adequate). Be that as it may, there's at least one other fast alternative -- namely, Forth that uses DTC (direct-threaded-code).
I've given some thought to the subject of the design for a Forth for the 65c816. I don't have any intention of actually implementing this though --- there aren't enough people interested in the 65c816 (the only people I know of who use the 65c816 are on this forum, there are only a handful of them, and most of them are convinced that I am not a "language expert" in regard to Forth) --- in the unlikely case that a lot of interest develops, I might do it.
I briefly examined Garth's Forth and IMHO there are three problems:
- It is ITC rather than STC.
- It uses X rather than S for the data-stack (note that UR/Forth used SP for the data-stack and BP for the return-stack).
- It holds the entire data-stack in memory, rather than told the TOS in A (note that UR/Forth held the TOS in BX).
- I'm holding the TOS in A --- always holding the TOS in a register is well-known to boost the speed significantly --- note that always holding both the SOS and TOS in a register doesn't work very well (and is a non-issue on the 65c816 that suffers from a shortage of registers).
- I'm using S as the data-stack --- this allows me to use PHA and PLA for DUP and DROP respectively --- these are one-byte instructions, so they can be inlined to increase speed and are 1/3 the size of a JSR so they also reduce bloat.
- STC uses JSR/RTS, which is faster than NEXT in ITC (or DTC).
- STC uses BEQ (along with some code to pop the TOS and test it) rather than a 0BRANCH primitive and BRA rather than a BRANCH primitive, which is much faster (and speeds up loops which tend to be the big time-sinks in most programs).
- STC allows small functions such as OVER + @ ! etc. to be inlined, which is much faster than doing a function call, and in some cases is smaller than a JSR.
- STC allows jump-termination (replace JSR xxx RTS with JMP xxx), which boosts the speed and helps to avoid return-stack overflow in recursive functions.
- STC allows peephole-optimization to be done. This mostly involves using literal values as operands. For example, a @ or ! to an absolute address does not require the address to be loaded into a register and then the register used indirectly, but can be optimized to use either absolute or zero-page addressing-modes.
- STC allows ISRs to be written in Forth and to have very little overhead --- reducing interrupt latency is often the best way to boost the speed of micro-controllers.
I wrote a document describing my design --- it is short enough that I can just inline it in this post:
Code: Select all
VINO816.TXT --- copyright: April 2018, Hugh Aguilar
This is called vino816 Forth because a person would have to be drunk to think that programming Forth on the 65c816 is a good idea --- it is slower than C or Pascal, but that is mostly because the 65c816 was designed to support C or Pascal rather than Forth --- it has the advantage over C or Pascal of providing an interactive development environment, which is useful for testing code.
This is STC (subroutine-threaded-code) meaning that the code mostly consists of JSR instructions --- some peephole-optimization can be done, generating inlined machine-code.
register usage:
A TOS (top-of-stack) of data-stack
X return-stack pointer
Y general-purpose
S data-stack pointer
Local variables are possible, but would need a zero-page pointer to be used as the local-stack pointer.
; -----------------------------------------------------------------------------------------------------
; These are a few simple macros:
macro FAST_ENTER ; done at the start of primitives that don't use Y internally
PLY
endm
macro FAST_LEAVE ; done at the end of primitives that start with FAST_ENTER
PHY
RTS
endm
1 equ 1st ; this is the SOS after FAST_ENTER has been done, or is the return-address
3 equ 2nd ; this is the SOS if FAST_ENTER has not been done
5 equ 3rd
7 equ 4th
macro ENTER ; done at the start of colon words or primitives that use Y internally
FAST-ENTER
DEX
DEX
STY 0,X
endm
macro LEAVE ; done at the end of functions that start with ENTER
LDY 0,X
INX
INX
FAST_LEAVE
endm
; Primitives that don't push or pull to the data-stack (such as NEGATE ROT etc.) can just leave the return-address on the data-stack and do RTS at the end.
; Primitives that push or pull to the data-stack can use FAST-ENTER and FAST-LEAVE at the end. This only works if they don't use Y internally.
; Primitives that push or pull to the data-stack and use Y internally have to do ENTER at the start and LEAVE at the end. They are going to be slow.
macro _DUP ; a -- a a
PHA
endm
macro _OVER ; a b -- a b a
_DUP
LDA 2nd,S
endm
macro _NIP ; a b -- b
PLY
endm
macro _DROP ; a --
PLA
endm
macro _SWAP ; a b -- b a
PLY
PHA
TYA
endm
macro _SRAP ; a b c -- c b a
LDY 3,S
STA 3,S
TYA
endm
macro _ROT ; a b c -- b c a
_SWAP ; -- a c b
_SRAP ; -- b c a
endm
macro _NROT ; a b c -- c a b
_SRAP ; -- c b a
_SWAP ; -- c a b
endm
macro _NEGATE ; a -- -a
EOR #$FFFF
CLC
ADC #1
endm
macro _ADD ; a b -- a+b
CLC
ADC 1st,S
_NIP
endm
; These are a few simple primitives:
TRUE: ; -- true ; note that TRUE is 1 rather than -1 as in ANS-Forth
FAST_ENTER
_DUP
LDA #1
FAST_LEAVE
DROP: ; a --
FAST_ENTER
_DROP
FAST_LEAVE
DUP: ; a -- a a
FAST_ENTER
_DUP
FAST_LEAVE
OVER: ; a b -- a b a
FAST_ENTER
_OVER
FAST_LEAVE
SWAP: ; a b -- b a ; don't use _SWAP because it assumes the return-address is removed
LDY 2nd,S ; 1st is still the return-address
STA 2nd,S
TYA
RTS
ROT: ; a b c -- b c a ; don't use _ROT because it assumes the return-address is removed
LDY 3rd,S
STA 3rd,S
TYA ; -- a c b
LDY 4th,S
STA 4th,S
TYA ; -- b c a
RTS
NROT: ; a b c -- c a b ; don't use _NROT because it assumes the return-address is removed
LDY 4th,S
STA 4th,S
TYA ; -- c b a
LDY 3rd,S
STA 3rd,S
TYA ; -- c a b
RTS
MINUS: ; a b -- a-b
ENTER
_NEGATE
_ADD
LEAVE
PLUS: ; a b -- a+b
ENTER
_ADD
LEAVE
; MINUS and PLUS are both slow because they use ENTER and LEAVE rather than FAST_ENTER and FAST_LEAVE --- they should be inlined.
NIP: ; a b -- b
FAST_ENTER
_NIP
FAST_LEAVE
FETCH: ; ptr -- n
TAY
LDA 0,Y
RTS
STORE: ; n ptr --
ENTER
TAY
PLA
STA 0,Y
LEAVE
; STORE is very slow because it uses ENTER and LEAVE --- most effort on peephole-optimization should involve inlining this code.
TO_R: ; a -- ; return: -- a
FAST_ENTER
DEX
DEX
STA 0,X
_DROP
FAST_LEAVE
R_FETCH: ; -- a ; return: a -- a
FAST_ENTER
_DUP
LDA 0,X
FAST_LEAVE
R_FROM: ; -- a ; return: a --
FAST_ENTER
_DUP
LDA 0,X
INX
INX
FAST_LEAVE
RDROP: ; -- ; return: a --
INX
INX
RTS