How text input works with ANSI Forth (with Gforth tricks)
Posted: Mon Jan 02, 2017 12:35 pm
One of the problems I had when first trying to write my own Forth is that the books were all pretty old, als in pre-ANSI old, and still used stuff like WORD and TIB for input. Since I'm at that stage again with Liara Forth, and have a far better grip on the material than with Tali (which should be rewritten), I thought I'd provide an introduction for people who want to create their own Forth. ANSI Forth actually makes sense, though I've found that Gforth's take is even a bit better.
ANSI Forth starts off with ABORT and QUIT that flow into each other, just like it always was, with QUIT providing the endless input loop. The difference is that ANSI tries to provide a common means of access for the four ways that Forth can receive data:
ANSI does this with help of a new word, REFILL ( -- f ) (http://forth-standard.org/standard/core/REFILL), which is what QUIT calls. It returns a flag that indicates if there are new characters in the input buffer, regardless of the current source (except EVALUATE, see below). QUIT then does the parsing and finding and other stuff we'll get to in a second.
How does REFILL know where to look? If you still are using blocks, you start off with BLK @ (two words) which will return either the block number or a zero if the input source is not a block. In the second case, we use SOURCE-ID ( -- 0 | -1 | fileid ) (http://forth-standard.org/standard/file/SOURCE-ID). That returns 0 for the keyboard, -1 for a string via EVALUATE, or the file id for a file. Since blocks are usually not used anymore, SOURCE-ID is pretty much all you need.
It is REFILL's job to get the input from these various sources (except from EVALUATE). The obvious and most important one is from the keyboard, which is still done via ACCEPT ( addr u - u ) (http://forth-standard.org/standard/core/ACCEPT). It in turn is supposed to call KEY, preferably a vectored KEY so the input can come via console or serial line. As far as I can tell, the details of KEY vectoring are left up to the implementation.
So where does REFILL put the stuff? There can be various buffers for the various input sources, but you always get the current input buffer and its length from SOURCE ( -- addr u ) (http://forth-standard.org/standard/core/SOURCE). TIB and #TIB are gone, though the pointer to the current index of the current input buffer still lives in >IN. The input buffer is supposed to be treated as read-only.
The next step is parsing the input. WORD is still around, but considered outdated, none the least because it returns the string found as a character string and does this weird thing where it adds a space to the end of the input buffer (see http://forth-standard.org/standard/core/PARSE for a discussion on why WORD is considered bad these days). The two words that replace it are PARSE ( char "ccc<char>" -- c-addr u ), which is more general for any delimiter char, but doesn't skip leading delimiters, and PARSE-NAME ("name" -- addr u) (http://forth-standard.org/standard/core/PARSE-NAME) which skips leading strings and is really what you want for input. In Liara and Tali, PARSE-NAME drops through to PARSE after skipping leading strings and providing a space as the delimiter character.
Now that we have the string, we need to see if it is in the Dictionary. ANSI strangely enough still recommends FIND ( c-addr -- c-addr 0 | xt 1 | xt -1 ) (http://forth-standard.org/standard/core/FIND) which uses counted strings. You can get rid of them with COUNT, but this is still a pain. Gforth - but not ANSI - goes the logical next step of defining FIND-NAME ( addr u -- nt | 0 ) (https://www.complang.tuwien.ac.at/forth ... token.html) that takes a string and returns a token to the word if it is found. Obviously, the combination of PARSE-NAME FIND-NAME is a lot easier than fooling around with counted strings.
There is one gotcha here: FIND-NAME doesn't return an Execution Token (xt) like FIND, but a Name Token (nt). This is because Gforth sees a difference between them; Tali Forth for instance does not. Liara keeps the Dictionary headers separate from the actual code, which allows some more or less clever tricks with code flowing into each other, so FIND-NAME is fine: The nt is the pointer to a word's Dictionary header, and NAME>INT ( nt -- xt ) (https://www.complang.tuwien.ac.at/forth ... token.html) is the Gforth word to find the xt from a given nt.
That's pretty much it, or at least how I understand it. I have found the Forth Programmer's Handbook (Conklin and Rather, https://www.amazon.de/Forth-Programmers ... 419675494/) very helpful explaining how the new system works, far more helpful than the standard documents in fact. Liara's code (https://github.com/scotws/LiaraForth) is currently up to PARSE-NAME, with FIND-NAME etc. next on the list, in case that helps anybody with an actual, though very PRE-ALPHA implementation of all of this.
ANSI Forth starts off with ABORT and QUIT that flow into each other, just like it always was, with QUIT providing the endless input loop. The difference is that ANSI tries to provide a common means of access for the four ways that Forth can receive data:
- The keyboard (technically the "user input device")
- A text string via EVALUATE
- A file if there is an underlying OS that supports it
- A block, because some things never die
ANSI does this with help of a new word, REFILL ( -- f ) (http://forth-standard.org/standard/core/REFILL), which is what QUIT calls. It returns a flag that indicates if there are new characters in the input buffer, regardless of the current source (except EVALUATE, see below). QUIT then does the parsing and finding and other stuff we'll get to in a second.
How does REFILL know where to look? If you still are using blocks, you start off with BLK @ (two words) which will return either the block number or a zero if the input source is not a block. In the second case, we use SOURCE-ID ( -- 0 | -1 | fileid ) (http://forth-standard.org/standard/file/SOURCE-ID). That returns 0 for the keyboard, -1 for a string via EVALUATE, or the file id for a file. Since blocks are usually not used anymore, SOURCE-ID is pretty much all you need.
It is REFILL's job to get the input from these various sources (except from EVALUATE). The obvious and most important one is from the keyboard, which is still done via ACCEPT ( addr u - u ) (http://forth-standard.org/standard/core/ACCEPT). It in turn is supposed to call KEY, preferably a vectored KEY so the input can come via console or serial line. As far as I can tell, the details of KEY vectoring are left up to the implementation.
So where does REFILL put the stuff? There can be various buffers for the various input sources, but you always get the current input buffer and its length from SOURCE ( -- addr u ) (http://forth-standard.org/standard/core/SOURCE). TIB and #TIB are gone, though the pointer to the current index of the current input buffer still lives in >IN. The input buffer is supposed to be treated as read-only.
The next step is parsing the input. WORD is still around, but considered outdated, none the least because it returns the string found as a character string and does this weird thing where it adds a space to the end of the input buffer (see http://forth-standard.org/standard/core/PARSE for a discussion on why WORD is considered bad these days). The two words that replace it are PARSE ( char "ccc<char>" -- c-addr u ), which is more general for any delimiter char, but doesn't skip leading delimiters, and PARSE-NAME ("name" -- addr u) (http://forth-standard.org/standard/core/PARSE-NAME) which skips leading strings and is really what you want for input. In Liara and Tali, PARSE-NAME drops through to PARSE after skipping leading strings and providing a space as the delimiter character.
Now that we have the string, we need to see if it is in the Dictionary. ANSI strangely enough still recommends FIND ( c-addr -- c-addr 0 | xt 1 | xt -1 ) (http://forth-standard.org/standard/core/FIND) which uses counted strings. You can get rid of them with COUNT, but this is still a pain. Gforth - but not ANSI - goes the logical next step of defining FIND-NAME ( addr u -- nt | 0 ) (https://www.complang.tuwien.ac.at/forth ... token.html) that takes a string and returns a token to the word if it is found. Obviously, the combination of PARSE-NAME FIND-NAME is a lot easier than fooling around with counted strings.
There is one gotcha here: FIND-NAME doesn't return an Execution Token (xt) like FIND, but a Name Token (nt). This is because Gforth sees a difference between them; Tali Forth for instance does not. Liara keeps the Dictionary headers separate from the actual code, which allows some more or less clever tricks with code flowing into each other, so FIND-NAME is fine: The nt is the pointer to a word's Dictionary header, and NAME>INT ( nt -- xt ) (https://www.complang.tuwien.ac.at/forth ... token.html) is the Gforth word to find the xt from a given nt.
That's pretty much it, or at least how I understand it. I have found the Forth Programmer's Handbook (Conklin and Rather, https://www.amazon.de/Forth-Programmers ... 419675494/) very helpful explaining how the new system works, far more helpful than the standard documents in fact. Liara's code (https://github.com/scotws/LiaraForth) is currently up to PARSE-NAME, with FIND-NAME etc. next on the list, in case that helps anybody with an actual, though very PRE-ALPHA implementation of all of this.