6502.org

Posted: **Mon Jan 02, 2017 12:35 pm**

One of the problems I had when first trying to write my own Forth is that the books were all pretty old, als in pre-ANSI old, and still used stuff like WORD and TIB for input. Since I'm at that stage again with Liara Forth, and have a far better grip on the material than with Tali (which should be rewritten), I thought I'd provide an introduction for people who want to create their own Forth. ANSI Forth actually makes sense, though I've found that Gforth's take is even a bit better.

ANSI Forth starts off with ABORT and QUIT that flow into each other, just like it always was, with QUIT providing the endless input loop. The difference is that ANSI tries to provide a common means of access for the four ways that Forth can receive data:

The keyboard (technically the "user input device")
A text string via EVALUATE
A file if there is an underlying OS that supports it
A block, because some things never die

ANSI does this with help of a new word, REFILL ( -- f ) (http://forth-standard.org/standard/core/REFILL), which is what QUIT calls. It returns a flag that indicates if there are new characters in the input buffer, regardless of the current source (except EVALUATE, see below). QUIT then does the parsing and finding and other stuff we'll get to in a second.

How does REFILL know where to look? If you still are using blocks, you start off with BLK @ (two words) which will return either the block number or a zero if the input source is not a block. In the second case, we use SOURCE-ID ( -- 0 | -1 | fileid ) (http://forth-standard.org/standard/file/SOURCE-ID). That returns 0 for the keyboard, -1 for a string via EVALUATE, or the file id for a file. Since blocks are usually not used anymore, SOURCE-ID is pretty much all you need.

It is REFILL's job to get the input from these various sources (except from EVALUATE). The obvious and most important one is from the keyboard, which is still done via ACCEPT ( addr u - u ) (http://forth-standard.org/standard/core/ACCEPT). It in turn is supposed to call KEY, preferably a vectored KEY so the input can come via console or serial line. As far as I can tell, the details of KEY vectoring are left up to the implementation.

So where does REFILL put the stuff? There can be various buffers for the various input sources, but you always get the current input buffer and its length from SOURCE ( -- addr u ) (http://forth-standard.org/standard/core/SOURCE). TIB and #TIB are gone, though the pointer to the current index of the current input buffer still lives in >IN. The input buffer is supposed to be treated as read-only.

The next step is parsing the input. WORD is still around, but considered outdated, none the least because it returns the string found as a character string and does this weird thing where it adds a space to the end of the input buffer (see http://forth-standard.org/standard/core/PARSE for a discussion on why WORD is considered bad these days). The two words that replace it are PARSE ( char "ccc<char>" -- c-addr u ), which is more general for any delimiter char, but doesn't skip leading delimiters, and PARSE-NAME ("name" -- addr u) (http://forth-standard.org/standard/core/PARSE-NAME) which skips leading strings and is really what you want for input. In Liara and Tali, PARSE-NAME drops through to PARSE after skipping leading strings and providing a space as the delimiter character.

Now that we have the string, we need to see if it is in the Dictionary. ANSI strangely enough still recommends FIND ( c-addr -- c-addr 0 | xt 1 | xt -1 ) (http://forth-standard.org/standard/core/FIND) which uses counted strings. You can get rid of them with COUNT, but this is still a pain. Gforth - but not ANSI - goes the logical next step of defining FIND-NAME ( addr u -- nt | 0 ) (https://www.complang.tuwien.ac.at/forth ... token.html) that takes a string and returns a token to the word if it is found. Obviously, the combination of PARSE-NAME FIND-NAME is a lot easier than fooling around with counted strings.

There is one gotcha here: FIND-NAME doesn't return an Execution Token (xt) like FIND, but a Name Token (nt). This is because Gforth sees a difference between them; Tali Forth for instance does not. Liara keeps the Dictionary headers separate from the actual code, which allows some more or less clever tricks with code flowing into each other, so FIND-NAME is fine: The nt is the pointer to a word's Dictionary header, and NAME>INT ( nt -- xt ) (https://www.complang.tuwien.ac.at/forth ... token.html) is the Gforth word to find the xt from a given nt.

That's pretty much it, or at least how I understand it. I have found the Forth Programmer's Handbook (Conklin and Rather, https://www.amazon.de/Forth-Programmers ... 419675494/) very helpful explaining how the new system works, far more helpful than the standard documents in fact. Liara's code (https://github.com/scotws/LiaraForth) is currently up to PARSE-NAME, with FIND-NAME etc. next on the list, in case that helps anybody with an actual, though very PRE-ALPHA implementation of all of this.

Posted: **Tue Jan 03, 2017 4:52 am**

Nice work (in progress), Scot! Can you briefly explain (or link me to an explanation of) why you changed your initial design and decided to incorporate the contents of dTOS in register Y instead of register A?

Mike B.

Posted: **Tue Jan 03, 2017 11:21 am**

Thank you! If you go to https://github.com/scotws/LiaraForth/tree/master/docs , there are two documents that start with "tos_" - one is a table with comparisons of different common words with a pure Direct Page Data Stack, and TOS for A and Y, and the other is some of the code I used for comparisons (at lot more I did on paper on public transport). The short version is that except for DROP, either Y or A as TOS was faster.

In practice, it is sort of a mixed bag. TYA and TAY are very inexpensive operations - 1 byte, 2 cycles each - as are stuff like the Forth instruction 1+ (INY, another 1 byte / 2 cycle instruction), where INC $00,X would be 2 bytes and 7 cycles (!). It's neat to be able to do LDA $0000,Y (that's lda.y 0000 in the listing) for certain stuff using TOS as the address. Testing TOS for zero is either automatic on LDY or just TYA. On the other hand, every time you push down on the stack, you add a rather expensive STY $00,X (sty.dx 00) for 2 bytes and 5 cycles (because Y register is 16 bit). Words like FIND-NAME use a lot of LDA (DP),Y (ldy.diy dp), and since X is blocked, I'm having to PHY/PLA, which costs horrible, horrible 9 cycles.

What may be worse, though, is that using Y as TOS makes simple things more complicated. Popping the stack is

Code: Select all

ldy.dx 00  ;  LDY $00,X
inx
inx

and I find myself making stupid mistakes because of the extra step. Since I'm doing a lot in straight assembler instead of JSR/RTS again like with Tali, it's not quite as bad; and a macro would help, of course, but that is just putting wallpaper over the problem. A quick INX INX is definitely easier to think about. If I were trying to write a small Forth, this would all be a no-brainer, but the focus here is on speed, and that's turning out to be more tricky.

I'm currently working on FIND-NAME (coded but buggy), and after that I'll be writing a first, provisional version of EXECUTE so I have a complete working loop. Then comes CREATE, to get a feeling for how the internals are going to work. The plan is to then go back and revisit a bunch of fundamental design decisions like what to use for TOS or the design of the dictionary headers before I start cranking out the other words. At the moment, I think it is perfectly possible I'll go back to a DP-only stack, but I'll want to write actual alternative code for a direct comparison before I decide.

Posted: **Tue Jan 03, 2017 8:05 pm**

scotws wrote:

ANSI tries to provide a common means of access for the four ways that Forth can receive data:

The keyboard (technically the "user input device")
A text string via EVALUATE
A file if there is an underlying OS that supports it
A block, because some things never die

Thanks for sharing your work, Scot. I have a Forth which desperately needs to have the input code revised, but I've been putting off the job. I appreciate having the topic summarized.

-- Jeff

Posted: **Thu Jan 05, 2017 4:01 pm**

Dr Jefyll wrote:

I have a Forth which desperately needs to have the input code revised, but I've been putting off the job.

I rebuilt the input code in my (x86 Linux) Forth in November 2013, apparently. Is the code something that you'd be interested in, either as an extract of the input logic or a copy of the entire repository?

Actually, the reason I haven't just shoved the entire thing onto github yet is that it's still called "bootstrap", and I haven't come up with a decent (or at least not horrible) name yet.

Posted: **Thu Jan 05, 2017 5:31 pm**

Thanks, nyef. I'll PM you if I need anything. First, though, will be an evaluation just to see what's the shortest distance between two points. It'd be nice to implement the most elegant and sophisticated solution, but a "good enough" solution may carry the day. OTOH maybe the optimal solution won't be that difficult.

Posted: **Tue Feb 21, 2017 1:47 am**

I forgot to add that there is a reference implementation of QUIT as part of the ANSI standard at http://forth-standard.org/standard/implement

Code: Select all

: QUIT 
   ( empty the return stack and set the input source to the user input device ) 
   POSTPONE [ 
     REFILL 
   WHILE 
     ['] INTERPRET CATCH 
     CASE 
     0 OF STATE @ 0= IF ." OK" THEN CR ENDOF 
     -1 OF ( Aborted ) ENDOF 
     -2 OF ( display message from ABORT" ) ENDOF 
     ( default ) DUP ." Exception # " . 
     ENDCASE 
   REPEAT BYE 
;

Note there is no ACCEPT, since it is hidden as part of REFILL, and you need CATCH and CASE.

Posted: **Wed Feb 22, 2017 2:19 pm**

Since it's related to text input, and I've just coded the first version of it for Liara Forth (https://github.com/scotws/LiaraForth), here is some general info about how numbers are handled by Forth.

If you look at the code for QUIT above, you'll there is a (informal) word called INTERPRET. ANSI defines no such thing (leaving the implementation details vague as always), but what this does is use PARSE-NAME (that's BL WORD COUNT in paleolithic Forths) to isolate the next word in the input stream, returning the string address and length as ( addr u ) on the Data Stack. The next step is for something like ANSI's FIND (or the far superior FIND-NAME ( addr u -- nt | 0 ) in Gforth) to use that string to see if there is such a word in the Dictionary. If yes, it returns a pointer to that word, either as the classical "execution token" (xt, Tali Forth) of the word's executable code, or as a "name token" (nt, Gforth and Liara Forth) which points to the Dictionary entry. Whatever, the word is executed, and control goes back to the loop.

Now, if the string is not found in the Dictionary, INTERPRET checks to see if it is a number.

The word at the heart of the process is >NUMBER ("to-number"). It has a complicated-looking stack comment ( ud1 addr1 u1 -- ud2 addr2 u2 ) and a scary description (http://forth-standard.org/standard/core/toNUMBER):

Quote:

ud2 is the unsigned result of converting the characters within the string specified by c-addr1 u1 into digits, using the number in BASE, and adding each into ud1 after multiplying ud1 by the number in BASE. Conversion continues left-to-right until a character that is not convertible, including any "+" or "-", is encountered or the string is entirely converted. c-addr2 is the location of the first unconverted character or the first character past the end of the string if the string was entirely converted. u2 is the number of unconverted characters in the string.

In practice, it's actually rather simple:

ud1 is the number you start out with, which is almost always zero. Think of it as reserving two cells on the Data Stack for the double-celled result.
addr1 u1 is the string of the word we're checking that we got from PARSE-NAME, which is the point of this all.
ud2 is the number we get out of the conversion, always in double cell format.
addr2 u2 is only important during failure, because it will point to the offending character. In practice, you just check u2 to see if it is zero, because that means all is well. If it is not zero, the conversion has failed and we ABORT the whole interpretation process with "Undefined word", because a number was our last hope after we found out the string wasn't in the Dictionary.

Starting Forth (https://www.forth.com/starting-forth/10 ... operators/) has a fun demonstration of how >NUMBER works. Rewritten with modern words:

Code: Select all

: PLUS ( n "string" -- )  0. PARSE-NAME >NUMBER 2DROP D>S + ." = " . ;

The 0. is ud1 in the official definition and creates a double-celled zero on the stack. PARSE-NAME gets a string from the user, which is then converted by >NUMBER. We drop the addr2 u2 part with 2DROP, leaving the result as a double cell number on the stack, which we convert to single-cell with D>S (in some Forths this can just be DROP, but D>S is more portable). The rest is output that allows this word to be used for infix notation, because

Code: Select all

2 PLUS 13

will give you the result "15". Who says Forth can't do infix?

However, >NUMBER is a pain to use directly. The double cell at the beginning and the usually useless string at the end is only part of the problem: The word will barf at a minus sign and doesn't respect the dot at the end of a number that indicates a double celled word (Gforth lets you puts dots and a whole bunch of other stuff pretty much anywhere to signal a double cell word, but AFAIK ANSI only allows the dot at the end and that's the way Liara does it).

In practice, you use NUMBER ( addr u -- n | d ) to convert numbers. For some reason, this is not part of the ANSI standard, but the Forth Programmer's Handbook (Conklin & Rather) lists it as "common", which I think might be a polite way of saying "sane". NUMBER takes the number string, with any minus and dot, and returns either a single or double cell number. If the conversion fails, it complains and calls ABORT (or THROW, depending on your Forth) with the "Unknown word" error.

In other words, it is a wrapper for >NUMBER. The steps are:

If the first character is a minus, make a note of that fact (set a flag), and remove the first character (in Forth this is usually 1 /STRING, in assembler you add one to the address and decrement the length of the string by one).
If the last character is a dot, make a note of that fact (set a flag), and remove the last character (by DECing the string length, for instance).
Set up everything for >NUMBER by making sure ( 0 0 addr u ) is on the stack.
Call >NUMBER, where (waves hands in the air) stuff is done with things to actually convert the number.
If the length of the string that was returned is not zero, complain about an unknown word and ABORT or THROW.
If all is well, dump the string information (usually 2DROP), leaving the number as a double cell on the stack
Remember if we were given a single or double cell number. If it was single cell, convert it (usually D>S).
Remember if we were given an negative number. If yes, either NEGATE or DNEGATE the result, depending if we are single or double celled.

Which leaves us with the question of how >NUMBER works under the hood. I'll try to get to that later.

Posted: **Thu Feb 23, 2017 3:07 pm**

So, under the hood of >NUMBER. This will be a bit more technical.

As discussed above, we arrive at the word with ( ud addr u ) and want to return the same stack structure. In between, the job is to take a string such as "42" and turn it into the number $2A while taking into account the current radix (which we get from BASE).

>NUMBER is usually realized in high level code. For example, Andrew uses (https://github.com/andrew-jacobs/w65c26 ... -forth.lst):

Code: Select all

BEGIN
   DUP WHILE
       OVER C@ DIGIT?
       0= IF DROP EXIT THEN
       >R 2SWAP BASE @ UD*
       R> M+ 2SWAP
       1 /STRING
REPEAT ;

This shows us the basic steps:

Walk through the string, checking each character
If the character is not a legal digit for the current radix, quit with what we have
Otherwise, multiply the number we got by the radix and add it to the total

A large part of the work revolves around taking an ASCII character, checking if it is a legal digit, and converting it to the correct numerical value. Again, there is no ANSI word for this, and even worse, no "common" name like we see with NUMBER. Andrew uses DIGIT? in the code above, as does Gforth, while pForth uses DIGIT, so the rough consensus seems to be DIGIT? (Liara currently uses CHAR>NUMBER, but I will rename that to DIGIT? as well). Andrew's version again, reformatting his - certainly not my - comments:

Code: Select all

    [ HEX ] DUP 39 > 100 AND +    \ silly looking
    DUP 140 > 107 AND -   30 -    \ but it works!
    DUP BASE @ U<

This mucking about with ASCII values can obviously be done in assembler as well for greater speed. It just has to return two things: A flag as TOS and the numerical value as NOS.

If you're surprised by the non-standard word UD* above, it helps to know that it is defined as

Code: Select all

DUP >R UM* DROP  SWAP R> UM* ROT +

UM* (unsigned mix-size multiplication) takes two single-cell numbers and returns the double-cell result. If we don't define UD*, we end up with a bit messier code - here a snippet of the loop from pForth (https://github.com/philburk/pforth/blob ... mberio.fth):

Code: Select all

    WHILE ( -- ud1 c-addr n  )
        SWAP >R  ( -- ud1lo ud1hi n  )
        SWAP  BASE @ ( -- ud1lo n ud1hi base  )
        UM* DROP ( -- ud1lo n ud1hi*baselo  )
        ROT BASE @ ( -- n ud1hi*baselo ud1lo base )
        UM* ( -- n ud1hi*baselo ud1lo*basello ud1lo*baselhi )
        D+  ( -- ud2 )
        R> 1+     \ increment char*
        R> 1- >R  \ decrement count
    REPEAT

Gforth uses UM* as well, but hides the whole mess in the word ACCUMLATE defined as

Code: Select all

SWAP >R SWAP USERADDR <70>  @ UM* DROP ROT USERADDR <70>  @ UM* D+ R>

UM* is so popular because there are various well-documented ways to implement 16 bit * 16 bit = 32 bit numbers (see http://6502.org/source/integers/fastmult.htm and http://wilsonminesco.com/16bitMathTables/index.html for variants with tables), so we'll skip the details.

For those looking to implement >NUMBER in assembler, the general procedure for the 65816 can be described as follows:

First, shamelessly steal Garth's idea of moving everything to a scratch pad on the Direct Page for speed from his reference implementation of UM/MOD (http://6502.org/source/integers/ummodfix/ummodfix.htm). In fact, you can use the same memory locations. We map them as follows:

Code: Select all

                +-----+-----+-----+-----+-----+-----+-----+-----+
                |   UD-LO   |   UD-HI   |     N     | UD-HI-LO  |
                |           |           |           |           |
                |  S    S+1 | S+2   S+3 | S+4   S+5 | S+6   S+7 |
                +-----+-----+-----+-----+-----+-----+-----+-----+

UD-LO is the low cell of the double-cell number ud we were given as the fourth stack entry, UD-HI the high cell. The next to cells are for temporary storage: N is the number that DIGIT? gave us, and UD-HI-LO is a product of the first multiplication step we'll get to in a minute. We start off by copying UD-LO and UD-HI from the stack to the scratch pad.

Have DIGIT? convert the next character and store the result as N in S+4 (remember, this is 16-bit).
Using UM*, multiply the radix from BASE with UD-HI, giving us a double cell number with the cells UD-HI-LO and UD-HI-HI (speaking of silly).
Discard UD-HI-HI, and store HD-HI-LO in S+6
Again using UM*, multiply N with UD-LO. The result replaces the UD-LO in S and UD-HI in S+2
Use a version of D+ - add double cell numbers - to add ( UD-LO-LO UD-LO-HI ) and ( N UD-HI-LO )
Place the result in S and S+2, and start over with the next character

There is obviously a lot of room for optimization here, starting with an assembler version of UD* instead of calling UM* twice, and not storing stuff every time. But that seems to be the general idea.

When we're done, we copy S and S+1 back to the third and fourth stack locations. If we use ( addr u ) themselves as the pointer/counter for the operation, we don't have to fool around with the stack at all, and can simply return to the caller (usually NUMBER) when were done or something goes wrong.

And that's pretty much it for number conversion.

Posted: **Thu Feb 23, 2017 3:37 pm**

scotws wrote:

Andrew's version again, reformatting his - certainly not my - comments:

Code: Select all

    [ HEX ] DUP 39 > 100 AND +    \ silly looking
    DUP 140 > 107 AND -   30 -    \ but it works!
    DUP BASE @ U<

This mucking about with ASCII values can obviously be done in assembler as well for greater speed. It just has to return two things: A flag as TOS and the numerical value as NOS.

My number parsing is otherwise dreadful, but I use similar logic for picking off the digits (neither version seems to accept minuscule digits, both the above and below only accept majuscule digits after 9):

Code: Select all

: ?digit ( c -- n t/f )
  30 -                  ( displace ASCII 0 to zero )
  DUP 9 > OVER 11 < AND ( gap between ASCII 9 and A )
  OR                    ( set c to -1 if it's in the gap )
  DUP 9 > 7 AND -       ( close the gap )
  DUP BASE @ U<         ( bounds-check against BASE )
 ;

Edit, a couple of hours later: If we add

Code: Select all

DUP 60 > 20 AND XOR

to the start of either word, it should convert lower-case to upper-case before conversion. This isn't a general "toupper" method, though, as it doesn't do a range-check, and thus damages the non-alphabetic characters in the range, but since we're only checking for alphabetics here it is sufficient.

Edit again, a few minutes later, to produce actual working code.

Posted: **Thu Nov 30, 2017 9:57 am**

scotws wrote:

How does REFILL know where to look? So where does REFILL put the stuff?

In my Forth, these kind of things are all done with a small table of vectors. (A class with virtual methods, if you will.) When the input source is changed, the table pointer is changed accordingly. The vectors include REFILL, input buffer address, whether to print a prompt (?prompt below), etc.

scotws wrote:

The next step is parsing the input.

I use the modern PARSE-NAME like everyone else, but my FIND-NAME is only a minor update from FIND.

Code: Select all

create interpreters   ' compile, ,  ' number ,  ' execute ,
: interpret-xt ( xt +-1 | addr u 0 ) 1+ cells interpreters + @ execute ;
: parsed ( addr u -- ) find-name interpret-xt ;
: interpret   begin parse-name dup while parsed repeat 2drop ;
: quit   ... begin refill while interpret ?prompt repeat ;

[ and ] changes the first item in interpreters to execute or compile,.

6502.org

How text input works with ANSI Forth (with Gforth tricks)

How text input works with ANSI Forth (with Gforth tricks)

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks

Re: How text input works with ANSI Forth (with Gforth tricks