6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 01, 2024 2:37 pm

All times are UTC




Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next
Author Message
 Post subject:
PostPosted: Fri Aug 27, 2004 9:44 pm 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
kc5tja wrote:
I'm getting pretty sick and tired of all the mis-information that exists about Forth.


Getting laughed at for using Forth comes with the territory, unfortunately. A lot of people are less than receptive to Forth, but at least they don't laugh as hard once they see the sort of fast and powerful programs that can be written with it.

kc5tja wrote:
It's also a fallacy that if a language is interpreted, it must necessarily be slow.


True, but most interpreted languages (e.g. BASIC) are/were optimized for space, not for speed. When speed became an issue, they were compiled and the interpreter was ditched. If you look at the heyday of interpreted languages, fast interpretation wasn't exactly a popular approach (e.g. thanks to FIG-Forth -- and the Forth standards -- indirect threading was common on the 6502, even though subroutine threading is much faster, and in many ways simpler -- well, it's not interpreted either, so this isn't a perfect example), and common misconceptions like this arise.


kc5tja wrote:
The representation of the compiled software can range from a vector of addresses (where each address represents another Forth word), all the way up to natively generated machine language. I often have to wonder why BASIC interpreters of old didn't represent their programs in this manner -- it sure would have made them run so very much faster.


Early BASIC interpreters were written with two dominant restrictions: (a) a small amount of ROM to work with (maybe 8k, maybe only 4k), and (b) a small amount of RAM to work with (sometimes as little as 1k, though 4k was more common) to work with. With minimal space as the driving force, the result was simple parsers that don't optimize (much) and tokenized programs. Native code did eventually happen -- with BASIC compilers.

I've been jotting down ideas for a fast interpreted BASIC that I haven't started writing yet. In terms of major 6502 projects, right now it's behind at least my write up of 6502 decimal mode (including undocumented behavior), which I am currently working on (but isn't getting any closer to being completed now that I'm writing long-winded forum posts), and finishing my functional, but incomplete and undocumented 6502 subtroutine threaded Forth.

My BASIC was partly inspired by Apple (Integer) BASIC, which was written by Steve Wozniak. Wozinak used ideas from HP BASICs (he had worked at HP), and of course HP was well-known for their stack-oriented RPN calculators. "Integer" BASIC is really not all that far away from the address interpreter you suggest. It uses a sophisticated parser which does all of it's syntax checking at parse time rather than execution time. It also converts constants at parse time to their twos complement reprepresentation rather than storing them as ASCII. Therefore, unlike many BASICs (e.g. EhBASIC and Applesoft) you never got a SYNTAX ERROR at run time, and you weren't converting constants (e.g. the statment PRINT 99) or line numbers (e.g. the statement IF A < B THEN 10) each and every time they were encountered.

There were other things that could have been handled at parse time, but weren't, such as checking for too many parentheses in an expression, and determining whether you needed the address of a variable so you could store a value in it (e.g. a LET or an INPUT statement) or if only the contents of the variable were needed (e.g. PRINT A). (Determining the latter at parse time would have sped up expression evaluation considerably.)

To use Wozniak's terms, the tokenized program contained "nouns" and "verbs". Nouns were constants (numbers), strings literals, and variable names, and verbs were basically everything else. In Forth terms, an example of a compiled literal (in assembly) is DW LIT,10. LIT would be the verb and 10 would be the noun.

The BASIC program was stored in infix order. All verbs took some action (although in some cases the action was to do nothing). Different tokens could represent the same keyword (verb). For example, there were three tokens that would list as PRINT. One token was for PRINTing numbers, one for strings, and one for CRs (i.e. when PRINT had no arguments). In Forth, these three actions have three different names: . (dot) TYPE and CR, but in BASIC they all had the name PRINT. Which token to use was determined at parse time based on syntax and context, so the parser was quite clever. Interestingly, even delimiters such as commas and semicolons were verbs. So a statement like PRINT 12+34;56 would would be stored as:

Code:
DB PRINT_number_token
DB literal_number_token
DW 12
DB plus_token
DB literal_number_token
DW 34
DB semicolon_print_number_token
DB literal_number_token
DW 56


The tokens for PRINT and semicolon were different (but this was only so that LIST knew which one to print), but used the exact same routine (i.e. both went to the same address). The Forth equivalent, 12 34 + . 56 . (usually) compiles to DW LIT,12,LIT,34,PLUS,DOT,LIT,56,DOT. Forth can just execute "verbs" as it encounters them, but since the BASIC program is in infix order, the (inner) interpreter must check precedence to determine if it can execute a verb.

Incidentally, there were other examples of multiple tokens for a single keyword. An equals sign (=) had one token for a numeric comparison (e.g. IF A=1 THEN PRINT), a different token for a string comparison (e.g. IF A$="A" THEN PRINT), an assignment statement (e.g. LET A=1), etc. There are over a dozen tokens that LIST as a comma, since the comma in a POKE 0,0 statement takes a different action than the comma in a NEXT A,B statement, etc.

You may have already guessed one idea I've got: use a sophisticated parser AND a sophisticated LIST. The program would be entered in infix order, stored in postfix order, and LISTed in infix order. Thus, the parser would convert infix to postfix, and LIST would convert postfix back to infix. By moving precedence checking from the (inner) interpreter to the parser and to LIST, the interpreter will be very fast (and Forth-like).

In Forth, to insert code, you have to recompile everything after the insertion point, since the addresses of any subsequent words will change. In BASIC, I'd like to retain the ability to insert a line painlessly, so I am planning to start with tokens rather than addresses. It's possible to write a token interpreter that compares favorably to an addresses interpreter, in terms of speed.

I've gathered enough other ideas (e.g. I like the Acorn assembler idea, but I'm planning to try a one-pass assembler instead) that intrigue me that I plan to write this BASIC at some point. Obviously, the plans I've mentioned are subject to change once I start writing and discover what the tradeoffs are.

kc5tja wrote:
The disadvantage of this method is that it consumes slightly more space. But, this is a classic size/speed trade-off issue, and the ratio isn't that bad in practice.


Another Forth misconception! In terms of space, it is not necessarily true that subroutine threading > direct threading > indirect threading.

The code definition example: CODE z NOP, NEXT, END-CODE

Indirect threading (6 bytes):

Code:
z: DW  z+2
   NOP
   JMP NEXT


Direct threading (4 bytes):

Code:
z: NOP
   JMP NEXT


Subroutine threading (2 bytes):

Code:
z: NOP
   RTS


So for code definitions, indirect threading is always the largest, and subroutine threading is always smallest (and fastest, to boot). Subroutine threading also seems to encourage writing more code definitions, so there's quite a bit a space savings that can be had.

The colon definition example: : z a b c d ;

Indirect threading (12 bytes):

Code:
z: DW  DO_COLON
   DW  a
   DW  b
   DW  c
   DW  d
   DW  EXIT


Direct threading (13 bytes):

Code:
z: JSR DO_COLON
   DW  a
   DW  b
   DW  c
   DW  d
   DW  EXIT


Subroutine threading (12 bytes):

Code:
z: JSR a
   JSR b
   JSR c
   JMP d


For a shorter definition, subroutine threading will take the least amount of space. The colon definition example: : z a b c d e ;

Indirect threading: 14 bytes
Direct threading: 15 bytes
Subtroutine threading: 15 bytes

For anything longer, yes, subtroutine threading will take the most space, but good Forth programming practice dictates that definitions should be short, so the space penalty of subroutine threading is smaller than you might think.

You can also save space (and increase speed) with subroutine threading in some cases by inlining assembly, rather than writing a single-use code definition. In subroutine threading, with no NEXT routine, the Y register is usually free (I've even used it as a loop counter, to get a terrific benefit in both speed and space).

When I decided to write a subroutine threaded Forth, I thought I was going to get clobbered in terms of space. But that hasn't been the case so far. Now, if I could just bring myself to use postfix assembly with subroutine threading, rather than the more-legible (at least to me) prefix assembly...


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Aug 28, 2004 6:38 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8534
Location: Southern California
Since this discussion subject has quickly evolved so much, we really ought to move it to the Forth area; but since I'm responding to two different things here, I don't know what to call the subjet if I move it there myself now, so we're still here.


> When I decided to write a subroutine threaded Forth, I thought I was
> going to get clobbered in terms of space. But that hasn't been the case
> so far.

Interesting. Can you give some idea of how much space you're finding you need for a kernel, compared to what would be needed for the same thing in indirect-threadded?


> Now, if I could just bring myself to use postfix assembly with
> subroutine threading, rather than the more-legible (at least to me)
> prefix assembly...

You don't have to. The quick and dirty way to make a prefix Forth assembler which I used in my Forth was to just make mnemonics that had the address mode built in, like LDA_ABS, LDA(ZP,X), etc.. and comma-in the operand. The mnemonics just compile the appropriate machine-language op code byte. They don't need to even be in a separate vocabulary because they don't conflict with anything else in Forth. For example, AND is always in Forth but if you want the assembly AND, it'll have an address mode associated with it, like AND_ZP or AND#. The branches were a minor challenge, but still not complicated because the assembler in Forth is not normally used for large applications, only writing primitives, runtimes, maybe ISRs, and things like that. I am working on making my branching method more user-friendly though. When I thought of this, I was able to initially write the assembler in an evening. It made me wonder why I didn't do it years earlier. It has been so handy.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Aug 28, 2004 7:30 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
dclxvi wrote:
Getting laughed at for using Forth comes with the territory, unfortunately.


People have a right to laugh if they wish. But when misconceptions and lies are the rule, it's impossible to offer a level field on which to discuss the merits of a language.

Quote:
True, but most interpreted languages (e.g. BASIC) are/were optimized for space, not for speed.


APL, J, and K can be implemented with rather trivial interpreters, and still achieve excellent speeds. The reason is that they are vector processing languages. They try hard to let you write programs without ever relying on explicit iteration. That is the secret to their speed.

J, for example, doesn't even tokenize its source code, and K stores subroutines as strings. :)

Quote:
well, it's not interpreted either, so this isn't a perfect example), and common misconceptions like this arise.


An aside: if you do a careful study of the academic papers written about compilers and interpreters, you'll find very few papers which actually distinguish between the two. This is because any interpreted language can be compiled, and vice versa.

Quote:
Early BASIC interpreters were written with two dominant restrictions: (a) a small amount of ROM to work with (maybe 8k, maybe only 4k), and (b) a small amount of RAM to work with (sometimes as little as 1k, though 4k was more common) to work with. With minimal space as the driving force, the result was simple parsers that don't optimize (much) and tokenized programs. Native code did eventually happen -- with BASIC compilers.


BASIC is a horrible example of a language to use for comparison. The utter lack of any standard structure for BASIC meant that it was impossible to arrive at a regular, tokenized encoding for representing source (e.g., you can represent Lisp code as a perfect tree for example. Forth code as an array of arrays. BASIC is . . . well.... err...I'll get back to you on that). For this reason, to interpret BASIC, you had to parse the tokenized stream as if it were raw source code, every time you executed a statement. It's this parsing that I was objecting to.

If BASIC designers had concentrated instead on truely optimizing their storage methods, they (a) could have drastically reduced the average storage requirements for a BASIC program, and (b) could have made it run substantially faster.

Quote:
rather than storing them as ASCII. Therefore, unlike many BASICs (e.g. EhBASIC and Applesoft) you never got a SYNTAX ERROR at run time, and you weren't converting constants (e.g. the statment PRINT 99) or line numbers (e.g. the statement IF A < B THEN 10) each and every time they were encountered.


Excellent! This is what I expected to see. I hadn't realized that this was already done. As far as I was aware, AppleBASIC for the Apple IIs didn't do this (at least, I don't remember them doing it), but then again, it was also floating point BASIC too, with the AppleDisk system. Not sure if that made any difference.

Quote:
You may have already guessed one idea I've got: use a sophisticated parser AND a sophisticated LIST. The program would be entered in infix order, stored in postfix order, and LISTed in infix order. Thus, the parser would convert infix to postfix, and LIST would convert postfix back to infix. By moving precedence checking from the (inner) interpreter to the parser and to LIST, the interpreter will be very fast (and Forth-like).


That will definitely work, and there is some prior art already: the Smalltalk programming language does precisely this. Given a chunk of bytecode, produced by a Smalltalk compiler, which is directly "executable" by the Smalltalk virtual machine, Smalltalk can produce the precise set of high-level command to produce that exact bytecode. Note that Smalltalk's bytecode is based on a stack-architecture virtual machine, not entirely unlike Java.

Quote:
In Forth, to insert code, you have to recompile everything after the insertion point, since the addresses of any subsequent words will change. In BASIC, I'd like to retain the ability to insert a line painlessly, so I am planning to start with tokens rather than addresses. It's possible to write a token interpreter that compares favorably to an addresses interpreter, in terms of speed.


Bytecode/tokens is just "token-threaded" -- several Forth implementations have been implemented token-threaded. The most popular of which is OpenFirmware, actually.

However, with that said, in my experience, recompiling software in Forth is so fast that the application I just beckoned has already been compiled before my finger leaves the ENTER key (assuming your Forth compiler's dictionary lookups are sufficiently efficient). Therefore, the issue of addresses, literally, is a complete non-issue for me.

Quote:
Another Forth misconception! In terms of space, it is not necessarily true that subroutine threading > direct threading > indirect threading.


What? Subroutine threading takes 3 bytes to store a reference to a word on the 6502, while both direct and indirect threading take 2. The numbers don't lie. Subroutine threading is smaller only for colon definitions with less than four words in the body. At 4 words, the overhead is the same, and beyond 4 words, ITC/DTC is smaller.

Quote:
For anything longer, yes, subtroutine threading will take the most space, but good Forth programming practice dictates that definitions should be short, so the space penalty of subroutine threading is smaller than you might think.


I don't think anyone was contesting this. I know I certainly wasn't. But it nonetheless is true that STC will consume more space (for a 16-bit Forth implementation on a 6502) than a DTC implementation for the average program size. The average number of words in a colon definition turns out to be 8 in most Forth programs, so each definition, one can estimate, will have four bytes of "overhead" to them compared to other methods. Times the number of definitions, and you can see that it adds up...slightly.

Quote:
You can also save space (and increase speed) with subroutine threading in some cases by inlining assembly, rather than writing a single-use code definition.


This is precisely how my Forth cross compiler works under Linux. In fact, this is the single simplest compiler for Forth you can ever possibly have! The core of the compiler consumes only 5 screens -- all total, the whole compiler fits on a single sheet of paper. I also have five more screens to define useful and common words too, but still, it's darn good!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Aug 28, 2004 8:12 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8534
Location: Southern California
Quote:
Quote:
In Forth, to insert code, you have to recompile everything after the insertion point, since the addresses of any subsequent words will change

in my experience, recompiling software in Forth is so fast that the application I just beckoned has already been compiled before my finger leaves the ENTER key (assuming your Forth compiler's dictionary lookups are sufficiently efficient). Therefore, the issue of addresses, literally, is a complete non-issue for me.

There may be situations where you don't want to recompile everything past that point. An example might be where there's an array of data you don't want to have to collect again that would get trashed. In that case, it's easy enough to write a new word that has the corrected behavior or expanded code, and make yourself some simple tools to make the beginning of the old word's parameter field refer to the new word, followed by unnest (or some variation thereof). The first system I started out on compiled slowly enough that it was sometimes worth doing it this way.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Aug 28, 2004 4:48 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
GARTHWILSON wrote:
There may be situations where you don't want to recompile everything past that point. An example might be where there's an array of data you don't want to have to collect again that would get trashed. In that case, it's easy enough to write a new word that has the corrected behavior or expanded code, and make yourself some simple tools to make the beginning of the old word's parameter field refer to the new word, followed by unnest (or some variation thereof). The first system I started out on compiled slowly enough that it was sometimes worth doing it this way.


In this case, I would store the data in blocks. However, if even this isn't feasible, then you could always store the data at a well-known buffer location in unallocated space. Chuck Moore does this a lot with his ColorForth environment. Since Forth programs are so tiny, it's highly unlikely that a loaded or reloaded program will overwrite the data buffer. (BTW, this used to be used a lot in the old BASiC-on-8-bit era.)

However, if even that doesn't satisfy your requirements, I completely agree -- in this case, I would have to say that Forth itself may not be the right language (I never said it was the perfect language for all people). If this were really an issue, I would definitely investigate a language like Scheme or a proper subset of Common Lisp, both of which supports proper garbage collection, and can easily support in-situ code updates without having to recollect data. As I understand it, making a proper subset of CL is not very hard to do, and I was thinking about exploring it someday. It also has the benefit of hiding pointers from the programmer, so you just plain don't have to think about them (as an experienced programmer fluent in many languages, I consider this a good thing overall). However, I know for a fact as well that native Lisp environments still allow direct access to machine resources too. Here's some sample code:

Code:
; Print the set of integers from 1 to 10, inclusive.  Woohoo!
; Untested code, so I'm not sure if this will work right.  I used to have a CLisp
; interpreter on my box, but don't anymore.  I'm kinda ticked -- I want to know
; where it went.

(defun last (xs)
  "Returns very last element of a list as an atom"
  (cond
    ((= (length xs) 1) (car xs))
    (t  (last (cdr xs)))))

(defun all-but-last (xs)
  "Returns all but the last element of a list, obviously as a list."
  (cond
    ((< (length xs) 2) ())
    (t (cons (car xs) (all-but-last (cdr xs))))))

(defun reverse (xs)
  "Reverses the order of a list."
  (cons (last xs) (reverse (all-but-last xs)))

(defun append (x xs)
  "Appends x onto the list xs"
  (reverse (cons x (reverse xs))))

(defun range (x)
  "Returns a range of integers from 1 to x in a list."
  (cond
    ((= x 1) (list 1))
    (t  (append x (range (- x 1))))))

(display (range (10))


God, that's horrible. I know there is a more elegant way to append an atom to the end of a list, but for the life of me, I can't figure it out right now. So, I decided to brute force it. Some things to note, though: note that use of variables is typically "bind once, use many times." This is the hallmark of functional programming. I can tell you right now, though, because of my senior moment, the above code would run very, very slowly if it were to be executed (still too fast for the human to notice, since we're dealing only with 10 integers, but still...).

Another language I would definitely look to implement would be Smalltalk -- like Lisp, it is fully garbage collected, and can do in-place code updates. Note that you don't need the full-blown "Smalltalk-80 Environment" -- I'm talking about just the core language. The syntax may be a little bit more agreeable, but it requires a significantly more sophisticated language parser than either Forth or Lisp (still simpler than BASIC though, I think).

Code:
IntegerArray subclass: #IntegerRange
IntegerRange methods

display
  0 to: self length - 1 do: [ i | console print: self at: i ].
  ^ self

range: max
  0 to: max do: [ i | self append: i ].
  ^self


Here I'm cheating, because I'm pretty confident Smalltalk's array objects have an append: method built into them.

Anyway, you get the idea.

But now you might worry about the performance of garbage collection versus explicit memory management. As long as you do your homework on the subject, you won't have a problem with this. Generational collectors make collection virtually transparent in most cases, and if you do so incrementally, it has even less of a performance impact. One dirty trick that Oberon (a language derived from Modula-II, and written by Niklaus Wirth; highly recommended too!) uses is to perform garbage collection while waiting for an I/O request to complete, for example. There is no reason why you should not be able to do the same in a Lisp environment either.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Sep 07, 2004 8:16 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
GARTHWILSON wrote:
Interesting. Can you give some idea of how much space you're finding you need for a kernel, compared to what would be needed for the same thing in indirect-threadded?


Here is the current size. These numbers are probably not going to be particularly useful for a couple of reasons. First, it's not currently optimized for space or for speed. I decided to start with the philosophy "nothing too long, nothing too slow", so it's currently in that middle ground. At this point, I'm leaning toward optimizing the finished version for speed. (Another possible project would be to write a subroutine threaded Forth that is optimized for space, using this version as a starting point.)

Second, it was written from scratch and not derived from an extant indirect or direct threaded implemenation, so there isn't really anything with which to make a good comparison. There's FIG-Forth, but it might not be so easy to choose which pieces of FIG-Forth to use for comparison, since there are aspects which differ from the FIG-Forth model.

Currently, there are 86 words (that includes aliases, which only take up head space) that have been tested (several more untested words have been written). Basically, it's only the words necessary for making the QUIT (text interpreter) loop function. Not included in the 86 words would be words such as :, IF, and VARIABLE. Here is how the code is currently organized (it will definitely be reorganized in the finished version):

Heads: 628 bytes (all 86 words)
Primitives: 548 bytes (34 words)
Constants, variables, and deferred words: 66 bytes (16 words)
High-level words (with inline assembly intermixed): 851 bytes (34 words)
System-dependent (I/O) words: 37 bytes (2 words)

The total comes to just over 2k. For the purpose of debugging and experimentation (and partly my own curiousity) the Forth data stack does not sit on the zero page (unlike most Forth implementations). The total comes to 69 fewer bytes if the Forth data stack is moved to the zero page.

With no point of comparison, you may be wondering how I can claim that subroutine threaded isn't so bad in terms of space. The answer is that I started out writing the high-level words using only JSRs (and a final JMP of course). Basically, I just started with the colon definition and did simply wrote it using JSRs instead of DWs. As I looked at the code that resulted, I noticed a number of ways of optimizing it by replacing JSRs with (inline) assembly. In many instances, the code was not only faster (as you'd expect), but smaller (or at least the same length) too, so there wasn't even a decison to be made about whether to optimize or not. More on this below.

GARTHWILSON wrote:
You don't have to. The quick and dirty way to make a prefix Forth assembler which I used in my Forth was to just make mnemonics that had the address mode built in, like LDA_ABS, LDA(ZP,X), etc.. and comma-in the operand. The mnemonics just compile the appropriate machine-language op code byte. They don't need to even be in a separate vocabulary because they don't conflict with anything else in Forth. For example, AND is always in Forth but if you want the assembly AND, it'll have an address mode associated with it, like AND_ZP or AND#. The branches were a minor challenge, but still not complicated because the assembler in Forth is not normally used for large applications, only writing primitives, runtimes, maybe ISRs, and things like that. I am working on making my branching method more user-friendly though. When I thought of this, I was able to initially write the assembler in an evening. It made me wonder why I didn't do it years earlier. It has been so handy.


No, I don't have to. And you're right that a 6502 Forth assembler is quick and easy to write. It's very easy to convert a postfix assembler into a highly legible prefix assembler. The name conflicts are easy enough to handle; there are only 3 in my assembler: (, #, and AND. When the assembler is active I use AND to mean the 6502 instruction and & to mean the bit-wise operator. Of course, ( is optional -- in my assembler it does nothing -- but I think something like LDA ( 0 ,X) is the most legible. I use the usual prefix handling technique of not storing bytes until the next instruction, which means that the last instruction must be followed by something (usually an END-CODE) to store bytes of that final instruction. All in all, I've been very pleased with how my prefix assemblers have turned out.

However, with subroutine threaded code, inline assembly is available at any time with [ and ]. As I've gained experience, it has become clear to me that an ordinary postfix assembler is the more natural and elegant solution. With a prefix assembler, you need to finish the last instruction and I usually reset the assembler before the first instruction just in case (so it won't store bytes). When you first encounter Forth and postfix notation, you can either think "I'm not comfortable with postfix; I'm going to make Forth behave in a way that I'm more familiar with" or you can simply learn how to work with postfix. I think the decision to use a prefix or a postfix assembler with a subroutine-threaded Forth falls into a similar category.

kc5tja wrote:
APL, J, and K can be implemented with rather trivial interpreters, and still achieve excellent speeds.


But perception is going to be shaped by the most popular languages -- e.g. BASIC and Forth. Even the staunchest critics of Forth acknowledge that it is one of the fastest interpreted languages out there, so the evidence that interpretation can be fast is there. But not so many people realize (or want to admit) that these techniques can be applied to interpretation in general.

BASIC interpreters, on the other hand, tend to be slow. The emphasis of most seems to be the first word of their acronym -- Beginner's -- and I can't really fault slow interpretation when fast interpretation wasn't a goal in the first place. BASIC bears more of a resemblance to other programming languages -- superficially -- than Forth and its postfix nature. So the perception, unfair as it is, is that interpretation is slow, unless you use Forth, no matter how many exotic programming languages exist as counterexamples. A lot of people like to dismiss interpretation by saying that interpreters don't offer the benefits of a good compiler, even though it is equally true that compilers don't offer the benefits of a good interpreter.

kc5tja wrote:
Excellent! This is what I expected to see. I hadn't realized that this was already done. As far as I was aware, AppleBASIC for the Apple IIs didn't do this (at least, I don't remember them doing it), but then again, it was also floating point BASIC too, with the AppleDisk system. Not sure if that made any difference.


Applesoft, the later, and more well-known version of BASIC, was written by Microsoft (Apple Computer supplied the Apple-specific routines, though). It used floating point math, and its parser did little more than turn keywords into tokens. It was functional for the most part, but far from elegant. Applesoft was far more commonly used (due to the floating point math) and more extensively documented.

Integer BASIC, on the other hand, was replaced in ROM by Applesoft starting with the II+, the second machine in the Apple II line. The only official Integer BASIC documentation was a BASIC tutorial manual (which called it Apple BASIC, rather than Integer BASIC) that didn't even cover all of the commands! Interestingly, the only source code for Integer BASIC was handwritten and hand-assembled by Wozniak himself! (Evidence of this was apparent when the ROM was disassembled. For example, the code would jump from point A to point B back to point A just to insert a single instruction, whereas with an assembler, that instruction would be inserted in the intended place and the remaining code would be moved accordingly.) This was no small task, since size of the object code was about 5k. Ironically, Apple computer published a lot of their early source code, but the handwritten source code prevented a golden opportunity to share a unique approach. (A few adventurous types did some disassembly and published their findings in the form of magazine articles, so some of the internals of Integer BASIC eventually became known.)

As I alluded to in my earlier post, there were several places the efficiency of the Integer BASIC interpreter could be improved (and there were a handful of minor bugs), but the concept itself was quite elegant. I've done enough hand-assembly (nothing anywhere near that long, though) to cut anyone who assembles a 5k program by hand some slack in terms of efficiency.

Incidentally, both BASICs preceded the Apple disk drive. So the DOS took some fairly unconventional steps to be accessible from either of these ROM-based BASICs, neither of which was designed with existence of DOS in mind.

kc5tja wrote:
Bytecode/tokens is just "token-threaded" -- several Forth implementations have been implemented token-threaded. The most popular of which is OpenFirmware, actually.


I'm aware of token threaded Forths, though I haven't used one. On the processors I've worked with, address threading seems to be the more natural approach, so I haven't seen a compelling reason to implement a token threaded Forth.

kc5tja wrote:
However, with that said, in my experience, recompiling software in Forth is so fast that the application I just beckoned has already been compiled before my finger leaves the ENTER key (assuming your Forth compiler's dictionary lookups are sufficiently efficient). Therefore, the issue of addresses, literally, is a complete non-issue for me.


Compiler speed isn't the issue for me. On one Forth I've written, FIND is a high-level word. If I wanted blazing compilation speed, there are a lot of things I could do, but so far I haven't needed to. I've encountered some of the limitations of address threading, nothing terrible though. For BASIC, I just like the idea of move and insert more than edit and recompile. I wouldn't say addresses are a complete non-issue for me, but I would say they are a very minor issue.

kc5tja wrote:
What? Subroutine threading takes 3 bytes to store a reference to a word on the 6502, while both direct and indirect threading take 2. The numbers don't lie. Subroutine threading is smaller only for colon definitions with less than four words in the body. At 4 words, the overhead is the same, and beyond 4 words, ITC/DTC is smaller.


In rereading my earlier post, I probably wasn't very clear about what my main point was in terms of a size comparsion. Yes, a JSR (and a JMP) takes 3 bytes and a DW takes 2. There isn't anything you can do about that. However, just blindly substituting JSRs for DWs doesn't take full advantage of STC in terms of space or speed. There are ways to optimize STC that bring the total number of bytes in an STC Forth much closer to the total number of bytes in an ITC/DTC Forth than the 6/5 ratio (20 bytes vs. 24 bytes for an "average" high-level word) you suggest. That's was the point I was trying to make.

Here are some of the reason why STC is not as space inefficient as it seems, along with some potential space optimizations for the 6502. The code below assumes the typical 6502 data stack implementation (i.e. how FIG-Forth implements it), and (unlike FIG-Forth) separated heads (so that one STC -- and for that matter DTC -- primitive can fall through to another). All of the code is untested.

First, there are those things that take the same amount of space in STC as ITC/DTC. The size of a head doesn't depend on threading. ." String" will usually compile an inline counted string whose size doesn't depend on threading (the preceding JSR PDOTQ will of course be larger than a DW PDOTQ). This doesn't affect the difference in size between STC and ITC/DTC, but it does affect the ratio.

Second, at a speed penalty, you can implement constants and variables as a JSR followed by a DW which is the same as DTC and just 1 byte more than ITC (5 bytes vs. 4 bytes, rather than 3 vs. 2). Literals can also be implemented this way which is 5 bytes (STC) vs. 4 bytes (ITC/DTC). Again, this affects the ratio of sizes rather than the difference.

Third, there are high level STC constructs that take less space than their ITC/DTC counterparts. An unconditional branch (compiled by AHEAD or AGAIN) is 3 bytes (JMP destination) with STC, vs. 4 bytes (DW BRANCH,destination) with ITC/DTC. Likewise, a deferred word is 3 bytes (JMP address) with STC vs. 4 bytes (DW DO_DEFER,address) with ITC/DTC.

Fourth, since primitives take less space in STC than ITC/DTC, the ratio of the number of primitives to the number of high-level words affects the total size of STC vs. the total size of ITC. An STC primitive has 4 fewer bytes of overhead than an ITC primitive, so if an "average" high-level word is 4 bytes longer with STC than DTC, a simple 50/50 split of primitives and high level words makes the total size of DTC and ITC (roughly) equal (of course, there will be other factors).

Fifth, since STC has no NEXT, DO_COLON, or EXIT routine, there are that many fewer total bytes in an STC Forth. Counting the cold start code that prepares the zero page for the NEXT routine (i.e. storing a $6C at W-1), these routines take roughly 60 bytes in FIG-Forth (not counting any debugging code). That's 15 "average" STC high-level words just to break even.

Sixth, STC lends itself to optimizing primitives for space better than ITC/DTC. For example, you can implement the standard words 1-, ABS, NEGATE, and INVERT as follows:

Code:
ONEMINUS LDA 0,X
         BNE ONEM1
         DEC 1,X
ONEM1    DEC 0,X
         RTS
ABS      LDA 1,X
         BNE ABS1
NEGATE   JSR ONEMINUS
INVERT   JSR INV1
INV1     LDA #$FF
         EOR 0,X
         STA 0,X
         INX
ABS1     RTS


I can't see a way of doing this with ITC or DTC that doesn't take more than the usual 4 or 2 extra bytes of overhead. Also, note that even though these examples are optimized for space, in terms of speed, they still compare favorably to ITC and DTC that is specifically optimized for speed. Using self-modifying code also allows for some interesting space optimizations (though some of the space savings may be offset by having to copy the routine from ROM to RAM at cold start). For example, the words AND, OR, XOR, -, and + can be implemented as follows:

Code:
AND   LDA #$35
      BNE MATH
OR    LDA #$15
      BNE MATH
XOR   LDA #$55
      BNE MATH
MINUS JSR NEGATE
PLUS  CLC
      LDA #$75
MATH  STA MATH2
      JSR MATH1
MATH1 LDA 2,X
MATH2 ADC 0,X
      STA 2,X
      INX
      RTS


Again, note that these examples are not significantly slower than speed-optimized ITC or DTC.

Seventh, there are several relatively common high-level sequences that can be replaced by shorter inline assembly sequences. For example, with ITC/DTC, the sequence DUP 0< IF would typically compile to DW DUP,ZLT,QBRANCH,destination but with STC, it could be implemented in 4 fewer bytes with:

Code:
LDA 0,X
BPL label


Of course, this assumes that the branch destination is within half a page of the BPL instruction, but this will almost always be true in Forth. There are also sequences that can be converted to inline assembly that will take the same amount of space as the ITC/DTC high-level counterparts. For example, a DROP can compile to a pair of INX instructions, which takes only 2 bytes, just as an ITC/DTC DROP does. The sequence 0= IF which would ordinarily compile to DW ZEQ,QBRANCH,destination can be implemented in 6 bytes using STC also:

Code:
LDA 0,X
ORA 1,X
BNE label


Another example is ACCEPT. KEY is typically a deferred word, so that different input sources can easily be selected. Since KEY is deferred, I usually write ACCEPT as a high-level word (usually with very few editing features). In my subroutine threaded Forth, after KEY, I use inline assembly to put the keystroke in the accumulator and use a CMP immediate followed by a BEQ or BNE to handle keystrokes that require special action (and switching back to high level Forth when convenient). The CMP-branch is only 4 bytes which is fewer bytes than a literal (or constant) followed by = IF or by OF (i.e. a CASE structure), so several bytes are saved this way (plus it's much faster).

Eighth, one high-level word can fall through to another, which will save 3 bytes. For example, with ITC, the high level definitions : z a b ; : y c z ; compiles to:

Code:
z: DW DO_COLON ; JSR for DTC
   DW a
   DW b
   DW EXIT

y: DW DO_COLON ; JSR for DTC
   DW c
   DW z
   DW EXIT


Even if you rearrange the assembly so that y precedes z, there is no way for y to fall through to z. In many Forths, replacing DW z,EXIT with DW BRANCH,z+2 (z+3 for DTC) will be faster, but the size is still the same. With STC, the definitions compile to:

Code:
z: JSR a
   JMP b

y: JSR c
   JMP z


If y is moved before z, the assembly will be:

Code:
y: JSR c
   JMP z
z: JSR a
   JMP b


and the JMP z can be eliminated.

Ninth, the definition : z BEGIN a WHILE b REPEAT ; usually is not optimal because the unconditional branch (compiled by REPEAT) will be executed multiple times. By rearranging the definition to : z AHEAD BEGIN b [ 1 CS-ROLL ] THEN a UNTIL ; the unconditional branch is executed just once. Since there is no DO_COLON routine with STC, the AHEAD can be eliminated (saving 3 bytes) as follows:

Code:
zee0: JSR b
zee:  JSR a
      JSR QBRANCH ; prepare Z flag and pop data stack
      BEQ zee0
      RTS


In other words, the address returned by ' z is zee, not zee0. In my subroutine threaded Forth, the >NUMBER definition uses this technique. Once again, it's also faster.

Finally, I should point out that using these optimizations does not necessary mean that the compiler must be more sophisticated (and therefore larger). The alternative is to optimize manually by inserting the optimizations yourself. However, adding a few short definitions helps and also makes the code more readable. Most Forth compilers do very little optimization, leaving it up to the programmer to know the most efficient way of doing things. For example, it is up to the programmer to use the single word 1- instead of the two word sequence 1 -.

Of course, there are things that will take more space with STC (manipulating the return stack is one example), but the bottom line is with STC, there are a lot of options available if you wish to optimize for space.

kc5tja wrote:
I don't think anyone was contesting this. I know I certainly wasn't. But it nonetheless is true that STC will consume more space (for a 16-bit Forth implementation on a 6502) than a DTC implementation for the average program size. The average number of words in a colon definition turns out to be 8 in most Forth programs, so each definition, one can estimate, will have four bytes of "overhead" to them compared to other methods. Times the number of definitions, and you can see that it adds up...slightly.


Replacing DWs with a JSR (or JMP) is simply translating from ITC/DTC to STC, and I don't think you'll get as much out of an STC by translating ITC/DTC as you would by rewriting it specifically for STC. The usual response as to why there aren't any programs that translate C to Forth and/or vice versa is that translating a well-written C program to Forth tends to produce a poor Forth code, and translating a well-written Forth program to C tends to produce poor C code. I believe this is also true of ITC/DTC and STC, albeit to a far lesser degree (I would say replace "poor" with "not as good"). The differences between colorForth and ANS Forth suggest to me that a slightly different approach is necessary (and I believe Chuck Moore realized this) to produce the best possible STC, regardless of whether it's space or speed you're after.

Based on some of the optimizations listed above, I think there is even a chance that it's possible to write a STC Forth that is actually smaller than a ITC or DTC Forth on the 6502, given some reasonable ground rules, and assuming that both have the same functionality and both were optimized strictly for space. It'd be quite a challenge, though.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Sep 08, 2004 5:18 am 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
dclxvi wrote:
kc5tja wrote:
Bytecode/tokens is just "token-threaded" -- several Forth implementations have been implemented token-threaded. The most popular of which is OpenFirmware, actually.


I'm aware of token threaded Forths, though I haven't used one. On the processors I've worked with, address threading seems to be the more natural approach, so I haven't seen a compelling reason to implement a token threaded Forth.


The point is compactness and uniformity of representation. When expressing the original source code is not feasible (e.g., as with OpenFirmware drivers, which exists in relatively small ROMs on the expansion card they accompany), but binary independence and economy are requirements, then token-threading is ideal. Like I said, this is why OpenFirmware uses it.

For what it's worth, most OpenFirmware environments will run-time recompile token-threaded representation into a more native and faster executing form.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Sep 08, 2004 7:23 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8534
Location: Southern California
Sheesh! And I thought I and Samuel were long-winded! That was great though. Thanks for the detailed examples and points of clarification. I guess I can quit saying STC is less memory-efficient. I hope you'll have opportunities to recycle your "book" and get more uses out of it for your time. Although I'm not going to start over on my 65816 Forth because of it, it does give some ideas for future improvements.

I did "cheat" a little though and do some optimizations in my '816 ITC Forth that are similar to what you're talking about. For example, nested structures sometimes result in two or more consecutive branches. Why not just branch to the intended destination the first time. The only disadvantage I could see for some of these things is that if the user wants to use SEE to examine the insides, there could be some confusing results. SEE terminates when it comes to the unnest at the end of a secondary; but there won't necessarily be an unnest in some cases, especially if one word branches right into another without going through unnest-nest or other such overhead. I could probably also expand my inventory of (probably headerless) words that replace occurrences of things like the DUP 0< IF that you mention. I know it's taboo in Forth, but there's nothing that says the code is supposed to be readable after being de-compiled, and we are after all talking about internal optimizations. If the user wants it readable, he'll have to look at the source code I provide that shows what it looks like before these optimizations are carried out.

Something I found so helpful with the '816 is that since it can handle 16 bits at a time, a lot of words that would have taken so much extra space if defined as primitives on the '02 became easy with the '816, resulting not just in faster code but even shorter and easier to write in many cases—the best of all worlds. Because of this, my '816 Forth has around 350 primitives. Having so much higher percentage of the kernel in primitives dramatically reduces the number of times NEXT, nest, and unnest get executed to carry out a particular job, further speeding things up.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Sep 08, 2004 7:35 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8534
Location: Southern California
Since we've hijacked pbug's "How to make a BASIC compiler?" long enough now and quickly turned it into Forth material that's way too advanced for those who are new to Forth, why don't we continue this thread under the Forth section of the forum. I'll start a new subject and call it "cont. from BASIC compiler". I am looking forward to more.

<later> Ok, it's at http://www.6502.org/forum/viewtopic.php?p=3336#3336
pbug, you can go back to talking about BASIC compilers. I for one will still be interested—even if I do think you'd be better off going to Forth ;)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Wed Sep 29, 2004 5:04 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
It occured to me that no one has mentioned the BASIC compiler for FIG-Forth in Forth Dimensions. Early issues of Forth Dimensions are online at www.forth.org in the form of .pdf files. The article appears on p. 178 (the .pdf page number, p. 175 is printed at the bottom of the page itself) of the Volume 3 .pdf file (~12MB!). There is also a related article on converting infix to postfix on p. 14 of the Volume 4 Number 6 .pdf file (~3MB). There are also source code files around, so you don't have to type it in from the article.

Anyway, you can use the FIG-Forth source code in the (6502.org) Repository with the Forth Dimensions article to get an example of a BASIC compiler for the 6502. It's the same concept as converting infix to postfix at parse time that I mentioned in an earlier post. The BASIC program (to be compiled) will seem to be spaced a little strangely as compared to a traditional BASIC, because the compiler uses the ordinary FIG-Forth parsing method, i.e. space delimited words. You could write your own parser if you wish to change this. Also, it will compile BASIC to (indirect threaded) Forth, but it could be altered to compile to assembly (or to translate the Forth into assembly).


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sat Aug 21, 2010 8:40 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Just an update here: it seems the forth.org website no longer has direct links to the FD scans.

You can still find them at the direct URL:

http://www.forth.org/fd/contents.html


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Aug 23, 2010 10:46 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
Ruud wrote:

I'm working on a Pascal compiler but one written in Pascal. The idea is that, once compiled on the PC, the user can improve and compile it on the native machine.

But I must confess there is a little snag in it: the executable so far is over 200 KB and I don't think it will ever be small enough to run it on a C64, 8032, Apple ][ or any other 6502 machine. The Apple2GS or a C64/128 with SuperCPU (=65816) could cope with it.


Have you looked at the P4 compiler?
http://homepages.cwi.nl/~steven/pascal/

It was used as the basis of the UCSD Pascal compiler, which certainly ran on the 6502.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Aug 23, 2010 10:50 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Also, don't forget that early compilers couldn't fit into available core in the early days even on "big machines" like the PDP series. C is an example, where you had a master driver program, cc, which in turn invoked c1, c2, and as in that order, to lex and parse, then to generate code for, then to assemble (respectively) the binary you're creating.

If you can split your one executable into multiple components, or even somehow implement an overlay system, that will allow you to run your compiler on the C64 natively.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Aug 23, 2010 10:59 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 408
Location: Minnesota
dclxvi wrote:
My BASIC was partly inspired by Apple (Integer) BASIC, which was written by Steve Wozniak. Wozinak used ideas from HP BASICs (he had worked at HP), and of course HP was well-known for their stack-oriented RPN calculators. "Integer" BASIC is really not all that far away from the address interpreter you suggest. It uses a sophisticated parser which does all of it's syntax checking at parse time rather than execution time. It also converts constants at parse time to their twos complement reprepresentation rather than storing them as ASCII. Therefore, unlike many BASICs (e.g. EhBASIC and Applesoft) you never got a SYNTAX ERROR at run time, and you weren't converting constants (e.g. the statment PRINT 99) or line numbers (e.g. the statement IF A < B THEN 10) each and every time they were encountered.

There were other things that could have been handled at parse time, but weren't, such as checking for too many parentheses in an expression, and determining whether you needed the address of a variable so you could store a value in it (e.g. a LET or an INPUT statement) or if only the contents of the variable were needed (e.g. PRINT A). (Determining the latter at parse time would have sped up expression evaluation considerably.)


There is also Atari BASIC, which contains a lot of these same ideas:
http://users.telenet.be/kim1-6502/6502/absb.html

Although it isn't perfect either:
http://www3.sympatico.ca/maury/other_stuff/atari_basic.html


Last edited by teamtempest on Tue Aug 24, 2010 12:16 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Aug 24, 2010 12:07 am 
Offline

Joined: Wed May 20, 2009 1:06 pm
Posts: 491
teamtempest wrote:
There is also Atari BASIC, which contains a lot of these same ideas:
http://users.telenet.be/kim1-6502/6502/absb.html

Although it is perfect either:
http://www3.sympatico.ca/maury/other_stuff/atari_basic.html


The Atari Basic Source Book

http://users.telenet.be/kim1-6502/6502/absb.html


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 29 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: