GARTHWILSON wrote:
Interesting. Can you give some idea of how much space you're finding you need for a kernel, compared to what would be needed for the same thing in indirect-threadded?
Here is the current size. These numbers are probably not going to be particularly useful for a couple of reasons. First, it's not currently optimized for space or for speed. I decided to start with the philosophy "nothing too long, nothing too slow", so it's currently in that middle ground. At this point, I'm leaning toward optimizing the finished version for speed. (Another possible project would be to write a subroutine threaded Forth that is optimized for space, using this version as a starting point.)
Second, it was written from scratch and not derived from an extant indirect or direct threaded implemenation, so there isn't really anything with which to make a good comparison. There's FIG-Forth, but it might not be so easy to choose which pieces of FIG-Forth to use for comparison, since there are aspects which differ from the FIG-Forth model.
Currently, there are 86 words (that includes aliases, which only take up head space) that have been tested (several more untested words have been written). Basically, it's only the words necessary for making the QUIT (text interpreter) loop function. Not included in the 86 words would be words such as :, IF, and VARIABLE. Here is how the code is currently organized (it will definitely be reorganized in the finished version):
Heads: 628 bytes (all 86 words)
Primitives: 548 bytes (34 words)
Constants, variables, and deferred words: 66 bytes (16 words)
High-level words (with inline assembly intermixed): 851 bytes (34 words)
System-dependent (I/O) words: 37 bytes (2 words)
The total comes to just over 2k. For the purpose of debugging and experimentation (and partly my own curiousity) the Forth data stack does not sit on the zero page (unlike most Forth implementations). The total comes to 69 fewer bytes if the Forth data stack is moved to the zero page.
With no point of comparison, you may be wondering how I can claim that subroutine threaded isn't so bad in terms of space. The answer is that I started out writing the high-level words using only JSRs (and a final JMP of course). Basically, I just started with the colon definition and did simply wrote it using JSRs instead of DWs. As I looked at the code that resulted, I noticed a number of ways of optimizing it by replacing JSRs with (inline) assembly. In many instances, the code was not only faster (as you'd expect), but smaller (or at least the same length) too, so there wasn't even a decison to be made about whether to optimize or not. More on this below.
GARTHWILSON wrote:
You don't have to. The quick and dirty way to make a prefix Forth assembler which I used in my Forth was to just make mnemonics that had the address mode built in, like LDA_ABS, LDA(ZP,X), etc.. and comma-in the operand. The mnemonics just compile the appropriate machine-language op code byte. They don't need to even be in a separate vocabulary because they don't conflict with anything else in Forth. For example, AND is always in Forth but if you want the assembly AND, it'll have an address mode associated with it, like AND_ZP or AND#. The branches were a minor challenge, but still not complicated because the assembler in Forth is not normally used for large applications, only writing primitives, runtimes, maybe ISRs, and things like that. I am working on making my branching method more user-friendly though. When I thought of this, I was able to initially write the assembler in an evening. It made me wonder why I didn't do it years earlier. It has been so handy.
No, I don't have to. And you're right that a 6502 Forth assembler is quick and easy to write. It's very easy to convert a postfix assembler into a highly legible prefix assembler. The name conflicts are easy enough to handle; there are only 3 in my assembler: (, #, and AND. When the assembler is active I use AND to mean the 6502 instruction and & to mean the bit-wise operator. Of course, ( is optional -- in my assembler it does nothing -- but I think something like LDA ( 0 ,X) is the most legible. I use the usual prefix handling technique of not storing bytes until the next instruction, which means that the last instruction must be followed by something (usually an END-CODE) to store bytes of that final instruction. All in all, I've been very pleased with how my prefix assemblers have turned out.
However, with subroutine threaded code, inline assembly is available at any time with [ and ]. As I've gained experience, it has become clear to me that an ordinary postfix assembler is the more natural and elegant solution. With a prefix assembler, you need to finish the last instruction and I usually reset the assembler before the first instruction just in case (so it won't store bytes). When you first encounter Forth and postfix notation, you can either think "I'm not comfortable with postfix; I'm going to make Forth behave in a way that I'm more familiar with" or you can simply learn how to work with postfix. I think the decision to use a prefix or a postfix assembler with a subroutine-threaded Forth falls into a similar category.
kc5tja wrote:
APL, J, and K can be implemented with rather trivial interpreters, and still achieve excellent speeds.
But perception is going to be shaped by the most popular languages -- e.g. BASIC and Forth. Even the staunchest critics of Forth acknowledge that it is one of the fastest interpreted languages out there, so the evidence that interpretation can be fast is there. But not so many people realize (or want to admit) that these techniques can be applied to interpretation in general.
BASIC interpreters, on the other hand, tend to be slow. The emphasis of most seems to be the first word of their acronym -- Beginner's -- and I can't really fault slow interpretation when fast interpretation wasn't a goal in the first place. BASIC bears more of a resemblance to other programming languages -- superficially -- than Forth and its postfix nature. So the perception, unfair as it is, is that interpretation is slow, unless you use Forth, no matter how many exotic programming languages exist as counterexamples. A lot of people like to dismiss interpretation by saying that interpreters don't offer the benefits of a good compiler, even though it is equally true that compilers don't offer the benefits of a good interpreter.
kc5tja wrote:
Excellent! This is what I expected to see. I hadn't realized that this was already done. As far as I was aware, AppleBASIC for the Apple IIs didn't do this (at least, I don't remember them doing it), but then again, it was also floating point BASIC too, with the AppleDisk system. Not sure if that made any difference.
Applesoft, the later, and more well-known version of BASIC, was written by Microsoft (Apple Computer supplied the Apple-specific routines, though). It used floating point math, and its parser did little more than turn keywords into tokens. It was functional for the most part, but far from elegant. Applesoft was far more commonly used (due to the floating point math) and more extensively documented.
Integer BASIC, on the other hand, was replaced in ROM by Applesoft starting with the II+, the second machine in the Apple II line. The only official Integer BASIC documentation was a BASIC tutorial manual (which called it Apple BASIC, rather than Integer BASIC) that didn't even cover all of the commands! Interestingly, the only source code for Integer BASIC was handwritten and hand-assembled by Wozniak himself! (Evidence of this was apparent when the ROM was disassembled. For example, the code would jump from point A to point B back to point A just to insert a single instruction, whereas with an assembler, that instruction would be inserted in the intended place and the remaining code would be moved accordingly.) This was no small task, since size of the object code was about 5k. Ironically, Apple computer published a lot of their early source code, but the handwritten source code prevented a golden opportunity to share a unique approach. (A few adventurous types did some disassembly and published their findings in the form of magazine articles, so some of the internals of Integer BASIC eventually became known.)
As I alluded to in my earlier post, there were several places the efficiency of the Integer BASIC interpreter could be improved (and there were a handful of minor bugs), but the concept itself was quite elegant. I've done enough hand-assembly (nothing anywhere near that long, though) to cut anyone who assembles a 5k program by hand some slack in terms of efficiency.
Incidentally, both BASICs preceded the Apple disk drive. So the DOS took some fairly unconventional steps to be accessible from either of these ROM-based BASICs, neither of which was designed with existence of DOS in mind.
kc5tja wrote:
Bytecode/tokens is just "token-threaded" -- several Forth implementations have been implemented token-threaded. The most popular of which is OpenFirmware, actually.
I'm aware of token threaded Forths, though I haven't used one. On the processors I've worked with, address threading seems to be the more natural approach, so I haven't seen a compelling reason to implement a token threaded Forth.
kc5tja wrote:
However, with that said, in my experience, recompiling software in Forth is so fast that the application I just beckoned has already been compiled before my finger leaves the ENTER key (assuming your Forth compiler's dictionary lookups are sufficiently efficient). Therefore, the issue of addresses, literally, is a complete non-issue for me.
Compiler speed isn't the issue for me. On one Forth I've written, FIND is a high-level word. If I wanted blazing compilation speed, there are a lot of things I could do, but so far I haven't needed to. I've encountered some of the limitations of address threading, nothing terrible though. For BASIC, I just like the idea of move and insert more than edit and recompile. I wouldn't say addresses are a complete non-issue for me, but I would say they are a very minor issue.
kc5tja wrote:
What? Subroutine threading takes 3 bytes to store a reference to a word on the 6502, while both direct and indirect threading take 2. The numbers don't lie. Subroutine threading is smaller only for colon definitions with less than four words in the body. At 4 words, the overhead is the same, and beyond 4 words, ITC/DTC is smaller.
In rereading my earlier post, I probably wasn't very clear about what my main point was in terms of a size comparsion. Yes, a JSR (and a JMP) takes 3 bytes and a DW takes 2. There isn't anything you can do about that. However, just blindly substituting JSRs for DWs doesn't take full advantage of STC in terms of space or speed. There are ways to optimize STC that bring the total number of bytes in an STC Forth much closer to the total number of bytes in an ITC/DTC Forth than the 6/5 ratio (20 bytes vs. 24 bytes for an "average" high-level word) you suggest. That's was the point I was trying to make.
Here are some of the reason why STC is not as space inefficient as it seems, along with some potential space optimizations for the 6502. The code below assumes the typical 6502 data stack implementation (i.e. how FIG-Forth implements it), and (unlike FIG-Forth) separated heads (so that one STC -- and for that matter DTC -- primitive can fall through to another). All of the code is untested.
First, there are those things that take the same amount of space in STC as ITC/DTC. The size of a head doesn't depend on threading. ." String" will usually compile an inline counted string whose size doesn't depend on threading (the preceding JSR PDOTQ will of course be larger than a DW PDOTQ). This doesn't affect the difference in size between STC and ITC/DTC, but it does affect the ratio.
Second, at a speed penalty, you can implement constants and variables as a JSR followed by a DW which is the same as DTC and just 1 byte more than ITC (5 bytes vs. 4 bytes, rather than 3 vs. 2). Literals can also be implemented this way which is 5 bytes (STC) vs. 4 bytes (ITC/DTC). Again, this affects the ratio of sizes rather than the difference.
Third, there are high level STC constructs that take less space than their ITC/DTC counterparts. An unconditional branch (compiled by AHEAD or AGAIN) is 3 bytes (JMP destination) with STC, vs. 4 bytes (DW BRANCH,destination) with ITC/DTC. Likewise, a deferred word is 3 bytes (JMP address) with STC vs. 4 bytes (DW DO_DEFER,address) with ITC/DTC.
Fourth, since primitives take less space in STC than ITC/DTC, the ratio of the number of primitives to the number of high-level words affects the total size of STC vs. the total size of ITC. An STC primitive has 4 fewer bytes of overhead than an ITC primitive, so if an "average" high-level word is 4 bytes longer with STC than DTC, a simple 50/50 split of primitives and high level words makes the total size of DTC and ITC (roughly) equal (of course, there will be other factors).
Fifth, since STC has no NEXT, DO_COLON, or EXIT routine, there are that many fewer total bytes in an STC Forth. Counting the cold start code that prepares the zero page for the NEXT routine (i.e. storing a $6C at W-1), these routines take roughly 60 bytes in FIG-Forth (not counting any debugging code). That's 15 "average" STC high-level words just to break even.
Sixth, STC lends itself to optimizing primitives for space better than ITC/DTC. For example, you can implement the standard words 1-, ABS, NEGATE, and INVERT as follows:
Code:
ONEMINUS LDA 0,X
BNE ONEM1
DEC 1,X
ONEM1 DEC 0,X
RTS
ABS LDA 1,X
BNE ABS1
NEGATE JSR ONEMINUS
INVERT JSR INV1
INV1 LDA #$FF
EOR 0,X
STA 0,X
INX
ABS1 RTS
I can't see a way of doing this with ITC or DTC that doesn't take more than the usual 4 or 2 extra bytes of overhead. Also, note that even though these examples are optimized for space, in terms of speed, they still compare favorably to ITC and DTC that is specifically optimized for speed. Using self-modifying code also allows for some interesting space optimizations (though some of the space savings may be offset by having to copy the routine from ROM to RAM at cold start). For example, the words AND, OR, XOR, -, and + can be implemented as follows:
Code:
AND LDA #$35
BNE MATH
OR LDA #$15
BNE MATH
XOR LDA #$55
BNE MATH
MINUS JSR NEGATE
PLUS CLC
LDA #$75
MATH STA MATH2
JSR MATH1
MATH1 LDA 2,X
MATH2 ADC 0,X
STA 2,X
INX
RTS
Again, note that these examples are not significantly slower than speed-optimized ITC or DTC.
Seventh, there are several relatively common high-level sequences that can be replaced by shorter inline assembly sequences. For example, with ITC/DTC, the sequence DUP 0< IF would typically compile to DW DUP,ZLT,QBRANCH,destination but with STC, it could be implemented in 4 fewer bytes with:
Code:
LDA 0,X
BPL label
Of course, this assumes that the branch destination is within half a page of the BPL instruction, but this will almost always be true in Forth. There are also sequences that can be converted to inline assembly that will take the same amount of space as the ITC/DTC high-level counterparts. For example, a DROP can compile to a pair of INX instructions, which takes only 2 bytes, just as an ITC/DTC DROP does. The sequence 0= IF which would ordinarily compile to DW ZEQ,QBRANCH,destination can be implemented in 6 bytes using STC also:
Code:
LDA 0,X
ORA 1,X
BNE label
Another example is ACCEPT. KEY is typically a deferred word, so that different input sources can easily be selected. Since KEY is deferred, I usually write ACCEPT as a high-level word (usually with very few editing features). In my subroutine threaded Forth, after KEY, I use inline assembly to put the keystroke in the accumulator and use a CMP immediate followed by a BEQ or BNE to handle keystrokes that require special action (and switching back to high level Forth when convenient). The CMP-branch is only 4 bytes which is fewer bytes than a literal (or constant) followed by = IF or by OF (i.e. a CASE structure), so several bytes are saved this way (plus it's much faster).
Eighth, one high-level word can fall through to another, which will save 3 bytes. For example, with ITC, the high level definitions : z a b ; : y c z ; compiles to:
Code:
z: DW DO_COLON ; JSR for DTC
DW a
DW b
DW EXIT
y: DW DO_COLON ; JSR for DTC
DW c
DW z
DW EXIT
Even if you rearrange the assembly so that y precedes z, there is no way for y to fall through to z. In many Forths, replacing DW z,EXIT with DW BRANCH,z+2 (z+3 for DTC) will be faster, but the size is still the same. With STC, the definitions compile to:
Code:
z: JSR a
JMP b
y: JSR c
JMP z
If y is moved before z, the assembly will be:
Code:
y: JSR c
JMP z
z: JSR a
JMP b
and the JMP z can be eliminated.
Ninth, the definition : z BEGIN a WHILE b REPEAT ; usually is not optimal because the unconditional branch (compiled by REPEAT) will be executed multiple times. By rearranging the definition to : z AHEAD BEGIN b [ 1 CS-ROLL ] THEN a UNTIL ; the unconditional branch is executed just once. Since there is no DO_COLON routine with STC, the AHEAD can be eliminated (saving 3 bytes) as follows:
Code:
zee0: JSR b
zee: JSR a
JSR QBRANCH ; prepare Z flag and pop data stack
BEQ zee0
RTS
In other words, the address returned by ' z is zee, not zee0. In my subroutine threaded Forth, the >NUMBER definition uses this technique. Once again, it's also faster.
Finally, I should point out that using these optimizations does not necessary mean that the compiler must be more sophisticated (and therefore larger). The alternative is to optimize manually by inserting the optimizations yourself. However, adding a few short definitions helps and also makes the code more readable. Most Forth compilers do very little optimization, leaving it up to the programmer to know the most efficient way of doing things. For example, it is up to the programmer to use the single word 1- instead of the two word sequence 1 -.
Of course, there are things that will take more space with STC (manipulating the return stack is one example), but the bottom line is with STC, there are a lot of options available if you wish to optimize for space.
kc5tja wrote:
I don't think anyone was contesting this. I know I certainly wasn't. But it nonetheless is true that STC will consume more space (for a 16-bit Forth implementation on a 6502) than a DTC implementation for the average program size. The average number of words in a colon definition turns out to be 8 in most Forth programs, so each definition, one can estimate, will have four bytes of "overhead" to them compared to other methods. Times the number of definitions, and you can see that it adds up...slightly.
Replacing DWs with a JSR (or JMP) is simply translating from ITC/DTC to STC, and I don't think you'll get as much out of an STC by translating ITC/DTC as you would by rewriting it specifically for STC. The usual response as to why there aren't any programs that translate C to Forth and/or vice versa is that translating a well-written C program to Forth tends to produce a poor Forth code, and translating a well-written Forth program to C tends to produce poor C code. I believe this is also true of ITC/DTC and STC, albeit to a far lesser degree (I would say replace "poor" with "not as good"). The differences between colorForth and ANS Forth suggest to me that a slightly different approach is necessary (and I believe Chuck Moore realized this) to produce the best possible STC, regardless of whether it's space or speed you're after.
Based on some of the optimizations listed above, I think there is even a chance that it's possible to write a STC Forth that is actually smaller than a ITC or DTC Forth on the 6502, given some reasonable ground rules, and assuming that both have the same functionality and both were optimized strictly for space. It'd be quite a challenge, though.