Posted: Tue Sep 07, 2004 8:16 am
GARTHWILSON wrote:
Interesting. Can you give some idea of how much space you're finding you need for a kernel, compared to what would be needed for the same thing in indirect-threadded?
Second, it was written from scratch and not derived from an extant indirect or direct threaded implemenation, so there isn't really anything with which to make a good comparison. There's FIG-Forth, but it might not be so easy to choose which pieces of FIG-Forth to use for comparison, since there are aspects which differ from the FIG-Forth model.
Currently, there are 86 words (that includes aliases, which only take up head space) that have been tested (several more untested words have been written). Basically, it's only the words necessary for making the QUIT (text interpreter) loop function. Not included in the 86 words would be words such as :, IF, and VARIABLE. Here is how the code is currently organized (it will definitely be reorganized in the finished version):
Heads: 628 bytes (all 86 words)
Primitives: 548 bytes (34 words)
Constants, variables, and deferred words: 66 bytes (16 words)
High-level words (with inline assembly intermixed): 851 bytes (34 words)
System-dependent (I/O) words: 37 bytes (2 words)
The total comes to just over 2k. For the purpose of debugging and experimentation (and partly my own curiousity) the Forth data stack does not sit on the zero page (unlike most Forth implementations). The total comes to 69 fewer bytes if the Forth data stack is moved to the zero page.
With no point of comparison, you may be wondering how I can claim that subroutine threaded isn't so bad in terms of space. The answer is that I started out writing the high-level words using only JSRs (and a final JMP of course). Basically, I just started with the colon definition and did simply wrote it using JSRs instead of DWs. As I looked at the code that resulted, I noticed a number of ways of optimizing it by replacing JSRs with (inline) assembly. In many instances, the code was not only faster (as you'd expect), but smaller (or at least the same length) too, so there wasn't even a decison to be made about whether to optimize or not. More on this below.
GARTHWILSON wrote:
You don't have to. The quick and dirty way to make a prefix Forth assembler which I used in my Forth was to just make mnemonics that had the address mode built in, like LDA_ABS, LDA(ZP,X), etc.. and comma-in the operand. The mnemonics just compile the appropriate machine-language op code byte. They don't need to even be in a separate vocabulary because they don't conflict with anything else in Forth. For example, AND is always in Forth but if you want the assembly AND, it'll have an address mode associated with it, like AND_ZP or AND#. The branches were a minor challenge, but still not complicated because the assembler in Forth is not normally used for large applications, only writing primitives, runtimes, maybe ISRs, and things like that. I am working on making my branching method more user-friendly though. When I thought of this, I was able to initially write the assembler in an evening. It made me wonder why I didn't do it years earlier. It has been so handy.
However, with subroutine threaded code, inline assembly is available at any time with [ and ]. As I've gained experience, it has become clear to me that an ordinary postfix assembler is the more natural and elegant solution. With a prefix assembler, you need to finish the last instruction and I usually reset the assembler before the first instruction just in case (so it won't store bytes). When you first encounter Forth and postfix notation, you can either think "I'm not comfortable with postfix; I'm going to make Forth behave in a way that I'm more familiar with" or you can simply learn how to work with postfix. I think the decision to use a prefix or a postfix assembler with a subroutine-threaded Forth falls into a similar category.
kc5tja wrote:
APL, J, and K can be implemented with rather trivial interpreters, and still achieve excellent speeds.
BASIC interpreters, on the other hand, tend to be slow. The emphasis of most seems to be the first word of their acronym -- Beginner's -- and I can't really fault slow interpretation when fast interpretation wasn't a goal in the first place. BASIC bears more of a resemblance to other programming languages -- superficially -- than Forth and its postfix nature. So the perception, unfair as it is, is that interpretation is slow, unless you use Forth, no matter how many exotic programming languages exist as counterexamples. A lot of people like to dismiss interpretation by saying that interpreters don't offer the benefits of a good compiler, even though it is equally true that compilers don't offer the benefits of a good interpreter.
kc5tja wrote:
Excellent! This is what I expected to see. I hadn't realized that this was already done. As far as I was aware, AppleBASIC for the Apple IIs didn't do this (at least, I don't remember them doing it), but then again, it was also floating point BASIC too, with the AppleDisk system. Not sure if that made any difference.
Integer BASIC, on the other hand, was replaced in ROM by Applesoft starting with the II+, the second machine in the Apple II line. The only official Integer BASIC documentation was a BASIC tutorial manual (which called it Apple BASIC, rather than Integer BASIC) that didn't even cover all of the commands! Interestingly, the only source code for Integer BASIC was handwritten and hand-assembled by Wozniak himself! (Evidence of this was apparent when the ROM was disassembled. For example, the code would jump from point A to point B back to point A just to insert a single instruction, whereas with an assembler, that instruction would be inserted in the intended place and the remaining code would be moved accordingly.) This was no small task, since size of the object code was about 5k. Ironically, Apple computer published a lot of their early source code, but the handwritten source code prevented a golden opportunity to share a unique approach. (A few adventurous types did some disassembly and published their findings in the form of magazine articles, so some of the internals of Integer BASIC eventually became known.)
As I alluded to in my earlier post, there were several places the efficiency of the Integer BASIC interpreter could be improved (and there were a handful of minor bugs), but the concept itself was quite elegant. I've done enough hand-assembly (nothing anywhere near that long, though) to cut anyone who assembles a 5k program by hand some slack in terms of efficiency.
Incidentally, both BASICs preceded the Apple disk drive. So the DOS took some fairly unconventional steps to be accessible from either of these ROM-based BASICs, neither of which was designed with existence of DOS in mind.
kc5tja wrote:
Bytecode/tokens is just "token-threaded" -- several Forth implementations have been implemented token-threaded. The most popular of which is OpenFirmware, actually.
kc5tja wrote:
However, with that said, in my experience, recompiling software in Forth is so fast that the application I just beckoned has already been compiled before my finger leaves the ENTER key (assuming your Forth compiler's dictionary lookups are sufficiently efficient). Therefore, the issue of addresses, literally, is a complete non-issue for me.
kc5tja wrote:
What? Subroutine threading takes 3 bytes to store a reference to a word on the 6502, while both direct and indirect threading take 2. The numbers don't lie. Subroutine threading is smaller only for colon definitions with less than four words in the body. At 4 words, the overhead is the same, and beyond 4 words, ITC/DTC is smaller.
Here are some of the reason why STC is not as space inefficient as it seems, along with some potential space optimizations for the 6502. The code below assumes the typical 6502 data stack implementation (i.e. how FIG-Forth implements it), and (unlike FIG-Forth) separated heads (so that one STC -- and for that matter DTC -- primitive can fall through to another). All of the code is untested.
First, there are those things that take the same amount of space in STC as ITC/DTC. The size of a head doesn't depend on threading. ." String" will usually compile an inline counted string whose size doesn't depend on threading (the preceding JSR PDOTQ will of course be larger than a DW PDOTQ). This doesn't affect the difference in size between STC and ITC/DTC, but it does affect the ratio.
Second, at a speed penalty, you can implement constants and variables as a JSR followed by a DW which is the same as DTC and just 1 byte more than ITC (5 bytes vs. 4 bytes, rather than 3 vs. 2). Literals can also be implemented this way which is 5 bytes (STC) vs. 4 bytes (ITC/DTC). Again, this affects the ratio of sizes rather than the difference.
Third, there are high level STC constructs that take less space than their ITC/DTC counterparts. An unconditional branch (compiled by AHEAD or AGAIN) is 3 bytes (JMP destination) with STC, vs. 4 bytes (DW BRANCH,destination) with ITC/DTC. Likewise, a deferred word is 3 bytes (JMP address) with STC vs. 4 bytes (DW DO_DEFER,address) with ITC/DTC.
Fourth, since primitives take less space in STC than ITC/DTC, the ratio of the number of primitives to the number of high-level words affects the total size of STC vs. the total size of ITC. An STC primitive has 4 fewer bytes of overhead than an ITC primitive, so if an "average" high-level word is 4 bytes longer with STC than DTC, a simple 50/50 split of primitives and high level words makes the total size of DTC and ITC (roughly) equal (of course, there will be other factors).
Fifth, since STC has no NEXT, DO_COLON, or EXIT routine, there are that many fewer total bytes in an STC Forth. Counting the cold start code that prepares the zero page for the NEXT routine (i.e. storing a $6C at W-1), these routines take roughly 60 bytes in FIG-Forth (not counting any debugging code). That's 15 "average" STC high-level words just to break even.
Sixth, STC lends itself to optimizing primitives for space better than ITC/DTC. For example, you can implement the standard words 1-, ABS, NEGATE, and INVERT as follows:
Code: Select all
ONEMINUS LDA 0,X
BNE ONEM1
DEC 1,X
ONEM1 DEC 0,X
RTS
ABS LDA 1,X
BNE ABS1
NEGATE JSR ONEMINUS
INVERT JSR INV1
INV1 LDA #$FF
EOR 0,X
STA 0,X
INX
ABS1 RTS
Code: Select all
AND LDA #$35
BNE MATH
OR LDA #$15
BNE MATH
XOR LDA #$55
BNE MATH
MINUS JSR NEGATE
PLUS CLC
LDA #$75
MATH STA MATH2
JSR MATH1
MATH1 LDA 2,X
MATH2 ADC 0,X
STA 2,X
INX
RTS
Seventh, there are several relatively common high-level sequences that can be replaced by shorter inline assembly sequences. For example, with ITC/DTC, the sequence DUP 0< IF would typically compile to DW DUP,ZLT,QBRANCH,destination but with STC, it could be implemented in 4 fewer bytes with:
Code: Select all
LDA 0,X
BPL label
Code: Select all
LDA 0,X
ORA 1,X
BNE label
Eighth, one high-level word can fall through to another, which will save 3 bytes. For example, with ITC, the high level definitions : z a b ; : y c z ; compiles to:
Code: Select all
z: DW DO_COLON ; JSR for DTC
DW a
DW b
DW EXIT
y: DW DO_COLON ; JSR for DTC
DW c
DW z
DW EXIT
Code: Select all
z: JSR a
JMP b
y: JSR c
JMP z
Code: Select all
y: JSR c
JMP z
z: JSR a
JMP b
Ninth, the definition : z BEGIN a WHILE b REPEAT ; usually is not optimal because the unconditional branch (compiled by REPEAT) will be executed multiple times. By rearranging the definition to : z AHEAD BEGIN b [ 1 CS-ROLL ] THEN a UNTIL ; the unconditional branch is executed just once. Since there is no DO_COLON routine with STC, the AHEAD can be eliminated (saving 3 bytes) as follows:
Code: Select all
zee0: JSR b
zee: JSR a
JSR QBRANCH ; prepare Z flag and pop data stack
BEQ zee0
RTS
Finally, I should point out that using these optimizations does not necessary mean that the compiler must be more sophisticated (and therefore larger). The alternative is to optimize manually by inserting the optimizations yourself. However, adding a few short definitions helps and also makes the code more readable. Most Forth compilers do very little optimization, leaving it up to the programmer to know the most efficient way of doing things. For example, it is up to the programmer to use the single word 1- instead of the two word sequence 1 -.
Of course, there are things that will take more space with STC (manipulating the return stack is one example), but the bottom line is with STC, there are a lot of options available if you wish to optimize for space.
kc5tja wrote:
I don't think anyone was contesting this. I know I certainly wasn't. But it nonetheless is true that STC will consume more space (for a 16-bit Forth implementation on a 6502) than a DTC implementation for the average program size. The average number of words in a colon definition turns out to be 8 in most Forth programs, so each definition, one can estimate, will have four bytes of "overhead" to them compared to other methods. Times the number of definitions, and you can see that it adds up...slightly.
Based on some of the optimizations listed above, I think there is even a chance that it's possible to write a STC Forth that is actually smaller than a ITC or DTC Forth on the 6502, given some reasonable ground rules, and assuming that both have the same functionality and both were optimized strictly for space. It'd be quite a challenge, though.