Liara Forth, an "ANSI(ish) initial Forth" for the W65C265SXB
Liara Forth, an "ANSI(ish) initial Forth" for the W65C265SXB
So it's time to stop fooling around with blinking LEDs and write a Forth for the 265SXB.
From what I've seen, other people here such as Andrew (viewtopic.php?f=9&t=3612) are already working on large Forths for the 65816. My aim is to create a "first" or "initial" Forth for the 265SXB: The one you can download as a binary and install immediately after you've bought the board and some Flash memory. There should be tools to access expanded memory, but it will work "out of the box" with the 32 Kb RAM and 24 Kb Flash memory, which is the most simple (useful) configuration. As such, it will access the Mensch Monitor routines (at least at first), which is where the other 8 Kb are.
Obviously I'll be drawing a lot of inspiration from Tali Forth for the 65c02. There are two things that didn't work out so well there I'll be changing:
Tailored specifically for the 265SXB. Tali Forth did have a hardware basis, but it was sorta, kinda supposed to run on more general hardware, which, to put it politely, lead to some lack of structure. This time, the machine is exectly defined. Yes, there will be words to blink the LED and write to Flash.
Dictionary headers and code are kept separate. That allows fallthroughs and all other kinds of neat tricks with the code (see viewtopic.php?p=3331#p3331 for examples) that Tali can't do. (I should probably do a complete rewrite of Tali Forth, but as other people here have pointed out, once you have gone 16 bit with the 65816, it's pretty hard to go back.)
Like Tali, Liara will be based on subroutine threaded code (STC). Again, other people here are working on ITC and DTC versions. And I would still argue that because there are so few registers available, this makes a better fit. Also, like Tali, the ratio of primitives to threaded words will be rather high, for the added speed (fast the 265SXB ain't), the optimizations, and simply because I enjoyed all that coding. STC is good for these things.
(Footnote: I'm probably going to get a RPi 3 soon, which has the ARM-A53 64-bit CPU where they cleaned up the assembler ("AArch64", see https://www.element14.com/community/ser ... Manual.pdf). If I ever write a Forth for that, I promise to take a more serious look at DTC, because then I'll have about 30 64-bit registers to play with.)
The Return Stack will be the system stack, starting at 03FF and growing down. Direct Page (DP) would start at 0200, avoiding the first two pages completely where the Mensch Monitor does its thing. The Data Stack would start after whatever space the DP variables take and use X as the DSP. More on that in a later post, because I'm considering doing something weird with the stacks.
The other stuff is pretty obvious: 16-bit cell size, max code size 24 Kb, terminal access at first via the USB power jack, but with the option of the serial port as an alternative input source. I'm still considering cooperative ("PAUSE") multitasking. There should be enough space for a small editor of some sort, which then should be able to cope with any extra RAM.
I've started a (rather empty) GitHub respository at https://github.com/scotws/LiaraForth . I expect things to be slow going for a while, first because I'll probably be finding bugs galore with my assembler and emulator, second because I'm going to get chased out in the garden a lot in the next few weeks.
From what I've seen, other people here such as Andrew (viewtopic.php?f=9&t=3612) are already working on large Forths for the 65816. My aim is to create a "first" or "initial" Forth for the 265SXB: The one you can download as a binary and install immediately after you've bought the board and some Flash memory. There should be tools to access expanded memory, but it will work "out of the box" with the 32 Kb RAM and 24 Kb Flash memory, which is the most simple (useful) configuration. As such, it will access the Mensch Monitor routines (at least at first), which is where the other 8 Kb are.
Obviously I'll be drawing a lot of inspiration from Tali Forth for the 65c02. There are two things that didn't work out so well there I'll be changing:
Tailored specifically for the 265SXB. Tali Forth did have a hardware basis, but it was sorta, kinda supposed to run on more general hardware, which, to put it politely, lead to some lack of structure. This time, the machine is exectly defined. Yes, there will be words to blink the LED and write to Flash.
Dictionary headers and code are kept separate. That allows fallthroughs and all other kinds of neat tricks with the code (see viewtopic.php?p=3331#p3331 for examples) that Tali can't do. (I should probably do a complete rewrite of Tali Forth, but as other people here have pointed out, once you have gone 16 bit with the 65816, it's pretty hard to go back.)
Like Tali, Liara will be based on subroutine threaded code (STC). Again, other people here are working on ITC and DTC versions. And I would still argue that because there are so few registers available, this makes a better fit. Also, like Tali, the ratio of primitives to threaded words will be rather high, for the added speed (fast the 265SXB ain't), the optimizations, and simply because I enjoyed all that coding. STC is good for these things.
(Footnote: I'm probably going to get a RPi 3 soon, which has the ARM-A53 64-bit CPU where they cleaned up the assembler ("AArch64", see https://www.element14.com/community/ser ... Manual.pdf). If I ever write a Forth for that, I promise to take a more serious look at DTC, because then I'll have about 30 64-bit registers to play with.)
The Return Stack will be the system stack, starting at 03FF and growing down. Direct Page (DP) would start at 0200, avoiding the first two pages completely where the Mensch Monitor does its thing. The Data Stack would start after whatever space the DP variables take and use X as the DSP. More on that in a later post, because I'm considering doing something weird with the stacks.
The other stuff is pretty obvious: 16-bit cell size, max code size 24 Kb, terminal access at first via the USB power jack, but with the option of the serial port as an alternative input source. I'm still considering cooperative ("PAUSE") multitasking. There should be enough space for a small editor of some sort, which then should be able to cope with any extra RAM.
I've started a (rather empty) GitHub respository at https://github.com/scotws/LiaraForth . I expect things to be slow going for a while, first because I'll probably be finding bugs galore with my assembler and emulator, second because I'm going to get chased out in the garden a lot in the next few weeks.
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
For Liara, I'm considering a different configuration of the stacks: Having them grow towards each other. I'm calling this the "Königskinder" (King's Children) design.
We'd put the Direct Page (DP) at 00:0200 to avoid the default Direct Page and Stack use of the Mensch Monitor in the first two pages. The Data Stack (DS) begins in this area after any variables that are used. It then grows "up" (towards 00:FFFF), not down. The Direct Stack Pointer (DSP) is X, and points to the top entry on the Data Stack (TOS). The Return Stack (RS) is the normal system stack. It starts at 00:03FF and grows "down" (towards 00:0000) and points to the next free entry. This configuration means that both Stacks grow towards each other, eating up a pool of free the space between them. Though the DS is limited to 128 16-bit entries (minus variables), in theory, the Return Stack could keep growing on the 65816 and crash into the DS. This potentially gives the RS more space, and we can test for a collision of the stacks -- an overflow of either the RS or DS -- before it happens by simply testing if DSP == RSP, because one points to its next entry, the other to its current entry.
More to the point, depending on how many DP entries are required for variables, it might be possible to get away with only using one page for both stacks (S would be 00:02FF), making multitasking less of a memory hog. There doesn't seem to be any reason that DP and stack areas can't overlap, and I have the feeling that most stacks are far to large.
(The math: Assume half of the 256 bytes go to the Return Stack, which leaves us 128 bytes of the page. Assume that half of that goes to variables (far too much, but the math is easier). This leaves us with 64 bytes for the Data Stack, or a stack depth of 32 cells at a cell size of 16 bit. That sounds like a very large number for one Forth interpreter thread on a single-user system.)
The name, BTW, comes from the German folk song "Es waren zwei Königskinder" (https://de.wikipedia.org/wiki/Es_waren_ ... nigskinder) about two children of a king who really liked each other, but could never meet because the water was too deep.
The people you hear in the background clearing their throats are the Germans here who remember how the story ends (both die). That, dear children, is why you always remember to check for overflow.
~
We'd put the Direct Page (DP) at 00:0200 to avoid the default Direct Page and Stack use of the Mensch Monitor in the first two pages. The Data Stack (DS) begins in this area after any variables that are used. It then grows "up" (towards 00:FFFF), not down. The Direct Stack Pointer (DSP) is X, and points to the top entry on the Data Stack (TOS). The Return Stack (RS) is the normal system stack. It starts at 00:03FF and grows "down" (towards 00:0000) and points to the next free entry.
Code: Select all
00:200 -> +-------------------------+ <- Direct Page start
| |
| Direct Page Variables |
| |
(unknown) -> +-------------------------+ <- Data Stack Pointer Start (DSP0)
| | |
| Data Stack | |
| V | <- DSP (X)
| |
/~~~~~~~~~~~~~~~~~~~~~~~~~/
| |
| ^ | <- RSP (S)
| Return Stack | |
| | |
00:3FF -> +-------------------------+ <- Stack Pointer (RSP0)More to the point, depending on how many DP entries are required for variables, it might be possible to get away with only using one page for both stacks (S would be 00:02FF), making multitasking less of a memory hog. There doesn't seem to be any reason that DP and stack areas can't overlap, and I have the feeling that most stacks are far to large.
(The math: Assume half of the 256 bytes go to the Return Stack, which leaves us 128 bytes of the page. Assume that half of that goes to variables (far too much, but the math is easier). This leaves us with 64 bytes for the Data Stack, or a stack depth of 32 cells at a cell size of 16 bit. That sounds like a very large number for one Forth interpreter thread on a single-user system.)
The name, BTW, comes from the German folk song "Es waren zwei Königskinder" (https://de.wikipedia.org/wiki/Es_waren_ ... nigskinder) about two children of a king who really liked each other, but could never meet because the water was too deep.
Quote:
Es waren zwei Königskinder,
die hatten einander so lieb,
sie konnten beisammen nicht kommen,
das Wasser war viel zu tief
die hatten einander so lieb,
sie konnten beisammen nicht kommen,
das Wasser war viel zu tief
~
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Hi Scot,
I believe that the main reason for the data stack growing "down" is because Forth does a lot of indexing to NOS and 3OS in the primitives like OVER, SWAP, ROT, etc. If you try to do this on an "up" growing stack, you end up using a lot of negative indices, which (although not impossible) might require you to use "long" addressing to get the necessary wrap-around on the '816 (or a lot of DEX;DEX/INX;INX combos). Garth uses negative data-stack addressing in a spot or two in his '802 Forth, but he only has a bank zero, so 16-bit negative indices are all that are needed. Everywhere else, he uses fast and compact 8-bit indexing, just like the '02 Forths.
I might be wrong about this "bank-bleeding", but I think that you should consider it before you get too far. Are you going to limit your execution tokens to 16-bit, or use 24-bit, or padded 32-bit?
Mike B.
I believe that the main reason for the data stack growing "down" is because Forth does a lot of indexing to NOS and 3OS in the primitives like OVER, SWAP, ROT, etc. If you try to do this on an "up" growing stack, you end up using a lot of negative indices, which (although not impossible) might require you to use "long" addressing to get the necessary wrap-around on the '816 (or a lot of DEX;DEX/INX;INX combos). Garth uses negative data-stack addressing in a spot or two in his '802 Forth, but he only has a bank zero, so 16-bit negative indices are all that are needed. Everywhere else, he uses fast and compact 8-bit indexing, just like the '02 Forths.
I might be wrong about this "bank-bleeding", but I think that you should consider it before you get too far. Are you going to limit your execution tokens to 16-bit, or use 24-bit, or padded 32-bit?
Mike B.
- BigDumbDinosaur
- Posts: 9426
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Why not start the hardware stack very high in memory and not worry about a data stack/hardware stack collision? On my POC unit, the hardware stack starts at $00CBFF, which is right below the (fixed) TIA-232 FIFOs. Programs load at $000400, making the likelihood of a collision very remote. With the arrangement you are proposing, you will be consuming clock cycles in checking for free RAM every time an entry is added to the data stack.
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
BigDumbDinosaur wrote:
Why not start the hardware stack very high in memory and not worry about a data stack/hardware stack collision?
Mike - I'll reverse the stack back to normal as well, though I'm not sure negative indexing would be that much of a problem for the machine (thinking about might hurt my head). I'll be staying on Bank 00, because this is a "initial" Forth, so XTs will be the native cell size, 16 bits. A second, "big" Forth with full memory range would be 32 bits with padding, I think.
As an aside: I briefly considered a "shifted" addressing scheme: Addresses are 24-bits long, but segmented so that the last four bits will always be zero, wasting 8 bytes on average. But they can now be packed in a 16-bit space with one nibble of the Bank Byte.
Code: Select all
Variant A:
16-bit address stored in code: AAAB
24-bit upacked real address: 1B:AAA0
Variant B:
16-bit address stored in code: BAAA
24-bit upacked real address: 1B:AAA0Of course, the whole unpacking part breaks your neck on the 65816; all the unmasking and shifting just takes too much time. It might work on an ARM processor because of the "free" shifting, but there you don't need it. Ah well, it was an interesting exercise.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
I wasn't trying to stunt your creativity, Scot. The best way to try it out is to code a few primitives each way, and see how it pans out. Keeping in mind the '816's "Lovecraftian" bank wrapping and bank crossing personalities for its different execution states.
http://6502.org/tutorials/65c816opcodes.html
Mike B.
http://6502.org/tutorials/65c816opcodes.html
Mike B.
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Ha! Milestone: It worked in the emulator, and now it works on the machine. Now, all it does is print some strings and then echo what is typed, but since that proves the calls to PUT_CHR and GET_CHR work as intended, that takes care of most interface problems with the Mensch Monitor.
(Of course, I could have had this working four hours earlier if I hadn't stupidly forgotten to re-enable interrupts before calling PUT_CHR. What a difference a byte makes, right?)
(Of course, I could have had this working four hours earlier if I hadn't stupidly forgotten to re-enable interrupts before calling PUT_CHR. What a difference a byte makes, right?)
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Remember when I wrote that PUT_CHR and GET_CHR were working? Well, not so much.
For some reason, I just couldn't them to work in the FIND-NAME loop, and I got fed up after a day of wasted effort and used Andrew's I/O code from the w65c265sxb-hacker (https://github.com/andrew-jacobs/w65c265sxb-hacker) instead. His routines now live in a file named kernel.tasm (to isolate the hardware dependencies and for licensing reasons) and provide a much shorter, faster, and above all working version of put_chr and get_chr. There is still some weird problem with the Backspace character, possibly related to the terminal program, but the basic loop worked immediately. (I'll push the code when I've fixed the BS thing, pun intended.)
Half of the problem is the headache that comes from trying to figure out what the Mensch Monitor routines actually do. I'm perfectly willing to believe that the problem was my fault, but in the end, the hassle is not worth it if there is a simpler alternative. The MM at least needs better documentation, though since the version shipped with the board is from 1995 (if I remember correctly), a major update might be in order?
So, thank you, Andrew.
For some reason, I just couldn't them to work in the FIND-NAME loop, and I got fed up after a day of wasted effort and used Andrew's I/O code from the w65c265sxb-hacker (https://github.com/andrew-jacobs/w65c265sxb-hacker) instead. His routines now live in a file named kernel.tasm (to isolate the hardware dependencies and for licensing reasons) and provide a much shorter, faster, and above all working version of put_chr and get_chr. There is still some weird problem with the Backspace character, possibly related to the terminal program, but the basic loop worked immediately. (I'll push the code when I've fixed the BS thing, pun intended.)
Half of the problem is the headache that comes from trying to figure out what the Mensch Monitor routines actually do. I'm perfectly willing to believe that the problem was my fault, but in the end, the hassle is not worth it if there is a simpler alternative. The MM at least needs better documentation, though since the version shipped with the board is from 1995 (if I remember correctly), a major update might be in order?
So, thank you, Andrew.
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
There comes a time when you have to admit that your Very Clever Idea might in fact be Very Clever, but in the end is simply impractical and therefore A Bit Too Clever. In my case, it is keeping the top of the data stack (TOS) in Y.
It's not that it doesn't work. On the contrary, stuff like TYA and INY speed lots of words up very nicely. The problem is that it makes thinking about the code harder and some things a lot more complicated. Put differently, keeping everything "on X" might be slower and use more space, but it is the cleaner design. Put even more differently than that, this makes my brain hurt too much.
As an example of the problems you run into, take Forth words such as DEPTH and .S. Both need to know how many elements are on the stack. This is pretty straightforward with X as the stack pointer. With Y as TOS, you have various cases: Stack is completely empty; there is one element on the stack in Y; and there is more than one element, in Y and on the Direct Page via X. Now, this can be done: If X is equal to the initial value (which I called "dsp0"), we know the stack is empty. If X is dsp0+2, there is one element on the stack in Y. Starting dsp0+4, we have two or more elements on the stack.
Yes, you can build a system this way, and it's faster and smaller than just using X. But this shows that it gets complicated because you keep running into situations where you have to proceed case-by-case. This in turn makes debugging harder - usually my bugs are of the "stupid typo" class, quickly found; with Liara, I'm running into more complicated bugs where Y was passed to A for something and then I got confused where to put it back on the stack. And this is with code where a lot of it is just adapting Tali Forth's 8-bit stuff to 16-bit. Makes me wonder which bugs I haven't found because the logic is more complicated. Be first, but first be right, as journalists say.
So I'll be rewriting Liara with the "common" stack model - slower, longer, but easier to understand - and put a sign with the letters KISS over my desk.
It's not that it doesn't work. On the contrary, stuff like TYA and INY speed lots of words up very nicely. The problem is that it makes thinking about the code harder and some things a lot more complicated. Put differently, keeping everything "on X" might be slower and use more space, but it is the cleaner design. Put even more differently than that, this makes my brain hurt too much.
As an example of the problems you run into, take Forth words such as DEPTH and .S. Both need to know how many elements are on the stack. This is pretty straightforward with X as the stack pointer. With Y as TOS, you have various cases: Stack is completely empty; there is one element on the stack in Y; and there is more than one element, in Y and on the Direct Page via X. Now, this can be done: If X is equal to the initial value (which I called "dsp0"), we know the stack is empty. If X is dsp0+2, there is one element on the stack in Y. Starting dsp0+4, we have two or more elements on the stack.
Yes, you can build a system this way, and it's faster and smaller than just using X. But this shows that it gets complicated because you keep running into situations where you have to proceed case-by-case. This in turn makes debugging harder - usually my bugs are of the "stupid typo" class, quickly found; with Liara, I'm running into more complicated bugs where Y was passed to A for something and then I got confused where to put it back on the stack. And this is with code where a lot of it is just adapting Tali Forth's 8-bit stuff to 16-bit. Makes me wonder which bugs I haven't found because the logic is more complicated. Be first, but first be right, as journalists say.
So I'll be rewriting Liara with the "common" stack model - slower, longer, but easier to understand - and put a sign with the letters KISS over my desk.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
scotws wrote:
... As an example of the problems you run into, take Forth words such as DEPTH and .S. Both need to know how many elements are on the stack. This is pretty straightforward with X as the stack pointer. With Y as TOS, you have various cases: Stack is completely empty; there is one element on the stack in Y; and there is more than one element, in Y and on the Direct Page via X. Now, this can be done: If X is equal to the initial value (which I called "dsp0"), we know the stack is empty. If X is dsp0+2, there is one element on the stack in Y. Starting dsp0+4, we have two or more elements on the stack ...
Mike B.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Besides, how often are DEPTH and .S used?
If it makes the most commonly used words faster, then it's a net win for performance.
If it makes most words easier to implement (ie, TOS-oriented ones), then it's a net win for simplicity.
If it makes most words shorter, but just a few like these longer, then it's a net win in size.
If it makes the most commonly used words faster, then it's a net win for performance.
If it makes most words easier to implement (ie, TOS-oriented ones), then it's a net win for simplicity.
If it makes most words shorter, but just a few like these longer, then it's a net win in size.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
White Flame wrote:
Besides, how often are DEPTH and .S used?
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Okay, I'm back. The best way to look at the two methods is to put them side by side (with a bonus TOS-in-a thrown in for good measure). In these methods, I am assuming 65c802/816 in full 16-bit register mode, SP is in x, decrement before push, increment after pull, two address units per cell:
The only clear winner for TOS in $0,x is DROP, which is in all fairness a very commonly executed word [Edit: ! (store) is also a winner for TOS in RAM]. The other two methods tie or edge it out in machine code size and execution speed for almost everything else, though.
Regarding DEPTH ... you'll notice that it doesn't matter whether x is pointing to TOS or NOS; if (x == SP0), then the stack is empty, and if (x == SP0-2) then the stack contains one cell. The reason for this is that every word (like DUP ) which grows the stack does DEX DEX somewhere inside it, and every word (like DROP ) which shrinks the stack does INX INX somewhere inside.
The strange #SP0-2 in the TOS-in-a version of DEPTH is there to counteract the necessity of performing the machine language equivalent of DUP before trashing the accumulator for the depth calculation. If the stack was empty and DEPTH was called, the DUP would have pushed nonsense, but that's not a problem, as long as the nonsense doesn't enter into the calculation. In fact, it doesn't matter which of the three methods you use; if you DUP an empty stack you're DUPing nonsense. It's just bad luck that TOS-in-a has to DEX DEX before the TXA to maintain non-empty stack integrity, that's all.
Let's look at a specific example:
SP0 = 9
x = 9
a = 1234
The 1234 is just nonsense, but it doesn't matter:
So for the TOS-in-register methods, as long as you always remember that x points to NOS instead of TOS, it doesn't matter if either or both of them contain uninitialized nonsense, as long as you don't try to use the uninitialized nonsense by allowing x to increment above SP0.
Hope this helps,
Mike B.
[Edit: Added in a column for my 65m32, with a hard line to separate fact from fantasy. Maybe I'll break through this year ...]
[Edit #2: Added code samples for >R R> TUCK and PICK ; fixed ! ]
Code: Select all
TOS in $0,x TOS in y TOS in a | 65m32 TOS in a
------------ ------------ ------------ | --------------
NOS in $2,x NOS in $0,x NOS in $0,x | NOS in $0,s
RSP in s RSP in s RSP in s | RSP in x (!!!)
This is how we init the stacks :
ldx #RP0 ldx #RP0 ldx #RP0 | ldx #RP0
txs txs txs | lds #SP0
ldx #SP0 ldx #SP0 ldx #SP0
This is how we DUP :
lda 0,x dex dex | pha
dex dex dex | $NEXT
dex sty 0,x sta 0,x
sta 0,x $NEXT $NEXT
$NEXT
This is how we DROP :
inx ldy 0,x lda 0,x | pla
inx inx inx | $NEXT
$NEXT inx inx
$NEXT $NEXT
This is how we OVER :
lda 2,x dex dex | pda ,s
dex dex dex | $NEXT
dex sty 0,x sta 0,x
sta 0,x ldy 4,x lda 4,x
$NEXT $NEXT $NEXT
This is how we SWAP :
lda 0,x lda 0,x ldy 0,x | exa ,s
ldy 2,x sty 0,x sta 0,x | $NEXT
sta 2,x tay tya
sty 0,x $NEXT $NEXT
$NEXT
This is how we NIP :
lda 0,x inx inx | ins
inx inx inx | $NEXT
inx $NEXT $NEXT
sta 0,x
$NEXT
This is how we @ :
lda (0,x) lda 00,y tay | lda ,a
sta 0,x tay lda 00,y | $NEXT
$NEXT $NEXT $NEXT
This is how we ! :
lda 2,x lda 0,x tay | sla #,b
sta (0,x) sta 00,y lda 0,x | sla ,b
inx inx sta 00,y | $NEXT
inx inx inx
inx ldy 0,x inx
inx inx lda 0,x
$NEXT inx inx
$NEXT inx
$NEXT
This is how we + :
lda 0,x tya clc | add ,s+
inx clc adc 0,x | $NEXT
inx adc 0,x inx
clc tay inx
adc 0,x inx $NEXT
sta 0,x inx
$NEXT $NEXT
This is how we - :
lda 2,x tya eor #$ffff | sub ,s+
sec eor #$ffff sec | cdd #1
sbc 0,x sec adc 0,x | $NEXT
inx adc 0,x inx
inx tay inx
sta 0,x inx $NEXT
$NEXT inx
$NEXT
This is how we >R :
lda 0,x phy pha | sla ,-x
inx ldy 0,x lda 0,x | $NEXT
inx inx inx
pha inx inx
$NEXT $NEXT $NEXT
This is how we R> :
pla dex dex | pda ,x+
dex dex dex | $NEXT
dex sty 0,x sta 0,x
sta 0,x ply pla
$NEXT $NEXT $NEXT
This is how we TUCK :
lda 0,x lda 0,x ldy 0,x | exa ,s
ldy 2,x sty 0,x sta 0,x | pda ,s
sta 2,x dex dex | $NEXT
sty 0,x dex dex
dex sta 0,x sty 0,x
dex $NEXT $NEXT
sta 0,x
$NEXT
This is how we PICK :
txa phx phx | add #,s
asl 0,x tya asl | lda ,a
adc 0,x asl adc 1,s | $NEXT
tay adc 1,s tax
lda 02,y tax lda 0,x
sta 0,x ldy 0,x plx
$NEXT plx $NEXT
$NEXT
This is how we DEPTH :
txa txa dex | pda #-1,s
eor #$ffff eor #$ffff dex | cdd #SP0
sec sec sta 0,x | $NEXT
adc #SP0 adc #SP0 txa
lsr lsr eor #$ffff
dex dex sec
dex dex adc #SP0-2
sta 0,x sty 0,x lsr
$NEXT tay $NEXT
$NEXT
Regarding DEPTH ... you'll notice that it doesn't matter whether x is pointing to TOS or NOS; if (x == SP0), then the stack is empty, and if (x == SP0-2) then the stack contains one cell. The reason for this is that every word (like DUP ) which grows the stack does DEX DEX somewhere inside it, and every word (like DROP ) which shrinks the stack does INX INX somewhere inside.
The strange #SP0-2 in the TOS-in-a version of DEPTH is there to counteract the necessity of performing the machine language equivalent of DUP before trashing the accumulator for the depth calculation. If the stack was empty and DEPTH was called, the DUP would have pushed nonsense, but that's not a problem, as long as the nonsense doesn't enter into the calculation. In fact, it doesn't matter which of the three methods you use; if you DUP an empty stack you're DUPing nonsense. It's just bad luck that TOS-in-a has to DEX DEX before the TXA to maintain non-empty stack integrity, that's all.
Let's look at a specific example:
SP0 = 9
x = 9
a = 1234
The 1234 is just nonsense, but it doesn't matter:
Code: Select all
dex \ x is now 8
dex \ x is now 7
sta 0,x \ $07 now contains 1234
txa \ a is now 7
eor #$ffff \ a is now -8
sec
adc #SP0-2 \ a is now -8+1+9-2 = 0
lsr \ a is still 0, which is the correct depth
Hope this helps,
Mike B.
[Edit: Added in a column for my 65m32, with a hard line to separate fact from fantasy. Maybe I'll break through this year ...]
[Edit #2: Added code samples for >R R> TUCK and PICK ; fixed ! ]
Last edited by barrym95838 on Sat Sep 18, 2021 6:58 pm, edited 11 times in total.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Thanks for doing that, Mike. So without counting stack initialization (since that's only done before we get to work), the overall length of TOS in 0,X and TOS in Y is only different by less than 2% which is insignificant, and TOS in A is a little over 7% shorter than TOS in Y, which is still minor. I haven't counted cycles yet, but it looks like the difference will be rather minor there, too. As they say, sometimes the best experience to have is someone else's, and now I've found out without doing it myself! 
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Liara Forth, an "ANSI(ish) initial Forth" for the W65C26
Thanks to everybody for the input. Mike, now I feel bad, I should have pointed out that I actually did a whole bunch of comparisons like that before I started out with Liara - see https://github.com/scotws/LiaraForth/bl ... ariants.md for a table of the results including words such as R> and their cycle counts (which I am more interested in than the size on the 65816). Some of the code is in https://github.com/scotws/LiaraForth/bl ... narios.txt though I did most of them on paper while on public transport. I didn't do DEPTH at the time, because I considered it too rare; for the record, your version is in fact better than mine.
There is no question that TOS-A and TOS-Y are superior in size and speed to TOS-in-X, which is why I started out coding that way. TYA is something of a "killer instruction" with TOS-Y at one byte and two cycles in this regard, because its savings add up for things like testing if TOS is zero or a negative number.
The problem is what White Flame has touched in his second point: Complexity. SWAP and even DEPTH are trivial to understand in any version. FIND-NAME and PARSE-NAME are not anymore; for example, you have to use Y as an index at some point and have to keep straight when Y is being used for what. With TOS-in-X, if you are not touching X, you are not touching the stack. To my surprise, I have found that overall complexity is higher than I had expected with TOS-Y.
The question for me now is if the speed and size gains are really worth the added complexity. Added complexity means more bugs and more difficult bugs, but you could argue that this is a one-time investment, because once it works, it works forever. It makes it harder for other people to understand the code, but - let's be realistic here - there are at least three Forths for the 65816 now, so I'm mostly writing this for myself. Those arguments fall in the "dude, stop whining and fix your damn code" range.
One big advantage of using TOS-in-X would be that I could use a lot of the same code for Liara (65816) and a complete and very necessary rewrite of Tali (65c02). There is no sense in using Y as half-of-TOS for the 8-bit machine (though to be honest, I have never actually done the code, that might be an interesting exercise). However, the idea of a bare-metal Forth is that it fits the machine as best as it can, and TOS-Y works very well on the 65816. And I did set out to make Liara fast(ish).
The good news is that I'm not on the clock. Thanks to the magic of Git, I've opened a TOS-in-X branch (not pushed to master for now, so not on GitHub) and am going to see how much of a difference that actually makes. So far, as Garth has pointed out, not that much, though I haven't calculated the cycle counts yet (my most important metric). Once i have TOS-in-X versions of PARSE-NAME and FIND-NAME working, I should be able to make a final decision.
There is no question that TOS-A and TOS-Y are superior in size and speed to TOS-in-X, which is why I started out coding that way. TYA is something of a "killer instruction" with TOS-Y at one byte and two cycles in this regard, because its savings add up for things like testing if TOS is zero or a negative number.
The problem is what White Flame has touched in his second point: Complexity. SWAP and even DEPTH are trivial to understand in any version. FIND-NAME and PARSE-NAME are not anymore; for example, you have to use Y as an index at some point and have to keep straight when Y is being used for what. With TOS-in-X, if you are not touching X, you are not touching the stack. To my surprise, I have found that overall complexity is higher than I had expected with TOS-Y.
The question for me now is if the speed and size gains are really worth the added complexity. Added complexity means more bugs and more difficult bugs, but you could argue that this is a one-time investment, because once it works, it works forever. It makes it harder for other people to understand the code, but - let's be realistic here - there are at least three Forths for the 65816 now, so I'm mostly writing this for myself. Those arguments fall in the "dude, stop whining and fix your damn code" range.
One big advantage of using TOS-in-X would be that I could use a lot of the same code for Liara (65816) and a complete and very necessary rewrite of Tali (65c02). There is no sense in using Y as half-of-TOS for the 8-bit machine (though to be honest, I have never actually done the code, that might be an interesting exercise). However, the idea of a bare-metal Forth is that it fits the machine as best as it can, and TOS-Y works very well on the 65816. And I did set out to make Liara fast(ish).
The good news is that I'm not on the clock. Thanks to the magic of Git, I've opened a TOS-in-X branch (not pushed to master for now, so not on GitHub) and am going to see how much of a difference that actually makes. So far, as Garth has pointed out, not that much, though I haven't calculated the cycle counts yet (my most important metric). Once i have TOS-in-X versions of PARSE-NAME and FIND-NAME working, I should be able to make a final decision.