Extra stacks

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 09, 2004 7:06 am

I've already discussed this briefly with Bruce but I'll open it up here to the forum.

BACKGROUND

When Forth does floating-point arithmetic, it often uses the regular data stack for FP operations. A FP representation might still be only two cells (like a double-precision number). In the case of the 6502 where a cell is normally 16 bits (so a double is 32 bits), such a FP representation might use three bytes for mantissa and one for exponent.

There have been times I want more precision than you normally get with the 16-bit cells and the 32-bit intermediate results of UM/MOD and M* etc.; but I don't want the complexity or performance penalty of FP, and I don't want the stack mess that comes with having several very long items on the regular data stack (like 4 cells per item). It makes it a nightmare if you have to reach back in the stack with ROLL , PICK, and so on, and keep things straight. I have triple- and quad-precision integer Forth words from Forth Dimensions magazine, but they use the regular data stack, causing the same stack-management difficulties.

THE IDEA

It's common to have a separate FP stack especially if the FP representations are longer than a regular double; but has anyone ever thought of having a separate stack for double-precision scaled-integer / fixed-point math? A standard cell on this stack could be 4 bytes and a "double" (like the intermediate results of */ ) would be 8, instead of the 2 and 4 bytes of the normal data stack. 64 bits gives almost a ±10E19 range.

Additionally, it could be a complex stack, with the imaginary part simply being ignored in real-only arithmetic. The various operators could even check for 0i in the involved cells and determine automatically whether complex-number arithmetic is necessary. Hmmm... that raises the issue of initialization.

NAMING CONVENTIONS?

The FP stack arithmetic words are always F* F/ F+ etc, and normal stack integer double-precision words are like D+ so maybe the higher-precision-stack words should start with H like H+ H* H*/ HDROP HDUP H2DUP HROT etc. unless you know of something else that is already somewhat standardized. Then of course there would be words to transfer things between stacks, like >H H> etc..

WHERE IN 6502 MEMORY?

My original idea for 6502 implementation would be to have it start at the other end of available ZP (or DP, in the case of the '816) space from the regular data stack, and have them grow toward each other, so that all the free space is all together in the middle and you'll never have the situation where one stack is out of space while the other one has plenty of unused space.

Then Bruce suggested that since the numbers on the high-precision stack won't be addresses needing ZP for the indirect addressing modes, this stack could be kept anywhere in RAM; and that furthermore, the various bytes of a "cell" would not have to be kept together, so indexing can be made easier with for example TOSbyte1,X, TOSbyte2,X, TOSbyte3,X, etc., where the value in X is always 0 for TOS (top of stack), 1 for the next "cell", etc.. That idea gets even better when considering a complex stack where the doubles really eat up memory (16 bytes each!)

If the stack were limited to 8 levels of complex doubles (ie, same as 16 complex singles), it would take half a page of the 6502 memory map-- not bad, as long as it doesn' have to be in ZP. The greater consideration might be for the memory needed by the dozens of extra words (H>D, D>H, H*, H_OVER, HDUP, etc.).

WHY THE FUSS?

It seems like 16 bits (with the occasional 32 bits) ought to be enough for most applications. The higher-precision stack interests me now partly because of my problem with my 16-bit scaled-integer sine and cosine words that are producing unreasonably high distortion products in an FFT routine. With potentially thousands of calculations that go into a particular output cell, the round-off and truncation errors gets compounded. I'm sure there's a way to calculate these more accurately with the resources that are already there, but I'm not the math specialist to figure it out. [Edit: I found the errors were due to a multiplication routine bug that only shows up under certain very limited circumstances, and I fixed it. Still, there remain many applications where the higher precision in needed. With this FFT, I can only do 2048 7-bit samples without overflowing the 16-bit cells.] When I get the large look-up tables implemented [Edit, 6/25/12: posted here], the sine and cosine problem will disappear since the look-ups will be accurate to all 16 bits [Edit, a year later, in 2005: My improved SIN & COS routines are now accurate usually in all 16 bits, and never off by more than one lsb]; but I know there will be something else later. Sometimes the multiple-precision wish is also just to make it easier to keep things within range and not have to worry about losing precision due to near-underflow conditions or getting totally wrong answers due to overflow conditions in intermediate calculations in a long string of them. It seems like this high-precision stack would be more efficient than FP in most respects, but I'm open to ideas.

kc5tja · Post by **kc5tja** » Fri Jul 09, 2004 4:56 pm

In my FTS1001 simulations, I'm finding that I haven't even approached fully utilizing the 16 data stack depth. 8 items should be plenty sufficient for the H-stack.

dclxvi · Post by **dclxvi** » Mon Jul 12, 2004 9:07 am

GARTHWILSON wrote:

Then Bruce suggested that since the numbers on the high-precision stack won't be addresses needing ZP for the indirect addressing modes, this stack could be kept anywhere in RAM; and that furthermore, the various bytes of a "cell" would not have to be kept together, so indexing can be made easier with for example TOSbyte1,X, TOSbyte2,X, TOSbyte3,X, etc., where the value in X is always 0 for TOS (top of stack), 1 for the next "cell", etc.. That idea gets even better when considering a complex stack where the doubles really eat up memory (16 bytes each!)

I'd like to add:

1. If the data stack pointer is the X register (i.e. the usual 6502 implemenation), it will often be more convenient to use LDY H_STK_PTR and access the "high-precision" stack with H_STK_0,Y and H_STK_1,Y etc. than using LDX and H_STK_0,X etc. since you won't have to save and restore X in the former case. Of course abs,Y addressing isn't available for all instructions (e.g. ASL).

2. The "high-precision" stack pointer can be decremented with a DEC H_STK_PTR or incremented with an INC H_STK_PTR. Decrementing or incrementing the data stack pointer traditionally takes 4 cycles (a pair of DEXs or a pair of INXs), and a INC zp takes only 5 cycles and a INC abs takes only 6 cycles, so the performance hit is small.

3. For many instructions, abs,Y (or abs,X) takes the same number of cycles as zp,X so as long the "high-precision" stack is placed in memory where abs,Y won't cross a page boundary, the performance hit is again small. (STA is one exception. STA zp,X takes 4 cycles, but STA abs,Y takes 5 cycles.)

4. There are LDX abs,Y and LDY abs,X instructions but no corresponding STX abs,Y and STY abs,X instructions, so you'll have to use a TXA STA sequence or a TYA STA sequence instead, which adds 2 cycles.

kc5tja · Post by **kc5tja** » Mon Jul 12, 2004 3:32 pm

dclxvi wrote:

3. For many instructions, abs,Y (or abs,X) takes the same number of cycles as zp,X so as long the "high-precision" stack is placed in memory where abs,Y won't cross a page boundary, the performance hit is again small. (STA is one exception. STA zp,X takes 4 cycles, but STA abs,Y takes 5 cycles.)

How is this possible? There are three opcode bytes to fetch instead of just two. This suggests that abs,Y ought to be one cycle more than zp,X.

Thowllly · Post by **Thowllly** » Tue Jul 13, 2004 12:52 pm

kc5tja wrote:

dclxvi wrote:

3. For many instructions, abs,Y (or abs,X) takes the same number of cycles as zp,X so as long the "high-precision" stack is placed in memory where abs,Y won't cross a page boundary, the performance hit is again small. (STA is one exception. STA zp,X takes 4 cycles, but STA abs,Y takes 5 cycles.)

How is this possible? There are three opcode bytes to fetch instead of just two. This suggests that abs,Y ought to be one cycle more than zp,X.

zp,X will load the zp address in one cycle and then add X to it in the next cycle. abs,X will load the low byte of the address in one cycle and in the next cycle it will both add X to the low byte and load the high byte of the address. Only if X+low byte overflows is another cycle needed to increment the high byte.

kc5tja · Post by **kc5tja** » Tue Jul 13, 2004 2:40 pm

Thowllly wrote:

zp,X will load the zp address in one cycle and then add X to it in the next cycle. abs,X will load the low byte of the address in one cycle and in the next cycle it will both add X to the low byte and load the high byte of the address. Only if X+low byte overflows is another cycle needed to increment the high byte.

Ah, I forgot about that. I thought that the CPU would feed the ZP byte directly to the ALU, while feeding X to it as well, thus saving a cycle. But, I guess it doesn't.

Thanks.

JimBoyd · Post by **JimBoyd** » Sun Oct 04, 2020 8:18 pm

Has anyone else implemented an extra stack ( or more) in Forth? How did it affect your programming?
I've implemented an extra stack in Fleet Forth that I mostly use for a spare data stack. I've defined the following parallels for the return stack words:

Code: Select all

>R      >A
R>      A>
R@      A@
DUP>R   DUP>A
2>R     2>A
2R>     2A>

Since this extra stack, which I call the auxiliary stack, is rignt up against the return stack and right up against the area the C64 uses to keep track of which files are open, the words to move data to and from the auxiliary stack ( or aux stack ) test for overflow/underflow.
I usually use the aux stack to hold control flow data so I can do more with CODE words while keeping the source sane and avoiding hand calculated offsets. The aux stack words CS>A and A>CS move the control flow data on the control flow stack ( data stack ) to or from the aux stack. I also use the aux stack to hold temporary addresses to be resolved later, when I want one CODE word to branch or jump into another CODE word at a certain location.
I have also used these words in place of their return stack counterparts when defining a new word ( for my decompiler ) to test it. Once it was working, I changed the aux stack words to the faster return stack words.
It's even been helpful in hand tracing the execution of a system word I was modifying ( to make it easier to support more drive types ).
On another thread, SamCoVT mentioned:

SamCoVT wrote:

I'll also recommend avoiding >r and r> when easy/possible because they make the words harder to test. While they are sometimes the exact right tool for the job, they can only be used in word definitions while compiling.

Aux stack to the rescue!
First, some temporary redefinitions to make things a little safer, just in case:

Code: Select all

: >R >A ; 
REDEFINE: >R
 OK
: 2>R 2>A ; 
REDEFINE: 2>R
 OK
: DUP>R DUP>A ; 
REDEFINE: DUP>R
 OK
: R> A> ; 
REDEFINE: R>
 OK
: 2R> 2A> ; 
REDEFINE: 2R>
 OK
: R@ A@ ; 
REDEFINE: R@
 OK

Here is modified source for one of Fleet Forth's system words:

Code: Select all

// (DR/W)
HEX
NH 2 CONSTANT DSI
: (DR/W)  ( ADR BLK# R/WF CNT -- )
   1- SPLIT 2>R  T&S (IS) DSI
   R> 0
   ?DO
      >R  2OVER 2OVER R@ 100 SR/W
      2>R
      100 UNDER+  DSI + 2 PICK /MOD
      2R>
      ROT UNDER+  R>
   LOOP
   R> 1+ SR/W  DROP ;
' (DR/W) IS DR/W

And here is the log of tracing it by hand:

Code: Select all

HEX  OK
2 DRIVE  OK
PAD 315 1 B/BUF  OK
.S 5934  315    1  400  OK
.AS EMPTY  OK
1- SPLIT  OK
.S 5934  315    1   FF    3  OK
2>A  OK
.S 5934  315    1  OK
T&S81  OK
.S   28 5934   24   50    0    1    1  OK
0 VALUE DSI  OK
TO DSI  OK
.S   28 5934   24   50    0    1  OK
A> 0  OK
.S   28 5934   24   50    0    1    3 
   0  OK
. . 0 3  OK
>A 2OVER 2OVER A@ 100  OK
: .SRW CR . . . . . . ;  OK
.S   28 5934   24   50    0 5934   24 
  50    0    1  100  OK
.A  OK
.S   28 5934   24   50    0 5934   24 
  50    0    1  100    A    0  OK
D. A  OK
.AS   FF    1  OK
.S   28 5934   24   50    0 5934   24 
  50    0    1  100  OK
.SRW 
100 1 0 50 24 5934  OK
2>A 100 UNDER+ DSI + 2 PICK /MOD  OK
.S   28 5A34   25    0  OK
2A> ROT UNDER+ A>  OK
.S   28 5A34   25   50    0    1  OK
>A 2OVER 2OVER A@ 100 .SRW 
100 1 0 50 25 5A34  OK
2>A 100 UNDER+ DSI + 2 PICK /MOD  OK
2A> ROT UNDER+ R>  OK
.S   28 5B34   26   50    0    1  OK
.AS   FF  OK
>A 2OVER 2OVER A@ 100 .SRW 
100 1 0 50 26 5B34  OK
2>A 100 UNDER+ DSI + 2 PICK /MOD  OK
2A> ROT UNDER+ A>  OK
.S   28 5C34   27   50    0    1  OK
R>  OK
.S   28 5C34   27   50    0    1   FF  OK
1+  OK
.SRW DROP 
100 1 0 50 27 5C34  OK
.S EMPTY  OK
CONSOLE

There is one place in the log where I inadvertently typed .A instead of .AS , placing a double on the data stack rather than displaying the contents of the aux stack. I promptly removed it and continued tracing by hand.
Had I accidentally typed >R rather than >A ( because that is what the source has ) it would have been fine thanks to the temporary redefinitions. Accidentally typing ?DO or LOOP would not have caused a problem other than clearing all stacks when it aborted with the message "FOR COMPILING".

SamCoVT · Post by **SamCoVT** » Sun Oct 04, 2020 10:17 pm

JimBoyd wrote:

On another thread, SamCoVT mentioned:

SamCoVT wrote:

I'll also recommend avoiding >r and r> when easy/possible because they make the words harder to test. While they are sometimes the exact right tool for the job, they can only be used in word definitions while compiling.

Aux stack to the rescue!

OK - This is pretty slick and something I wish I had in my brain earlier. For doing the kind of debugging work you show, it doesn't even have to be fast - a simple implementation using some space ALLOTted in the dictionary along with an index or pointer would work fine. Thanks for sharing!

IamRob · Post by **IamRob** » Mon Oct 05, 2020 4:03 am

Instead of >r and r>, or even variables, I started using free ZP locations for temporary storage. I call it Z! and Z@, which are defined as,

: Z! 0 ! ; (or any free ZP memory)
: Z@ 0 @ ;

The advantage of using memory locations compared to >R is it doesn't have to be DUMP'd at the end, with R> DUMP, if the value is not needed.
Another use for the ZP location is the loop variable doesn't get retained when LEAVE is encountered. So I will use: I Z! LEAVE in words that contain a loop that exits prematurely.

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 05, 2020 4:07 am

IamRob wrote:

Another use for the ZP location is the loop variable doesn't get retained when LEAVE is encountered. So I will use: I Z! LEAVE in words that contain a loop that exits prematurely.

How 'bout just having LEAVE store the loop index in a variable, all in the one primitive so it's faster. I think I'll do that myself.

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 05, 2020 6:34 am

Jim, you have quite a few words above that are neither part of any standard I know of, nor defined above. One I'll ask about however is REDEFINE:. It appears to edit the old word to redirect execution to the new one, for secondaries that are already compiled using it, so those secondaries don't need to be recompiled.. Is that what's happening? I've had a way to do that but I like yours more.

JimBoyd · Post by **JimBoyd** » Mon Oct 05, 2020 8:14 pm

GARTHWILSON wrote:

Jim, you have quite a few words above that are neither part of any standard I know of, nor defined above. One I'll ask about however is REDEFINE:. It appears to edit the old word to redirect execution to the new one, for secondaries that are already compiled using it, so those secondaries don't need to be recompiled.. Is that what's happening? I've had a way to do that but I like yours more.

No. Sorry, I should have been clear about that. This section:

Code: Select all

: >R >A ; 
REDEFINE: >R
 OK
: 2>R 2>A ; 
REDEFINE: 2>R
 OK
: DUP>R DUP>A ; 
REDEFINE: DUP>R
 OK
: R> A> ; 
REDEFINE: R>
 OK
: 2R> 2A> ; 
REDEFINE: 2R>
 OK
: R@ A@ ; 
REDEFINE: R@
 OK

is part of the log of the interactive session where I hand traced the word (DR/W) . I had modified it to make it easier to support the 1581 disk drive as well as the others.
"REDEFINE: >R" is a message from the system letting me know I redefined >R . It's harder to tell that from the print dump than a live session so I may change that message to something like:
"YOU REDEFINED >R"
or even
">R EXISTS"
or maybe
">R REDEFINED"
or even
">R WAS REDEFINED"

Your comment does give me an idea and I'll have to give it some thought.
As for the other words, the source is from the source for my Forth kernel.
NH sets a flag so the metacompiler compiles the next word headerless. For interactive testing in Forth, not metacompiling, I redefine NH as a no-op

Code: Select all

: NH ;

In the log, the phrase "2 DRIVE" ( "10 DRIVE" would also work ) sets the current drive to drive 10 ( drive 8 being selected with "0 DRIVE" or "8 DRIVE" ) . Commodore 64 disk drives start at device 8 and go up from there.
SPLIT splits a cell into its low byte and high byte. It is seven bytes and is a really fast "$100 /MOD" .
Fleet Forth, like Blazin' Forth, uses direct access to drive sectors for block access ( on disks that are only supposed to be for blocks) .
T&S derives the starting track and sector from the block number.
(IS) is the primitive used by IS and TO to write the value on the data stack into the first cell of the parameter field of the following word in the definition and bump IP past said word.
SR/W is the sector read write word.
UNDER+ has the following stack diagram:
( N1 N2 N3 -- N1+N3 N2 )
(DR/W) is the vector for the deferred word DR/W , disk read write.
Either DR/W or RR/W ( ram read write ) is executed by R/W depending on the block number.
I've modified things since my latest upload. a block number of $8000 and up and RR/W is executed. RR/W sees a block number $8000 less than the actual block number.
I hope this clarifies things.

JimBoyd · Post by **JimBoyd** » Mon Oct 05, 2020 8:34 pm

IamRob wrote:

Instead of >r and r>, or even variables, I started using free ZP locations for temporary storage. I call it Z! and Z@, which are defined as,

: Z! 0 ! ; (or any free ZP memory)
: Z@ 0 @ ;

One disadvantage is keeping track of which ZP locations you are using if you need more than one. The aux stack is an actual stack and my implementation is over 40 cells deep. ( memory the C64 wasn't using below screen memory ) .

Quote:

The advantage of using memory locations compared to >R is it doesn't have to be DUMP'd at the end, with R> DUMP, if the value is not needed.
Another use for the ZP location is the loop variable doesn't get retained when LEAVE is encountered. So I will use: I Z! LEAVE in words that contain a loop that exits prematurely.

Wouldn't you still need to initialize the storage with a sentinel value so you know if you left the loop prematurely? Something like 0 Z! or -1 Z! ?
In one of my system words, I leave a loop like this:

Code: Select all

   ?DO
      DUP I >BT @ =
      IF  DROP I UNLOOP
      ELSE CS>A
   LOOP

UNLOOP discards the loop parameters and I branch out of the loop by moving the control flow data from ELSE to the aux stack. Further in the definition I resolve the ELSE by moving the control flow data back to the data stack ( the control flow stack ) like this:

Code: Select all

   A>CS THEN

Yes, the definition was a bit long. It could not be factored into non trivial smaller parts that got used more than once though.

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 05, 2020 9:10 pm

JimBoyd wrote:

I hope this clarifies things.

Yes, that clears up a lot.

Quote:

SPLIT splits a cell into its low byte and high byte.

Yes, that one is standard. COMBINE is the complement.

IamRob · Post by **IamRob** » Tue Oct 06, 2020 3:51 am

JimBoyd wrote:

IamRob wrote:

Instead of >r and r>, or even variables, I started using free ZP locations for temporary storage. I call it Z! and Z@, which are defined as,

: Z! 0 ! ; (or any free ZP memory)
: Z@ 0 @ ;

One disadvantage is keeping track of which ZP locations you are using if you need more than one. The aux stack is an actual stack and my implementation is over 40 cells deep. ( memory the C64 wasn't using below screen memory ) .

Quote:

The advantage of using memory locations compared to >R is it doesn't have to be DUMP'd at the end, with R> DUMP, if the value is not needed.
Another use for the ZP location is the loop variable doesn't get retained when LEAVE is encountered. So I will use: I Z! LEAVE in words that contain a loop that exits prematurely.

Wouldn't you still need to initialize the storage with a sentinel value so you know if you left the loop prematurely? Something like 0 Z! or -1 Z! ?
In one of my system words, I leave a loop like this:

Code: Select all

   ?DO
      DUP I >BT @ =
      IF  DROP I UNLOOP
      ELSE CS>A
   LOOP

UNLOOP discards the loop parameters and I branch out of the loop by moving the control flow data from ELSE to the aux stack. Further in the definition I resolve the ELSE by moving the control flow data back to the data stack ( the control flow stack ) like this:

Code: Select all

   A>CS THEN

Yes, the definition was a bit long. It could not be factored into non trivial smaller parts that got used more than once though.

Yes, I have to initialize Z! to zero. I have thought of implementing something like UNLOOP, which is a cleaner exit and probably faster. Is your UNLOOP a primitive or a word?

Extra stacks

Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks

Re: Extra stacks