W65C816 struggling with DTC

enso · Post by **enso** » Wed Nov 27, 2019 5:36 am

Help! I've implemented and used many Forth-like STC systems over decades now, so am very rusty on classic Forth. I kind of miss the idea of a thin inner interpreter and all the niceties that go with it (such as simplicity and ability to decompile threaded code easily, etc.) I am trying to wrap my noodle around how a DTC inner may be implemented.

Assuming Y is DSP and X is IP; I would like to keep TOS in A. That doesn't leave too many registers, so NEXT can be INX INX JMP (0,X) 5/10. Primitives can inline or jmp to NEXT at the end. So far so good.

EXIT is basically PLX NEXT 6/15.

High level words work fine if I inline the entry code as a thought experiment:

Code: Select all

  PHX       ;1/4 save IP
  LDX $HERE ;3/3 set new IP
  JMP (0,X) ;3/6
HERE:
  dw DUP
  ..
  dw EXIT

That seems likely to work, but that's 7 bytes and 13 cycles. I am not too crazy about hardcoding a literal address, but getting the processor PC any other way is even worse, and we have no registers to keep W (and probably wouldn't want to as it's bigger and longer). Not that I would do this for real, of course, because DOCOL.

Now, DOCOL will need to swap the (return address-1) at the top of the stack with X (IP). Since I have no spare registers, this turns ugly enough for me to give up on the whole idea.

The best I've come up with is to start all high-level words with PHX JSR DOCOL (4/10), and have docol pull the address-1 into X (and increment it). That leaves IP on the return stack, and avoids ugly swapping.

Notes:
* I would have preferred to use X for DSP, but there is no JMP (0,Y).
* I thought about DP as DSP, but incrementing and decrementing it seems too painful.
* If the worst comes to worst, I could give up on A, especially if the win in DOCOL is big enough to compensate for slightly bigger/slower datastack manipulation.
* Overhead - NEXT:5/10 high-level entry 4/10, EXIT 6/15 (including a NEXT). Not too bad, minimum of 20 cycles * vs. 12 for JSR/RTS.

Any suggestions? Am I missing something obvious? Thanks in advance.

GARTHWILSON · Post by **GARTHWILSON** » Wed Nov 27, 2019 7:28 am

enso wrote:

I would like to keep TOS in A.

See this post. Elsewhere, Michael Barry (forum member barrym95838), posted about trying out different ways, and included performance charts. There was a small—almost negligible— performance benefit to keeping TOS in A. Maybe he or someone else has it bookmarked or can remember adequate search terms to find it.

enso · Post by **enso** » Wed Nov 27, 2019 8:09 am

Yes, I saw that.
TOS in a register makes a big difference in an optimized STC compiler that does a lot of inlining. In fact, there is a Forth implementation I really appreciate, FreeForth, that keeps both TOS and NOS in registers, and uses 'register renaming in software' as it compiles code, optimising away things like SWAP altogether! I have to keep reminding myself that it does not matter much here.

barrym95838 · Post by **barrym95838** » Wed Nov 27, 2019 9:08 am

We talk about different scenarios here:

viewtopic.php?f=9&t=3649

... but I don't delve into DTC NEXT (which is just a single instruction on my 65m32).

Quote:

NEXT can be INX INX JMP (0,X) 5/10.

If you use Y for IP then you could try the equivalent INY PHY INY RTS 4/14(?). [Edit: this seems wrong ... I missed a level of indirection.]

X is a very versatile register, and deciding the best way to use it involves careful thought and test coding. Native '816 coding isn't really my thing, so I'll defer further speculation to the 16-bit experts in the gallery.

BitWise · Post by **BitWise** » Wed Nov 27, 2019 3:45 pm

In my '816 DTC Forth I use this

Code: Select all

    80                        ; The CONTINUE macro is used at the end of a native word to invoke the next
    81                        ; word pointer.
    82                        
    83                        CONTINUE        macro
    84                                        tyx                             ; Copy IP to X
    85                                        iny
    86                                        iny
    87                                        jmp     (0,x)                   ; Then execute word
    88                                        endm

I also use DP as the data stack pointer so all access to the stack is via zero page instructions, for example:

Code: Select all

   375                        ; ! ( x a-addr -- )
   376                        ;
   377                        ; Store x at a-addr.
   378                        
   379 00:0546: 35 05 00 01                  HEADER  1,"!",NORMAL
       00:054A: 21 
   380                        STORE:
   381 00:054B: A5 03                        lda     <3                      ; Fetch data value
   382 00:054D: 92 01                        sta     (1)                     ; .. and store
   383 00:054F: 7B                           tdc                             ; Clean up data stack
   384 00:0550: 1A                           inc     a
   385 00:0551: 1A                           inc     a
   386 00:0552: 1A                           inc     a
   387 00:0553: 1A                           inc     a
   388 00:0554: 5B                           tcd
   389 00:0555: BB C8 C8 7C                  CONTINUE                        ; Done
       00:0559: 00 00

enso · Post by **enso** » Wed Nov 27, 2019 7:13 pm

BitWise, that actually looks good. I was a little put off by tdc/inc a/inc a/tcd sequence, but at 4 bytes and 8 cycles, it is not out of the question; the ability to access data on the stack, and, as you showed, indirect through stack data, is really powerful. I also realized that code words that really mess with the stack can usually adjust it once (or twice at most). Forth actually under-utilizes data higher up on the stack; I suppose the intent of DP is to be used as a frame pointer for HLLs. I will make a note to my future self who is busy implementing a Lisp.

I also found your Forth implementation (somehow missed it before, will take a look as it's probably what I need). I suppose using Y as IP gives you access to pre-incremented IP in X to jump through and post-incremented IP to push in DOCOL. The way I was doing saves one instruction in NEXT but requires an extra PHX in every high-level preamble (and pre-increments IP, a little odd I suppose). Your way is a little more compact and clearer.

barrym95838, thanks for the missing detail (I do have that thread in my bookmarks)

MichaelM · Post by **MichaelM** » Wed Nov 27, 2019 9:58 pm

enso:

Mike Barry and Brad Rodriguez helped me sort through this issue. I have a compact side-by-side comparison of the inner interpreter for ITC and DTC here. I derived that table from Brad's Moving Forth articles which you can find on his Camel Forth website.

Dr Jefyll · Post by **Dr Jefyll** » Thu Nov 28, 2019 1:07 am

enso wrote:

[With DP as the data stack pointer] the ability to access data on the stack, and, as you showed, indirect through stack data, is really powerful.

In particular, you could have a 24-bit address on stack, and use it -- in situ -- to instantly fetch and store anywhere in the 16 MB address space.

For me, that's compelling. I hate the idea that anything outside 64K would be clunky and awkward (and slow) to access. What a de-motivation to do anything ambitious!

-- Jeff

enso · Post by **enso** » Thu Nov 28, 2019 7:25 pm

A 24-bit cell size would be quite nice. I wish there was a little more 24-bit support in the processor... a multiply-by-3 instruction, 24-bit registers...

Dr Jefyll · Post by **Dr Jefyll** » Fri Nov 29, 2019 1:25 am

enso wrote:

A 24-bit cell size would be quite nice.

IMO the sensible thing is let each cell be 32 bits. Then it can hold a 16- or 24-bit address, or a 32-bit double-precision integer.

( I know with FIG Forth you'd use two cells to hold a 24- or 32-bit value, but I've been there and done that, and found I HATED it! Suppose you have two values on stack, a 16-bit and a 32-bit, and say you wanna swap the two. What would you rather write (and read) in the source code -- SWAP or ROT ? And that's the most trivial of examples! Trust me, it only gets worse. That's why I insist the proper policy is one cell per value. )

Quote:

I was a little put off by tdc/inc a/inc a/tcd sequence, but at 4 bytes and 8 cycles, it is not out of the question

Two things about this:

- for 32-bit cells, the code to increment the DataStack Ptr becomes tdc / clc / adc# 4 / tcd -- which is only one cycle slower than what's required for 16-bit cells.

- it wouldn't take a lot of extra smarts for Forth's compiler to somewhat reduce the number of up-down adjustments of the DataStack Ptr. There are several different ways this could work, but here's one example...

Consider the Forth word + which, like many words, shrinks the stack. Suppose the most commonly used stack-shrink words have alternative definitions which leave the stack un-shrunken ( ie; they omit the tdc / clc / adc# 4 / tcd ). Now consider the word which appears next in the user's source code. With luck, it'll be one of a different set of common words which have alternative definitions. The alternative definitions for words in this second set expect the stack to be un-shrunken upon entry, and are written accordingly.

A peephole optimizer could look for cases where a word belonging to the first set is followed by a word belonging to the second set. Finding such a case, it makes the substitutions. Then at run-time, rather than getting adjusted twice, the stack gets adjusted once, or possibly not at all.

(Probably this could've been explained more elegantly, but I hope the principle is clear. Look for ways to avoid being so fine-grained with the stack adjustments!)

-- Jeff

enso · Post by **enso** » Fri Nov 29, 2019 7:03 am

Code: Select all

tdc        (1/2)
clc        (1/2)
adc #$0004 (3/3)
tcd        (1/2)
================
            6/9

Yes, I suppose it's worth having a uniform data size.

You could generalize your optimization as follows: at compile time, we maintain a 'notional DSP' and keep an offset from the actual DSP. Words like + are written to operate on the notional stack position at run-time (without adjusting DSP) and update the offset at compile-time. At certain points - loop entry/exit, prior to exiting, or calling a word that is not 'smart' - we make things right by adjusting the real stack by the offset.

In practice, this can be done by compiling different versions of + as you pointed out. Computations are generally not too deep, and you can probably get by with keeping a couple or 3 versions. If you plan for this kind of optimization and make a provision for optimized versions for any word, you can start with regular words, and add optimized versions as needed (with the help of a profiler).

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

BitWise · Post by **BitWise** » Fri Nov 29, 2019 2:54 pm

I've been working on a subroutine threaded forth compiler for the '816 on and off for a while. It applies lots of peep hole optimisation to the code so that something like this

Code: Select all

hex
ff00 constant ACIA:DATA inline
ff01 constant ACIA:STAT inline
ff02 constant ACIA:CMND inline 
ff03 constant ACIA:CTRL inline

: emit ( ch -- )
  begin ACIA:STAT c@ 0010 and until
  ACIA:DATA c!
;

is turned into this

Code: Select all

emit:                                   ; : emit
.L1:
                dex
                dex
                sep     #32
                lda     -255            ; -255 c@
                sta     <1,x
                stz     <2,x
                rep     #32
                lda     <1,x
                and     #16             ; 16 and
                sta     <1,x
                ldy     <1,x
                inx
                inx
                tya
                jeq     .L1             ; (?branch)
                lda     <1,x            ; -256 c!
                sep     #32
                sta     -256
                rep     #32
                inx
                inx
                rts                     ; ;

As you don't need an instruction pointer in STC the code uses X as the data stack pointer. I added JEQ/JNE etc opcodes to my assembler which generate the shortest branch sequence (e.g. BEQ or BNE + JMP) depending on the relative distance to the target address.

The optimiser gets rid of many stack push/pulls.

One day I'll finish it.

Dr Jefyll · Post by **Dr Jefyll** » Sat Nov 30, 2019 6:26 am

enso wrote:

Yes, I suppose it's worth having a uniform data size.

Right -- and FWIW let me remind everyone that 32-bit cells can be achieved in different ways.

I've explained why the D register has a compelling advantage as the DSP for an '816, but on 6502 or 'C02 you may want the 32-bit cells to live as two halves that reside in a high-word region and a low-word region both addressed by X (and the increment value for X would be 2). Probably you'll want the low-word region in zero-page so you can exploit (x,ind) address mode, but if desired the high-word region could be located outside zero-page.

Quote:

You could generalize your optimization as follows: at compile time, we maintain a 'notional DSP' and keep an offset from the actual DSP.

Yes. Nicely put.

Quote:

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

This is maybe pretty obvious, but another prospective optimization is to look for occurrences of common 2-word sequences such as OVER + and replace the sequence with a one-word substitute named OVER&+ (or whatever you like to call it). The main goal is to reduce the number of times NEXT executes, but sometimes (as in this case) you'll eliminate the stack-pointer adjustments, too.

This sort of optimization is easy enough to do manually. In my FIG dictionary I have about half a dozen contractions, and I think Garth has a collection that's quite a bit larger.

-- Jeff

dwight · Post by **dwight** » Sun Dec 01, 2019 6:19 pm

One thought is to use hardware assistance. Keep the return stack under processor control but use an external data stack, with A as the TOS. You'd use two addresses to decode operations, one for pop and one for push. A few more addresses for double operations would be handy. It wouldn't take that much size for typical stack usage but moving both data and return stack off chip makes large recursion more practical. Just my rambling.
Dwight

GARTHWILSON · Post by **GARTHWILSON** » Fri Jan 10, 2020 9:42 am

Dr Jefyll wrote:

enso wrote:

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

This is maybe pretty obvious, but another prospective optimization is to look for occurrences of common 2-word sequences such as OVER + and replace the sequence with a one-word substitute named OVER&+ (or whatever you like to call it). The main goal is to reduce the number of times NEXT executes, but sometimes (as in this case) you'll eliminate the stack-pointer adjustments, too.

This sort of optimization is easy enough to do manually. In my FIG dictionary I have about half a dozen contractions, and I think Garth has a collection that's quite a bit larger.

I was just going through my '816 Forth material as I've had some time to work on it again recently, preparing it for publication. In the main file, I get the following:

Code: Select all

2>R             3DROP           NIP;            3+
2R>             3DUP            0F_AND          3*
2R@             3_PICK          1F_AND          BASE@
RDROP           2_PICK          FF_AND          SPAN@
RSWAP           DUP>R           +@              STATE@
I@              SWAP!           @+              DROP_FALSE
I_C@            !;              SWAP-
I+              C!;             ?EXIT
I@              @>R             ?LEAVE

and then of course there are ones in common Forth usage which some might easily overlook, like that @ . is just ? and @ EXECUTE is just PERFORM.

The ones ending in ; are for when it's the last word in a secondary, eliminating the unnest. The ones starting with ? do the equivalent action of putting the word between IF and THEN, like IF EXIT THEN. Each of these integrates multiple words into a single primitive, making it far more efficient. For example, RDROP is nothing but PLA, and DROP_FALSE is nothing but STZ 0,X. I might have missed some. Also, I have not looked in the INCLude files yet which are for hardware, extended math, time & date functions, the assembler, added string functions, and prioritized interrupt service (I think I have some long ones there, which integrate a lot of words that would normally be secondaries). At some point in the past, I looked for the most common two-word combinations and charted how many times I used them, to determine the value of writing single primitives to replace them.

You can also speed up execution by avoiding a series of branches like you get where execution follows a series of ELSE occurrences, and make the first branch go directly to where the next non-branching execution occurs. This speeds things up a little because you don't have branch foo1, with foo1 saying branch foo2, and foo2 saying branch foo3, etc..

From a web page I'm writing about my '816 ITC Forth:

A Forth could be written with only 30 primitives, or possibly even just 12, but it would be a very slow Forth. Primitives are of course much, much faster than secondaries; but unlike the situation with the 6502, a primitive on the '816 is often even more memory-thrifty and easier to write! Many things are written here as primitives that many other Forths do as time-wasting secondaries, for example -ROT which is often a secondary : -ROT ROT ROT ; . The primitive version is four times as fast. It does not have to run nest and unnest ; also, NEXT only has to run once, not four times.

When you have hundreds of primitives, you can bypass some of the DUPing and DROPing, because if it's inside a primitive, you don't need to DUP something just because the next thing will destroy it. Just use it in place without DUPing or destroying it. Similarly, you can have words you might want headerless like DUP>R , combinations that are commonly used. On the '816, DUP>R is only LDA 0,X, PHA , so you're not actually DUPlicating it or affecting the data stack at all.

Since the main inner loop (ie, NEXT ) takes almost half of the processor's time, we've seen a ton of forum topics on making NEXT as fast as possible, overlooking the fact that you can do even better by reducing the number of times NEXT has to run to get a job done, by turning a lot of the secondaries into primitives.

W65C816 struggling with DTC

W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC

Re: W65C816 struggling with DTC