6502.org • View topic - W65C816 struggling with DTC

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Forth

All times are UTC

W65C816 struggling with DTC

Page 1 of 2

[ 17 posts ]

Go to page 1, 2 Next

Previous topic | Next topic

Author

Message

enso

Post subject: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 5:36 am

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

Help! I've implemented and used many Forth-like STC systems over decades now, so am very rusty on classic Forth. I kind of miss the idea of a thin inner interpreter and all the niceties that go with it (such as simplicity and ability to decompile threaded code easily, etc.) I am trying to wrap my noodle around how a DTC inner may be implemented.

Assuming Y is DSP and X is IP; I would like to keep TOS in A. That doesn't leave too many registers, so NEXT can be INX INX JMP (0,X) 5/10. Primitives can inline or jmp to NEXT at the end. So far so good.

EXIT is basically PLX NEXT 6/15.

High level words work fine if I inline the entry code as a thought experiment:

Code:

  PHX       ;1/4 save IP
  LDX $HERE ;3/3 set new IP
  JMP (0,X) ;3/6
HERE:
  dw DUP
  ..
  dw EXIT

That seems likely to work, but that's 7 bytes and 13 cycles. I am not too crazy about hardcoding a literal address, but getting the processor PC any other way is even worse, and we have no registers to keep W (and probably wouldn't want to as it's bigger and longer). Not that I would do this for real, of course, because DOCOL.

Now, DOCOL will need to swap the (return address-1) at the top of the stack with X (IP). Since I have no spare registers, this turns ugly enough for me to give up on the whole idea.

The best I've come up with is to start all high-level words with PHX JSR DOCOL (4/10), and have docol pull the address-1 into X (and increment it). That leaves IP on the return stack, and avoids ugly swapping.

Notes:
* I would have preferred to use X for DSP, but there is no JMP (0,Y).
* I thought about DP as DSP, but incrementing and decrementing it seems too painful.
* If the worst comes to worst, I could give up on A, especially if the win in DOCOL is big enough to compensate for slightly bigger/slower datastack manipulation.
* Overhead - NEXT:5/10 high-level entry 4/10, EXIT 6/15 (including a NEXT). Not too bad, minimum of 20 cycles * vs. 12 for JSR/RTS.

Any suggestions? Am I missing something obvious? Thanks in advance.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

GARTHWILSON

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 7:28 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8433
Location: Southern California

enso wrote:

I would like to keep TOS in A.

See this post. Elsewhere, Michael Barry (forum member barrym95838), posted about trying out different ways, and included performance charts. There was a small—almost negligible— performance benefit to keeping TOS in A. Maybe he or someone else has it bookmarked or can remember adequate search terms to find it.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

enso

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 8:09 am

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

Yes, I saw that.
TOS in a register makes a big difference in an optimized STC compiler that does a lot of inlining. In fact, there is a Forth implementation I really appreciate, FreeForth, that keeps both TOS and NOS in registers, and uses 'register renaming in software' as it compiles code, optimising away things like SWAP altogether! I have to keep reminding myself that it does not matter much here.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

barrym95838

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 9:08 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1929
Location: Sacramento, CA, USA

We talk about different scenarios here:

viewtopic.php?f=9&t=3649

... but I don't delve into DTC NEXT (which is just a single instruction on my 65m32).

Quote:

NEXT can be INX INX JMP (0,X) 5/10.

If you use Y for IP then you could try the equivalent INY PHY INY RTS 4/14(?). [Edit: this seems wrong ... I missed a level of indirection.]

X is a very versatile register, and deciding the best way to use it involves careful thought and test coding. Native '816 coding isn't really my thing, so I'll defer further speculation to the 16-bit experts in the gallery.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Last edited by barrym95838 on Sun Dec 26, 2021 8:00 am, edited 2 times in total.

Top

BitWise

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 3:45 pm

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

In my '816 DTC Forth I use this

Code:

    80                        ; The CONTINUE macro is used at the end of a native word to invoke the next
    81                        ; word pointer.
    82                        
    83                        CONTINUE        macro
    84                                        tyx                             ; Copy IP to X
    85                                        iny
    86                                        iny
    87                                        jmp     (0,x)                   ; Then execute word
    88                                        endm

I also use DP as the data stack pointer so all access to the stack is via zero page instructions, for example:

Code:

   375                        ; ! ( x a-addr -- )
   376                        ;
   377                        ; Store x at a-addr.
   378                        
   379 00:0546: 35 05 00 01                  HEADER  1,"!",NORMAL
       00:054A: 21 
   380                        STORE:
   381 00:054B: A5 03                        lda     <3                      ; Fetch data value
   382 00:054D: 92 01                        sta     (1)                     ; .. and store
   383 00:054F: 7B                           tdc                             ; Clean up data stack
   384 00:0550: 1A                           inc     a
   385 00:0551: 1A                           inc     a
   386 00:0552: 1A                           inc     a
   387 00:0553: 1A                           inc     a
   388 00:0554: 5B                           tcd
   389 00:0555: BB C8 C8 7C                  CONTINUE                        ; Done
       00:0559: 00 00 

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs

Top

enso

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 7:13 pm

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

BitWise, that actually looks good. I was a little put off by tdc/inc a/inc a/tcd sequence, but at 4 bytes and 8 cycles, it is not out of the question; the ability to access data on the stack, and, as you showed, indirect through stack data, is really powerful. I also realized that code words that really mess with the stack can usually adjust it once (or twice at most). Forth actually under-utilizes data higher up on the stack; I suppose the intent of DP is to be used as a frame pointer for HLLs. I will make a note to my future self who is busy implementing a Lisp.

I also found your Forth implementation (somehow missed it before, will take a look as it's probably what I need). I suppose using Y as IP gives you access to pre-incremented IP in X to jump through and post-incremented IP to push in DOCOL. The way I was doing saves one instruction in NEXT but requires an extra PHX in every high-level preamble (and pre-increments IP, a little odd I suppose). Your way is a little more compact and clearer.

barrym95838, thanks for the missing detail (I do have that thread in my bookmarks)

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

MichaelM

Post subject: Re: W65C816 struggling with DTC

Posted: Wed Nov 27, 2019 9:58 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

enso:

Mike Barry and Brad Rodriguez helped me sort through this issue. I have a compact side-by-side comparison of the inner interpreter for ITC and DTC here. I derived that table from Brad's Moving Forth articles which you can find on his Camel Forth website.

_________________
Michael A.

Top

Dr Jefyll

Post subject: Re: W65C816 struggling with DTC

Posted: Thu Nov 28, 2019 1:07 am

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3354
Location: Ontario, Canada

enso wrote:

[With DP as the data stack pointer] the ability to access data on the stack, and, as you showed, indirect through stack data, is really powerful.

In particular, you could have a 24-bit address on stack, and use it -- in situ -- to instantly fetch and store anywhere in the 16 MB address space. :shock:

For me, that's compelling. I hate the idea that anything outside 64K would be clunky and awkward (and slow) to access. What a de-motivation to do anything ambitious!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

enso

Post subject: Re: W65C816 struggling with DTC

Posted: Thu Nov 28, 2019 7:25 pm

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

A 24-bit cell size would be quite nice. I wish there was a little more 24-bit support in the processor... a multiply-by-3 instruction, 24-bit registers...

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

Dr Jefyll

Post subject: Re: W65C816 struggling with DTC

Posted: Fri Nov 29, 2019 1:25 am

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3354
Location: Ontario, Canada

enso wrote:

A 24-bit cell size would be quite nice.

IMO the sensible thing is let each cell be 32 bits. Then it can hold a 16- or 24-bit address, or a 32-bit double-precision integer.

( I know with FIG Forth you'd use two cells to hold a 24- or 32-bit value, but I've been there and done that, and found I HATED it! Suppose you have two values on stack, a 16-bit and a 32-bit, and say you wanna swap the two. What would you rather write (and read) in the source code -- SWAP or ROT ? And that's the most trivial of examples! Trust me, it only gets worse. That's why I insist the proper policy is one cell per value. )

Quote:

I was a little put off by tdc/inc a/inc a/tcd sequence, but at 4 bytes and 8 cycles, it is not out of the question

Two things about this:

- for 32-bit cells, the code to increment the DataStack Ptr becomes tdc / clc / adc# 4 / tcd -- which is only one cycle slower than what's required for 16-bit cells.

- it wouldn't take a lot of extra smarts for Forth's compiler to somewhat reduce the number of up-down adjustments of the DataStack Ptr. There are several different ways this could work, but here's one example...

Consider the Forth word + which, like many words, shrinks the stack. Suppose the most commonly used stack-shrink words have alternative definitions which leave the stack un-shrunken ( ie; they omit the tdc / clc / adc# 4 / tcd ). Now consider the word which appears next in the user's source code. With luck, it'll be one of a different set of common words which have alternative definitions. The alternative definitions for words in this second set expect the stack to be un-shrunken upon entry, and are written accordingly.

A peephole optimizer could look for cases where a word belonging to the first set is followed by a word belonging to the second set. Finding such a case, it makes the substitutions. Then at run-time, rather than getting adjusted twice, the stack gets adjusted once, or possibly not at all.

(Probably this could've been explained more elegantly, but I hope the principle is clear. Look for ways to avoid being so fine-grained with the stack adjustments!)

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

enso

Post subject: Re: W65C816 struggling with DTC

Posted: Fri Nov 29, 2019 7:03 am

Joined: Sat Sep 29, 2012 10:15 pm
Posts: 899

Code:

tdc        (1/2)
clc        (1/2)
adc #$0004 (3/3)
tcd        (1/2)
================
            6/9

Yes, I suppose it's worth having a uniform data size.

You could generalize your optimization as follows: at compile time, we maintain a 'notional DSP' and keep an offset from the actual DSP. Words like + are written to operate on the notional stack position at run-time (without adjusting DSP) and update the offset at compile-time. At certain points - loop entry/exit, prior to exiting, or calling a word that is not 'smart' - we make things right by adjusting the real stack by the offset.

In practice, this can be done by compiling different versions of + as you pointed out. Computations are generally not too deep, and you can probably get by with keeping a couple or 3 versions. If you plan for this kind of optimization and make a provision for optimized versions for any word, you can start with regular words, and add optimized versions as needed (with the help of a profiler).

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

_________________
In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut

Top

BitWise

Post subject: Re: W65C816 struggling with DTC

Posted: Fri Nov 29, 2019 2:54 pm

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

I've been working on a subroutine threaded forth compiler for the '816 on and off for a while. It applies lots of peep hole optimisation to the code so that something like this

Code:

hex
ff00 constant ACIA:DATA inline
ff01 constant ACIA:STAT inline
ff02 constant ACIA:CMND inline 
ff03 constant ACIA:CTRL inline

: emit ( ch -- )
  begin ACIA:STAT c@ 0010 and until
  ACIA:DATA c!
;

is turned into this

Code:

emit:                                   ; : emit
.L1:
                dex
                dex
                sep     #32
                lda     -255            ; -255 c@
                sta     <1,x
                stz     <2,x
                rep     #32
                lda     <1,x
                and     #16             ; 16 and
                sta     <1,x
                ldy     <1,x
                inx
                inx
                tya
                jeq     .L1             ; (?branch)
                lda     <1,x            ; -256 c!
                sep     #32
                sta     -256
                rep     #32
                inx
                inx
                rts                     ; ;

As you don't need an instruction pointer in STC the code uses X as the data stack pointer. I added JEQ/JNE etc opcodes to my assembler which generate the shortest branch sequence (e.g. BEQ or BNE + JMP) depending on the relative distance to the target address.

The optimiser gets rid of many stack push/pulls.

One day I'll finish it.

Top

Dr Jefyll

Post subject: Re: W65C816 struggling with DTC

Posted: Sat Nov 30, 2019 6:26 am

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3354
Location: Ontario, Canada

enso wrote:

Yes, I suppose it's worth having a uniform data size.

Right -- and FWIW let me remind everyone that 32-bit cells can be achieved in different ways.

I've explained why the D register has a compelling advantage as the DSP for an '816, but on 6502 or 'C02 you may want the 32-bit cells to live as two halves that reside in a high-word region and a low-word region both addressed by X (and the increment value for X would be 2). Probably you'll want the low-word region in zero-page so you can exploit (x,ind) address mode, but if desired the high-word region could be located outside zero-page.

Quote:

You could generalize your optimization as follows: at compile time, we maintain a 'notional DSP' and keep an offset from the actual DSP.

Yes. Nicely put.

Quote:

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

This is maybe pretty obvious, but another prospective optimization is to look for occurrences of common 2-word sequences such as OVER + and replace the sequence with a one-word substitute named OVER&+ (or whatever you like to call it). The main goal is to reduce the number of times NEXT executes, but sometimes (as in this case) you'll eliminate the stack-pointer adjustments, too.

This sort of optimization is easy enough to do manually. In my FIG dictionary I have about half a dozen contractions, and I think Garth has a collection that's quite a bit larger.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

dwight

Post subject: Re: W65C816 struggling with DTC

Posted: Sun Dec 01, 2019 6:19 pm

Joined: Tue Jun 08, 2004 11:51 pm
Posts: 213

One thought is to use hardware assistance. Keep the return stack under processor control but use an external data stack, with A as the TOS. You'd use two addresses to decode operations, one for pop and one for push. A few more addresses for double operations would be handy. It wouldn't take that much size for typical stack usage but moving both data and return stack off chip makes large recursion more practical. Just my rambling.
Dwight

Top

GARTHWILSON

Post subject: Re: W65C816 struggling with DTC

Posted: Fri Jan 10, 2020 9:42 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8433
Location: Southern California

Dr Jefyll wrote:

enso wrote:

Interesting - I didn't thing you could do much to optimize interpreted Forth. But you can.

I was just going through my '816 Forth material as I've had some time to work on it again recently, preparing it for publication. In the main file, I get the following:

Code:

2>R             3DROP           NIP;            3+
2R>             3DUP            0F_AND          3*
2R@             3_PICK          1F_AND          BASE@
RDROP           2_PICK          FF_AND          SPAN@
RSWAP           DUP>R           +@              STATE@
I@              SWAP!           @+              DROP_FALSE
I_C@            !;              SWAP-
I+              C!;             ?EXIT
I@              @>R             ?LEAVE

and then of course there are ones in common Forth usage which some might easily overlook, like that @ . is just ? and @ EXECUTE is just PERFORM.

The ones ending in ; are for when it's the last word in a secondary, eliminating the unnest. The ones starting with ? do the equivalent action of putting the word between IF and THEN, like IF EXIT THEN. Each of these integrates multiple words into a single primitive, making it far more efficient. For example, RDROP is nothing but PLA, and DROP_FALSE is nothing but STZ 0,X. I might have missed some. Also, I have not looked in the INCLude files yet which are for hardware, extended math, time & date functions, the assembler, added string functions, and prioritized interrupt service (I think I have some long ones there, which integrate a lot of words that would normally be secondaries). At some point in the past, I looked for the most common two-word combinations and charted how many times I used them, to determine the value of writing single primitives to replace them.

You can also speed up execution by avoiding a series of branches like you get where execution follows a series of ELSE occurrences, and make the first branch go directly to where the next non-branching execution occurs. This speeds things up a little because you don't have branch foo1, with foo1 saying branch foo2, and foo2 saying branch foo3, etc..

From a web page I'm writing about my '816 ITC Forth:

easier

-ROT

: -ROT ROT ROT ;

nest

unnest

NEXT

DUP

DROP

DUP

DUP>R

LDA 0,X

PHA

DUP

NEXT

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Page 1 of 2

[ 17 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Forth

All times are UTC

Who is online

Users browsing this forum: No registered users and 4 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum