6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri May 10, 2024 1:42 am

All times are UTC




Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: vino816 Forth design
PostPosted: Tue Apr 24, 2018 4:11 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
Dr Jefyll wrote:
You mentioned Forth using STC (subroutine threaded code), which indeed is quite a lot faster than ITC. I don't understand your comment about the '816 needing better support for X as a data-stack pointer (for the most part I find it adequate). Be that as it may, there's at least one other fast alternative -- namely, Forth that uses DTC (direct-threaded-code).

The 65c816 lacks a post-increment of X (for pulling a datum from the data-stack) and a pre-decrement of X for pushing a datum to the data-stack. Because of this, your code ends up with a lot of INX INX and DEX DEX code sequences --- this makes the code bloaty and slow.

I've given some thought to the subject of the design for a Forth for the 65c816. I don't have any intention of actually implementing this though --- there aren't enough people interested in the 65c816 (the only people I know of who use the 65c816 are on this forum, there are only a handful of them, and most of them are convinced that I am not a "language expert" in regard to Forth) --- in the unlikely case that a lot of interest develops, I might do it.

I briefly examined Garth's Forth and IMHO there are three problems:
  • It is ITC rather than STC.
  • It uses X rather than S for the data-stack (note that UR/Forth used SP for the data-stack and BP for the return-stack).
  • It holds the entire data-stack in memory, rather than told the TOS in A (note that UR/Forth held the TOS in BX).

I came up with a design that fixes these problems. I would expect the code generated to be about 8 times faster than Garth's Forth for these reasons:
  • I'm holding the TOS in A --- always holding the TOS in a register is well-known to boost the speed significantly --- note that always holding both the SOS and TOS in a register doesn't work very well (and is a non-issue on the 65c816 that suffers from a shortage of registers).
  • I'm using S as the data-stack --- this allows me to use PHA and PLA for DUP and DROP respectively --- these are one-byte instructions, so they can be inlined to increase speed and are 1/3 the size of a JSR so they also reduce bloat.
  • STC uses JSR/RTS, which is faster than NEXT in ITC (or DTC).
  • STC uses BEQ (along with some code to pop the TOS and test it) rather than a 0BRANCH primitive and BRA rather than a BRANCH primitive, which is much faster (and speeds up loops which tend to be the big time-sinks in most programs).
  • STC allows small functions such as OVER + @ ! etc. to be inlined, which is much faster than doing a function call, and in some cases is smaller than a JSR.
  • STC allows jump-termination (replace JSR xxx RTS with JMP xxx), which boosts the speed and helps to avoid return-stack overflow in recursive functions.
  • STC allows peephole-optimization to be done. This mostly involves using literal values as operands. For example, a @ or ! to an absolute address does not require the address to be loaded into a register and then the register used indirectly, but can be optimized to use either absolute or zero-page addressing-modes.
  • STC allows ISRs to be written in Forth and to have very little overhead --- reducing interrupt latency is often the best way to boost the speed of micro-controllers.

For the most part, STC is all about inlining code and doing peephole-optimization --- code is going to be quite bloaty --- of course, bloaty code is only a problem if you have users who want to write large programs, which makes it a non-problem for the foreseeable future. :wink:

I wrote a document describing my design --- it is short enough that I can just inline it in this post:
Code:
VINO816.TXT --- copyright: April 2018, Hugh Aguilar

This is called vino816 Forth because a person would have to be drunk to think that programming Forth on the 65c816 is a good idea --- it is slower than C or Pascal, but that is mostly because the 65c816 was designed to support C or Pascal rather than Forth --- it has the advantage over C or Pascal of providing an interactive development environment, which is useful for testing code.

This is STC (subroutine-threaded-code) meaning that the code mostly consists of JSR instructions --- some peephole-optimization can be done, generating inlined machine-code.

register usage:
A    TOS (top-of-stack) of data-stack
X    return-stack pointer
Y    general-purpose
S    data-stack pointer

Local variables are possible, but would need a zero-page pointer to be used as the local-stack pointer.

; -----------------------------------------------------------------------------------------------------

; These are a few simple macros:

macro FAST_ENTER        ; done at the start of primitives that don't use Y internally
    PLY
endm

macro FAST_LEAVE        ; done at the end of primitives that start with FAST_ENTER
    PHY
    RTS
endm

1 equ 1st               ; this is the SOS after FAST_ENTER has been done, or is the return-address
3 equ 2nd               ; this is the SOS if FAST_ENTER has not been done
5 equ 3rd
7 equ 4th

macro ENTER             ; done at the start of colon words or primitives that use Y internally
    FAST-ENTER
    DEX
    DEX
    STY 0,X
endm

macro LEAVE             ; done at the end of functions that start with ENTER
    LDY 0,X
    INX
    INX
    FAST_LEAVE
endm

; Primitives that don't push or pull to the data-stack (such as NEGATE ROT etc.) can just leave the return-address on the data-stack and do RTS at the end.
; Primitives that push or pull to the data-stack can use FAST-ENTER and FAST-LEAVE at the end. This only works if they don't use Y internally.
; Primitives that push or pull to the data-stack and use Y internally have to do ENTER at the start and LEAVE at the end. They are going to be slow.

macro _DUP              ; a -- a a
    PHA
endm

macro _OVER             ; a b -- a b a
    _DUP
    LDA 2nd,S
endm

macro _NIP              ; a b -- b
    PLY
endm   

macro _DROP             ; a --
    PLA
endm

macro _SWAP             ; a b -- b a
    PLY
    PHA
    TYA
endm   

macro _SRAP             ; a b c -- c b a
    LDY 3,S
    STA 3,S
    TYA
endm

macro _ROT              ; a b c -- b c a
    _SWAP               ; -- a c b
    _SRAP               ; -- b c a
endm

macro _NROT             ; a b c -- c a b
    _SRAP               ; -- c b a
    _SWAP               ; -- c a b
endm

macro _NEGATE           ; a -- -a
    EOR #$FFFF
    CLC
    ADC #1
endm

macro _ADD              ; a b -- a+b
    CLC
    ADC 1st,S
    _NIP
endm

; These are a few simple primitives:

TRUE:                   ; -- true               ; note that TRUE is 1 rather than -1 as in ANS-Forth
    FAST_ENTER
    _DUP
    LDA #1
    FAST_LEAVE

DROP:                   ; a --
    FAST_ENTER
    _DROP
    FAST_LEAVE

DUP:                    ; a -- a a
    FAST_ENTER
    _DUP
    FAST_LEAVE

OVER:                   ; a b -- a b a
    FAST_ENTER
    _OVER
    FAST_LEAVE

SWAP:                   ; a b -- b a            ; don't use _SWAP because it assumes the return-address is removed
    LDY 2nd,S           ; 1st is still the return-address
    STA 2nd,S
    TYA
    RTS

ROT:                    ; a b c -- b c a        ; don't use _ROT because it assumes the return-address is removed
    LDY 3rd,S
    STA 3rd,S
    TYA                 ; -- a c b
    LDY 4th,S
    STA 4th,S
    TYA                 ; -- b c a
    RTS

NROT:                   ; a b c -- c a b        ; don't use _NROT because it assumes the return-address is removed
    LDY 4th,S
    STA 4th,S
    TYA                 ; -- c b a
    LDY 3rd,S
    STA 3rd,S
    TYA                 ; -- c a b
    RTS

MINUS:                  ; a b -- a-b
    ENTER
    _NEGATE
    _ADD
    LEAVE

PLUS:                   ; a b -- a+b
    ENTER
    _ADD
    LEAVE

; MINUS and PLUS are both slow because they use ENTER and LEAVE rather than FAST_ENTER and FAST_LEAVE --- they should be inlined.

NIP:                    ; a b -- b
    FAST_ENTER
    _NIP
    FAST_LEAVE

FETCH:                  ; ptr -- n
    TAY
    LDA 0,Y
    RTS

STORE:                  ;  n ptr --
    ENTER
    TAY
    PLA
    STA 0,Y
    LEAVE

; STORE is very slow because it uses ENTER and LEAVE --- most effort on peephole-optimization should involve inlining this code.

TO_R:                   ; a --                  ; return: -- a
    FAST_ENTER
    DEX
    DEX
    STA 0,X
    _DROP
    FAST_LEAVE

R_FETCH:                ; -- a                  ; return: a -- a   
    FAST_ENTER
    _DUP
    LDA 0,X
    FAST_LEAVE

R_FROM:                 ; -- a                  ; return: a --
    FAST_ENTER
    _DUP
    LDA 0,X
    INX
    INX
    FAST_LEAVE

RDROP:                  ; --                    ; return: a --
    INX
    INX
    RTS

edit: I fixed a bug in MINUS and PLUS.


Last edited by Hugh Aguilar on Tue Apr 24, 2018 12:36 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Tue Apr 24, 2018 4:28 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
Hugh Aguilar wrote:
I briefly examined Garth's Forth and IMHO there are three problems:
  • It is ITC rather than STC.
  • It uses X rather than S for the data-stack (note that UR/Forth used SP for the data-stack and BP for the return-stack).
  • It holds the entire data-stack in memory, rather than told the TOS in A (note that UR/Forth held the TOS in BX).

Garth seems to be quite attached to ITC --- I don't know why --- it is just as easy to generate a JSR xxx as an xxx (with xxx being the cfa of the word to be executed).

Anyway, he could stick with ITC (he is going to) and still make a substantial boost in speed with these upgrades:
  • Use S for the data-stack and X for the return-stack. This would involve changing his DOCOLON and EXIT to use X instead of S. This would allow him to use PHA PHY PLA PLY for accessing the data-stack, which occurs more often than DOCOLON and EXIT.
  • Hold the TOS in either A or Y. If it is held in A, this boosts the speed of arithmetic such as + etc.. If it is held in Y, this boosts the speed of @ and ! etc.. Currently he uses A in NEXT, so holding TOS in Y would be easy. His NEXT could easily be rewritten to use Y though, which would allow TOS to be held in A.

In my vino816 design I chose to hold TOS in A --- this is a debatable choice --- but either A or Y is better than holding the entire data-stack in memory.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Tue Apr 24, 2018 5:08 am 
Offline

Joined: Mon Jan 07, 2013 2:42 pm
Posts: 576
Location: Just outside Berlin, Germany
I created a STC Forth for the 65816 called Liara Forth (see https://github.com/scotws/LiaraForth) with Y as the TOS. There is some technical documentation about the design at https://github.com/scotws/LiaraForth/bl ... TERNALS.md . Note there is quite a bit of inlining. The reason I chose STC was the lack of registers on the 65816 and the simpler design, though now I'm not sure if Y was the best choice for TOS or if it shouldn't be A as in your design.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Tue Apr 24, 2018 12:38 pm 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
scotws wrote:
I created a STC Forth for the 65816 called Liara Forth (see https://github.com/scotws/LiaraForth) with Y as the TOS. There is some technical documentation about the design at https://github.com/scotws/LiaraForth/bl ... TERNALS.md . Note there is quite a bit of inlining. The reason I chose STC was the lack of registers on the 65816 and the simpler design, though now I'm not sure if Y was the best choice for TOS or if it shouldn't be A as in your design.

How many users do have?

Inlining is easy if you have an assembler written in Forth, which is easy for the 65c02 --- other processors, such as the Z80, are much more difficult to write an assembler for.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Wed Apr 25, 2018 3:18 am 
Offline

Joined: Mon Jan 07, 2013 2:42 pm
Posts: 576
Location: Just outside Berlin, Germany
Hugh Aguilar wrote:
How many users do have?
Oh, about zero (on a good day) :D . But that was never the point, I learned a lot and had fun, including writing my first emulator, a poor man's multi-pass assembler, and designing my own mnemonics system. Now that's entertainment! What I learned I fed back into the rewrite of Tali Forth for the 65c02. So I don't regret a minute, and am thinking of a rewrite for the 65816 (though realisticly that would probably be next year or such). Which is why I'll have to revisit the question of where to put TOS at some point -- Y worked, but wasn't really ideal, and turns out to be harder to think about than expected. Your suggestion, A, would be my next choice.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Wed Apr 25, 2018 3:36 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
Andrew and Jeff were playing with some interesting ideas here:

viewtopic.php?f=9&t=3612

Andrew talks about his experiment here:

viewtopic.php?f=9&t=3686

His fetch is pretty cool:
Code:
fetch:
     lda  (1)
     sta  1
     $NEXT


but not quite as cool as my (vaporware) 65m32 fetch:
Code:
fetch:
     lda  0,a
     $NEXT


It's a real shame that the '816 doesn't have any op-code space for IND and DED (for the direct page register), because they would offer a huge benefit to a Forth with PSP in DP. It saddens me to admit it, but the '816 just doesn't give me the warm fuzzy feeling that the '02 and 'c02 do. It could have been so much better if it hadn't been handcuffed by legacy binary compatibility ... and those pesky 8-bit bytes!

Mike B.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Wed Apr 25, 2018 6:28 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
barrym95838 wrote:
my (vaporware) 65m32 fetch:
Code:
fetch:
     lda  0,a
     $NEXT


Well, in vino FETCH would be:
Code:
    tay
    lda 0,y

That is almost as fast as yours --- the TAY is a 1-byte 2-cycle instruction --- unfortunately, there is no (Y) addressing-mode; a 0 offset is needed, and it is 2-byte rather than 1-byte.

There are a lot of trade-offs. For example, this is how 0BRANCH would be inlined in vino:
Code:
    tay     ; this sets ZF as needed
    pla     ; this messes up ZF
    iny
    dey     ; this sets ZF as needed again
    beq xxx

This is how 0BRANCH would be inlined if the entire data-stack were in memory:
Code:
    pla     ; this pops the top value and simultaneously sets ZF as needed
    beq xxx

OTOH, holding the TOS in a register is more efficient in most other cases.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Wed Apr 25, 2018 12:56 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3354
Location: Ontario, Canada
Hugh Aguilar wrote:
There are a lot of trade-offs.
That's for sure. :) But in your lead post you make some convincing points. For example these (below) carry a lot of weight, assuming speed is the #1 priority.

(Which is not always the case. As Mike B notes, PSP in DP offers real benefit but incrementing/decrementing DP is slow. The upside, IMO, is being able to use a 24-bit address in situ on stack, just as you do with a 16-bit address. ie- no need to copy it somewhere else.)

Quote:
  • STC uses BEQ (along with some code to pop the TOS and test it) rather than a 0BRANCH primitive and BRA rather than a BRANCH primitive, which is much faster (and speeds up loops which tend to be the big time-sinks in most programs).
  • STC allows small functions such as OVER + @ ! etc. to be inlined, which is much faster than doing a function call, and in some cases is smaller than a JSR.


BTW, this...
Hugh Aguilar wrote:
For example, this is how 0BRANCH would be inlined in vino:
Code:
    tay     ; this sets ZF as needed
    pla     ; this messes up ZF
    iny
    dey     ; this sets ZF as needed again
    beq xxx

... can be replaced with (one byte bigger and one cycle faster) this:
Code:
    tay     ; this sets ZF as needed
    pla     ; this messes up ZF
    cpy #0 ; this sets ZF as needed again
    beq xxx

Or it might be doable to rearrange things as follows:
Code:
     beq xxx   ;test ZF first and, either way, do the PLA *afterward*
     pla
     ...

xxx: pla
     ...


-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Thu Apr 26, 2018 1:04 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
Dr Jefyll wrote:
Or it might be doable to rearrange things as follows:
Code:
     beq xxx   ;test ZF first and, either way, do the PLA *afterward*
     pla
     ...

xxx: pla
     ...


A forward branch, as you showed, is easy in a single-pass compiler.

A backward branch, such as with the UNTIL in Forth, is not so easy. You need a multi-pass compiler to do this efficiently. If you have a single-pass compiler, you would do this:
Code:
start:
    ....
    bit #$FFFF
    bne past
    pla
    bra start
past:
    pla
    ....

Note that so far I have been using traditional assembly syntax (on the assumption that most 6502-forum folks, especially Garth, find this most familiar). I have my own Forth assembler for the 65c02 (and, now, the M65c02A too). A Forth system would, of course, use a Forth assembler (I prefer my own). So, the above would be written:

Code:
    {:
        ...
        $FFFF bit#  bne{  pla  2swap bra}  :}  pla
    ...

Anyway, this is inefficient because in the usual case of repeating the loop you are doing two branches (the BNE doesn't branch and the BRA does branch) rather than one branch (a BEQ back to the start). This inefficiency isn't all that terrible --- the BNE that doesn't branch only takes 3 clock cycles (IIRC, it is 3 clocks for a non-branch and 4 for a branch).

Garth has seen my M65c02A assembler --- and the peephole-optimizer too --- he was not impressed.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sat Apr 28, 2018 4:44 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
Hugh Aguilar wrote:
Code:
    {:
        ...
        $FFFF bit#  bne{  pla  2swap bra}  :}  pla
    ...

Anyway, this is inefficient because in the usual case of repeating the loop you are doing two branches (the BNE doesn't branch and the BRA does branch) rather than one branch (a BEQ back to the start). This inefficiency isn't all that terrible --- the BNE that doesn't branch only takes 3 clock cycles (IIRC, it is 3 clocks for a non-branch and 4 for a branch).

BTW: My peephole-optimizer gets rid of that $FFFF BIT# most of the time --- it keeps track of whether the the ZF flag is valid for the A register (it usually is, because PLA LDA etc. all set ZF) --- if the ZF flag is already valid, the $FFFF BIT# is discarded as being redundant.

We aren't discussing peephole-optimization right now though.

We are discussing how to inline 0BRANCH with a backwards branch. Here is another possibility that can also be done with a single-pass compiler:
Code:
    pha  {:  pla                      \ this is BEGIN
    ...
    $FFFF bit#  beq}  pla             \ this is UNTIL
    ...

This is efficient for BEGIN UNTIL loops --- we have an extra PHA PLA on the final iteration --- we don't have inefficiency on every iteration.

This is what you get if you have a BEGIN WHILE REPEAT loop:
Code:
    pha  {:  pla                      \ this is BEGIN
    ...
    $FFFF bit#  beq{  pla             \ this is WHILE
    ...
    2swap bra}  }:  pla               \ this is REPEAT
    ...

BEGIN WHILE REPEAT does an extra PHA PLA on the final iteration --- it is not inefficient on every iteration.
This is the best solution yet --- we only have a minor inefficiency on the final iteration (not on every iteration, like we did in my previous solution in the previous post) --- we can do this with a single-pass compiler because BEGIN compiles the same code no matter whether it turns out to be a BEGIN UNTIL or a BEGIN WHILE REPEAT


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sun Apr 29, 2018 12:50 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
Hugh Aguilar wrote:
BTW: My peephole-optimizer gets rid of that $FFFF BIT# most of the time --- it keeps track of whether the the ZF flag is valid for the A register (it usually is, because PLA LDA etc. all set ZF) --- if the ZF flag is already valid, the $FFFF BIT# is discarded as being redundant.

We aren't discussing peephole-optimization right now though.

Also, $FFFF BIT# can be replaced with TAY --- this is both faster and smaller.
This always works because I never carry data in Y through a branch --- the sections of code between branches are compiled and peephole-optimized as chunks --- doing peephole-optimization that covers entire control-structures would require a multi-pass compiler, and I only have single-pass at this time.

This thread seems to have devolved into me responding to myself --- apparently there is no interest in vino-816 --- I should not bother with further posts.

I mostly started this thread because of the previous exchange in the game-machine thread:

Hugh Aguilar wrote:
BigDumbDinosaur wrote:
As a well-written Forth kernel tends to be quite efficient (efficiency being one of the hallmarks of Forth), I'd expect that a kernel optimized for the 65C816 running in native mode would see a substantial performance gain.

It is obvious that William Mensch wanted to support C programming, so he added the offset,S and (offset,S),Y addressing-modes, but he was failing to support Forth.

If he doesn't want to support Forth, then I don't want to support his 65c816.

The 65c816 ISA was designed to support C, so supporting Forth is going to require contorted programming. I would do that if I was getting paid --- I'll do most anything for money, so long as it is not immoral or illegal, and I'm totally okay with implementing bad ideas --- do you want to hire me to write a Forth for the 65c816?
You've already said on this forum that you will never be convinced that Forth has any value, so I'm expecting that you are not going to hire me --- you don't actually believe what you said: "a well-written Forth kernel tends to be quite efficient (efficiency being one of the hallmarks of Forth)"

sark02 wrote:
Hugh Aguilar wrote:
do you want to hire me to write a Forth for the 65c816?
The question wasn't to me, but FYI if I wanted a Forth on any system, I'd want to hire a language expert.

What a slap in the face! SARK02 apparently believes that because I said that the 65c816 does not support Forth very well, this implies that I am incompetent at Forth. :cry:
He believes that it is an easy matter to find a "language expert" who knows more about Forth than I do and will say:
Quote:
The 65c816 is perfect for Forth! You are a genius to have chosen the 65c816 for Forth! You can pay by check, but your check has to clear before I begin work, Mr. Genius. I guarantee a kernel optimized for the 65c816 and a substantial performance gain! Woo hoo!

So, I began to wonder how a mighty "language expert" would design a Forth for the 65c816. Thinking about this, I came up with the vino816 design. Now SARK02 has to find a "language expert" who can think up a better design --- good luck with that!

I have no intention of actually implementing vino816 (not for free anyway; as I said above, I'm totally okay with getting paid to implement bad ideas). The only person I know of who uses the 65c816 for commercial projects is Garth Wilson, so I showed it to him first, but he was very unimpressed (he said it was "unnecessarily inefficient" and my code quality was so bad it made him "cringe"). AFAIK, Garth by himself is the entire 65c816 market on Earth, so if he is not interested, then nobody is.

BTW: William Mensch may have been thinking about Pascal rather than C. If so, then he was thinking about a crippled Pascal that lacks nested functions --- there is no register available for a local-frame pointer --- there were several such crippled Pascals in the 1980s, most notably Borland Turbo Pascal for the Z80 and later the 8088 (the bytecode Pascal for the Apple-II was also crippled afaik, although I never used it).


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sun Apr 29, 2018 1:11 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
Wow. I have never heard Garth respond like that to anyone before, publicly or privately. You must have struck a nerve somehow ... I've been told that you have a knack for that, although I seem to be somewhat immune.

Mike B.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sun Apr 29, 2018 2:50 am 
Offline

Joined: Fri Jun 03, 2016 3:42 am
Posts: 158
barrym95838 wrote:
Wow. I have never heard Garth respond like that to anyone before, publicly or privately. You must have struck a nerve somehow ... I've been told that you have a knack for that, although I seem to be somewhat immune.

Mike B.

Well, telling him that my STC Forth would generate code 8 times faster than his ITC Forth was somewhat undiplomatic --- maybe should have said 2 times faster (he estimated 40% faster, which is less than 1.5 times faster) --- I don't actually know that it would be 8 times faster; I was just guessing.

I think my vino816 design is about as good as any "language expert" is going to do on the 65c816 --- it is still not going to be as fast as C or Pascal --- my ENTER and LEAVE are a huge time-sink, plus I'm tying up X rather than leaving it available for general-purpose use.

I could retarget my vino816 design for the CPU12 (rename it the vino12). My Forth is not going to be as fast as C or Pascal on the CPU12 either though (although, at least, I no longer need ENTER and LEAVE because I can use X for the data-stack pointer and keep SP for the return-stack pointer as it was designed to be used). The reason is that the CPU12 has only a very few registers (D, X, Y, SP and PC), and Forth requires one of them to be used as the data-stack pointer --- this is one less register that is available for general-purpose use --- as a rule, Forth is only competitive with C or Pascal when there are plenty of registers, so tying up one register for use as the data-stack pointer doesn't take such a substantial bite out of your available registers.

The CPU12 is obsolete anyway --- NXP still sells it for use in legacy boards --- I've not heard of anybody using it for new designs.
I don't think there is a CPU12 forum --- there is this 6502 forum, but it is mostly nostalgia --- the CPU12 was never used in popular products (Commodore-64, Apple-IIc, etc.), so there is no nostalgia for it.
The ColdFire is obsolete too --- that was a pretty cool processor --- the ARM Cortex has made everything obsolete! :(

I suppose I could just program the ARM Cortex --- I doubt that I could get hired for that --- if a job were offered, the next day there would be 100,000 job applicants, all of whom have education or experience that I lack.

For the most part, right now, I'm focusing on my TOYF processor --- running Forth, it should be several times faster than the MSP430 running C --- it doesn't have interrupts though, so it can only use a paced loop, but that is adequate of motion-control.

One of the problems with America is that too many people have the attitude: "Nothing but the best for me!"
So, there is a tendency to decide too early what the best is. The ARM Cortex was arguably the best design of the 1990s --- this doesn't necessarily imply that it will continue to be the best design of the 21st century --- it dominates so much now though, that any new design faces an insurmountable uphill climb to gain any recognition.

Similarly we have these Star Wars movies that just get worse and worse --- the geniuses in Hollywood have decided this is a winning formula, so they are sticking with it --- this "best" drives out other movies that could be better (it is actually difficult to imagine any movie being worse).


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sun Apr 29, 2018 6:23 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1928
Location: Sacramento, CA, USA
According to Dr. Brad, Phil Koopman did a study of how often Forth primitives get executed in some "benchmark Forth programs", whatever that means:

1. ENTER (12.21%)
2. EXIT (11.74%)
3. VARIABLE (5.46%)
4. @ (5.40%)
5. 0BRANCH (4.78%)
6. LIT (4.54%)
7. + (4.18%)
8. SWAP (3.90%)
9. R> (3.89%)
10. >R (3.87%)
11. CONSTANT (3.68%)
12. DUP (3.05%)

Note that ENTER, EXIT, R>, and >R combined add up to almost 32% of primitive executions.

I don't know as much as I would like about Forth programming, but if that list is correct, I hesitate to allow myself to be convinced that your PSP in S design is more efficient than PSP in X for STC (even if your compiler is very clever). I started to test-code some comparison primitives so I could hand-compile a small program both ways, but it got too late, so I'll just post this thought and try to share some code tomorrow if I have time between errands.

Mike B.

P.S.
Leepivonka has been laying some groundwork of his own, and offers this interesting little morsel:
Code:
10 fib . 89  ok
4 fib . 5  ok

see fib                     : fib  ( n1 -- n2 )
058A B500       LDA 00,x         dup
058C 38         SEC            2 <
058D E90200     SBC #0002 {' SInEnd0}
0590 20E026     JSR 26E0 {<+003A}
0593 A8         TAY            if
0594 D003       BNE 0599 {fib+000F}
0596 4CA305     JMP 05A3 {fib+0019}
0599 E8         INX              drop
059A E8         INX
059B A90100     LDA #0001 {' SInIndx0+0001}     1
059E CA         DEX
059F CA         DEX
05A0 9500       STA 00,x
05A2 60         RTS              exit
 ok                   then
$5a3 disasm
05A3 B500       LDA 00,x         dup
05A5 3A         DEA            1-
05A6 CA         DEX
05A7 CA         DEX
05A8 9500       STA 00,x
05AA 208A05     JSR 058A {fib}         recurse
05AD 20EB1C     JSR 1CEB {Swap+000C}      swap
05B0 A90200     LDA #0002 {' SInEnd0}      2
05B3 49FFFF     EOR #FFFF {' minval+0026}   -
05B6 38         SEC
05B7 7500       ADC 00,x
05B9 9500       STA 00,x
05BB 208A05     JSR 058A {fib}         recurse
05BE 20361F     JSR 1F36 {++0018}      +
05C1 60         RTS            ;
 ok

5 fact . 120  ok
8 fact . -25216  ok
7 fact . 5040  ok

$5be 10 dump
0005BE 20 36 1F 60 66 61 63 74 04 8A 05 B5 00 A8 F0 02  6.`fact........ ok
see fact                  : fact ( n -- n! )
05C9 B500       LDA 00,x         dup
05CB A8         TAY            0> if
05CC F002       BEQ 05D0 {fact+0007}
05CE 1003       BPL 05D3 {fact+000A}
05D0 4CE305     JMP 05E3 {fact+001A}
05D3 B500       LDA 00,x           dup
05D5 3A         DEA              1-
05D6 CA         DEX
05D7 CA         DEX
05D8 9500       STA 00,x
05DA 20C905     JSR 05C9 {fact}           recurse
05DD 200C22     JSR 220C {*}           *
05E0 4CEC05     JMP 05EC {fact+0023}       else
 ok
$5e3 disasm
05E3 E8         INX              drop
05E4 E8         INX
05E5 A90100     LDA #0001 {' SInIndx0+0001}     1
05E8 CA         DEX
05E9 CA         DEX
05EA 9500       STA 00,x
                   then
05EC 60         RTS            ;
 ok

10 primes 2 3 5 7  ok
50 primes 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47  ok

see primes                  : primes ( n -- )
05F6 A90200     LDA #0002 {' SInEnd0}      2
05F9 20FE38     JSR 38FE {.+0004}      .
05FC A90300     LDA #0003 {' SInEnd0+0001}   3
05FF 20FE38     JSR 38FE {.+0004}      .
0602 A90200     LDA #0002 {' SInEnd0}      2
0605 B400       LDY 00,x         swap
0607 9500       STA 00,x
0609 98         TYA
060A A8         TAY            
060B A90500     LDA #0005 {' SIn_Buf0+0001}   5
060E 5A         PHY            do
060F 48         PHA
0610 B500       LDA 00,x           dup
0612 CA         DEX
0613 CA         DEX
0614 9500       STA 00,x
0616 201022     JSR 2210 {*+0004}        dup *
0619 A301       LDA 01,s           i
061B 20D626     JSR 26D6 {<+0030}        <
061E A8         TAY              if
061F D003       BNE 0624 {primes+002E}
0621 4C2606     JMP 0626 {primes+0030}
0624 F600       INC 00,x             1+
                     then
0626 A90100     LDA #0001 {' SInIndx0+0001}     1
0629 CA         DEX
062A CA         DEX
062B 9500       STA 00,x
062D B502       LDA 02,x           over
062F 1A         INA              1+
0630 A8         TAY
0631 A90300     LDA #0003 {' SInEnd0+0001}     3
0634 5A         PHY              do
0635 48         PHA
0636 A305       LDA 05,s             j
0638 CA         DEX
0639 CA         DEX
063A 9500       STA 00,x
063C A301       LDA 01,s             i
063E 200F23     JSR 230F {Mod+0004}          mod
0641 208924     JSR 2489 {0=+000E}          0=
0644 A8         TAY                if
0645 D003       BNE 064A {primes+0054}
0647 4C5106     JMP 0651 {primes+005B}
064A D600       DEC 00,x               1-
064C A303       LDA 03,s               leave
064E 3A         DEA
064F 8301       STA 01,s
                       then
0651 A90200     LDA #0002 {' SInEnd0}          2
0654 205440     JSR 4054 {+Loop+001D}          +loop
0657 30DD       BMI 0636 {primes+0040}
0659 68         PLA
065A 7A         PLY
065B B500       LDA 00,x           if
065D E8         INX
065E E8         INX
065F A8         TAY
0660 D003       BNE 0665 {primes+006F}
0662 4C6A06     JMP 066A {primes+0074}
0665 A301       LDA 01,s             i
0667 20FE38     JSR 38FE {.+0004}          .
                     then
066A A90200     LDA #0002 {' SInEnd0}        2
066D 205440     JSR 4054 {+Loop+001D}        +loop
0670 309E       BMI 0610 {primes+001A}
0672 68         PLA
0673 7A         PLY
0674 E8         INX            drop
0675 E8         INX
0676 60         RTS            ;
 ok


It looks like DROP 1 is a good candidate for optimization.


Top
 Profile  
Reply with quote  
 Post subject: Re: vino816 Forth design
PostPosted: Sun Apr 29, 2018 5:21 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
barrym95838 wrote:
According to Dr. Brad, Phil Koopman did a study of how often Forth primitives get executed in some "benchmark Forth programs", whatever that means:

1. ENTER (12.21%)
2. EXIT (11.74%)
3. VARIABLE (5.46%)
4. @ (5.40%)
5. 0BRANCH (4.78%)
6. LIT (4.54%)
7. + (4.18%)
8. SWAP (3.90%)
9. R> (3.89%)
10. >R (3.87%)
11. CONSTANT (3.68%)
12. DUP (3.05%)

That's a curious list.

Specifically what stands out to me is VARIABLE, as that's a compiling words, not a runtime word. I guess perhaps it's capturing variable references?

i.e. if I had:
Code:
: A B @ + ;

That would be counted as ENTER, VARIABLE (B), FETCH, and PLUS.

I guess that's what we're seeing.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 37 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: