6502.org • View topic - Fastest 65C02 Forth?

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Forth

All times are UTC

Fastest 65C02 Forth?

Page 1 of 1

[ 10 posts ]

Previous topic | Next topic

Author

Message

Druzyek

Post subject: Fastest 65C02 Forth?

Posted: Tue Dec 17, 2019 3:40 am

Joined: Mon May 12, 2014 6:18 pm
Posts: 365

I'm curious about comparing performance between assembly and Forth. Which Forth for the 65C02 offers the best execution speed? I don't mind if the code takes up a lot of memory. Cross compiling or native is fine as long as I can run the output on a simulator. Thanks in advance!

Top

barrym95838

Post subject: Re: Fastest 65C02 Forth?

Posted: Tue Dec 17, 2019 7:39 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA

I haven't yet found an example implementation to provide to you, but my educated guess is that the fastest threaded 'c02 Forth should be STC, because that's likely the quickest method for an unassisted 'c02 to deal with a 16-bit interpreter pointer (i.e in PC, rather than ZP or two 8-bit registers).

That said, Charlie has a very nimble '02 DTC Forth that could be modified to be even faster by mixing in some 'c02 instructions where appropriate. Pettil is a "longoing" labor of love for Charlie, and he appears to be quite good at optimizing it for size and speed.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Top

BitWise

Post subject: Re: Fastest 65C02 Forth?

Posted: Tue Dec 17, 2019 3:43 pm

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

I agree with Barry, a Subroutine Threaded Code (STC) Forth will be the quickest but will also be bigger as each word call is a JSR rather than just a 16-bit address and you need some internal routines for pushing literals to the stack and testing TOS for branches.

For example a word like:

Code:

: emit ( ch -- )
  begin ACIA:STAT c@ 0010 and until
  ACIA:DATA c!
; external

will end up something like this (after a little optimization)

Code:

                .global emit
emit:                                   ; : emit
.L5:
                lda     -255            ; -255 c@
                ldy     #0
                jsr     __push
                lda     #<16            ; 16
                ldy     #>16
                jsr     __push
                jsr     and             ; and
                jsr     __test
                jeq     .L5             ; (?branch)
                jsr     __pull
                sta     -256            ; -256 c!
                rts                     ; ;

On the 6502 Direct Threaded Code (DTC) will be the next fastest and a bit more compact.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs

Top

GARTHWILSON

Post subject: Re: Fastest 65C02 Forth?

Posted: Tue Dec 17, 2019 7:11 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

From the Forth section of my links page:

Bruce Clark's 2-instruction 65816 NEXT in ITC Forth
Bruce Clark's single-instruction, 6-clock 65816 NEXT in DTC Forth
Bruce Clark explains how the faster-running STC Forth avoids the expected memory penalties. He gives 9 reasons, starting in the middle of his long post in the middle of the page. STC of course eliminates the need for NEXT, nest, and unnest, thus improving speed.
explanation of five different Forth threading methods, by Brad Rodriguez

(Bruce Clark and Brad Rodriguez are both on this forum, but not very active.)

How well the '02 does in Forth compared to assembly language will depend of course on what you're doing. If you're doing a lot of 8-bit operations, its performance in Forth will be very poor compared to assembly language, since virtually everything in Forth is done in 16-bit (or 32- or 64-bit on higher-caliber processors) and the '02 has to take words apart to "get them through the door" and them put them back together on the other side, so to speak. (The 65816 does much better in this regard.) OTOH, if you're doing something like an FFT which has a ton of 16-bit multiplies, I don't think you'd find the much difference in the performance from even ITC Forth to assembly language. A nice thing however is that it's so easy to mix assembly language into Forth for the few small parts where performance must be maximized. I initially wrote my Forth assembler in an evening, and then realized later that since it's Forth, macro capability was natural, without any extra effort.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

leepivonka

Post subject: Re: Fastest 65C02 Forth?

Posted: Wed Dec 18, 2019 2:39 am

Joined: Fri Apr 15, 2016 1:03 am
Posts: 140

You might find Tali Forth interesting:
http://forum.6502.org/viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.

I've been working on 65816F, but it is for the 65816, not the 65C02.
http://forum.6502.org/viewtopic.php?f=9&t=4336
It's designed to be very fast & reasonably small. My goal is to generate respectable assembly code.

Top

barrym95838

Post subject: Re: Fastest 65C02 Forth?

Posted: Wed Dec 18, 2019 3:50 am

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA

leepivonka wrote:

You might find Tali Forth interesting:

Ah, shoot! I can't believe I forgot Tali Forth from the STC category! Sorry, Scot ...

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Top

leepivonka

Post subject: Re: Fastest 65C02 Forth?

Posted: Wed Dec 18, 2019 8:22 am

Joined: Fri Apr 15, 2016 1:03 am
Posts: 140

Here is some sample code on a hacked up version of Tali:

Code:

Tali Forth 2 kernel for 65816s (27 Dec 2018)

-1 strip-underflow !  ok
  ok
: DSum >r 0. r> 1+ 0 do i m+ loop ;  ok
: Timer cc@ 2>r 1000 DSum cc@ 2r> d- d. ." cycles " d. ;  ok
timer 143466 cycles 500500  ok
see dsum
nt: A9A  xt: AC0
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 118

0AC0  B5 00 B4 01 E8 E8 5A 48  A9 00 CA CA 95 00 74 01  ......ZH ......t.
0AD0  A9 00 CA CA 95 00 74 01  68 7A CA CA 95 00 94 01  ......t. hz......
0AE0  F6 00 D0 02 F6 01 CA CA  74 00 74 01 38 A9 00 F5  ........ t.t.8...
0AF0  02 A8 A9 80 F5 03 95 03  48 5A 18 98 75 00 A8 B5  ........ HZ..u...
0B00  01 75 03 E8 E8 E8 E8 48  5A DA BA 38 BD 02 01 FD  .u.....H Z..8....
0B10  04 01 A8 BD 03 01 FD 05  01 FA CA CA 95 01 94 00  ........ ........
0B20  20 88 A8 18 68 69 01 A8  68 69 00 48 5A 70 03 4C   ...hi.. hi.HZp.L
0B30  09 0B 68 68 68 68  ..hhhh

AC0      0 lda.zx         >r
AC2      1 ldy.zx
AC4        inx
AC5        inx
AC6        phy
AC7        pha
AC8      0 lda.#         0.
ACA        dex
ACB        dex
ACC      0 sta.zx
ACE      1 stz.zx
AD0      0 lda.#
AD2        dex
AD3        dex
AD4      0 sta.zx
AD6      1 stz.zx
AD8        pla            r>
AD9        ply
ADA        dex
ADB        dex
ADC      0 sta.zx
ADE      1 sty.zx
AE0      0 inc.zx         1+
AE2      2 bne
AE4      1 inc.zx
AE6        dex            0
AE7        dex
AE8      0 stz.zx
AEA      1 stz.zx
AEC        sec            do
AED      0 lda.#
AEF      2 sbc.zx
AF1        tay
AF2     80 lda.#
AF4      3 sbc.zx
AF6      3 sta.zx
AF8        pha
AF9        phy
AFA        clc
AFB        tya
AFC      0 adc.zx
AFE        tay
AFF      1 lda.zx
B01      3 adc.zx
B03        inx
B04        inx
B05        inx
B06        inx
B07        pha
B08        phy
B09        phx              i
B0A        tsx
B0B        sec
B0C    102 lda.x
B0F    104 sbc.x
B12        tay
B13    103 lda.x
B16    105 sbc.x
B19        plx
B1A        dex
B1B        dex
B1C      1 sta.zx
B1E      0 sty.zx
B20   A888 ( ' m+ 3 + ) jsr        m+
B23        clc             loop
B24        pla
B25      1 adc.#
B27        tay
B28        pla
B29      0 adc.#
B2B        pha
B2C        phy
B2D      3 bvs
B2F    B09 ( ' DSum 49 + ) jmp
B32        pla
B33        pla
B34        pla
B35        pla ok
  ok
see timer
nt: B1C  xt: B42
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 50

0B42  20 17 80 20 01 AF A9 E8  A0 03 CA CA 95 00 94 01   .. .... ........
0B52  20 C0 0A 20 17 80 20 2A  AF 20 DE A8 20 24 B5 20   .. .. * . .. $.
0B62  61 A3 4C 6E 0B 63 79 63  6C 65 73 20 20 6A B5 20  a.Ln.cyc les  j.
0B72  24 B5  $.

B42   8017 ( ' cc@ ) jsr      cc@
B45   AF01 ( ' 2>r ) jsr      2>r
B48     E8 lda.#         1000
B4A      3 ldy.#
B4C        dex
B4D        dex
B4E      0 sta.zx
B50      1 sty.zx
B52    AC0 ( ' DSum ) jsr      DSum
B55   8017 ( ' cc@ ) jsr      cc@
B58   AF2A ( ' 2r> ) jsr      2r>
B5B   A8DE ( ' d- 3 + ) jsr      d-
B5E   B524 ( ' d. 3 + ) jsr      d.
B61   A361 ( ' sliteral 34 + ) jsr   ." cycles "
B64    B6E ( ' Timer 2C + ) jmp
B67    "cycles "
B6E   B56A ( ' Type ) jsr
B71   B524 ( ' d. 3 + ) jsr      d.
  ok
-------------------------------------------------------------------
Tali Forth 2 kernel for 65816s (27 Dec 2018)

-1 strip-underflow !  ok
  ok
$ff00 constant ACIA:STAT  ok
$ff01 constant ACIA:DATA  ok
  ok
: emit1  compiled
  begin acia:stat c@ $10 and until  compiled
  acia:data c! ;  ok
  ok
  ok
see acia:stat
nt: A9F  xt: AC5
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 7

0AC5  A9 00 A0 FF 4C 43 B8  ....LC.

AC5      0 lda.#
AC7     FF ldy.#
AC9   B843 ( ' dup 7 + ) jmp ok
see acia:data
nt: AB5  xt: ADB
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 7

0ADB  A9 01 A0 FF 4C 43 B8  ....LC.

ADB      1 lda.#
ADD     FF ldy.#
ADF   B843 ( ' dup 7 + ) jmp ok
  ok
see emit1
nt: AC7  xt: AED
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 42

0AED  20 C5 0A A1 00 95 00 74  01 A9 10 CA CA 95 00 74   ......t .......t
0AFD  01 20 78 B3 E8 E8 B5 FE  15 FF D0 03 4C ED 0A 20  . x..... ....L..
0B0D  DB 0A B5 02 81 00 E8 E8  E8 E8  ........ ..

               begin
AED    AC5 ( ' ACIA:STAT ) jsr        ACIA:STAT
AF0      0 lda.zxi           c@
AF2      0 sta.zx
AF4      1 stz.zx
AF6     10 lda.#           $10
AF8        dex
AF9        dex
AFA      0 sta.zx
AFC      1 stz.zx
AFE   B378 ( ' and 3 + ) jsr        and
B01        inx
B02        inx
B03     FE lda.zx          until
B05     FF ora.zx
B07      3 bne
B09    AED ( ' emit1 ) jmp
B0C    ADB ( ' ACIA:DATA ) jsr      ACIA:DATA
B0F      2 lda.zx         c!
B11      0 sta.zxi
B13        inx
B14        inx
B15        inx
B16        inx ok
  ok

For comparison, here is the same sample in 65816F running on a 65816 CPU:

Code:

65816F 2019Oct06
: DSum >r 0. r> 1+ 0 do i m+ loop ;  ok
: Timer cc@ 2>r 1000 DSum cc@ 2r> d- d. d. ;  ok
timer 56195 500500  ok    Runs in 56195 machine cycles
see dsum
04F3 B500       LDA 00,x         >r
04F5 E8         INX
04F6 E8         INX
04F7 48         PHA
04F8 A00000     LDY #0000 {' SInIndx0}      0.
04FB A90000     LDA #0000 {' SInIndx0}
04FE 20D685     JSR 85D6 {PsuYA}
0501 68         PLA            r>
0502 1A         INA            1+
0503 A8         TAY
0504 A90000     LDA #0000 {' SInIndx0}      0
0507 5A         PHY            do
0508 48         PHA
0509 A301       LDA 01,s           i
050B 20D688     JSR 88D6 {M++0004}        m+
050E 68         PLA             loop
050F 1A         INA
0510 C301       CMP 01,s
0512 D0F4       BNE 0508 {DSum+0015}
0514 7A         PLY
0515 60         RTS            ;
 ok
see timer
051E 02F5       COP #F5            cc@
0520 48         PHA            2>r
0521 5A         PHY
0522 A9E803     LDA #03E8         1000
0525 20F704     JSR 04F7 {DSum+0004}      DSum
0528 02F5       COP #F5            cc@
052A 20D685     JSR 85D6 {PsuYA}
052D 7A         PLY            2r>
052E 68         PLA
052F 203C89     JSR 893C {D-+0003}      d-
0532 2003A5     JSR A503 {D.}         d.
0535 2003A5     JSR A503 {D.}         d.
0538 60         RTS            ;
 ok
-------------------------------------------------------
65816F 2019Oct06
  ok
$ff00 constant ACIA:STAT  ok
$ff01 constant ACIA:DATA  ok
: emit1  compiled
  begin ACIA:STAT c@ $10 and until  compiled
  ACIA:DATA c! ;  ok
  ok
see acia:Stat
04F8 A900FF     LDA #FF00 {' ACIA:STAT}
04FB 4C9A85     JMP 859A {PsuA}
 ok
see acia:data
050A A901FF     LDA #FF01 {' ACIA:DATA}
050D 4C9A85     JMP 859A {PsuA}
 ok
see emit1
                  begin
0518 A90000     LDA #0000 {' SInIndx0}        ACIA:STAT c@
051B E220       SEP #20 {Loc+0004}
051D AD00FF     LDA FF00 {' ACIA:STAT}
0520 C220       REP #20 {Loc+0004}
0522 291000     AND #0010 {' SInIndx2}        $10 and
0525 A8         TAY             until
0526 F0F0       BEQ 0518 {emit1}
0528 B500       LDA 00,x         ACIA:DATA c!
052A E8         INX
052B E8         INX
052C E220       SEP #20 {Loc+0004}
052E 8D01FF     STA FF01 {' ACIA:DATA}
0531 C220       REP #20 {Loc+0004}
0533 60         RTS            ;
 ok

Top

SamCoVT

Post subject: Re: Fastest 65C02 Forth?

Posted: Wed Dec 18, 2019 4:09 pm

Joined: Sun May 13, 2018 5:49 pm
Posts: 255

leepivonka wrote:

You might find Tali Forth interesting:
http://forum.6502.org/viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.

For the 65C02, TaliForth2 is reasonably fast but isn't as small as it used to be. It's about 21.5K now, but implements much of the ANS2012 standard including the core (including extension words), block (including extension words), double, facility, string, tools, search (including extension words), and string wordsets - along with an assembler and two simple editors.

On the speed front, however, Tali uses native compiling and gets a bit of a speed boost from that (at the expense of code size). In native compiling, rather than using a JSR to another word, Tali just copies the opcodes for that routine into the word being compiled (up to, but not including the RTS). There are also words, such as allow-native, always-native, and never-native to control whether a new word will be native compiled when compiled in a future word. There is also a variable nc_limit that sets the maximum size of word that will be natively compiled - words larger than this will be compiled as a JSR (unless they have been flagged always-native).

This gets you closer to assembly speeds with some of the drawbacks that have already been mentioned (all operations are 16-bit and there will sometimes be unneeded stack thrashing or use of the stack when using a register would have been faster).

When a Forth has an assember, the assembler IS Forth (or perhaps it's the other way around - it gets very meta if you start thinking about it) and is just a different (forth) way to build a new word. If you think of it this way, then the Forth using assembler words is as fast as any other assembler and is really only limited by the skill of the programmer.

If you are interested in timing some different Forths, you'll find that Tali2 runs in py65mon out of the box. You may also find interesting, in the test folder on github, the files talitest.py and cycles.fs. The talitest.py extends py65mon, loads a binary file into the memory space (it's currently set to load taliforth-py65mon.bin but you can change it if you want) and it also provides a cycle counter mapped into the memory space. You read from address 0xF006 to start the cycle counter (ignoring the result), read from 0xF006 to stop the cycle counter (again ignoring the result), and then the cycle count between those two accesses is available as a 32-bit value in memory locations 0xF008-F00B. Just beware that the 32-bit value is presented in Tali2's double format, which is neither little endian or big endian for the entire 32-bit value. This allows it to be read directly using 2@. Because this is memory mapped, it could be used by any Forth that will run in py65mon.

Top

barrym95838

Post subject: Re: Fastest 65C02 Forth?

Posted: Wed Dec 18, 2019 8:38 pm

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA

Yeah, I think that several small little-endian systems store Forth doubles in middle-endian (NUXI) format.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)

Top

Druzyek

Post subject: Re: Fastest 65C02 Forth?

Posted: Fri Jan 10, 2020 4:23 pm

Joined: Mon May 12, 2014 6:18 pm
Posts: 365

SamCoVT, this sounds really appealing. I'm going to work on getting it running. Is there a way to produce standalone binaries or would I need to include the whole 21k package with anything I create?

Top

Page 1 of 1

[ 10 posts ]

Board index » 6502.org Users Forum » Forth

All times are UTC

Who is online

Users browsing this forum: No registered users and 2 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum