Fastest 65C02 Forth?
Fastest 65C02 Forth?
I'm curious about comparing performance between assembly and Forth. Which Forth for the 65C02 offers the best execution speed? I don't mind if the code takes up a lot of memory. Cross compiling or native is fine as long as I can run the output on a simulator. Thanks in advance!
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Fastest 65C02 Forth?
I haven't yet found an example implementation to provide to you, but my educated guess is that the fastest threaded 'c02 Forth should be STC, because that's likely the quickest method for an unassisted 'c02 to deal with a 16-bit interpreter pointer (i.e in PC, rather than ZP or two 8-bit registers).
That said, Charlie has a very nimble '02 DTC Forth that could be modified to be even faster by mixing in some 'c02 instructions where appropriate. Pettil is a "longoing" labor of love for Charlie, and he appears to be quite good at optimizing it for size and speed.
That said, Charlie has a very nimble '02 DTC Forth that could be modified to be even faster by mixing in some 'c02 instructions where appropriate. Pettil is a "longoing" labor of love for Charlie, and he appears to be quite good at optimizing it for size and speed.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: Fastest 65C02 Forth?
I agree with Barry, a Subroutine Threaded Code (STC) Forth will be the quickest but will also be bigger as each word call is a JSR rather than just a 16-bit address and you need some internal routines for pushing literals to the stack and testing TOS for branches.
For example a word like:
will end up something like this (after a little optimization)
On the 6502 Direct Threaded Code (DTC) will be the next fastest and a bit more compact.
For example a word like:
Code: Select all
: emit ( ch -- )
begin ACIA:STAT c@ 0010 and until
ACIA:DATA c!
; external
Code: Select all
.global emit
emit: ; : emit
.L5:
lda -255 ; -255 c@
ldy #0
jsr __push
lda #<16 ; 16
ldy #>16
jsr __push
jsr and ; and
jsr __test
jeq .L5 ; (?branch)
jsr __pull
sta -256 ; -256 c!
rts ; ;
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Fastest 65C02 Forth?
From the Forth section of my links page:
How well the '02 does in Forth compared to assembly language will depend of course on what you're doing. If you're doing a lot of 8-bit operations, its performance in Forth will be very poor compared to assembly language, since virtually everything in Forth is done in 16-bit (or 32- or 64-bit on higher-caliber processors) and the '02 has to take words apart to "get them through the door" and them put them back together on the other side, so to speak. (The 65816 does much better in this regard.) OTOH, if you're doing something like an FFT which has a ton of 16-bit multiplies, I don't think you'd find the much difference in the performance from even ITC Forth to assembly language. A nice thing however is that it's so easy to mix assembly language into Forth for the few small parts where performance must be maximized. I initially wrote my Forth assembler in an evening, and then realized later that since it's Forth, macro capability was natural, without any extra effort.
- Bruce Clark's 2-instruction 65816 NEXT in ITC Forth
- Bruce Clark's single-instruction, 6-clock 65816 NEXT in DTC Forth
- Bruce Clark explains how the faster-running STC Forth avoids the expected memory penalties. He gives 9 reasons, starting in the middle of his long post in the middle of the page. STC of course eliminates the need for NEXT, nest, and unnest, thus improving speed.
- explanation of five different Forth threading methods, by Brad Rodriguez
How well the '02 does in Forth compared to assembly language will depend of course on what you're doing. If you're doing a lot of 8-bit operations, its performance in Forth will be very poor compared to assembly language, since virtually everything in Forth is done in 16-bit (or 32- or 64-bit on higher-caliber processors) and the '02 has to take words apart to "get them through the door" and them put them back together on the other side, so to speak. (The 65816 does much better in this regard.) OTOH, if you're doing something like an FFT which has a ton of 16-bit multiplies, I don't think you'd find the much difference in the performance from even ITC Forth to assembly language. A nice thing however is that it's so easy to mix assembly language into Forth for the few small parts where performance must be maximized. I initially wrote my Forth assembler in an evening, and then realized later that since it's Forth, macro capability was natural, without any extra effort.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
-
leepivonka
- Posts: 168
- Joined: 15 Apr 2016
Re: Fastest 65C02 Forth?
You might find Tali Forth interesting:
viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.
I've been working on 65816F, but it is for the 65816, not the 65C02.
viewtopic.php?f=9&t=4336
It's designed to be very fast & reasonably small. My goal is to generate respectable assembly code.
viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.
I've been working on 65816F, but it is for the 65816, not the 65C02.
viewtopic.php?f=9&t=4336
It's designed to be very fast & reasonably small. My goal is to generate respectable assembly code.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Fastest 65C02 Forth?
leepivonka wrote:
You might find Tali Forth interesting:
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
leepivonka
- Posts: 168
- Joined: 15 Apr 2016
Re: Fastest 65C02 Forth?
Here is some sample code on a hacked up version of Tali:
For comparison, here is the same sample in 65816F running on a 65816 CPU:
Code: Select all
Tali Forth 2 kernel for 65816s (27 Dec 2018)
-1 strip-underflow ! ok
ok
: DSum >r 0. r> 1+ 0 do i m+ loop ; ok
: Timer cc@ 2>r 1000 DSum cc@ 2r> d- d. ." cycles " d. ; ok
timer 143466 cycles 500500 ok
see dsum
nt: A9A xt: AC0
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 118
0AC0 B5 00 B4 01 E8 E8 5A 48 A9 00 CA CA 95 00 74 01 ......ZH ......t.
0AD0 A9 00 CA CA 95 00 74 01 68 7A CA CA 95 00 94 01 ......t. hz......
0AE0 F6 00 D0 02 F6 01 CA CA 74 00 74 01 38 A9 00 F5 ........ t.t.8...
0AF0 02 A8 A9 80 F5 03 95 03 48 5A 18 98 75 00 A8 B5 ........ HZ..u...
0B00 01 75 03 E8 E8 E8 E8 48 5A DA BA 38 BD 02 01 FD .u.....H Z..8....
0B10 04 01 A8 BD 03 01 FD 05 01 FA CA CA 95 01 94 00 ........ ........
0B20 20 88 A8 18 68 69 01 A8 68 69 00 48 5A 70 03 4C ...hi.. hi.HZp.L
0B30 09 0B 68 68 68 68 ..hhhh
AC0 0 lda.zx >r
AC2 1 ldy.zx
AC4 inx
AC5 inx
AC6 phy
AC7 pha
AC8 0 lda.# 0.
ACA dex
ACB dex
ACC 0 sta.zx
ACE 1 stz.zx
AD0 0 lda.#
AD2 dex
AD3 dex
AD4 0 sta.zx
AD6 1 stz.zx
AD8 pla r>
AD9 ply
ADA dex
ADB dex
ADC 0 sta.zx
ADE 1 sty.zx
AE0 0 inc.zx 1+
AE2 2 bne
AE4 1 inc.zx
AE6 dex 0
AE7 dex
AE8 0 stz.zx
AEA 1 stz.zx
AEC sec do
AED 0 lda.#
AEF 2 sbc.zx
AF1 tay
AF2 80 lda.#
AF4 3 sbc.zx
AF6 3 sta.zx
AF8 pha
AF9 phy
AFA clc
AFB tya
AFC 0 adc.zx
AFE tay
AFF 1 lda.zx
B01 3 adc.zx
B03 inx
B04 inx
B05 inx
B06 inx
B07 pha
B08 phy
B09 phx i
B0A tsx
B0B sec
B0C 102 lda.x
B0F 104 sbc.x
B12 tay
B13 103 lda.x
B16 105 sbc.x
B19 plx
B1A dex
B1B dex
B1C 1 sta.zx
B1E 0 sty.zx
B20 A888 ( ' m+ 3 + ) jsr m+
B23 clc loop
B24 pla
B25 1 adc.#
B27 tay
B28 pla
B29 0 adc.#
B2B pha
B2C phy
B2D 3 bvs
B2F B09 ( ' DSum 49 + ) jmp
B32 pla
B33 pla
B34 pla
B35 pla ok
ok
see timer
nt: B1C xt: B42
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 50
0B42 20 17 80 20 01 AF A9 E8 A0 03 CA CA 95 00 94 01 .. .... ........
0B52 20 C0 0A 20 17 80 20 2A AF 20 DE A8 20 24 B5 20 .. .. * . .. $.
0B62 61 A3 4C 6E 0B 63 79 63 6C 65 73 20 20 6A B5 20 a.Ln.cyc les j.
0B72 24 B5 $.
B42 8017 ( ' cc@ ) jsr cc@
B45 AF01 ( ' 2>r ) jsr 2>r
B48 E8 lda.# 1000
B4A 3 ldy.#
B4C dex
B4D dex
B4E 0 sta.zx
B50 1 sty.zx
B52 AC0 ( ' DSum ) jsr DSum
B55 8017 ( ' cc@ ) jsr cc@
B58 AF2A ( ' 2r> ) jsr 2r>
B5B A8DE ( ' d- 3 + ) jsr d-
B5E B524 ( ' d. 3 + ) jsr d.
B61 A361 ( ' sliteral 34 + ) jsr ." cycles "
B64 B6E ( ' Timer 2C + ) jmp
B67 "cycles "
B6E B56A ( ' Type ) jsr
B71 B524 ( ' d. 3 + ) jsr d.
ok
-------------------------------------------------------------------
Tali Forth 2 kernel for 65816s (27 Dec 2018)
-1 strip-underflow ! ok
ok
$ff00 constant ACIA:STAT ok
$ff01 constant ACIA:DATA ok
ok
: emit1 compiled
begin acia:stat c@ $10 and until compiled
acia:data c! ; ok
ok
ok
see acia:stat
nt: A9F xt: AC5
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 7
0AC5 A9 00 A0 FF 4C 43 B8 ....LC.
AC5 0 lda.#
AC7 FF ldy.#
AC9 B843 ( ' dup 7 + ) jmp ok
see acia:data
nt: AB5 xt: ADB
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 7
0ADB A9 01 A0 FF 4C 43 B8 ....LC.
ADB 1 lda.#
ADD FF ldy.#
ADF B843 ( ' dup 7 + ) jmp ok
ok
see emit1
nt: AC7 xt: AED
flags (CO AN IM NN UF HC R6): 0 0 0 1 0 0 0
size (decimal): 42
0AED 20 C5 0A A1 00 95 00 74 01 A9 10 CA CA 95 00 74 ......t .......t
0AFD 01 20 78 B3 E8 E8 B5 FE 15 FF D0 03 4C ED 0A 20 . x..... ....L..
0B0D DB 0A B5 02 81 00 E8 E8 E8 E8 ........ ..
begin
AED AC5 ( ' ACIA:STAT ) jsr ACIA:STAT
AF0 0 lda.zxi c@
AF2 0 sta.zx
AF4 1 stz.zx
AF6 10 lda.# $10
AF8 dex
AF9 dex
AFA 0 sta.zx
AFC 1 stz.zx
AFE B378 ( ' and 3 + ) jsr and
B01 inx
B02 inx
B03 FE lda.zx until
B05 FF ora.zx
B07 3 bne
B09 AED ( ' emit1 ) jmp
B0C ADB ( ' ACIA:DATA ) jsr ACIA:DATA
B0F 2 lda.zx c!
B11 0 sta.zxi
B13 inx
B14 inx
B15 inx
B16 inx ok
ok
For comparison, here is the same sample in 65816F running on a 65816 CPU:
Code: Select all
65816F 2019Oct06
: DSum >r 0. r> 1+ 0 do i m+ loop ; ok
: Timer cc@ 2>r 1000 DSum cc@ 2r> d- d. d. ; ok
timer 56195 500500 ok Runs in 56195 machine cycles
see dsum
04F3 B500 LDA 00,x >r
04F5 E8 INX
04F6 E8 INX
04F7 48 PHA
04F8 A00000 LDY #0000 {' SInIndx0} 0.
04FB A90000 LDA #0000 {' SInIndx0}
04FE 20D685 JSR 85D6 {PsuYA}
0501 68 PLA r>
0502 1A INA 1+
0503 A8 TAY
0504 A90000 LDA #0000 {' SInIndx0} 0
0507 5A PHY do
0508 48 PHA
0509 A301 LDA 01,s i
050B 20D688 JSR 88D6 {M++0004} m+
050E 68 PLA loop
050F 1A INA
0510 C301 CMP 01,s
0512 D0F4 BNE 0508 {DSum+0015}
0514 7A PLY
0515 60 RTS ;
ok
see timer
051E 02F5 COP #F5 cc@
0520 48 PHA 2>r
0521 5A PHY
0522 A9E803 LDA #03E8 1000
0525 20F704 JSR 04F7 {DSum+0004} DSum
0528 02F5 COP #F5 cc@
052A 20D685 JSR 85D6 {PsuYA}
052D 7A PLY 2r>
052E 68 PLA
052F 203C89 JSR 893C {D-+0003} d-
0532 2003A5 JSR A503 {D.} d.
0535 2003A5 JSR A503 {D.} d.
0538 60 RTS ;
ok
-------------------------------------------------------
65816F 2019Oct06
ok
$ff00 constant ACIA:STAT ok
$ff01 constant ACIA:DATA ok
: emit1 compiled
begin ACIA:STAT c@ $10 and until compiled
ACIA:DATA c! ; ok
ok
see acia:Stat
04F8 A900FF LDA #FF00 {' ACIA:STAT}
04FB 4C9A85 JMP 859A {PsuA}
ok
see acia:data
050A A901FF LDA #FF01 {' ACIA:DATA}
050D 4C9A85 JMP 859A {PsuA}
ok
see emit1
begin
0518 A90000 LDA #0000 {' SInIndx0} ACIA:STAT c@
051B E220 SEP #20 {Loc+0004}
051D AD00FF LDA FF00 {' ACIA:STAT}
0520 C220 REP #20 {Loc+0004}
0522 291000 AND #0010 {' SInIndx2} $10 and
0525 A8 TAY until
0526 F0F0 BEQ 0518 {emit1}
0528 B500 LDA 00,x ACIA:DATA c!
052A E8 INX
052B E8 INX
052C E220 SEP #20 {Loc+0004}
052E 8D01FF STA FF01 {' ACIA:DATA}
0531 C220 REP #20 {Loc+0004}
0533 60 RTS ;
ok
Re: Fastest 65C02 Forth?
leepivonka wrote:
You might find Tali Forth interesting:
viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.
viewtopic.php?f=9&t=2926
https://github.com/scotws/TaliForth2
It's for the 65C02. It's designed to be fast & not horribly big.
On the speed front, however, Tali uses native compiling and gets a bit of a speed boost from that (at the expense of code size). In native compiling, rather than using a JSR to another word, Tali just copies the opcodes for that routine into the word being compiled (up to, but not including the RTS). There are also words, such as allow-native, always-native, and never-native to control whether a new word will be native compiled when compiled in a future word. There is also a variable nc_limit that sets the maximum size of word that will be natively compiled - words larger than this will be compiled as a JSR (unless they have been flagged always-native).
This gets you closer to assembly speeds with some of the drawbacks that have already been mentioned (all operations are 16-bit and there will sometimes be unneeded stack thrashing or use of the stack when using a register would have been faster).
When a Forth has an assember, the assembler IS Forth (or perhaps it's the other way around - it gets very meta if you start thinking about it) and is just a different (forth) way to build a new word. If you think of it this way, then the Forth using assembler words is as fast as any other assembler and is really only limited by the skill of the programmer.
If you are interested in timing some different Forths, you'll find that Tali2 runs in py65mon out of the box. You may also find interesting, in the test folder on github, the files talitest.py and cycles.fs. The talitest.py extends py65mon, loads a binary file into the memory space (it's currently set to load taliforth-py65mon.bin but you can change it if you want) and it also provides a cycle counter mapped into the memory space. You read from address 0xF006 to start the cycle counter (ignoring the result), read from 0xF006 to stop the cycle counter (again ignoring the result), and then the cycle count between those two accesses is available as a 32-bit value in memory locations 0xF008-F00B. Just beware that the 32-bit value is presented in Tali2's double format, which is neither little endian or big endian for the entire 32-bit value. This allows it to be read directly using 2@. Because this is memory mapped, it could be used by any Forth that will run in py65mon.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Fastest 65C02 Forth?
Yeah, I think that several small little-endian systems store Forth doubles in middle-endian (NUXI) format.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
Re: Fastest 65C02 Forth?
SamCoVT, this sounds really appealing. I'm going to work on getting it running. Is there a way to produce standalone binaries or would I need to include the whole 21k package with anything I create?