6502.org

Posted: **Thu Aug 30, 2018 4:02 pm**

I've been using DurexForth on the C64 (in VICE) to learn Forth. To try and get an idea of Forth's performance comparted to BASIC I did a simple plot of the 'a' character to the whole screen (1,000 locations). Two general methods were tried one was a single loop adding the loop index to screen memory location and plotting. The second was two loops X and Y calculating the memory address with 'X+Y*40'.

Then in Forth only I implemented a lookup table for the Y offsets. This was actually a bit slower when I still used a Y*2 to index through the array (an array of 16bit values). When I did a left shift instead of multiplications it was much faster!

The closest comparison denoted in the comments below are the 'linear' test and the 'X,Y X+Y*40' tests. The 'start' and 'stop' functions in the code below are a Forth module that uses the C64 Jiffy timer to let you profile your code optimizing to execution time. The Forth code below is only the last test ( 0.766 X,Y luY w/Screen added to yoff ) but I'm guessing you know what the other code would have looked like.

Code: Select all

400 constant Screen
00  constant Minscr
3E8 constant Maxscr
91  constant Up
11  constant Down
9D  constant Left
1D  constant Right

variable scrn 400 scrn !
variable xpos 0 xpos !
variable ypos 0 ypos !

( creates array of screen Y offset values )
( i.e. 0400 0428 0450 )
create yoff 32 allot
: cfgyoff
   19 0 do
    I 28 * Screen + ( value )
    I 2 * yoff + !  ( index address )
   loop
;

( retrieves Y value from yoff array )
( adds in xpos and stores 'a' to screen )
: plotline3
   28 0 do
    I xpos !
    1 ypos @ 1 lshift yoff + @ 
    xpos @ + c!
   loop
;

( Linear is loop 0-1000 )
( X,Y is loop Y, loop X )
( 9s BASIC linear )
( 15s Basic X,Y X+Y*40 )
( 0.283s Forth linear )
( 2.133s X,Y X+40*40
( 2.316s X,Y lookup Y w/Y*2 )
( 0.833s X,Y lookup Y w/Y<<1 )
( 0.766 X,Y luY w/Screen added to yoff )
: plotxy2
   cfgyoff
   page start
   19 0 do
    I ypos !
    plotline3
   loop
   stop ." s" cr
;

Code: Select all

10 TIME$ = "000000":rem reset clock
20 rem 'linear' test
30 rem for i = 0 to 1000
40 rem poke 1024+i,1
50 rem next i
60 rem
70 rem X,Y test
80 for y = 0 to 25
90 for x = 0 to 40
100 poke 1024+y*40+x, 1
110 next x
120 next y
130 print TIME$

Posted: **Fri Aug 31, 2018 4:57 pm**

It's a little bit extra unfair to BASIC considering it's always doing floating point math.

Posted: **Fri Aug 31, 2018 7:18 pm**

There is a little benchmark where numerous Basic variants (and program implementations) and other languages have contributed there numbers. There isn't a single Forth contribution so far...

Regards,
Arne

Posted: **Fri Sep 07, 2018 12:57 am**

GaBuZoMeu wrote:

There is a little benchmark where numerous Basic variants (and program implementations) and other languages have contributed there numbers. There isn't a single Forth contribution so far...

Regards,
Arne

It was very easy to port you BASIC benchmark to the C64. I added in a few lines so I could use the Jiffy timer to automatically log execution time. I think spent a fair amount of time writing a Forth version which I got working this afternoon (after finding at least 1,000 ways it would not work). I'll run both versions through all iterations (A-F) in VICE and then set up a C64 this weekend to confirm execution times on real hardware.

I'll post me code and results in the benchmark thread...

Posted: **Fri Sep 07, 2018 3:09 am**

Even then, there's probably a lot that could be improved on the C64 Forth. The 6502 Forth I started with (I imagine they had common roots, maybe from Ragsdale), before I heavily modified it, had for example,

Code: Select all

: DEPTH        SP@ S0 @ SWAP - 2 / ;

Besides the fact that it would have taken less memory as a primitive something like this,

Code: Select all

CODE  DEPTH
        STX  N        ; Put stack pointer in N.
        LDA  #S0adr   ; Get the address of the 1st byte under the stack.
        SEC
        SBC  N        ; Subtract the data stack pointer value (X above).
        LSR  A        ; Divide by 2, to get from number of bytes on stack
        JMP  CPUSH    ; to number of cells.
      END-CODE
 ;-------------------

they even used 2 / instead of 2/ , making it at least dozens of times as slow!

I imagine there's a lot that could be dramatically improved in the C64 Forth if someone wanted to get serious about it. I've said that the Forth in the HP-71 module I got was not a very good one either. Without even resorting to assembly language, I was able to dramatically speed up many of the words. In the extreme case, I got the word 13x as fast, just by re-writing it in Forth.

When I wrote my '816 (actually '802) Forth, the ability to handle 16 bits at a time without taking cells apart "to get them through the door" so so speak, meant that there were lots more things that were suddenly worth it to write as primitives which would have taken too much memory as primitives on the 6502.

Posted: **Fri Sep 07, 2018 9:12 pm**

64Forth was definitely slow. I believe it actually copied the address from the CFA of a word to a location in zero page and performed an indirect jump through the zero page address! That's one way to get around the indirect jump bug, but what a time waster. I much prefer having CREATE determine if the CFA will land on a page boundary and allot 1 byte before the header is created. Blazin' Forth, another indirect threaded Forth and a Forth-83 standard Forth, was considerably faster. Alas, there were things Blazin' Forth could have done better, which motivated me to write my own.

Posted: **Fri Sep 07, 2018 9:43 pm**

Even though the 65c02 (CMOS) fixed that bug, I still kept the alignment for my '02 Forth. IIRC, I think it was to make it easier to decompile (using the ANS Forth word SEE), something I seldom need to do but once in a while it's useful. Without taking the time to review it right now, it seems like it had to do with allowing names to have characters with the high bit set (to get the specials we use in engineering, like Greek letters, degree symbol, etc. in the DOS/ANSI [Edit: that should say IBM437] character set).

Posted: **Fri Sep 07, 2018 10:17 pm**

My Forth has no aligning at all. Create just allots one byte if the CFA would otherwise cross a page boundary and my version of SEE works just fine, even with code words and the odd mix of code and high level. This:

Code: Select all

HEX
: #  ( D1 -- D2 )
   BASE @ UD/MOD ROT
   >ASSEM
   0 ,X LDA,  0A # CMP,  CS IF,
      6 # ADC,
   THEN,
   30 # ADC,  0 ,X STA,
   >FORTH
   HOLD ;

decompiles to this:

Code: Select all

SEE #
#
 1CDC BASE
 1CDE @
 1CE0 UD/MOD
 1CE2 ROT
 1CE4 (>ASSEM)
 1CE6    0 ,X LDA
 1CE8    A  # CMP
 1CEA 1CEE    BCC
 1CEC    6  # ADC
 1CEE   30  # ADC
 1CF0    0 ,X STA
 1CF2  9C4    JSR ' (>FORTH) >BODY
 1CF5 HOLD
 1CF7 EXIT
 OK

So perhaps, as you said, the alignment was to allow characters with the high bit set?

Posted: **Fri Apr 05, 2019 10:01 pm**

As my previous post hinted, Forth has one more advantage over BASIC, the use of low level code. Once the logic of a program is tested, the slower parts ( like inner loops ) can be rewritten as CODE words. A code definition gets compiled to the same dictionary as high level words, at least on the C64 Forths I've seen ( including my own Fleet Forth .) The entire dictionary can be saved to disk as a new Forth with a single command. In Fleet Forth it is FSAVE. In Blazin' Forth it is SAVE-FORTH. In BASIC, someone can still use low level code with the SYS and USR commands, but then has to decide where to put it and usually manage the loading and saving of the low level code separate from the BASIC code.

Posted: **Fri Apr 12, 2019 10:07 pm**

GARTHWILSON wrote:

Even then, there's probably a lot that could be improved on the C64 Forth. The 6502 Forth I started with (I imagine they had common roots, maybe from Ragsdale), before I heavily modified it, had for example,

Code: Select all

: DEPTH        SP@ S0 @ SWAP - 2 / ;

Besides the fact that it would have taken less memory as a primitive something like this,

Code: Select all

CODE  DEPTH
        STX  N        ; Put stack pointer in N.
        LDA  #S0adr   ; Get the address of the 1st byte under the stack.
        SEC
        SBC  N        ; Subtract the data stack pointer value (X above).
        LSR  A        ; Divide by 2, to get from number of bytes on stack
        JMP  CPUSH    ; to number of cells.
      END-CODE
 ;-------------------

they even used 2 / instead of 2/ , making it at least dozens of times as slow!

I imagine there's a lot that could be dramatically improved in the C64 Forth if someone wanted to get serious about it.

There's the rub. Different Forths for the 65XX with the same threading could have vastly different speeds depending on the ratio of primitives ( and which words are primitives ) in each kernel. Which words should be primitive? How do you strike a balance between compact size and speed of a Forth system?
I have some rules of thumb about what should be defined as a primitive. In my Forth I've defined the data stack operators and the math operators as primitives. Anything that is the same size or smaller as a primitive gets defined as a primitive. S>D is the same size so it is defined as a primitive. If a pair of words have a combined size that is smaller as primitives ( possibly because they complement each other ) , they are defined as primitives. [ and ] are an example of this. High level versions in my Forth are:

Code: Select all

: ]  ( -- )  STATE ON ;
: [  ( -- )  STATE OFF ; IMMEDIATE

The bodies are six bytes each for a total of twelve. The primitive versions are:

Code: Select all

HEX
CODE ]
   DEY,
   STATE    STY,
   STATE 1+ STY,
   NEXT JMP,
   END-CODE
CODE [ IMMEDIATE
   -2 ALLOT
   ' ] @ 1+ ,
   END-CODE

The body of ] is ten bytes and the body of [ is zero bytes for a net savings of two bytes.
Sometimes a high level word is simple and small but slow. Rewriting one of the words it uses as a primitive can result in the word being fast enough yet smaller than a version rewritten to be faster without resorting to redefining a component word as a primitive. >BT ( to buffer table ) in my Forth is a word to access the Nth entry in the block buffer table. The primitive is larger, yet helps the high level parts of BLOCK and BUFFER to be fast enough without convoluted ( at least not too badly convoluted ) high level code that would make them bigger.

Posted: **Fri Apr 12, 2019 10:19 pm**

...and the '816 makes it so much more efficient, which is why I have many hundreds of primitives in my '816 ITC Forth. [ and ] are:

Code: Select all

        HEADER "[", IMMEDIATE                   ; ( -- )
Lbrac:  PRIMITIVE
        STZ     STATEdata       ; STATE OFF
        GO_NEXT
 ;-------------------
        HEADER "]", NOT_IMMEDIATE               ; ( -- )
Rbrac:  PRIMITIVE
        LDA     #$FFFF
        STA     STATEdata       ; STATE ON
        GO_NEXT
 ;-------------------

and actually I should probably change ] to:

Code: Select all

        HEADER "]", NOT_IMMEDIATE               ; ( -- )
Rbrac:  PRIMITIVE
        DEC     STATEdata       ; STATE ON
        GO_NEXT
 ;-------------------

so they're just one assembly-language instruction each.

Posted: **Fri Apr 12, 2019 10:58 pm**

GARTHWILSON wrote:

...and the '816 makes it so much more efficient, which is why I have many hundreds of primitives in my '816 ITC Forth.

One of these days I have got to get a 65816.

Posted: **Fri Apr 12, 2019 11:04 pm**

On a 6502, at some point I think that a word gets big enough that the size of the primitive is actually large than the size of comparable high level word. One of Forth's benefits is code density.

Then you have the eForth philosophy of having as few primitives as possible simply to facilitate porting. I think it has, like, 30ish primitive words.

Then there's the "premature optimization" issue. Why make a word primitive if it's very rarely used. If the code is "fast enough", no more work is necessary.

It would be interesting to add a 4 byte space to each word, and modify DOCOL to increment it to act as a crude profiler.

Posted: **Fri Apr 12, 2019 11:41 pm**

whartung wrote:

On a 6502, at some point I think that a word gets big enough that the size of the primitive is actually large than the size of comparable high level word. One of Forth's benefits is code density.

Definitely, but some words such as multiplication and division operators in high level Forth on the 6502 ( or 6510 ) would be intolerably slow.

Quote:

Then you have the eForth philosophy of having as few primitives as possible simply to facilitate porting. I think it has, like, 30ish primitive words.

Fleet Forth has about five times that many primitives. Any Forth for the C64 is already targeting a specific platform, so portability may not be that big of an issue.

Quote:

Then there's the "premature optimization" issue. Why make a word primitive if it's very rarely used. If the code is "fast enough", no more work is necessary.

Because it's fun. If the primitive is smaller, then memory is saved. If it is the same size, then there is no cost in memory but a rarely used speedup here maybe another there and elsewhere. As for the work, if the source is for a metacompiler it can be tested interactively even without using the metacompiler. Once there are enough improvements, a new kernel can be built.

Quote:

It would be interesting to add a 4 byte space to each word, and modify DOCOL to increment it to act as a crude profiler.

An interesting idea. How about a word that patches NEXT the same way that Blazin' Forth does for its trace functionality. The profiler word could build a table with the address of each word profiled and how often it is executed and possibly some other data. The profiler would switch off profiling while it runs. The data could be kept in blocks with block numbers high enough to use RAM or a large enough set of block buffers could be configured to keep the needed blocks in memory. Blocks aren't just for source code, they're a primitive form of virtual memory.

Posted: **Sat Apr 13, 2019 6:20 pm**

whartung wrote:

It's a little bit extra unfair to BASIC considering it's always doing floating point math.

That's perhaps the 2nd contributor to its sluggishness. The 1st is the fact that CBM BASIC is always parsing (tokenized) source code, even when in RUN-mode. Forth doesn't tokenize its source code, it literally compiles it (perhaps into an intermediate representation, like direct- or indirect-threaded code), which means executing it involves a different runtime engine. If it compiles to subroutine-threaded code, it doesn't even need that.

So even if BASIC were to use integers for its variables, you still have the raw interpreter overhead to contend with: constantly reparsing numbers, performing linear scans to find the next program line to execute, and so on.

MS BASIC (from which CBM BASIC is derived) is not an efficient interpreter at all; its sole saving grace is how small it can be implemented in.

6502.org

Speed of Forth vs. BASIC on C64

Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64

Re: Speed of Forth vs. BASIC on C64