Robot Game: C vs Forth vs Assembly

Druzyek · Post by **Druzyek** » Tue Mar 17, 2020 10:56 pm

I've been working on porting a game from Python to C, Forth, and assembly to compare each language. The game is simple and only exists to have something to port, so no guarantees about how fun it is. The CC65 and Tali Forth 2 versions are finished. You can try playing them in your browser without downloading anything. The assembly versions will follow soon. Let me know if you find any bugs or differences between the versions please.

Robot Game - CC65 version
Robot Game - Tali Forth 2 version

Keys:
W - up
S - down
A - left
D - right
space - attack/mine/drill
M - open menu

BigEd · Post by **BigEd** » Wed Mar 18, 2020 8:55 am

I think I stepped into a lava field! This is amazing.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Mar 18, 2020 4:26 pm

BigEd wrote:

I think I stepped into a lava field! This is amazing.

You must be really burned up over that!

SamCoVT · Post by **SamCoVT** » Mon Apr 13, 2020 8:07 pm

I especially like how, when running the Forth version, you don't have to type "run" and can just start typing Forth things at it. I am a bit sad that the word "words" appears to be missing (did that get yanked out along with with the wordlists?), but I was able to define a new word and play with the Forth a bit.

I'm super-impressed with the selection/menu system in your game. It looks really nice and is easy to use (press M to get back to the map when done - that's the only part I couldn't figure out at first.) I know how much work can go into making things work that nicely/smoothly when you're starting from scratch.

Druzyek · Post by **Druzyek** » Tue Apr 14, 2020 1:25 am

Thanks! I thought Forth people would appreciate that. You can also press Q in the Forth version once you're in the game to get back to the Forth prompt. This is useful to make sure a word didn't accidentally leave something on the stack. Yes, "words" was scrapped along with every single other thing I could get rid of. The way the memory banking works is each bank has an identical copy of Forth and words get defined in the left over space, so freeing up one byte of Forth frees up one byte each in the five banks. One of the banks had less than 100 bytes left, so I was cutting it close. The version I uploaded shows a lot more space left since I turned the underflow check off. On the other hand, words wouldn't be too useful since it would only show the words in the current bank and those words would be encoded, ie the word "DrawMenuInventoryChar" gets redefined every time I send the source to Tali to something like "x1a." Automatically shortening all the names to 2 and 3 word abbreviations also saves hundreds of bytes.

Hmm, maybe I should let escape work for the map. Was it obvious from the titles that K and R get you to the other two menus?

I intend to put everything on github when I finish the other version but here is the source code in the mean time if you have any comments: Robot Game Forth source
One of the assembly versions is also done: Robot Game traditional assembly source

Druzyek · Post by **Druzyek** » Tue Jun 30, 2020 4:39 am

All four versions of the game are done. Here's a summary I did comparing the performance of each language with links to the games so you can play in your browser: Robot Game summary. C was 20-80% as fast as traditional assembly, Forth was 5-25% as fast, and the optimizer project I posted about here was up to 10% faster. Here's a graph of 28 different tasks the game does internally. Traditional/unoptimized assembly is normalized at 100%. Anything above that is faster and anything below is slower, ie a rating of 20% means five times slower than normal assembly.

litwr · Post by **litwr** » Tue Jun 30, 2020 8:16 pm

It is interesting why is C the fastest with rand8? IMHO you can check C compiler generated assembly and improve it a bit.

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 30, 2020 8:30 pm

litwr wrote:

It is interesting why is C the fastest with rand8? IMHO you can check C compiler generated assembly and improve it a bit.

In all cases, it will depend on how good the compiler and/or kernel is. The common Forths for the '02 are nowhere near optimized.

Druzyek · Post by **Druzyek** » Tue Jun 30, 2020 9:25 pm

litwr wrote:

It is interesting why is C the fastest with rand8? IMHO you can check C compiler generated assembly and improve it a bit.

I looked at that in the last paragraph of the Results section on the page. The way I implemented modulo in assembly was lazy, which is why C was better for rand8.

Quote:

In all cases, it will depend on how good the compiler and/or kernel is. The common Forths for the '02 are nowhere near optimized.

How do you mean optimized? Tali Forth 2 does not optimize like a C compiler where it modifies the generated assembly to be more efficient, but I think it is pretty well optimized as a Forth from what I can tell.

BigEd · Post by **BigEd** » Tue Jun 30, 2020 9:37 pm

great investigation!

GARTHWILSON · Post by **GARTHWILSON** » Tue Jun 30, 2020 10:43 pm

Druzyek wrote:

GARTHWILSON wrote:

In all cases, it will depend on how good the compiler and/or kernel is. The common Forths for the '02 are nowhere near optimized.

How do you mean optimized? Tali Forth 2 does not optimize like a C compiler where it modifies the generated assembly to be more efficient, but I think it is pretty well optimized as a Forth from what I can tell.

I forgot you said it was Tali Forth (which I have not evaluated). The common Forths for the '02 (usually based on figForth) have a lot of things as secondaries that could be turned into primitives for much better performance.

SamCoVT · Post by **SamCoVT** » Thu Jul 02, 2020 7:47 pm

Druzyek,

Thanks for taking the time to make that graph. I find it quite interesting.

For reference, all of TaliForth2's primitives are written in assembly. There are only a few that are done as JSRs to other words (Tali2 is an STC Forth).

I believe native compiling (where the opcodes for a simple word are compiled directly into another word) was disabled to conserve as much memory as possible (correct me if I'm wrong on that one, Druzyek), so all of the compiled code would end up as JSRs to the primitives or other defined words. I can see having JSR/RTS overhead for every word, along with stack juggling, slowing things like drawing stuff on the screen quite a bit vs. a C or ASM routine. Anything with loops is likely to be noticeably slower, and indeed almost all of the slowest routines have "Draw" in their name.

The native compiling would really have only helped with the JSR/RTS overhead, and it can't be used on words that have flow control (IF/ELSE/THEN/DO/LOOP/etc) as Tali2 doesn't know how to update the JMP addresses (or even which bytes might be addresses) as it just copies the bytes for one word into another.

I suspect the loop constructs Tali2 uses to be a lot of the slowness. The loop index is held on the return stack while the loop is running. Tali2 has to pop it into A/Y, add to it, and then push it back. LOOP is implemented as assembly that puts a 1 on the data stack and then falls right into +LOOP.

Druzyek · Post by **Druzyek** » Mon Jul 06, 2020 1:38 am

SamCoVT,

Did you see in the summary I posted a link to above that I thanked you for your help? Thanks again! I also mentioned that Tali Forth 2 is really great and the poor performance seems to be due to the nature of Forth.

Quote:

I believe native compiling (where the opcodes for a simple word are compiled directly into another word) was disabled to conserve as much memory as possible (correct me if I'm wrong on that one, Druzyek), so all of the compiled code would end up as JSRs to the primitives or other defined words.

I originally left "nc-limit" at the default of 20 then bumped it up to 26 once everything was working right. The results are from that version, so it should be doing a lot of native compiling.

Quote:

I suspect the loop constructs Tali2 uses to be a lot of the slowness. The loop index is held on the return stack while the loop is running. Tali2 has to pop it into A/Y, add to it, and then push it back. LOOP is implemented as assembly that puts a 1 on the data stack and then falls right into +LOOP.

That is a good point. I think a lot of the slowdown also comes from the stack-based nature of the language:
-You're constantly thrashing X, which doesn't happen in C and assembly, with stuff like "drop 5" where you have useless runs like DEX / DEX / INX / INX.
-Constraining everything to the top levels of the stack wastes a lot of cycles in stuff like "swap 1+ swap" which would be only one instruction in assembly since you can step over the topmost stack levels and work on any item.
-Automatically promoting 8-bit values to 16-bit to keep stack items the same size means you're always doing calculations the slowest way.

Another thing that slows everything down not specifically related to stacks is the lack of optimizations in most Forths. Something like "lshift" always generates the same machine code even in the case of "5 lshift" where the number of shifts is known at compile time and better assembly could be generated. C compilers can spot those situations and generate better assembly.

I also got Tali Forth 2 running in my simulator.

GARTHWILSON · Post by **GARTHWILSON** » Mon Jul 06, 2020 2:43 am

SamCoVT wrote:

For reference, all of TaliForth2's primitives are written in assembly.

Yes, that's what "primitive" means, at least in my Forth "upbringing." It's a code definition. Colon definitions OTOH are secondaries, not primitives. -ROT for example is often written as

Code: Select all

: -ROT ROT ROT ;

which in ITC is about four times as slow as the primitive version of the same word. However, I was just looking at my '02 Forth, and the code to do it as a primitive is so much longer than it takes as a secondary that you'd have to evaluate the memory supply before making the change, unlike the situation on the '816. Most of the primitives I write for the '02 now are things where I don't need 16-bit operation. I was looking again at the "Tali Forth for the 65c02" topic which is 13 pages, and although it's a lot to review, it looks like it was well done.

Quote:

I suspect the loop constructs Tali2 uses to be a lot of the slowness. The loop index is held on the return stack while the loop is running. Tali2 has to pop it into A/Y, add to it, and then push it back. LOOP is implemented as assembly that puts a 1 on the data stack and then falls right into +LOOP.

Hmmm... That does sound like a very slow way to do it. My loop (compiled by LOOP) is entirely separate from +loop (compiled by +LOOP), and does not affect or even look at the data stack.

Quote:

with stuff like "drop 5" where you have useless runs like DEX / DEX / INX / INX.

I don't know that I've ever heard of dropping five cells at once. If a word deals with more than about three, it often means a different approach should probably be taken. There are exceptions of course. Re-writing things into primitives in many cases avoids a lot of DUPing and DROPping.

Quote:

-Constraining everything to the top levels of the stack wastes a lot of cycles in stuff like "swap 1+ swap"

If you use it enough to justify it, you can make that a primitive too, INCR_NOS, which on the '816 is only INC 2,X, a single instruction. On the '02 you'd need to add a BNE and an INC 3,X.

You can always add primitives as the need might justify, which amount to a degree of optimization, combining commonly used pairs or sets of words into a single primitive. A list of many of my own '816 ones is at viewtopic.php?p=72734#p72734 .

And like they say, "You can get 90% of the performance of assembly language with only 10% done in assembly, provided it's the right 10%, the critical 10%," or something like that. The actual numbers will depend on the application; but the principle holds. And it's so easy to mix some 65xx assembly into Forth. I know many other processors are not so easy to write assembly language for (to write primitives or runtimes or ISRs, etc.).

Druzyek · Post by **Druzyek** » Mon Jul 06, 2020 3:37 am

Quote:

with stuff like "drop 5" where you have useless runs like DEX / DEX / INX / INX.

I don't know that I've ever heard of dropping five cells at once. If a word deals with more than about three, it often means a different approach should probably be taken. There are exceptions of course. Re-writing things into primitives in many cases avoids a lot of DUPing and DROPping.

I suppose without a single Forth standard I can't assume DROP means the same thing in all contexts. I meant here dropping one stack item and replacing it with 5. That could be just an LDA/STA pair in assembly but is easily 10x the cycles in Forth.

Quote:

-Constraining everything to the top levels of the stack wastes a lot of cycles in stuff like "swap 1+ swap"

If you use it enough to justify it, you can make that a primitive too, INCR_NOS, which on the '816 is only INC 2,X, a single instruction. On the '02 you'd need to add a BNE and an INC 3,X.

You only need the BNE if you're stuck always using 16-bit math, whereas you get away with a single INC in assembly if it's an 8-bit value. I think something like INCR_NOS really illustrates the inefficiency here. You shouldn't have to write custom assembly to add one to a local variable to avoid the 10x slowdown.

Quote:

You can always add primitives as the need might justify, which amount to a degree of optimization, combining commonly used pairs or sets of words into a single primitive. A list of many of my own '816 ones is at viewtopic.php?p=72734#p72734 .

I think this is a bit of a dead end as well. If you just take my lshift example, there are 16 different variations of doing a 16-bit shift if you know at compile time how many places you're shifting. You could replace something like "5 lshift" with a custom "lshift5" but then you are stuck creating two versions of every word that can take a constant.

Quote:

And like they say, "You can get 90% of the performance of assembly language with only 10% done in assembly, provided it's the right 10%, the critical 10%," or something like that. The actual numbers will depend on the application; but the principle holds. And it's so easy to mix some 65xx assembly into Forth.

That is a nice sentiment but even mixing in a lot of assembly, I think it's absolutely not true on the 6502. I hear stuff like this all the time online. Someone was claiming here for example that the overhead is only about 25% more than assembly. It seems like people became convinced of how fast Forth is and have just been quoting this for decades without proof. When you actually look at the data and compare the cycles executed, it's more like Forth is 10% the performance, not 90%. On the other hand, if you have a non-trivial example of getting 90% the performance, it would be really neat to see how that works.

Robot Game: C vs Forth vs Assembly

Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly

Re: Robot Game: C vs Forth vs Assembly