6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Mar 28, 2024 11:11 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Faster loops
PostPosted: Sat Apr 24, 2021 4:17 pm 
Offline

Joined: Sun May 13, 2018 5:49 pm
Posts: 244
Druzyek's speed comparisons of assembly vs C vs Forth on the 6502 was very interesting to take a look at. If you haven't seen it already, you can find it at
http://calc6502.com/RobotGame/summary.html#Results
Scroll down just a little bit to see the table of results.

One of the things I noticed is that many of the slower (<10% the speed of assembly) items had nested do loops in them and I was just curious what the penalty/overhead of using 16-bit loop variables for small counts (eg. stuff you'd count in X or Y in assembly). Fortunately, because Forth is awesome, I can scratch that curiosity itch and discover that the answer is "not as much as I was expecting, but more than double", which totally makes sense to me for using a 16-bit index vs an 8-bit one.

Leepivonka had provided some "replacement" words for do loops, so that made me want to try that myself. Rather than replace Tali2's do-loop words, these words are meant to compliment them. They use the Y register similar to how one might in assembly to count down to zero, so only the upper limit is given, and it's limited to 8-bits (but you can get 256 loops by specifying 0 as the starting value). Because these words use the Y register, I just made word names with a "Y" prefix. After writing this, I realized that it doesn't really matter which register is used, and A could have been used just as well (and sometimes is used because Tali2 has a PUSH-A macro to get A on the top of the Forth data stack.

For those that are interested, here is my code and the results I got (using the cycle counting tests available in the simulated Tali2 test suite).

Code:
\ PROGRAMMER  : Sam Colwell
\ FILE        : yloop.fs
\ DATE        : 2021-04
\ DESCRIPTION : For small loops of 256 or less, these words will use
\ the Y register in cowntdown mode only (eg load with 8-bit starting
\ count and it will always count down to zero) which is generally useful
\ for code that needs to run a number of times, but doesn't need to
\ control the direction of the counting (in exchange for speed).

\ Add the assembler.
assembler-wordlist >order

: YDO  ( C: -- addr) ( R: n --)  ( Start loop that will run n times )
  0 POSTPONE LDY.X          \ Load Y with the starting value.
  POSTPONE INX POSTPONE INX \ Remove value from stack.
  HERE          \ Save location to loop back to (leave on stack)
  POSTPONE PHY  \ Save current index to return stack.
; IMMEDIATE COMPILE-ONLY

: YLOOP ( C: addr -- ) ( R: -- ) ( Loop back up to start if y nonzero)
  POSTPONE PLY    \ Get current index from return stack.             
  POSTPONE DEY    \ Count this iteration.                             
  3 POSTPONE BEQ  \ If we reached zero, continue on (branch over jump)
  POSTPONE JMP    \ otherwise jump to the top of the loop.
  \ The jump should pull it's address from the stack
  \ It was placed there by the ydo word.
; IMMEDIATE COMPILE-ONLY

: YI
  \ The current index is on the return stack.
  POSTPONE PLA    \ Get the index into A and put it on the Forth stack.
  POSTPONE PHA
  POSTPONE PUSH-A \ This is a macro in Tali to put the value in A on TOS.
; IMMEDIATE COMPILE-ONLY

: YJ
  \ The index from the outer loop is the SECOND byte on the return stack.
  POSTPONE PLY \ Pull the one we don't want into Y
  POSTPONE PLA \ Pull the one we do want into A
  POSTPONE PHA \ Put both of them back (in the right order)
  POSTPONE PHY
  POSTPONE PUSH-A \ Put the J index on the Forth stack.
; IMMEDIATE COMPILE-ONLY

Tali2's assembler words are immediate. By using POSTPONE, that essentially makes these assembly macros. I could have used a runtime word and COMPILEd it, but these were so short it made more sense to me this way.

Next is a basic functionality test:
Code:
\ Try it out.
: testing
  5 ydo
     3 ydo
        cr ." yi=" yi .  ." yj=" yj .
     yloop
  yloop
;


\ Results:
testing
yi=3 yj=5
yi=2 yj=5
yi=1 yj=5
yi=3 yj=4
yi=2 yj=4
yi=1 yj=4
yi=3 yj=3
yi=2 yj=3
yi=1 yj=3
yi=3 yj=2
yi=2 yj=2
yi=1 yj=2
yi=3 yj=1
yi=2 yj=1
yi=1 yj=1  ok

Note that it counts down from the starting value, including the starting value, but does not run with the value 0.

Finally, the cycle counting tests (the cycle-test word takes the XT and prints the number of CPU cycles):
Code:
\ Cycle-Test RESULTS:
decimal
: testingy   255 ydo  yloop ;
: testingdo  255 0 do  loop ;
: testingyi  255 ydo  yi drop  yloop ;
: testingdoi 255 0 do  i drop  loop ;
: testingyy    255 ydo   255 ydo  yloop  yloop ;
: testingdodo  255 0 do  255 0 do  loop  loop ;
: testingyyij    255 ydo   255 ydo yi yj 2drop  yloop  yloop ;
: testingdodoij  255 0 do  255 0 do  i j 2drop  loop  loop ;

' testingy           cycle_test   CYCLES: 3660  ok
' testingdo          cycle_test   CYCLES: 16775  ok

' testingyi          cycle_test   CYCLES: 13605  ok
' testingdoi         cycle_test   CYCLES: 32075  ok

' testingyy          cycle_test   CYCLES: 933900  ok
' testingdodo        cycle_test   CYCLES: 4291340  ok

' testingyyij        cycle_test   CYCLES: 5420625  ok
' testingdodoij      cycle_test   CYCLES: 11053940  ok

These are trivial tests, but the results are enlightening. For each group above, the first one is the simpler 8-bit "Y" version of the loop running 255 times, and then using the regular do loop words running 255 times. I then have tests using i (which I just drop) to show the effects of getting the index, and then I do nested loops (without and then with indices).

I very much enjoy the way these new looping words are just as "valid" as the words that Tali2 comes with, and I could use them in places where I needed to squeeze some extra cycles out of a routine.


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Mon Apr 26, 2021 2:36 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Cool idea! Out of curiosity, could you compare the cycles of a DO that still uses the R stack but only the lower 8 bits?


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Mon Apr 26, 2021 8:53 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8412
Location: Southern California
Interesting idea. I haven't really thought about the 16-bit DO...LOOP's inefficiency on this 8-bit processor, although I did go the other way and write a set of 32-bit 2DO words for the '02, including 2?DO, 2+LOOP, 2BOUNDS, 2LEAVE, 2I, and 2UNLOOP, and the overhead was astronomical! I do use it though. Quite a few words in typical ITC Forth for the '02 use the setup routine, and I did shorter versions of them for 8 bits to improve efficiency in situations like for example CMOVE (calling the new version QCMOVE, for "quick CMOVE") since I usually use it for strings which are well under 256 bytes long, and the setup routine used for these is qsetup. I look forward to further discussion on the 8-bit loops thing though.

The 6502 (and especially the '816) is so easy to write short assembly-language pieces for though that I have never really seen Forth's 6502 slowness as a problem. Like they say, you can get 90% of assembly language's performance with only 10% of the code in assembly language, provided it's the critical 10%... or was it 95% and 5%? Anyway, you get the point. I have no problem writing the speed-critical parts in assembly, while enjoying the interactiveness and other virtues of Forth for the rest. The iconic book "Starting Forth" doesn't have anything about writing primitives until you get to the last chapter, and then there's still nothing about ;CODE (similar to DOES> but for the runtime behavior being defined in assembly language) and writing other things that are not Forth words per se, in assembly language. From discussion on another group, I take it Chuck Moore was not very interested in that. Back in the day, he was more interested in using stack processors for situations that needed more performance.

In the article, Joey does admit that he was new to Forth, which explains the longer development time and some of the resulting performance slowness for him. It hardly seems like a fair comparison. Regardless, it's not really a reflection on Forth except for its execution on 8-bitters. For modern 32- and 64-bit processors, optimizing Forth compilers may yield results that have fewer machine-language instructions than the source code had Forth instructions. Super efficient. Also, Forth, Inc. has a lot of stories from their own experience where for example it was estimated that a project would take a team of C programmers two years to complete a job, and one man came in and did it by himself in six months in Forth, reducing the hardware requirements in the process.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Mon Apr 26, 2021 11:29 am 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 556
I love the Forth language for it's compactness and elegance, but in my limited experience it is slower than assembler. For example:

I wrote an ASCII art Mandelbrot in Forth to demonstrate how much faster it is than Basic. I was pleased with the results because the Forth code was about five times faster, although fixed point integer arithmetic was likely a big reason why.

I then rewrote it in assembler and found that it was four times faster than Forth! See: viewtopic.php?f=2&t=6347

If you read the assembler it uses macros that emulate Forth words, and it is identical in structure to the Forth program with two exceptions. First, the macros inline many operations rather than a JSR to a subroutine. Second, I use 8.8 fixed point and byte operations rather than lshift and rshift to rescale after a multiply.

I am not sure which of those caused the bump in performance. The first is intrinsic to the way many Forths work and harder to fix. But the second might be amenable to Garth's idea that rewriting the rescale function in assembler with byte level operations might lift Forth's performance.


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Mon Apr 26, 2021 10:31 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
In the article, Joey does admit that he was new to Forth, which explains the longer development time and some of the resulting performance slowness for him. It hardly seems like a fair comparison. Regardless, it's not really a reflection on Forth except for its execution on 8-bitters.
Correct, I did say that some of the slowness in the results is due to lack of experience on my part, so I don't want people to take the numbers in the charts as completely representative. However, I still stand by the conclusion that using Forth on the 6502 entails an extreme loss of performance. That can be demonstrated by analyzing the underlying assembly, and I tried to cover that briefly in my article.

GARTHWILSON wrote:
The 6502 (and especially the '816) is so easy to write short assembly-language pieces for though that I have never really seen Forth's 6502 slowness as a problem. Like they say, you can get 90% of assembly language's performance with only 10% of the code in assembly language, provided it's the critical 10%... or was it 95% and 5%?
You have said this many times, but my objection is this only really applies to some projects. I've gone back and watched years and years of the presentations at the SVFIG meetings and noticed that the programmers there tend to apply Forth to one or a small handful of problems and then claim based on that that it's superior to C and competing languages at all times and in all circumstances. One guy has been showing off a small robot he made in the 90s programmed in Forth and sold to educational markets and claims that Forth is the only language that makes sense for robots. I'm not sure to begin with that it was the best choice in the 90s for an educational robot, much less that it's always the best choice for educational robots nowadays, much less for all robots at all times nowadays. People tend to put on blinders when they find something that works for them and demonstrate the "If it works for me, it ought to be good enough for anyone" fallacy.

On the other hand, I can see how I put on my own blinders and might criticize Forth too heavily sometimes since I care about performance, which isn't a concern for a lot of other people's projects. I got interested in the 6502 to make programmable calculators where you're never thinking about making a device that is "fast enough" but one where you need to maximize performance since you're letting users create their own programs on the device. Squeezing out an extra 5-10% of performance is a huge win if that's horsepower you can hand over to the user to do calculations with. Compare this to something like a microwave. I've noticed that mine has a noticeable delay between the time I press a button and it responds with a beep. Adding more horsepower to get the delay down is a waste of time and won't sell more microwaves. A lot of engineering is aimed at producing "good enough" rather than maximizing performance where it wouldn't make a difference anyway. If you're able to hit those engineering goals with Forth, or by rewriting 5-10% in assembly as you suggested, then that sounds like success to me. My objection is to the idea that this method works for all projects all the time. If over half of my ROM is taken up by calculations that I need to be really fast, there just isn't a 5-10% chunk for you to find and rewrite because it's equally important that all of them be fast. Like I said, however, I can see the appeal of using Forth for something where you don't miss the extra performance, and I don't mean to say it's a bad choice just because it doesn't work for what I want to do.

Quote:
For modern 32- and 64-bit processors, optimizing Forth compilers may yield results that have fewer machine-language instructions than the source code had Forth instructions. Super efficient.
This is really interesting. Have you looked at Mecrisp Forth? It uses a very cheap ARM board as a tether to program MSP430 microcontrollers. It generates optimized assembly for the target, so no kernal or thread overhead on the MSP430. There's also a version that uses an ARM board to program other ARM boards. Some of the guys involved with it claim it's only about 3 times slower than C, which is pretty darn impressive. One of the optimizing Forths for Windows leaves the majority of registers vacant. This seems like a waste, though from people I've talked to who know a lot more about x86 than I do, thrashing the stack and ignoring the extra registers may not really have a speed penalty on modern processors. I would love to know more.

Quote:
Also, Forth, Inc. has a lot of stories from their own experience where for example it was estimated that a project would take a team of C programmers two years to complete a job, and one man came in and did it by himself in six months in Forth, reducing the hardware requirements in the process.
Chuck Moore has said this kind of thing in his presentations before and it's almost certainly nonsense. I think it belongs in the same garbage bin as saying that pure Forth on the 6502 is "only 25% slower" than the equivalent assembly, which pops up on other forums from time to time. Please let me know if you have any way of verifying this (EDIT: this meaning the claim about C programmers). I've often thought about how you would organize a comparison by having a C and Forth team solve the same problem.


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Tue Apr 27, 2021 3:06 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1918
Location: Sacramento, CA, USA
Quote:
Also, Forth, Inc. has a lot of stories from their own experience where for example it was estimated that a project would take a team of C programmers two years to complete a job, and one man came in and did it by himself in six months in Forth, reducing the hardware requirements in the process.


If true, that man is to be respected, celebrated and admired. However ... I have a feeling that it might be foolish to depend on him for updates and support, or to expect him to consistently repeat that inspired feat of excellence.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Tue Apr 27, 2021 8:34 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8412
Location: Southern California
We're getting off on philosophy again, but I think some of that in any topic is fine as long as we can also instantly get back on topic. Having separate topics on philosophy may lack the background to make them meaningful or attractive.

Well, the only contact I've had with Forth, Inc. is through what they've published, online, and on paper before we had the internet. I don't really have any reason to doubt them, since they've accomplished some very impressive things including what I wrote in this post and this one. I quoted a bit of Elizabeth Rather's story here, how she found Forth to be many times as productive as the other languages she had worked in, even though she was initially very biased against it.

When I came up with my Forth development system for embedded development which gives practically instant turnaround speed between writing a piece of code and compiling it into the embedded system and watching it run, which is apparently the same as what Dr Jefyll developed although he used a parallel link to the host computer instead of serial, and which does not require backgrounding or suspending the editor in DOS nor any TSRs or other special software, my old boss who had moved on to another company said they had just spent $8000 on a C development system and they did not have nearly the interactiveness and development speed that I had with my very simple system.

The SwiftForth compiler optimizer example I cited is from Forth, Inc.'s site, where they take this example:
Attachment:
SwiftXoptimizer1.jpg
SwiftXoptimizer1.jpg [ 62.95 KiB | Viewed 6848 times ]

and it compiles this machine code:
Attachment:
SwiftForthOptimizer.jpg
SwiftForthOptimizer.jpg [ 59.75 KiB | Viewed 6848 times ]

five instructions as a subroutine, four if you were to straightline it so it would not need the RET nor be called with another instruction. In my '816 Forth, it looks like it's about 26 machine-language instructions, including NEXT. There are conditional branches in there, so not all 26 get executed, and I need to shorten it now in light of this recent topic.

Am I biased? Probably, because Forth works the way my brain does, and after meeting Forth, I didn't want any more algebraic languages. I did go through the K&R C book years later, and another beginning C book, and I cannot latch onto it. I have an article about my beef with C, here. The tag-along on the title says, "(You don't have to agree.)" If someone else's brain works that way, I have no criticisms for them; but I myself am done trying.

Quote:
Squeezing out an extra 5-10% of performance is a huge win if that's horsepower you can hand over to the user to do calculations with.

10% speedup is nothing unless it crosses a threshold between not being fast enough to keep up with a certain realtime demand and OTOH keeping up ok. Outside of exacting thresholds, it has been my observation that a 30% speedup is pretty insignificant to the user. 100% speedup is definitely significant, but not monumental. If we really needed maximum horsepower, there are ARMs and other processors that we can move up to. Jim Boyd obviously knows his FleetForth very, very well, although it's running on a 1MHz C64. Apparently it has the performance his uses need. For some things, I still use my 35-year-old HP-41cx calculator every day which only runs about 6300 assembly-language instructions per second (not 63,000, nor 630,000, nor 6,300,000), although it does have 56-bit registers so one instruction can do quite a bit in handling decimal operations. The longest program I've ever run on it was about 2,000 bytes, but most are under 200 bytes. It can control and take data from lots of workbench instruments at once; but it's definitely not the tool to use when I want to service tens of thousands (let alone 100,000+) interrupts per second like I can on my 5MHz 65c02 workbench computer.

It comes down to the a couple of questions, "How much speed does your intended application need?" and "Can you get there with your hardware and with your software methods?" "Software methods" will include language, but probably go much further, and in this case, forming a faster DO...LOOP for 8-bit counters.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Tue Apr 27, 2021 3:24 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3327
Location: Ontario, Canada
Druzyek wrote:
I've gone back and watched years and years of the presentations at the SVFIG meetings and noticed that the programmers there tend to apply Forth to one or a small handful of problems and then claim based on that that it's superior to C and competing languages at all times and in all circumstances.
To say it's superior at all times and in all circumstances is rather a sweeping statement. :roll: No way do I endorse talk like that. Perhaps it's a case of the Vocal Minority making the most noise. I hope no-one makes the mistake of supposing all Forth adherents are afflicted with such an unreasoning excess of zeal.

GARTH WILSON wrote:
which is apparently the same as what Dr Jefyll developed although he used a parallel link to the host computer instead of serial
Minor correction. Yes, nowadays I use a parallel link to the host. But back in my Wonder Years ™ there was no host. :shock:

I credit Forth with letting me pull off stuff like this (MCS48 development system)... long before I acquired my first 8088 box.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 12:28 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
We're getting off on philosophy again, but I think some of that in any topic is fine as long as we can also instantly get back on topic.
I agree. The topic started out about making loops faster, so discussing the role of performance is relevant in my opinion.

Quote:
I quoted a bit of Elizabeth Rather's story here, how she found Forth to be many times as productive as the other languages she had worked in, even though she was initially very biased against it.
Hmm, maybe some historical context would help. I remember waiting 30+ seconds to cross-compile relatively small 68000 binaries written in C on my (old for the time) Windows machine in 2002. Now I can compile much larger binaries with gcc in a second or two. Maybe the interactivity of Forth had more of a payoff at the time than it does today. I do wonder if many of the Forth guys who talk about how insufferably awful C is have touched a C compiler in the last decade or two, at least when they criticize speed and interactivity. No comments on Garth's beefs with C page since they seem accurate and are taste-based.

Quote:
10% speedup is nothing unless it crosses a threshold between not being fast enough to keep up with a certain realtime demand and OTOH keeping up ok.
I think this is a good illustration of the fallacy I mentioned above. It's begging the question to assume I even have a threshold and a realtime demand to keep up with. I don't. If I have to wait 9 seconds for a calculation to complete instead of 10, then I'm really happy about that. This applies to all kinds of other things as well like video games where 10% more performance might let me put 10 objects on the screen at once instead of 9. That lets me make better games, and the user does notice and value the difference.

Quote:
Outside of exacting thresholds, it has been my observation that a 30% speedup is pretty insignificant to the user.
Right, there are all kinds of engineering problems where this is definitely the case. The point I'm making here is that there are also all kinds of other problems where that is not the case and that 30% speedup or even 10% speedup is really, really important. Wouldn't you agree that this is true for some projects even if it's not true for the ones you have in mind?

Quote:
It comes down to the a couple of questions, "How much speed does your intended application need?"
That's a good question. I see where your method is valuable when that has an exact answer. In my case, it doesn't, which is why I'm pointing out that that method doesn't always work.

Quote:
To say it's superior at all times and in all circumstances is rather a sweeping statement. :roll: No way do I endorse talk like that. Perhaps it's a case of the Vocal Minority making the most noise. I hope no-one makes the mistake of supposing all Forth adherents are afflicted with such an unreasoning excess of zeal.
I think you're right. There seems to be an uncommonly high number of just that type of overzealous adherent in the Forth community, but I didn't mean it was the majority.

EDIT: and to bring us back on topic as Garth suggested, didn't someone find out TaliForth 2 was pushing a 1 on the stack then falling through to +LOOP? It might be worth looking at 8 and 16-bit versions that don't do this if you're looking at new DO...LOOP versions.


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 1:57 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8412
Location: Southern California
Druzyek wrote:
Now I can compile much larger binaries with gcc in a second or two. Maybe the interactivity of Forth had more of a payoff at the time than it does today. I do wonder if many of the Forth guys who talk about how insufferably awful C is have touched a C compiler in the last decade or two, at least when they criticize speed and interactivity.

True; but if it's an embedded system, now you have to transfer the image and put it in flash, which may take some time too. Even that has gotten pretty quick; but to quote my own earlier post,

      ...which you can watch at https://www.youtube.com/watch?v=--IJEl6HV2k . At about 1:26:30, one of the participants says that Forth is the future of robots in space, because of its incremental compilation. He works in this field, and says it's a huge problem if, for example, you have a very slow link to your spacecraft (because it's so many hundreds of millions of miles away and the signal is so weak when it reaches its destination) and you have to feed it some update code in C and you have a quarter of a gigabyte, whereas in Forth you might be able to get away with only a hundred bytes, because it's not necessary to re-compile everything.

Quote:
didn't someone find out TaliForth 2 was pushing a 1 on the stack then falling through to +LOOP? It might be worth looking at 8 and 16-bit versions that don't do this if you're looking at new DO...LOOP versions.

Interesting. I never thought of that. My own +loop is quite a bit longer than loop. Continuing discussion does give me ideas to check for ways to improve the efficiency of my Forth though.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 5:47 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1918
Location: Sacramento, CA, USA
Quote:
didn't someone find out TaliForth 2 was pushing a 1 on the stack then falling through to +LOOP? It might be worth looking at 8 and 16-bit versions that don't do this if you're looking at new DO...LOOP versions.


If that's true, it might be partially my fault, since ISTR mentioning an idea like that to Scot a few years back, without considering the potential performance hit.

[edit: yep, guilty!]

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 8:19 am 
Offline

Joined: Fri Apr 15, 2016 1:03 am
Posts: 134
YLoop could be 3 bytes shorter & a few cycles faster if you compile a short branch instead of a long branch in the cases where a short branch will reach.
Code:
                               : testing
        lda #5                     5
        dex
        dex
        sta 0,x
        stz 1,x
        lda 0,x                    ydo
        inx
        inx
@1:     pha
        lda #<3                       3
        dex
        dex
        sta 0,x
        stz 1,x
        lda 0,x                       ydo
        inx
        inx
@2:     pha
        jsr CR                           cr
        jsr (.")                         ." yi="
        .byte 3,"yi="
        pla                              yi
        pha
        dex
        dex
        sta 0,x
        stz 1,x
        jsr Dot                          .
        jsr (.")                         ." yj="
        .byte 3,"yj="
        ply                              yj
        pla
        pha
        phy
        dex
        dex
        sta 0,x
        stz 1,x
        jsr Dot                          .
        pla                             yloop
        dea
        bne @2
        pla                          yloop
        dea
        bne @1
        rts                        ;


Another way to squeeze out some more cycles from a YLoop iteration, at the cost of more bytes & cycles for loop initialization & termination, is to decrement a zero-page location instead of ply, dey, iny, .
Code:
                                : testing
        lda #5                     5
        dex
        dex
        sta 0,x
        stz 1,x
        ldy YCounter               ydo
        phy
        lda 0,x
        inx
        inx
        sta YCounter
@1:
        lda #<3                       3
        dex
        dex
        sta 0,x
        stz 1,x
        ldy YCounter                  ydo
        phy
        lda 0,x
        inx
        inx
        sta YCounter
@2:
        jsr CR                           cr
        jsr (.")                         ." yi="
        .byte 3,"yi="
        lda YCounter                     yi
        dex
        dex
        sta 0,x
        stz 1,x
        jsr Dot                          .
        jsr (.")                         ." yj="
        .byte 3,"yj="
        pla                              yj
        pha
        dex
        dex
        sta 0,x
        stz 1,x
        jsr Dot                          .
        dec YCounter                    yloop
        bne @2
        pla
        sta YCounter
        dec YCounter                 yloop
        bne @1
        pla
        sta YCounter
        rts                        ;


I've also implemented YDo & friends on 65816F ( a subroutine-threaded FORTH running on the 65816 in native mode using 16 bits)


Attachments:
File comment: 65816F console log
F_YDo.lst.txt [9.48 KiB]
Downloaded 66 times
Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 9:16 am 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 355
This speed up seems to only work for STC Forths. Is that correct?

When doing a direct jump back to YDO on an ITC Forth, Forth's IP pointer also has to be updated to start processing to just after the YDO that is being returned to. Otherwise the IP pointer just continues through the YLOOP as if nothing happened. I do get one iteration of YI and YJ though before it crashes.

Jumping back to ML code that has Forth words in between would require an extra step to update Forth's IP pointer.


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Wed Apr 28, 2021 7:47 pm 
Offline

Joined: Fri Apr 15, 2016 1:03 am
Posts: 134
The idea of using an 8-bit down counter instead of a 16-bit index & limit for higher speed should apply to any FORTH where the machine (like a 6502) can implement it in faster machine code.

So far this thread has focused on implementations in Tali FORTH (a Subroutine-Threaded-Code FORTH for the 6502) that generate inline machine code.
The idea can be implemented on an Indirect-Threaded-Code FORTH, but as you've noted, the details of the implementation will be different.

Here is a quick untested attempt for FIG FORTH for 6502 that uses the ITC word structure & flow control.
Code:
( YDo for 6502 FIG FORTH - Indirect Threaded Code )
( Completely untested so far )
( Will still work if the assembler words are not available )

hex

code (YDO) ( n -- )  ( runtime part of YDO )
  00B5 ,                ( lda 0,x       ; get lo byte of n )
  48 C,                 ( pha           ; push as count )
  6C C, ' DROP 2 - ,    ( jmp {DROP}    ; drop n, & Next )
  C;

: YDO  ( start an 8-bit down counter loop )
  COMPILE (YDO) ( compile a call to the runtime word )
  HERE          ( push branch back addr )
  4             ( push pairing code )
  ; IMMEDIATE

code (YLOOP) ( runtime part of YLOOP )
  7A C,                 (   ply         ; pop the count )
  88 C,                 (   dey         ; decrement it )
  04F0 ,                (   beq @2      ; if 0: exit the loop )
  5A C,                 (   phy         ; push updated count )
  4C C, ' BRAN ,        (   jmp Bran+2  ; do the branch back, & Next )
  4C C, ' 0BRANCH 8 + , ( @2: jmp BUMP  ; skip the branch addr, & Next )
  C;

: YLOOP  ( end a YDo )
  4 ?PAIRS      ( check pairing code )
  COMPILE (YLOOP)  ( compile a call to the runtime word )
  BACK          ( compile the branch address, left on stack by YDO )
  ; IMMEDIATE

code YI ( -- n )  ( get the current down counter )
  68 C,                 (   pla         ; get the count )
  48 C,                 (   pha                         )
  4C C, ' SP@ 1+ ,      (   jmp PUSHOA  ; push onto the param stack, & Next )
  C;

code YJ ( -- n )  ( get the next outer down counter )
  7A C,                 (   ply         ; get the next YDo out count )
  68 C,                 (   pla         )
  48 C,                 (   pha         )
  5A C,                 (   phy         )
  4C C, ' SP@ 1+ ,      (   jmp PUSHOA  ; push onto the param stack, & Next )
  C;


Top
 Profile  
Reply with quote  
 Post subject: Re: Faster loops
PostPosted: Sat May 01, 2021 12:58 am 
Offline

Joined: Fri May 05, 2017 9:27 pm
Posts: 841
GARTHWILSON wrote:

Am I biased? Probably, because Forth works the way my brain does, and after meeting Forth, I didn't want any more algebraic languages. I did go through the K&R C book years later, and another beginning C book, and I cannot latch onto it. I have an article about my beef with C, here. The tag-along on the title says, "(You don't have to agree.)" If someone else's brain works that way, I have no criticisms for them; but I myself am done trying.


I completely agree. At one time I mentioned that I grok C. A poor choice of word. I thought it meant something along the lines of a working understanding. It actually implies a far greater familiarity with a subject. A familiarity I do NOT wish to experience with C. I can write working programs in C, but I will never be a C expert. My brain just doesn't work that way and I will not force it. I don't have a problem with C, or rather with the use of C. At one time, C received a lot of attention from me, but I do not wish to repeat that brutal learning curve again. Last time I wrote anything in C, I still had to look up those convoluted precedence rules.

Quote:

Jim Boyd obviously knows his FleetForth very, very well, although it's running on a 1MHz C64. Apparently it has the performance his uses need.



I do have a lot of code words (primitives) as well as some mixed level words.

Quote:

It comes down to the a couple of questions, "How much speed does your intended application need?" and "Can you get there with your hardware and with your software methods?" "Software methods" will include language, but probably go much further, and in this case, forming a faster DO...LOOP for 8-bit counters.



For one's hobby, switching to another language may not be an option. If this other language does not mesh smoothly with the way you think, if it turns your hobby into a headache, if you can't wrap your mind around it then maybe it's time to upgrade the hardware like moving from a 65C02 to a 65C816.

By the way, a loop that takes a single parameter and counts down to zero reminds me of the FOR NEXT loop for Forth. I've heard of it but not seen it implemented.
Just a reminder for those of us who sometimes overlook the obvious. For really small loops: loop unrolling. I recall doing something where I had a DO LOOP that looped exactly three times and only executed one word. That one word was a code word (primitive) that replaced a handful of words. I looked at that and thought,"Why am I using a DO LOOP?" I removed the loop and used the word in the loop three times for a savings of memory and time.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: