6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 9:12 pm

All times are UTC




Post new topic Reply to topic  [ 10 posts ] 
Author Message
PostPosted: Fri Aug 05, 2022 5:56 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
In the Programming sub-forum I built a threaded interpreter that executes the BF language. I modeled my threaded interpreter on Forth and in the process I learned more about Forth itself. The next function is where a lot of cycles are spent and it's worth optimization for any threaded implementation. So I decided to re-read FIG Forth's next function to see if I could pick up any tips. But instead theĀ code left me wondering if I'm missing something.

Here's theĀ original with my question afterwards:

Code:
NEXT:   ldy #1
   lda (IP),Y      ; Fetch code field address pointed
   sta W+1         ; to by IP.
   dey
   lda (IP),Y
   sta W
    `TraceNext      ; Macro determines what (if anything) this does
   clc         ; Increment IP by two.
   lda IP
   adc #2
   sta IP
   bcc L54
   inc IP+1
L54:   jmp W-1      ; Jump to an indirect jump (W) which
;                        vectors to code pointed to by a code
;                        field.


Initially I found it confusing to use (IP),Y indexed addressing to load the threaded code address, but then I realized that indirect addressing without an index register was added for the 65c02. They also used adc #2 rather than two word increments because it's about six cycles faster (21 vs 30). Although the ldy #2 and DEY negate that advantage if you optimized for the 65c02.

So far so good.

But what I don't understand is the jmp to jmp where the code wrote the absolute address. Why wouldn't they use a jmp (w) since W isn't on a page boundary?


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 05, 2022 6:29 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
Many of 6502 Forth's words require NEXT to have left 0 in Y anyway; so you cannot neglect Y in NEXT.

The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

There definitely are some confusing things in Forth's innards, and once you get those set up and running, you can forget about them. That would include NEXT, DOES>, and a few other things.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 05, 2022 6:54 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
GARTHWILSON wrote:
The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

Ah, thanks!

That makes sense because FIG Forth is indirect threaded code, so that requires a double indirection. They had that comment about it, and in warm start they wrote the indirect jmp opcode to W-1.
Code:
   lda #$6C
   sta W-1

But it didn't sink in. In any event the code I wrote is direct threaded code so a single indirection works.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 5:32 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
GARTHWILSON wrote:
The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

This technique was explored in this and the following posts: viewtopic.php?p=78550#p78550


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 2:13 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
@BillG, thanks for the pointer. I skimmed it and a 15 cycle NEXT function is an impressive speed boost! Mine is much worse than that due to the self imposed constraints of my programming challenge.

As an easy optimization I put NEXT into a macro and appended it at the end of each function, rather than jumping to NEXT. That eliminated 3 cycles per NEXT and resulted in a very modest improvement (at the cost of code density). It probably wasn't worth it compared to something like relocating it to page zero.

But rather than pursue that I plan to rewrite into a subroutine threaded version to see how using the underlying hardware as the NEXT function effects performance.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 5:14 pm 
Offline

Joined: Wed Aug 21, 2019 6:10 pm
Posts: 217
Martin_H wrote:
@BillG, thanks for the pointer. I skimmed it and a 15 cycle NEXT function is an impressive speed boost! Mine is much worse than that due to the self imposed constraints of my programming challenge.

As an easy optimization I put NEXT into a macro and appended it at the end of each function, rather than jumping to NEXT. That eliminated 3 cycles per NEXT and resulted in a very modest improvement (at the cost of code density). It probably wasn't worth it compared to something like relocating it to page zero.


Also note that fig-Forth was an implementation that was very early on in the life of the Forth programing language, and there was more experience in implementing Forth's reflected in Forth implementations being done later in the 80's, so many people will refer to implementations like Bill Muench's eForth, http://www.forth.org/eforth.html, also worked on by Dr. CH Ting and others. eForth is neither speed optimized nor code size optimized: it is system development speed optimized, based on a relatively small number of code words with the rest of the system written in Forth. The original (and many later) eForth implementation was distributed as assembler source, similar to fig-Forth, while others are written to be cross-compiled using an already running Forth implementation.

If NEXT is NOT in the ZP, it may make sense to "write in" NEXT for JUST the most frequently used words ... IMV, DROP DUP PLUS FETCH and STORE would be a reasonable target list for a lot of Forth applications.

(1) If NEXT is not in the Dictionary, and the code for NEXT is NOT at the end of the EXIT routine (for instance, because there is an optimization for incrementing the IP as part of the EXIT process),
it is a common practice to write DROP with the NEXT code at the end and use the address of that NEXT for the JMP NEXT for other words ... once the system is up and running and you have some working Source to go on, doing a frequency list of Forth words in Source you may be using will quickly suggest some other words where you might want to include the NEXT routine rather than a JMP NEXT.

(2) Even if NEXT is in the Dictionary in the form of NOP, DROP is used a LOT more than NOP, so doing (1) and then just having "JMP NEXT" for the NOP code would be a simple speed optimization with very little code space impact.

Quote:
But rather than pursue that I plan to rewrite into a subroutine threaded version to see how using the underlying hardware as the NEXT function effects performance.


Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

Also, a word of advice from an earlier attempt to implement a Forth that got bogged down: there are a LOT of optimizations for both speed and code density you can pull off with subroutine threading, but it makes a lot of sense to just ignore them up front and get the subroutine threaded code system working, see how performance goes compared to what you would like, and then look at building in the optimizations in order of what looks like the best "bang for the buck" in terms of payoff versus effort required. By the time you have finished making a straightforward SRT system, you will have a MUCH better feel for how much effort will be required by the various possible optimizations.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 7:44 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
BruceRMcF wrote:
Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

I just finished coding and you're right. My direct threaded code printed the Sierpinski triangle in 222 seconds, while the subroutine threaded code printed it in 148 seconds. I'd call that significant, and it shows the overhead inherent in the NEXT function.

The only tricky bit was no longer having access to the IP to get parameters. Instead I popped the return PC off the hardware stack, used it as a pointer to the inline arguments, increment past them, and pushed the adjusted PC back on the stack.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 10:12 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
BruceRMcF wrote:
Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

All three of those techniques have their pros and cons, but if we're strictly talking about minimizing clocks/primitive you're probably in truth's ballpark, at least for a 65xx implementation.

https://www.bradrodriguez.com/papers/moving1.htm

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 06, 2022 10:21 pm 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
barrym95838 wrote:
https://www.bradrodriguez.com/papers/moving1.htm

To anyone who hasn't read that linked article. I've read through Brad's papers and they really helped me understand Forth threading models, and how to implement one. A really worthwhile read.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 11, 2022 12:44 am 
Offline

Joined: Fri May 05, 2017 9:27 pm
Posts: 895
BruceRMcF wrote:
(2) Even if NEXT is in the Dictionary in the form of NOP, DROP is used a LOT more than NOP, so doing (1) and then just having "JMP NEXT" for the NOP code would be a simple speed optimization with very little code space impact.


With an ITC (Indirect Threaded Code) Forth, a primitive normally has a code field which points to its body, but not always. In my Forth, the high level no-op is called NOOP . Since my Forth is an ITC Forth, NOOP is not "JMP NEXT". NOOP has no body and its code field points to NEXT .

My Forth has LIT fall into NEXT but I'm writing an experimental kernel where (LOOP) would fall through into NEXT . The way my DO LOOP's are implemented, LEAVE and EXIT would also fall into NEXT .


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 4 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: