FIG Forth's NEXT function confuses me slightly

Martin_H · Post by **Martin_H** » Fri Aug 05, 2022 5:56 pm

In the Programming sub-forum I built a threaded interpreter that executes the BF language. I modeled my threaded interpreter on Forth and in the process I learned more about Forth itself. The next function is where a lot of cycles are spent and it's worth optimization for any threaded implementation. So I decided to re-read FIG Forth's next function to see if I could pick up any tips. But instead the code left me wondering if I'm missing something.

Here's the original with my question afterwards:

Code: Select all

NEXT:	ldy #1
	lda (IP),Y		; Fetch code field address pointed
	sta W+1			; to by IP.
	dey
	lda (IP),Y
	sta W
 	`TraceNext		; Macro determines what (if anything) this does
	clc			; Increment IP by two.
	lda IP
	adc #2
	sta IP
	bcc L54
	inc IP+1
L54:	jmp W-1		; Jump to an indirect jump (W) which
;                        vectors to code pointed to by a code
;                        field.

Initially I found it confusing to use (IP),Y indexed addressing to load the threaded code address, but then I realized that indirect addressing without an index register was added for the 65c02. They also used adc #2 rather than two word increments because it's about six cycles faster (21 vs 30). Although the ldy #2 and DEY negate that advantage if you optimized for the 65c02.

So far so good.

But what I don't understand is the jmp to jmp where the code wrote the absolute address. Why wouldn't they use a jmp (w) since W isn't on a page boundary?

GARTHWILSON · Post by **GARTHWILSON** » Fri Aug 05, 2022 6:29 pm

Many of 6502 Forth's words require NEXT to have left 0 in Y anyway; so you cannot neglect Y in NEXT.

The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

There definitely are some confusing things in Forth's innards, and once you get those set up and running, you can forget about them. That would include NEXT, DOES>, and a few other things.

Martin_H · Post by **Martin_H** » Fri Aug 05, 2022 6:54 pm

GARTHWILSON wrote:

The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

Ah, thanks!

That makes sense because FIG Forth is indirect threaded code, so that requires a double indirection. They had that comment about it, and in warm start they wrote the indirect jmp opcode to W-1.

Code: Select all

	lda #$6C
	sta W-1

But it didn't sink in. In any event the code I wrote is direct threaded code so a single indirection works.

BillG · Post by **BillG** » Sat Aug 06, 2022 5:32 am

GARTHWILSON wrote:

The jump is effectively a double indirect. The whole thing will be more efficient if you copy NEXT to ZP RAM before running, and do self-modifying code, using the instructions' operand fields as the IP and W variables.

This technique was explored in this and the following posts: viewtopic.php?p=78550#p78550

Martin_H · Post by **Martin_H** » Sat Aug 06, 2022 2:13 pm

@BillG, thanks for the pointer. I skimmed it and a 15 cycle NEXT function is an impressive speed boost! Mine is much worse than that due to the self imposed constraints of my programming challenge.

As an easy optimization I put NEXT into a macro and appended it at the end of each function, rather than jumping to NEXT. That eliminated 3 cycles per NEXT and resulted in a very modest improvement (at the cost of code density). It probably wasn't worth it compared to something like relocating it to page zero.

But rather than pursue that I plan to rewrite into a subroutine threaded version to see how using the underlying hardware as the NEXT function effects performance.

BruceRMcF · Post by **BruceRMcF** » Sat Aug 06, 2022 5:14 pm

Martin_H wrote:

@BillG, thanks for the pointer. I skimmed it and a 15 cycle NEXT function is an impressive speed boost! Mine is much worse than that due to the self imposed constraints of my programming challenge.

As an easy optimization I put NEXT into a macro and appended it at the end of each function, rather than jumping to NEXT. That eliminated 3 cycles per NEXT and resulted in a very modest improvement (at the cost of code density). It probably wasn't worth it compared to something like relocating it to page zero.

Also note that fig-Forth was an implementation that was very early on in the life of the Forth programing language, and there was more experience in implementing Forth's reflected in Forth implementations being done later in the 80's, so many people will refer to implementations like Bill Muench's eForth, http://www.forth.org/eforth.html, also worked on by Dr. CH Ting and others. eForth is neither speed optimized nor code size optimized: it is system development speed optimized, based on a relatively small number of code words with the rest of the system written in Forth. The original (and many later) eForth implementation was distributed as assembler source, similar to fig-Forth, while others are written to be cross-compiled using an already running Forth implementation.

If NEXT is NOT in the ZP, it may make sense to "write in" NEXT for JUST the most frequently used words ... IMV, DROP DUP PLUS FETCH and STORE would be a reasonable target list for a lot of Forth applications.

(1) If NEXT is not in the Dictionary, and the code for NEXT is NOT at the end of the EXIT routine (for instance, because there is an optimization for incrementing the IP as part of the EXIT process),
it is a common practice to write DROP with the NEXT code at the end and use the address of that NEXT for the JMP NEXT for other words ... once the system is up and running and you have some working Source to go on, doing a frequency list of Forth words in Source you may be using will quickly suggest some other words where you might want to include the NEXT routine rather than a JMP NEXT.

(2) Even if NEXT is in the Dictionary in the form of NOP, DROP is used a LOT more than NOP, so doing (1) and then just having "JMP NEXT" for the NOP code would be a simple speed optimization with very little code space impact.

Quote:

But rather than pursue that I plan to rewrite into a subroutine threaded version to see how using the underlying hardware as the NEXT function effects performance.

Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

Also, a word of advice from an earlier attempt to implement a Forth that got bogged down: there are a LOT of optimizations for both speed and code density you can pull off with subroutine threading, but it makes a lot of sense to just ignore them up front and get the subroutine threaded code system working, see how performance goes compared to what you would like, and then look at building in the optimizations in order of what looks like the best "bang for the buck" in terms of payoff versus effort required. By the time you have finished making a straightforward SRT system, you will have a MUCH better feel for how much effort will be required by the various possible optimizations.

Martin_H · Post by **Martin_H** » Sat Aug 06, 2022 7:44 pm

BruceRMcF wrote:

Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

I just finished coding and you're right. My direct threaded code printed the Sierpinski triangle in 222 seconds, while the subroutine threaded code printed it in 148 seconds. I'd call that significant, and it shows the overhead inherent in the NEXT function.

The only tricky bit was no longer having access to the IP to get parameters. Instead I popped the return PC off the hardware stack, used it as a pointer to the inline arguments, increment past them, and pushed the adjusted PC back on the stack.

barrym95838 · Post by **barrym95838** » Sat Aug 06, 2022 10:12 pm

BruceRMcF wrote:

Prediction: SRT (SubRoutine Threading) will be a MASSIVE boost compared to indirect threading and a SUBSTANTIAL boost compared to direct threading.

All three of those techniques have their pros and cons, but if we're strictly talking about minimizing clocks/primitive you're probably in truth's ballpark, at least for a 65xx implementation.

https://www.bradrodriguez.com/papers/moving1.htm

Martin_H · Post by **Martin_H** » Sat Aug 06, 2022 10:21 pm

barrym95838 wrote:

https://www.bradrodriguez.com/papers/moving1.htm

To anyone who hasn't read that linked article. I've read through Brad's papers and they really helped me understand Forth threading models, and how to implement one. A really worthwhile read.

JimBoyd · Post by **JimBoyd** » Thu Aug 11, 2022 12:44 am

BruceRMcF wrote:

(2) Even if NEXT is in the Dictionary in the form of NOP, DROP is used a LOT more than NOP, so doing (1) and then just having "JMP NEXT" for the NOP code would be a simple speed optimization with very little code space impact.

With an ITC (Indirect Threaded Code) Forth, a primitive normally has a code field which points to its body, but not always. In my Forth, the high level no-op is called NOOP . Since my Forth is an ITC Forth, NOOP is not "JMP NEXT". NOOP has no body and its code field points to NEXT .

My Forth has LIT fall into NEXT but I'm writing an experimental kernel where (LOOP) would fall through into NEXT . The way my DO LOOP's are implemented, LEAVE and EXIT would also fall into NEXT .

FIG Forth's NEXT function confuses me slightly

FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly

Re: FIG Forth's NEXT function confuses me slightly