Is this the optimal 6502 NEXT?

chitselb · Post by **chitselb** » Sat Aug 21, 2010 7:55 am

I got my PET 2001 working again! To honor the occasion I decided to fire up some Forth on it.

http://www.flickr.com/photos/chitselb/s ... 696682160/

Alas, I couldn't find Forth for the PET anywhere on the Internet! Does anybody know where there is one?

Plenty of Forth for the C=64 though. So I got copies of volksForth, Blazin' Forth (my old favorite), and figForth and began to roll my own (PETTIL, for Personal Electronics Transactor Threaded Interpretive Language, also a play on the surname of the engineer who helped make the 6502 and the PET).

http://volksforth.sourceforge.net/
ftp://ftp.taygeta.com/pub/Forth/Compile ... mmodore64/

The idea here is to learn something new and have some fun creating. For me, the fun is in optimizing for time. After doing way too much further research, I decided to go with a direct threaded model vs. the more typical indirect threaded. I'm storing TOS in a zero page register, using a split parameter stack on zero page, and whatever else I can come up with to make PETTIL the fastest Forth on Commodore 8-bits. I'll be hashing the dictionary to speed up FIND (a simple "XOR all the nybbles of the name field together" 0..15 hash). Any other suggestions are appreciated.

After a few hours of throwing every trick in the book at it, here's my PETTIL NEXT routine

Code: Select all

                    clocks
0006   85 22      0  3  puta    sta tos
0008   A0 01      0  2  next1   ldy #1
000A   B1 0F      5  1  next    lda (ip),y
000C   85 1D      3             sta w+1
                        ip = *+1
000E   AD FF FF   4             lda $ffff
0011   85 1C      3             sta w
0013   98         2             tya
0014   38         2             sec
0015   65 0F      3             adc ip
0017   85 0F      3             sta ip
0019   B0 03      2  1          bcs t2
                        w = *+1
001B   4C FF FF   3     t1      jmp $ffff
001E   E6 10      0  5  t2      inc ip+1
0020   B0 F9      0  3          bcs t1     ; bra
                        tos = *
		   ===
		    30 (best case)
		    45 (worst case)

For side-by-side comparison, here's the C=64 volksForth NEXT

Code: Select all

0006 4  8D FA 7A   STA $7AFA	; 7-8 is SP
0009 5+ B1 0E      LDA ($0E),Y	; <-- best case entry point, assumes Y=1
000b 3  85 1D      STA $1D
000d 4  AD 3C 3A   LDA $3A3C	; 0e-0f is IP
0010 3  85 1C      STA $1C
0012 2  18         CLC
0013 3  A5 0E      LDA $0E
0015 2  69 02      ADC #$02
0017 3  85 0E      STA $0E
0019 2+ B0 03      BCS $001E
001b 5  6C 90 13   JMP ($1390)	; 1c-1d is W
001e 5  E6 0F      INC $0F
0020 3  B0 F9      BCS $001B	; bra

volksForth comes out to 32 clocks for the best case (entry at $0009) and 42 if page boundaries are crossed. So are there any tricks that I missed out on here?

kc5tja · Post by **kc5tja** » Sat Aug 21, 2010 4:41 pm

At the cost of one byte per compiled address, I'm personally a fan of subroutine-threading. Thus, S becomes the return stack pointer since you're just using JSR to invoke colon and code definitions alike, and RTS to return. This imposes a 12-cycle overhead and eliminates the need for an inner-interpreter altogether.

Otherwise, I don't think you'll ever find a general-purpose implementation of NEXT that falls below 30 or so cycles, for either direct- or indirect-threaded implementations.

Token-threading (e.g., byte-coding) might yield a performance boost, but you now restrict your number of callable definitions to 256.

GARTHWILSON · Post by **GARTHWILSON** » Sat Aug 21, 2010 7:24 pm

chitselb, welcome to the forum!

Also, take a look at my article on servicing interrupts in high-level Forth with no overhead, as the method involves NEXT. Actually, the '816 was more suitable for having a prioritized list of assembly ISRs and one of Forth ISRs which can be installed or deleted on the fly. If the assembly-language ISRs don't claim responsibility for the IRQ, then it goes to Forth ISRs, so there is a little overhead there; but you still don't have to save environments, set up extra stacks, or anything like that. I did it because I wanted to service interrupts in Forth and saw an easy way to do it, and because I was told it couldn't be done.

It would take some convincing for me to believe that splitting the stack and having to keep moving things in and out of a fixed-address TOS pair of ZP bytes would be more efficient than keeping byte pairs together in every cell and doing indexing and doing the INX or DEX twice to add or drop stack cells.

Another way to spend less time in NEXT is simply not to run it so often. What I found with the '816 was that it was more practical to make far more of the words to be primitives (code definitions) instead of secondaries (colon definitions). With hundreds of primitives, you don't have to run NEXT nearly as many times to get a job done as you do with common 6502 Forth where more of the common things are secondaries. With the '816, re-writing all these secondaries as primitives not only ran a lot faster (as expected), but in many cases took less memory as primitives (which may be somewhat surprising), and in some cases were even faster to write as primitives than as secondaries (even more surprising).

Quote:

At the cost of one byte per compiled address, I'm personally a fan of subroutine-threading.

Bruce has an excellent post here about how STC Forth is considerably more memory-efficient than expected. (The topic started out about BASIC, but something happened.

) He lists his reasons starting 6-7 paragraphs before his first code listing in that post.

kc5tja · Post by **kc5tja** » Sat Aug 21, 2010 7:59 pm

Well, he's optimizing his implementation to a degree that a mechanical translation of code would require a pretty complex compiler for. In other words, his STC environment is efficient because of the efforts he's invested elsewhere. That is, external efficiencies make up for the space overhead of his compiler's generated code.

And that's fine -- he's a smart cookie, and I respect his work and his results. Like I said, even I prefer STC over ITC/DTC on the 65xx architecture, even considering the space overhead it implies.

I also prefer Chuck Moore's approach to dealing with the space overhead of subroutine threading -- compile more often. Instead of having a Forth which has the kitchen sink compiled into the kernel, exploit your block storage system to compile applications as you need them, when you need them.

You will pay in I/O overhead (Commodore's computers aren't known for their breakneck I/O performance), but it should be plenty possible to write a compiler that beats the I/O bottleneck cold, which is all you can really ask for. Hence, a RAM expansion unit can be used to speed up application switching times, by using it as a RAM disk cache for commonly used application blocks.

GARTHWILSON · Post by **GARTHWILSON** » Sat Aug 21, 2010 9:23 pm

chitselb, I was just reminded about the common 6502 Forths' wasteful habit of making branch distances relative instead of absolute. It won't make a big difference in speed, but it doesn't make any sense to me why it was written the less-efficient way.

Make sure you also fix the bugs in UM/MOD and UM* which is called U* in FIG-Forth. Descriptions, fixes and optimizations are given here. See also http://6502org.wikidot.com/software-math .

Dr Jefyll · Post by **Dr Jefyll** » Sun Aug 22, 2010 10:23 am

chitselb wrote:

I got my PET 2001 working again! To honor the occasion I decided to fire up some Forth on it.

Hello, chitselb, and congrats on getting the PET 2001 working! Your NEXT optimizations are nicely crafted -- and the effort is certainly worthwhile, given that even a "fast" 30 cycle NEXT is still slower than many commonly-executed Forth words (eg: OVER DUP + whatever). Ie: the 6502 spends over half its time in NEXT ! Describing NEXT as a bottleneck is rather an understatement. (An aside to Garth: very pertinent point about running NEXT less often, but how do your primitives avoid passing through NEXT -- what threading is used? It sounds like you're compiling in-line machine code! Can you explain, or provide a link? Thanks.) [Afterthought: Hundreds of primitives? I think I understand now.]

kc5tja wrote:

I don't think you'll ever find a general-purpose implementation of NEXT that falls below 30 or so cycles

Many, many times I've tried to speed up traditional NEXT, but the limit indeed seemed to be about 30 cycles. So it completely startled and befuddled me when I took a fresh stab at the problem earlier tonight and immediately hit on a version that typically runs in 25 cycles -- a 20% speedup. Is it a "general-purpose implementation"? That's a matter of opinion, I suppose, but it does use conventional threading.

This implementation of NEXT is not a drop-in replacement. Register Y is reserved, a change that means all of your low-level words will need to be inspected, and some of them altered -- although fairly trivially. (No pain, no gain.) Y is used to hold the least-significant byte of IP, whereas the most-significant byte of IP is held in Zero-Page as usual. Weird? Yes. Clumsy? Not really.
Note: Preceding the IP-high byte in Z-PG is a byte containing zero. Used as a pointer, this pair of bytes has the label ip_ in the code below:

Code: Select all

                    clocks
Next:  lda (ip_),y     5 ;copy low-byte of word addressed by IP
       sta w           3
       iny             2
       lda (ip_),y     5 ;copy hi-byte of word addressed by IP
       sta w+1         3
       iny             2
       beq IncIPH      2 (or 3)
Go:    db $4C          3 ;4C is the JMP Abs op-code
W:     db $FF            ;JMP operand (over-written)
       db $FF            ;JMP operand (over-written)

IncIPH:inc ip_+1       5 ;increment most-significant byte of IP
       jmp (w)         5 (6 for 65c02)
  		              ===
		                25 (best case)
		                33 (worst case)

As I say, coding for all low-level words must be reviewed.

- Many words are OK as is because they don't use Y and have no repercussions arising from the 5-cycle speedup. These include DUP SWAP + AND and dozens of others: the operations occuring on Forth's P-stack (which is addressed by X).
- Words requiring the use of Y must save and afterwards restore it (costing 6 cycles, a net slowdown of 1 cycle). This tends to slightly dilute the overall performance gain.
- The "nest" and "un-nest" words (DOCOL and SEMIS) actually yield additional performance gain. I haven't tried coding these yet, but for the NMOS '02 I expect that TYA/TAY can directly replace LDA/STA IP for the low byte, saving one cycle. Have you got a 65C02 in your PET 2001 yet, chitselb? PHY and PLY instructions will make nest and un-nest even faster.
- BRANCH and -- if the branch is taken -- 0BRANCH and (LOOP) will run two cycles faster, because TYA/TAY can directly replace LDA/STA IP while computing the offset. (But, if you're willing to alter the compiler, it would be better to take Garth's suggestion and express absolute branch destinations, not relative offsets.)

I don't claim to be the first person to have thought of this treatment of IP and NEXT, but maybe I am. Can anyone see any major problem with it? I admit the appeal is probably limited to those seeking a more-or-less traditional Forth, whether single- or double-indirect.

I'm sure some of you noticed my code will break if the bytes of the word addressed by IP straddle a page boundary, but luckily that's a direct parallel to the NMOS 6502's buggy JMP-Indirect instruction. An effective solution can be found in Fig-Forth 6502, available in the "Monitors, Assemblers, and Interpreters" section here. (The issue is dealt with at compile time; there is no run-time cost. The word CREATE pre-pads the dictionary with an unused byte in the rare cases when the word about to be CREATEd would otherwise end up with a code-field straddling a page boundary.)

Whew! Sorry for the long-winded post!

-- Jeff

GARTHWILSON · Post by **GARTHWILSON** » Sun Aug 22, 2010 9:11 pm

Quote:

(An aside to Garth: very pertinent point about running NEXT less often, but how do your primitives avoid passing through NEXT -- what threading is used? It sounds like you're compiling in-line machine code! Can you explain, or provide a link? Thanks.)

It looks like you figured it out, but to answer anyway, each primitive is still followed by NEXT; but take for a simple example NIP , commonly defined as:

Code: Select all

: NIP  SWAP DROP ;  ( a b c -- a c )

which runs: nest NEXT SWAP NEXT DROP NEXT unnest NEXT, as opposed to making it a primitive which only runs NIP NEXT which is somewhere around four times as fast. NEXT runs only once instead of four times, NIP (as a primitive) is somewhere between the lengths of SWAP and DROP, and nest and unnest are completely avoided. In 6502, NIP would be defined as a primitive as:

Code: Select all

     LDA  0,X
     STA  2,X
     LDA  1,X
     STA  3,X
     JMP  POP

In this case, the primitive takes 11 bytes after the header, whereas the secondary takes only 8. In '816 however, the primitive takes only 7, so there's no memory penalty. Some other words will be more extreme. There's a ton of words that could be written as primitives but aren't in 6502 because it's hard to justify the extra memory consumption, but which have no memory penalty in '816 because it is so much more efficient at handling the 16-bit cells, and, of course, NEXT runs only once for the whole word instead of four or more times.

Quote:

I'm sure some of you noticed my code will break if the bytes of the word addressed by IP straddle a page boundary, but luckily that's a direct parallel to the NMOS 6502's buggy JMP-Indirect instruction. An effective solution can be found in Fig-Forth 6502, available in the "Monitors, Assemblers, and Interpreters" section here. (The issue is dealt with at compile time; there is no run-time cost. The word CREATE pre-pads the dictionary with an unused byte in the rare cases when the word about to be CREATEd would otherwise end up with a code-field straddling a page boundary.)

ALIGNing works out nicely for other things too, including decompiling. I align starting at the link field, so an odd-length name field doesn't have to take an extra byte if it starts out unaligned, and I save a few hundred bytes altogether. I have a note in my code that things should be aligned even if you're compling without headers, but I can't off the top of my head remember why. It was many years ago.

bogax · Post by **bogax** » Sun Aug 22, 2010 9:14 pm

Dr Jefyll wrote:

Code: Select all

                    clocks
Next:  lda (ip_),y     5 ;copy low-byte of word addressed by IP
       sta w           3
       iny             2
       lda (ip_),y     5 ;copy hi-byte of word addressed by IP
       sta w+1         3
       iny             2
       beq IncIPH      2 (or 3)
Go:    db $4C          3 ;4C is the JMP Abs op-code
W:     db $FF            ;JMP operand (over-written)
       db $FF            ;JMP operand (over-written)

IncIPH:inc ip_+1       5 ;increment most-significant byte of IP
       jmp (w)         5 (6 for 65c02)
  		              ===
		                25 (best case)
		                33 (worst case)

-- Jeff

As you're using selfmoding code and keeping IP aligned (and going
for speed), why not do this:

Code: Select all

IP_0
 lda 0000,y
 sta W+1
 iny
IP_1
 lda 0000,y
 sta W+2
 iny
 beq INC_IP
W
 jmp 0000

INC_IP
 inc IP_0+2
 inc IP_1+2 
 bne W

Saves a couple cycles

Dr Jefyll · Post by **Dr Jefyll** » Sun Aug 22, 2010 10:41 pm

Darn it, bogax -- I missed that option entirely! It took me a minute to catch on to what you're doing. But it doesn't surprise me that you would be the one to tighten up code that was seemingly already tightened to the limit. My hat is off to you ever since I saw your solution to the "Bit Counting" (count the number of one bits in a byte) problem.

Anyway, your optimization (to my optimization) very clearly deserves consideration, and I admit I overlooked it as an alternative. The only downside I can see is that, with two copies of IP-high to maintain, there'll be extra overhead for other code that needs to write to IP, including commonly-executed snippets such as un-nest BRANCH 0BRANCH and (LOOP). But I guess each of these would only suffer 3 extra cycles of overhead (to write the additional copy of IP-high; we don't need to read both copies).

Whether you end up with a net advantage would be determined by what sort of code mix you're running (and whether you've incorporated Garth's potent NEXT-reducing technique). For traditional code with lots of NEXTs occuring, your version probably comes out significantly ahead.

-- Jeff

bogax · Post by **bogax** » Mon Aug 23, 2010 12:32 am

Dr Jefyll wrote:

Anyway, your optimization (to my optimization) very clearly deserves consideration, and I admit I overlooked it as an alternative. The only downside I can see is that, with two copies of IP-high to maintain, there'll be extra overhead for other code that needs to write to IP, including commonly-executed snippets such as un-nest BRANCH 0BRANCH and (LOOP). But I guess each of these would only suffer 3 extra cycles of overhead (to write the additional copy of IP-high; we don't need to read both copies).

If you really wanted to get radical you could split the list and
make do with one iny and/or create a word to increment the
IP page(s) (let the compiler figure out where it goes)

God knows what it would do to your compiler though

(thinking mostly how do you work around inline paramters)

Dr Jefyll · Post by **Dr Jefyll** » Mon Aug 23, 2010 1:59 am

Quote:

If you really wanted to get radical you could split the list

If I understand what you mean (separate sequences of hi- and low-bytes?) then this absolutely would be a can of worms! I don't think I wanna go there!!

Quote:

create a word to increment the IP page(s) (let the compiler figure out where it goes)

This I can see being do-able within acceptable (?!) limits of twisted-ness. You'd eliminate the BEQ instruction and save two more cycles -- good call. I think I know how to manage the compiler/in-line-parameter problem, but never mind.

The people with the subroutine-threaded-code prob'ly wonder why we're flogging this dead horse, and it's a valid question. Just can't resist the challenge of a gnarly coding problem, I guess -- even if it's questionable whether the prize is worth winning!

-- Jeff

chitselb · Post by **chitselb** » Mon Aug 23, 2010 6:02 am

Thank you all for your help! I am amazed at both the quality and quantity of response. To anyone who knows how to get a link to the alleged TPUG Forth for PET (or any other) I would be greatly obliged.

Anton Ertl on comp.lang.forth posted a response to my question (which I crossposted there and also comp.sys.cbm) which got it down even further. If the word calling NEXT didn't mess with the Y register (always a 1 in this implementation) then we enter at $0004 (next) and (usually) will arrive at the target in ... (drum roll)

I'm calling this 17-cycle version my final edit on NEXT. Famous last words. It's a bit of a mix of volksForth and Anton's direct threading suggestion.

Code: Select all

; This NEXT gets copied to zero page at address 0.  
;
; at exit of this routine (primitives may assume on entry)
;    Y register is 1
;    C flag is clear
puta        sta tos        ; 0000
next1       ldy #1         ; 0002 ; 1 -> Y
next        tya            ; 0004 [2]
nexta       sec            ; 0005 [2]
            adc ip         ; 0006 [3]
            sta ip         ; 0008 [3]  IP+2 -> IP
            bcs *+5        ; 000A [2]
            jmp ($ffff)    ; 000C [5] (IP) -> PC
            inc ip+1       ; 000F
nextjmp     clc            ; 0011 ; 0 -> C
            bcc *-6        ; 0012
                           ; 0014 tos

I was a little stymied at first trying to figure out what would I do without a "W" register, and am still struggling a bit getting the rest of it (ENTER, EXIT, EXECUTE) all working together.

This version of NEXT uses JMP (indirect) which of course is subject to the NMOS JMP ($xxFF) anomaly. I don't want to go off in the weeds here discussing an issue which of course I could solve by hardware substitution. Since the Mostek 6502 is what's in my 30-year-old PET 2001 computer, it's an immutable law of the universe for the purposes of this project.

Obviously I won't be inflicting the pain of page boundary checks on NEXT when the compiler can take care of me. Probably something like check each word as it gets compiled into the dictionary and set a BOGUSXXFF flag if a word gets split at the page boundary. If that happens, issue an error message and cease compiling. The developer would then be responsible for juggling things around a bit. Ugly, I know.

I considered both the STC and yet-another-ITC models, but my unfamiliarity with DTC and Brad Rodriguez' excellent article here: http://www.bradrodriguez.com/papers/moving1.htm have me committed to implementing the direct threaded approach. Secondaries and primitives will look something like this (using the Blazin' Forth assembler syntax)

Here's my model for the dictionary (figForth/Blazin' Forth) but there's no CFA word because in direct-threaded, there is always machine code (instead of a pointer to it) at a given word's address.

Code: Select all

code foo ( -- ) 
   ascii * # lda,
   hex ffd2 jsr,
   next jmp,
end-code
: baz 1 2 + ;
: bar foo baz . ;

Code: Select all

bit7 = $80

next = 4
enter = 475
exit = 450
one = 500
two = 600
plus = 700
dot = 800

*=$4000
lfa0    .word 0
        .byt foo-*-1|bit7
        .asc "FO","O"|bit7
foo     lda #$2a           ; foo is a primitive
        jsr $ffd2
        jmp next

lfa1    .word lfa0
        .byt baz-*-1|bit7
        .asc "BA","Z"|bit7
baz     jmp enter          ; baz is a secondary
        .word one
        .word two
        .word plus
        .word exit

lfa2    .word lfa1
        .byt bar-*-1|bit7
        .asc "BA","R"|bit7
bar     jmp enter         ; bar is a secondary
        .word foo
        .word baz
        .word dot
        .word exit

4000  00 00 83 46 4f cf a9 2a
4008  20 d2 ff 4c 04 00 00 40
4010  83 42 41 da 4c db 01 f4
4018  01 58 02 bc 02 c2 01 0e
4020  40 83 42 41 d2 4c db 01
4028  06 40 14 40 20 03 c2 01

kc5tja · Post by **kc5tja** » Mon Aug 23, 2010 4:33 pm

Dr Jefyll wrote:

Y is used to hold the least-significant byte of IP

Notice I gave the ballpark figure of 30 cycles. 25 is perhaps the absolute minimum without fundamentally changing threading techniques. I, too, have used Y as the least significant byte of IP; however, to ensure fastest operation, it requires that all cell addresses be word-aligned to ensure no page crossings occur in the first INY.

It's still twice as slow as a simple JSR/RTS pair, though, and significantly slower still than the 80486's direct-threading implementation (just for comparison sake, which falls in at 2 to 4 cycles IIRC).

But, what can you do?

dclxvi · Post by **dclxvi** » Wed Aug 25, 2010 2:38 am

GARTHWILSON wrote:

Make sure you also fix the bugs in UM/MOD and UM* which is called U* in FIG-Forth.

An optimized version of the UM/MOD fix can be found here (along with the UM* fix):

http://6502.org/tutorials/65c02opcodes.html

kc5tja wrote:

Well, he's optimizing his implementation to a degree that a mechanical translation of code would require a pretty complex compiler for. In other words, his STC environment is efficient because of the efforts he's invested elsewhere. That is, external efficiencies make up for the space overhead of his compiler's generated code.

There is an alternative to optimizing ANS Forth by attempting to recognize various known sequences (I think it's mostly a matter of keeping track of the last N words that were compiled -- you have do anyway for N=1 if you're going to optimize SOMEWORD ; as a JMP -- but I haven't attempted to implement it, so I may be overlooking a thing or two or three...) that I alluded to back then. Namely, find the "100 words" that represent the efficient sequences and use those. In other words, most extant Forth compilers don't optimize the 2 word sequence 1 - as the 1 word sequence 1-. In the space-optimized STC case, you'd have a one word name for what would be, say, a four word sequence in ANS Forth, and you might wind up with a bunch of words that don't look much like Forth as we know it, but would produce small code. I don't know what those "100 words" actually are, so my old post is still pretty solidly in the realm of speculation.

The direction I've been leaning for quite some time now with an STC 6502 Forth is to gear it for speed, and if it's smaller as well then great, but if not then I'm willing to take the space hit.

Anyway, here's a DTC NEXT for the 65C02 that I came up with some time ago. It's not particularly practical, since you must preserve the X register, which is the most useful register for the data stack pointer (meaning any savings in NEXT itself are pretty much wiped out). But as an assembly-golf thing for the amusment of you all, here it is:

Code: Select all

NEXT INX           ;2
     INX           ;2
     BEQ :2        ;2
:1   JMP ($FF00,X) ;6
:2   INC :1+2
     JMP :1

kc5tja · Post by **kc5tja** » Wed Aug 25, 2010 2:53 am

NICE!!

Though in this case, I'd use Y as the data stack pointer, and reference the data stack in direct-page using absolute addresses. E.g., instead of:

Code: Select all

LDA $00,X

I'd use:

Code: Select all

LDA $0000,Y

I think it's only one cycle longer than DP,X mode, which shouldn't be much of an issue, I'd think.

The only difficulty would come from when you need to invoke @ and !, I think, as here you'd need to tweak the Y register and a reserved direct-page location to serve as your pointer.