Is this the optimal 6502 NEXT?

chitselb · Post by **chitselb** » Fri Mar 21, 2014 6:45 am

phooey. Charlie-brain-damage admission.

I would have to explicitly clear carry to pull this off, making it the same 15 clocks. Or have a contract with every primitive to ensure carry flag is clear before jmp next.

Nevermind!

Dr Jefyll · Post by **Dr Jefyll** » Fri Mar 21, 2014 7:54 am

chitselb wrote:

Or have a contract with every primitive to ensure carry flag is clear before jmp next.

I don't see a problem with that. At worst you break even. What you're doing looks good to me.

Brad R · Post by **Brad R** » Fri Mar 21, 2014 12:30 pm

chitselb wrote:

I'm using a design where the compiler

realigns at page boundaries to work around the jmp ($xxff) bug
inserts a call to the word "page" when it compiles definitions (where they cross page boundaries.)

All page does when it executes is 'inc ip+1', to cross the page. That unburdens NEXT considerably, at the expense of compiler complexity

Very clever. I like it! And it adds relatively little to compiler complexity.

E.g., in CamelForth all threads are compiled through a word ,XT (append eXecution Token), which turns out to be equivalent to the ANS word COMPILE, . I would merely change ,XT to test for HERE=$xxFE, and compile the call to "page" in that case. (I assume that you're keeping the threads 16-bit aligned.)

There will be some added complexity for words that fetch an in-line value, such as literal, ." , and the branch operators. They'll need to be smart about page boundaries (or simply use a 16-bit increment for IP).

White Flame · Post by **White Flame** » Fri Mar 21, 2014 3:24 pm

Since I'm looking at my AcheronVM again, I noticed its dispatch was pretty quick, all things considered. Could be useful for a token-threaded Forth, I guess:

Code: Select all

mainLoop1:     ; Main loop entry when an instruction consumed an operand byte, so bump instruction pointer an extra 1
 iny
 bmi _iptrOverflow

mainLoop0:     ; Main loop entry when an instruction had no operand bytes
 lda (iptr),y
 iny
 sta *+4
 jmp (dispatchTable)  ; low byte is self-modded

All of the tokens are an even number, so no ASL is required to index into the 16-bit dispatch table. 128 possible tokens in this encoding.

The current program counter is iptr + Y. Instruction implementations must preserve .Y, and there is another mainLoopY entry point which does "ldy saveY" for convenience, given that the implementation stashed Y there before trampling it. The code regularly but not exhaustively checks to see if Y is greater than 127 and coalesces the value back into iptr if it reaches that range. It's nice that instructions can freely grab operand bytes using iny without having to always perform this check (though obviously wildly degenerate cases will break), and it doesn't break anything when page boundaries are crossed. The only issue is that the lda (iptr),y instructions take another cycle in those cases. Any branch, jump, etc, sets the new iptr and resets Y to zero.

barrym95838 · Post by **barrym95838** » Thu Mar 27, 2014 7:10 pm

I never get tired of seeing all of the cool ways (zp),y gets used ... it's one of the most important keys to the 6502's power and versatility, IMO.

Mike

chitselb · Post by **chitselb** » Fri May 08, 2015 3:00 am

This was a huge win today! I replaced 27 bytes of EXECUTE code with just 11, eliminating the need for a "W" register (as is typical of Indirect Threaded implementations)

Code: Select all

execute
    lda tos+1
    pha
    lda tos
    pha
    jsr slide      ; identical to DROP but ends in RTS instead of NEXT
    php
    rti

for comparison, here's the old dog chow:
;--------------------------------------------------------------
; W register, used by EXECUTE
;w1
;    .word $dead             ; (for when you just need a W register)
;    .word exit              ; 'fragment secondary' used by EXECUTE

;--------------------------------------------------------------
#if 0
name=EXECUTE
tags=inner
stack=( cfa -- )
Executes the word whose code field address is on the stack.

#endif
;execute
;    lda tos                 ; <-- code field address
;    sta w1                  ; in direct-threaded models, this
;    lda tos+1               ; contains code instead of a pointer
;    sta w1+1                ; [SP] -> [W1]
;    lda ip+1
;    pha
;    lda ip
;    pha
;    lda #<(w1-2)
;    sta ip
;    lda #>(w1-2)
;    sta ip+1
;    jmp pops

GARTHWILSON · Post by **GARTHWILSON** » Fri May 08, 2015 3:32 am

What I'm running for 65c02 is:

Code: Select all

        LDA  0,X
        STA  W
        LDA  1,X
        STA  W+1
        INX
        INX
        JMP  W-1

My '816 ITC has basically the same thing:

Code: Select all

         HEADER "EXECUTE", NOT_IMMEDIATE     ; ( addr -- )
EXECUTE: PRIMITIVE
         LDA  0,X
 xeq1:   STA  W
         INX_INX
         JMP  W-1
 ;-------------------

and the label is for PERFORM:

Code: Select all

         HEADER "PERFORM", NOT_IMMEDIATE      ; ( addr -- )
PERFORM: PRIMITIVE                ; same as  @ EXECUTE
         LDA  (0,X)
         BRA  xeq1
 ;-------------------

barrym95838 · Post by **barrym95838** » Fri May 08, 2015 4:41 am

65m32:

Code: Select all

                 ;  154 ;--------------------------------------------------------------- EXECUTE
0000029e:00000289;  155         NOT_IMM 'EXECUTE'
0000029f:07455845;  155 
000002a0:43555445;  155 
                 ;  156  _execute: ; ( xt --  ) \ Execute Forth word 'xt'
000002a1:62020000;  157         pda  ,x+        ; Pop xt from S: and push it to R:
000002a2:5e0c0000;  158         rts             ; Pop xt from R: and jump to it

This is an example of a disadvantage of keeping TOS-in-A. If I kept it at 0,x then EXECUTE would be only a single machine instruction: jmp (,x+).

Mike B.

php rti is a neat hack, Charlie!

nonarkitten · Post by **nonarkitten** » Sat Sep 24, 2016 6:54 am

So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).

Code: Select all

PAGE    INC IP+2		- seldom executed
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles

It might seem terrible to loose both the X and Y registers, but the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles to whichever primitive is being writing.

Code: Select all

CLIT    LDY #0       ; 2 cycles
        LDA (IP+1),Y ; 5 cycles
        DEX          ; 2 cycles
        STA (0,X)    ; 6 cycles
        LDY IP+1     ; 3 cycles
        INY          ; 2 cycles
        JMP NEXT     ; 3 cycles

Dr Jefyll · Post by **Dr Jefyll** » Sun Sep 25, 2016 6:43 am

nonarkitten wrote:

Code: Select all

PAGE    INC IP+2		- seldom executed
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles

This is pretty cool. Fast -- I like it.

Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)

Quote:

the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles

Good point. In fact, maybe the Y reload is a better candidate for occupying the fallthrough position preceding NEXT.
IOW, replace this

Code: Select all

PAGE    INC IP+2      - seldom executed  <-------
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles

with this

Code: Select all

FIX_Y   LDY IP+1     ;reload bombed Y <-------
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles

(There could still be a PAGE label. But after the INC IP+2 it would JMP or BRA to NEXT rather than falling through.)

-- Jeff

PS- Shouldn't this be STA 0,X? Also I'd expect CLIT also to write zero to the highbyte of TOS at 1,X.

Code: Select all

CLIT    LDY #0       ; 2 cycles
        LDA (IP+1),Y ; 5 cycles
        DEX          ; 2 cycles
        STA (0,X)    ; 6 cycles  <-------
        LDY IP+1     ; 3 cycles
        INY          ; 2 cycles
        JMP NEXT     ; 3 cycles

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Sep 25, 2016 7:59 am

Dr Jefyll wrote:

Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)

You can thank the 16 bit ALU in the '816 for that.

chitselb · Post by **chitselb** » Wed Oct 12, 2016 3:43 pm

nonarkitten wrote:

So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).

It creates a contract that each primitive must return Y = IP[low], and register contracts are scary. Primitives that modify the Y register have to give the 3 cycles back with the LDY IP+1, which also adds two bytes to all such primitives. But if I were to go this route, I'd probably do it like so, (increases the size of the zero page code from 7 to 9 bytes):

Code: Select all

;this section lives in zero page
NEXTY   LDY IP+1     ;[3] restore instruction pointer
                     ;    (unnecessary if primitive didn't alter Y)
NEXT    INY          ;[2]
        INY          ;[2]
NEXTB   STY IP+1     ;[3]
IP      JMP ($CAFE)  ;[5]

;this part is in high memory
PAGE    INC IP+2     ;[5]
        LDY #0       ;[2]
        JMP NEXTB    ;[3]
 
CLIT    DEX          ;[2]
        LDY #0       ;[2]
        LDA (IP+1),Y ;[5]
        STY STACKH,X ;[4]
        STA STACKL,X ;[4] (I'm using a split stack)
        INC IP+1     ;[5] get past the literal byte
        JMP NEXTY    ;[3]

whartung · Post by **whartung** » Thu Oct 20, 2016 8:54 pm

The Fig-Forth already has such a contract with X, as it's the stack pointer.

JimBoyd · Post by **JimBoyd** » Sat Nov 13, 2021 3:12 am

GARTHWILSON wrote:

FIG-Forth's NEXT leaves Y=0 and then some of the words take advantage of that so they don't have to zero it first. I had considered making NEXT leave C unaffected to make it easier to do multi-precision arithmetic in secondaries.

If I were interested in multi-precision math in secondaries, that is not how I would do it.
Making NEXT preserve the state of the carry flag would burden all Forth words with the extra overhead. This is a quick rough draft of what I might try.

Code: Select all

VARIABLE CARRY
CODE (MP+)  ( N1 N2 -- N3 )  \ multi-precision plus
   CARRY ROR
   0 ,X LDA  2 ,X ADC  2 ,X STA
   1 ,X LDA  3 ,X ADC  3 ,X STA
   CARRY ROL
   POP JMP  END-CODE
CODE +  ( N1 N2 -- N3 )  \ traditional plus
   CLC
   ' (MP+) @ 3 + JMP
   END-CODE

Or if there are two consecutive zero page locations available.

Code: Select all

<ZP-LOCATION> CONSTANT CARRY
CODE (MP+)  ( N1 N2 -- N3 )  \ multi-precision plus
   CARRY ROR
   0 ,X LDA  2 ,X ADC  2 ,X STA
   1 ,X LDA  3 ,X ADC  3 ,X STA
   CARRY ROL
   POP JMP  END-CODE
CODE +
   CLC
   ' (MP+) @ 2+ JMP
   END-CODE

The code only uses one byte of the variable CARRY , but this is clear carry and set carry for multiple precision high level math.

Code: Select all

CARRY OFF  \ clear carry
CARRY ON   \ set carry

A hypothetical example of multi-precision addition in high level Forth using (MP+).

Code: Select all

CREATE VALUE1 20 ALLOT
CREATE VALUE2 20 ALLOT
\ add 20 byte (10 cell) numbers in VALUE1 and VALUE2
\ return result in VALUE1
: MP+  ( ADR1 ADR2 #CELLS -- )
   CARRY OFF 2* 0
   ?DO
      OVER I + @  OVER I + @  (MP+)
      OVER I + !
      2 
   +LOOP
   2DROP ;
VALUE2 VALUE1 10 MP+

agsb · Post by **agsb** » Fri Jan 13, 2023 6:40 pm

;
; using Minimal Indirect Thread Code
; https://github.com/agsb/immu/blob/main/ ... h%20en.pdf
;
; in classic ITC, the inner interpreter code always jump and
; DOCOL does a push to return stack and SEMIS does pull from return stack
;
; in minimal ITC, the code only jumps when first reference is 0x0000,
; that marks all primitives, else it don't jump, just direct does nest,
; to push the reference into return stack, and all words ends to unnest,
; to pull the reference from return stack.
;
; Because there is few primitives than compound words,
; that does a option for inner interpreter by:
; shorten all compounds words one cell
; (does not need begin with DOCOL),
; no dependence of IP or W to hold next reference
; ( it is keeped at return stack ),
; it easy for MCus or CPUs, with separate code and data memory
; ( compounds could stay in 'data' non-executable segment,
; only inner address interpreter and primitives in 'code' executable segment ),
; on RiscV ISA does easy fast inner code,
;
;
; First version for 6502, no optimizations
;
; r0, top of return stack, y indexed, any memory page
; p0, top of parameter stack, x index, any memory page
; tos, nos, pseudo registers at page zero
; wrk, ptr, pseudo registers at page zero
; a_save, x_save, y_save, keep values
; no use of A, Y as TOS, just as acumulator
;

unnest: ; aka semis, pull from return stack
lda r0 + 0, y
sta ptr + 0
lda r0 + 1, y
sta ptr + 1
iny
iny
; jmp next

next:
; as is, classic ITC from fig-forth 6502
sty y_save
ldy #0
lda (ptr), y
sta wrk + 0
iny
lda (ptr), y
sta wrk + 1
ldy y_save
; jmp refer

refer:
; as is, classic ITC from fig-forth 6502
; pointer to next reference
clc
lda ptr + 0
adc #CELL_SIZE
sta ptr + 0
bne @end
inc ptr + 1
@end:
; jmp leaf

leaf:
; in MICT, all leafs start with NULL
; in 6502, none code at page zero
; just compare high byte
lda wrk + 1
bne nest
; none forth word at page zero
jmp (ptr)

nest: ; aka docol push into return stack
dey
dey
lda ptr + 1
sta r0 + 1, y
lda ptr + 0
sta r0 + 0, y
; jmp link

link:
; next reference
lda wrk + 0
sta ptr + 0
lda wrk + 1
sta ptr + 1
jmp next

sorry long post.

Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re: Is this the optimal 6502 NEXT?

Re:

Re: Is this the optimal 6502 NEXT?