Is this the optimal 6502 NEXT?

Topics relating to various Forth models on the 6502, 65816, and related microprocessors and microcontrollers.
chitselb
Posts: 232
Joined: 21 Aug 2010
Location: Ontonagon MI
Contact:

Re: Is this the optimal 6502 NEXT?

Post by chitselb »

phooey. Charlie-brain-damage admission.

I would have to explicitly clear carry to pull this off, making it the same 15 clocks. Or have a contract with every primitive to ensure carry flag is clear before jmp next.

Nevermind!
User avatar
Dr Jefyll
Posts: 3526
Joined: 11 Dec 2009
Location: Ontario, Canada
Contact:

Re: Is this the optimal 6502 NEXT?

Post by Dr Jefyll »

chitselb wrote:
Or have a contract with every primitive to ensure carry flag is clear before jmp next.
I don't see a problem with that. At worst you break even. What you're doing looks good to me. :)
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
Brad R
Posts: 93
Joined: 07 Jan 2014
Contact:

Re: Is this the optimal 6502 NEXT?

Post by Brad R »

chitselb wrote:
I'm using a design where the compiler
  • realigns at page boundaries to work around the jmp ($xxff) bug
  • inserts a call to the word "page" when it compiles definitions (where they cross page boundaries.)
All page does when it executes is 'inc ip+1', to cross the page. That unburdens NEXT considerably, at the expense of compiler complexity
Very clever. I like it! And it adds relatively little to compiler complexity.

E.g., in CamelForth all threads are compiled through a word ,XT (append eXecution Token), which turns out to be equivalent to the ANS word COMPILE, . I would merely change ,XT to test for HERE=$xxFE, and compile the call to "page" in that case. (I assume that you're keeping the threads 16-bit aligned.)

There will be some added complexity for words that fetch an in-line value, such as literal, ." , and the branch operators. They'll need to be smart about page boundaries (or simply use a 16-bit increment for IP).
Because there are never enough Forth implementations: http://www.camelforth.com
White Flame
Posts: 704
Joined: 24 Jul 2012

Re: Is this the optimal 6502 NEXT?

Post by White Flame »

Since I'm looking at my AcheronVM again, I noticed its dispatch was pretty quick, all things considered. Could be useful for a token-threaded Forth, I guess:

Code: Select all

mainLoop1:     ; Main loop entry when an instruction consumed an operand byte, so bump instruction pointer an extra 1
 iny
 bmi _iptrOverflow

mainLoop0:     ; Main loop entry when an instruction had no operand bytes
 lda (iptr),y
 iny
 sta *+4
 jmp (dispatchTable)  ; low byte is self-modded
All of the tokens are an even number, so no ASL is required to index into the 16-bit dispatch table. 128 possible tokens in this encoding.

The current program counter is iptr + Y. Instruction implementations must preserve .Y, and there is another mainLoopY entry point which does "ldy saveY" for convenience, given that the implementation stashed Y there before trampling it. The code regularly but not exhaustively checks to see if Y is greater than 127 and coalesces the value back into iptr if it reaches that range. It's nice that instructions can freely grab operand bytes using iny without having to always perform this check (though obviously wildly degenerate cases will break), and it doesn't break anything when page boundaries are crossed. The only issue is that the lda (iptr),y instructions take another cycle in those cases. Any branch, jump, etc, sets the new iptr and resets Y to zero.
User avatar
barrym95838
Posts: 2056
Joined: 30 Jun 2013
Location: Sacramento, CA, USA

Re: Is this the optimal 6502 NEXT?

Post by barrym95838 »

I never get tired of seeing all of the cool ways (zp),y gets used ... it's one of the most important keys to the 6502's power and versatility, IMO.

Mike
chitselb
Posts: 232
Joined: 21 Aug 2010
Location: Ontonagon MI
Contact:

Re: Is this the optimal 6502 NEXT?

Post by chitselb »

This was a huge win today! I replaced 27 bytes of EXECUTE code with just 11, eliminating the need for a "W" register (as is typical of Indirect Threaded implementations)

Code: Select all

execute
    lda tos+1
    pha
    lda tos
    pha
    jsr slide      ; identical to DROP but ends in RTS instead of NEXT
    php
    rti

for comparison, here's the old dog chow:
;--------------------------------------------------------------
; W register, used by EXECUTE
;w1
;    .word $dead             ; (for when you just need a W register)
;    .word exit              ; 'fragment secondary' used by EXECUTE

;--------------------------------------------------------------
#if 0
name=EXECUTE
tags=inner
stack=( cfa -- )
Executes the word whose code field address is on the stack.

#endif
;execute
;    lda tos                 ; <-- code field address
;    sta w1                  ; in direct-threaded models, this
;    lda tos+1               ; contains code instead of a pointer
;    sta w1+1                ; [SP] -> [W1]
;    lda ip+1
;    pha
;    lda ip
;    pha
;    lda #<(w1-2)
;    sta ip
;    lda #>(w1-2)
;    sta ip+1
;    jmp pops
User avatar
GARTHWILSON
Forum Moderator
Posts: 8774
Joined: 30 Aug 2002
Location: Southern California
Contact:

Re: Is this the optimal 6502 NEXT?

Post by GARTHWILSON »

What I'm running for 65c02 is:

Code: Select all

        LDA  0,X
        STA  W
        LDA  1,X
        STA  W+1
        INX
        INX
        JMP  W-1
My '816 ITC has basically the same thing:

Code: Select all

         HEADER "EXECUTE", NOT_IMMEDIATE     ; ( addr -- )
EXECUTE: PRIMITIVE
         LDA  0,X
 xeq1:   STA  W
         INX_INX
         JMP  W-1
 ;-------------------
and the label is for PERFORM:

Code: Select all

         HEADER "PERFORM", NOT_IMMEDIATE      ; ( addr -- )
PERFORM: PRIMITIVE                ; same as  @ EXECUTE
         LDA  (0,X)
         BRA  xeq1
 ;-------------------
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
User avatar
barrym95838
Posts: 2056
Joined: 30 Jun 2013
Location: Sacramento, CA, USA

Re: Is this the optimal 6502 NEXT?

Post by barrym95838 »

65m32:

Code: Select all

                 ;  154 ;--------------------------------------------------------------- EXECUTE
0000029e:00000289;  155         NOT_IMM 'EXECUTE'
0000029f:07455845;  155 
000002a0:43555445;  155 
                 ;  156  _execute: ; ( xt --  ) \ Execute Forth word 'xt'
000002a1:62020000;  157         pda  ,x+        ; Pop xt from S: and push it to R:
000002a2:5e0c0000;  158         rts             ; Pop xt from R: and jump to it
This is an example of a disadvantage of keeping TOS-in-A. If I kept it at 0,x then EXECUTE would be only a single machine instruction: jmp (,x+).

Mike B.

php rti is a neat hack, Charlie!
User avatar
nonarkitten
Posts: 2
Joined: 17 Oct 2014

Re: Is this the optimal 6502 NEXT?

Post by nonarkitten »

So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).

Code: Select all

PAGE    INC IP+2		- seldom executed
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles
It might seem terrible to loose both the X and Y registers, but the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles to whichever primitive is being writing.

Code: Select all

CLIT    LDY #0       ; 2 cycles
        LDA (IP+1),Y ; 5 cycles
        DEX          ; 2 cycles
        STA (0,X)    ; 6 cycles
        LDY IP+1     ; 3 cycles
        INY          ; 2 cycles
        JMP NEXT     ; 3 cycles
-- Zomg! Pewpew!!
User avatar
Dr Jefyll
Posts: 3526
Joined: 11 Dec 2009
Location: Ontario, Canada
Contact:

Re: Is this the optimal 6502 NEXT?

Post by Dr Jefyll »

nonarkitten wrote:

Code: Select all

PAGE    INC IP+2		- seldom executed
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles
This is pretty cool. Fast -- I like it.

Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)
Quote:
the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles
Good point. In fact, maybe the Y reload is a better candidate for occupying the fallthrough position preceding NEXT.
IOW, replace this

Code: Select all

PAGE    INC IP+2      - seldom executed  <-------
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles
with this

Code: Select all

FIX_Y   LDY IP+1     ;reload bombed Y <-------
NEXT    INY          ; 2 cycles
        INY          ; 2 cycles
        STY IP+1     ; 3 cycles
IP      JMP ($0000)  ; 5 cycles
(There could still be a PAGE label. But after the INC IP+2 it would JMP or BRA to NEXT rather than falling through.)

-- Jeff

PS- Shouldn't this be STA 0,X? Also I'd expect CLIT also to write zero to the highbyte of TOS at 1,X. :)

Code: Select all

CLIT    LDY #0       ; 2 cycles
        LDA (IP+1),Y ; 5 cycles
        DEX          ; 2 cycles
        STA (0,X)    ; 6 cycles  <-------
        LDY IP+1     ; 3 cycles
        INY          ; 2 cycles
        JMP NEXT     ; 3 cycles
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
User avatar
BigDumbDinosaur
Posts: 9428
Joined: 28 May 2009
Location: Midwestern USA (JB Pritzker’s dystopia)
Contact:

Re: Is this the optimal 6502 NEXT?

Post by BigDumbDinosaur »

Dr Jefyll wrote:
Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)
You can thank the 16 bit ALU in the '816 for that. :D
x86?  We ain't got no x86.  We don't NEED no stinking x86!
chitselb
Posts: 232
Joined: 21 Aug 2010
Location: Ontonagon MI
Contact:

Re: Is this the optimal 6502 NEXT?

Post by chitselb »

nonarkitten wrote:
So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).
It creates a contract that each primitive must return Y = IP[low], and register contracts are scary. Primitives that modify the Y register have to give the 3 cycles back with the LDY IP+1, which also adds two bytes to all such primitives. But if I were to go this route, I'd probably do it like so, (increases the size of the zero page code from 7 to 9 bytes):

Code: Select all

;this section lives in zero page
NEXTY   LDY IP+1     ;[3] restore instruction pointer
                     ;    (unnecessary if primitive didn't alter Y)
NEXT    INY          ;[2]
        INY          ;[2]
NEXTB   STY IP+1     ;[3]
IP      JMP ($CAFE)  ;[5]

;this part is in high memory
PAGE    INC IP+2     ;[5]
        LDY #0       ;[2]
        JMP NEXTB    ;[3]
 
CLIT    DEX          ;[2]
        LDY #0       ;[2]
        LDA (IP+1),Y ;[5]
        STY STACKH,X ;[4]
        STA STACKL,X ;[4] (I'm using a split stack)
        INC IP+1     ;[5] get past the literal byte
        JMP NEXTY    ;[3]
whartung
Posts: 1004
Joined: 13 Dec 2003

Re: Is this the optimal 6502 NEXT?

Post by whartung »

The Fig-Forth already has such a contract with X, as it's the stack pointer.
JimBoyd
Posts: 931
Joined: 05 May 2017

Re:

Post by JimBoyd »

GARTHWILSON wrote:
FIG-Forth's NEXT leaves Y=0 and then some of the words take advantage of that so they don't have to zero it first. I had considered making NEXT leave C unaffected to make it easier to do multi-precision arithmetic in secondaries.

If I were interested in multi-precision math in secondaries, that is not how I would do it.
Making NEXT preserve the state of the carry flag would burden all Forth words with the extra overhead. This is a quick rough draft of what I might try.

Code: Select all

VARIABLE CARRY
CODE (MP+)  ( N1 N2 -- N3 )  \ multi-precision plus
   CARRY ROR
   0 ,X LDA  2 ,X ADC  2 ,X STA
   1 ,X LDA  3 ,X ADC  3 ,X STA
   CARRY ROL
   POP JMP  END-CODE
CODE +  ( N1 N2 -- N3 )  \ traditional plus
   CLC
   ' (MP+) @ 3 + JMP
   END-CODE

Or if there are two consecutive zero page locations available.

Code: Select all

<ZP-LOCATION> CONSTANT CARRY
CODE (MP+)  ( N1 N2 -- N3 )  \ multi-precision plus
   CARRY ROR
   0 ,X LDA  2 ,X ADC  2 ,X STA
   1 ,X LDA  3 ,X ADC  3 ,X STA
   CARRY ROL
   POP JMP  END-CODE
CODE +
   CLC
   ' (MP+) @ 2+ JMP
   END-CODE

The code only uses one byte of the variable CARRY , but this is clear carry and set carry for multiple precision high level math.

Code: Select all

CARRY OFF  \ clear carry
CARRY ON   \ set carry

A hypothetical example of multi-precision addition in high level Forth using (MP+).

Code: Select all

CREATE VALUE1 20 ALLOT
CREATE VALUE2 20 ALLOT
\ add 20 byte (10 cell) numbers in VALUE1 and VALUE2
\ return result in VALUE1
: MP+  ( ADR1 ADR2 #CELLS -- )
   CARRY OFF 2* 0
   ?DO
      OVER I + @  OVER I + @  (MP+)
      OVER I + !
      2 
   +LOOP
   2DROP ;
VALUE2 VALUE1 10 MP+

agsb
Posts: 31
Joined: 09 Jan 2023

Re: Is this the optimal 6502 NEXT?

Post by agsb »

;
; using Minimal Indirect Thread Code
; https://github.com/agsb/immu/blob/main/ ... h%20en.pdf
;
; in classic ITC, the inner interpreter code always jump and
; DOCOL does a push to return stack and SEMIS does pull from return stack
;
; in minimal ITC, the code only jumps when first reference is 0x0000,
; that marks all primitives, else it don't jump, just direct does nest,
; to push the reference into return stack, and all words ends to unnest,
; to pull the reference from return stack.
;
; Because there is few primitives than compound words,
; that does a option for inner interpreter by:
; shorten all compounds words one cell
; (does not need begin with DOCOL),
; no dependence of IP or W to hold next reference
; ( it is keeped at return stack ),
; it easy for MCus or CPUs, with separate code and data memory
; ( compounds could stay in 'data' non-executable segment,
; only inner address interpreter and primitives in 'code' executable segment ),
; on RiscV ISA does easy fast inner code,
;
;
; First version for 6502, no optimizations
;
; r0, top of return stack, y indexed, any memory page
; p0, top of parameter stack, x index, any memory page
; tos, nos, pseudo registers at page zero
; wrk, ptr, pseudo registers at page zero
; a_save, x_save, y_save, keep values
; no use of A, Y as TOS, just as acumulator
;

unnest: ; aka semis, pull from return stack
lda r0 + 0, y
sta ptr + 0
lda r0 + 1, y
sta ptr + 1
iny
iny
; jmp next

next:
; as is, classic ITC from fig-forth 6502
sty y_save
ldy #0
lda (ptr), y
sta wrk + 0
iny
lda (ptr), y
sta wrk + 1
ldy y_save
; jmp refer

refer:
; as is, classic ITC from fig-forth 6502
; pointer to next reference
clc
lda ptr + 0
adc #CELL_SIZE
sta ptr + 0
bne @end
inc ptr + 1
@end:
; jmp leaf

leaf:
; in MICT, all leafs start with NULL
; in 6502, none code at page zero
; just compare high byte
lda wrk + 1
bne nest
; none forth word at page zero
jmp (ptr)

nest: ; aka docol push into return stack
dey
dey
lda ptr + 1
sta r0 + 1, y
lda ptr + 0
sta r0 + 0, y
; jmp link

link:
; next reference
lda wrk + 0
sta ptr + 0
lda wrk + 1
sta ptr + 1
jmp next

sorry long post.
Post Reply