Is this the optimal 6502 NEXT?
Re: Is this the optimal 6502 NEXT?
phooey. Charlie-brain-damage admission.
I would have to explicitly clear carry to pull this off, making it the same 15 clocks. Or have a contract with every primitive to ensure carry flag is clear before jmp next.
Nevermind!
I would have to explicitly clear carry to pull this off, making it the same 15 clocks. Or have a contract with every primitive to ensure carry flag is clear before jmp next.
Nevermind!
Re: Is this the optimal 6502 NEXT?
chitselb wrote:
Or have a contract with every primitive to ensure carry flag is clear before jmp next.
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: Is this the optimal 6502 NEXT?
chitselb wrote:
I'm using a design where the compiler
- realigns at page boundaries to work around the jmp ($xxff) bug
- inserts a call to the word "page" when it compiles definitions (where they cross page boundaries.)
E.g., in CamelForth all threads are compiled through a word ,XT (append eXecution Token), which turns out to be equivalent to the ANS word COMPILE, . I would merely change ,XT to test for HERE=$xxFE, and compile the call to "page" in that case. (I assume that you're keeping the threads 16-bit aligned.)
There will be some added complexity for words that fetch an in-line value, such as literal, ." , and the branch operators. They'll need to be smart about page boundaries (or simply use a 16-bit increment for IP).
Because there are never enough Forth implementations: http://www.camelforth.com
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: Is this the optimal 6502 NEXT?
Since I'm looking at my AcheronVM again, I noticed its dispatch was pretty quick, all things considered. Could be useful for a token-threaded Forth, I guess:
All of the tokens are an even number, so no ASL is required to index into the 16-bit dispatch table. 128 possible tokens in this encoding.
The current program counter is iptr + Y. Instruction implementations must preserve .Y, and there is another mainLoopY entry point which does "ldy saveY" for convenience, given that the implementation stashed Y there before trampling it. The code regularly but not exhaustively checks to see if Y is greater than 127 and coalesces the value back into iptr if it reaches that range. It's nice that instructions can freely grab operand bytes using iny without having to always perform this check (though obviously wildly degenerate cases will break), and it doesn't break anything when page boundaries are crossed. The only issue is that the lda (iptr),y instructions take another cycle in those cases. Any branch, jump, etc, sets the new iptr and resets Y to zero.
Code: Select all
mainLoop1: ; Main loop entry when an instruction consumed an operand byte, so bump instruction pointer an extra 1
iny
bmi _iptrOverflow
mainLoop0: ; Main loop entry when an instruction had no operand bytes
lda (iptr),y
iny
sta *+4
jmp (dispatchTable) ; low byte is self-modded
The current program counter is iptr + Y. Instruction implementations must preserve .Y, and there is another mainLoopY entry point which does "ldy saveY" for convenience, given that the implementation stashed Y there before trampling it. The code regularly but not exhaustively checks to see if Y is greater than 127 and coalesces the value back into iptr if it reaches that range. It's nice that instructions can freely grab operand bytes using iny without having to always perform this check (though obviously wildly degenerate cases will break), and it doesn't break anything when page boundaries are crossed. The only issue is that the lda (iptr),y instructions take another cycle in those cases. Any branch, jump, etc, sets the new iptr and resets Y to zero.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Is this the optimal 6502 NEXT?
I never get tired of seeing all of the cool ways (zp),y gets used ... it's one of the most important keys to the 6502's power and versatility, IMO.
Mike
Mike
Re: Is this the optimal 6502 NEXT?
This was a huge win today! I replaced 27 bytes of EXECUTE code with just 11, eliminating the need for a "W" register (as is typical of Indirect Threaded implementations)
Code: Select all
execute
lda tos+1
pha
lda tos
pha
jsr slide ; identical to DROP but ends in RTS instead of NEXT
php
rti
for comparison, here's the old dog chow:
;--------------------------------------------------------------
; W register, used by EXECUTE
;w1
; .word $dead ; (for when you just need a W register)
; .word exit ; 'fragment secondary' used by EXECUTE
;--------------------------------------------------------------
#if 0
name=EXECUTE
tags=inner
stack=( cfa -- )
Executes the word whose code field address is on the stack.
#endif
;execute
; lda tos ; <-- code field address
; sta w1 ; in direct-threaded models, this
; lda tos+1 ; contains code instead of a pointer
; sta w1+1 ; [SP] -> [W1]
; lda ip+1
; pha
; lda ip
; pha
; lda #<(w1-2)
; sta ip
; lda #>(w1-2)
; sta ip+1
; jmp pops- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Is this the optimal 6502 NEXT?
What I'm running for 65c02 is:
My '816 ITC has basically the same thing:
and the label is for PERFORM:
Code: Select all
LDA 0,X
STA W
LDA 1,X
STA W+1
INX
INX
JMP W-1
Code: Select all
HEADER "EXECUTE", NOT_IMMEDIATE ; ( addr -- )
EXECUTE: PRIMITIVE
LDA 0,X
xeq1: STA W
INX_INX
JMP W-1
;-------------------
Code: Select all
HEADER "PERFORM", NOT_IMMEDIATE ; ( addr -- )
PERFORM: PRIMITIVE ; same as @ EXECUTE
LDA (0,X)
BRA xeq1
;-------------------
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Is this the optimal 6502 NEXT?
65m32:
This is an example of a disadvantage of keeping TOS-in-A. If I kept it at 0,x then EXECUTE would be only a single machine instruction: jmp (,x+).
Mike B.
php rti is a neat hack, Charlie!
Code: Select all
; 154 ;--------------------------------------------------------------- EXECUTE
0000029e:00000289; 155 NOT_IMM 'EXECUTE'
0000029f:07455845; 155
000002a0:43555445; 155
; 156 _execute: ; ( xt -- ) \ Execute Forth word 'xt'
000002a1:62020000; 157 pda ,x+ ; Pop xt from S: and push it to R:
000002a2:5e0c0000; 158 rts ; Pop xt from R: and jump to it
Mike B.
php rti is a neat hack, Charlie!
- nonarkitten
- Posts: 2
- Joined: 17 Oct 2014
Re: Is this the optimal 6502 NEXT?
So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).
It might seem terrible to loose both the X and Y registers, but the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles to whichever primitive is being writing.
Code: Select all
PAGE INC IP+2 - seldom executed
NEXT INY ; 2 cycles
INY ; 2 cycles
STY IP+1 ; 3 cycles
IP JMP ($0000) ; 5 cycles
Code: Select all
CLIT LDY #0 ; 2 cycles
LDA (IP+1),Y ; 5 cycles
DEX ; 2 cycles
STA (0,X) ; 6 cycles
LDY IP+1 ; 3 cycles
INY ; 2 cycles
JMP NEXT ; 3 cycles
-- Zomg! Pewpew!!
Re: Is this the optimal 6502 NEXT?
nonarkitten wrote:
Code: Select all
PAGE INC IP+2 - seldom executed
NEXT INY ; 2 cycles
INY ; 2 cycles
STY IP+1 ; 3 cycles
IP JMP ($0000) ; 5 cycles
Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)
Quote:
the Y register can be nuked and then reloaded with LDY IP+1 which only adds three cycles
IOW, replace this
Code: Select all
PAGE INC IP+2 - seldom executed <-------
NEXT INY ; 2 cycles
INY ; 2 cycles
STY IP+1 ; 3 cycles
IP JMP ($0000) ; 5 cyclesCode: Select all
FIX_Y LDY IP+1 ;reload bombed Y <-------
NEXT INY ; 2 cycles
INY ; 2 cycles
STY IP+1 ; 3 cycles
IP JMP ($0000) ; 5 cycles-- Jeff
PS- Shouldn't this be STA 0,X? Also I'd expect CLIT also to write zero to the highbyte of TOS at 1,X.
Code: Select all
CLIT LDY #0 ; 2 cycles
LDA (IP+1),Y ; 5 cycles
DEX ; 2 cycles
STA (0,X) ; 6 cycles <-------
LDY IP+1 ; 3 cycles
INY ; 2 cycles
JMP NEXT ; 3 cyclesIn 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
- BigDumbDinosaur
- Posts: 9426
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Is this the optimal 6502 NEXT?
Dr Jefyll wrote:
Minor added detail: 5 is correct clock count for the JMP (ind) "featured" in the NMOS 6502, but the 65C02's bug-fixed JMP (ind) increases that count to 6. (Until now I didn't realize the '816 reduces it back to 5.)
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: Is this the optimal 6502 NEXT?
nonarkitten wrote:
So, why not simply use Y to hold the LSB and store it into the zero page JMP command. This will get 12 cycles on a 6502 approaching subroutine-threading performance without clobbering the C flag (allowing multi-precision arithmetic).
Code: Select all
;this section lives in zero page
NEXTY LDY IP+1 ;[3] restore instruction pointer
; (unnecessary if primitive didn't alter Y)
NEXT INY ;[2]
INY ;[2]
NEXTB STY IP+1 ;[3]
IP JMP ($CAFE) ;[5]
;this part is in high memory
PAGE INC IP+2 ;[5]
LDY #0 ;[2]
JMP NEXTB ;[3]
CLIT DEX ;[2]
LDY #0 ;[2]
LDA (IP+1),Y ;[5]
STY STACKH,X ;[4]
STA STACKL,X ;[4] (I'm using a split stack)
INC IP+1 ;[5] get past the literal byte
JMP NEXTY ;[3]
Re: Is this the optimal 6502 NEXT?
The Fig-Forth already has such a contract with X, as it's the stack pointer.
Re:
GARTHWILSON wrote:
FIG-Forth's NEXT leaves Y=0 and then some of the words take advantage of that so they don't have to zero it first. I had considered making NEXT leave C unaffected to make it easier to do multi-precision arithmetic in secondaries.
If I were interested in multi-precision math in secondaries, that is not how I would do it.
Making NEXT preserve the state of the carry flag would burden all Forth words with the extra overhead. This is a quick rough draft of what I might try.
Code: Select all
VARIABLE CARRY
CODE (MP+) ( N1 N2 -- N3 ) \ multi-precision plus
CARRY ROR
0 ,X LDA 2 ,X ADC 2 ,X STA
1 ,X LDA 3 ,X ADC 3 ,X STA
CARRY ROL
POP JMP END-CODE
CODE + ( N1 N2 -- N3 ) \ traditional plus
CLC
' (MP+) @ 3 + JMP
END-CODE
Or if there are two consecutive zero page locations available.
Code: Select all
<ZP-LOCATION> CONSTANT CARRY
CODE (MP+) ( N1 N2 -- N3 ) \ multi-precision plus
CARRY ROR
0 ,X LDA 2 ,X ADC 2 ,X STA
1 ,X LDA 3 ,X ADC 3 ,X STA
CARRY ROL
POP JMP END-CODE
CODE +
CLC
' (MP+) @ 2+ JMP
END-CODE
The code only uses one byte of the variable CARRY , but this is clear carry and set carry for multiple precision high level math.
Code: Select all
CARRY OFF \ clear carry
CARRY ON \ set carry
A hypothetical example of multi-precision addition in high level Forth using (MP+).
Code: Select all
CREATE VALUE1 20 ALLOT
CREATE VALUE2 20 ALLOT
\ add 20 byte (10 cell) numbers in VALUE1 and VALUE2
\ return result in VALUE1
: MP+ ( ADR1 ADR2 #CELLS -- )
CARRY OFF 2* 0
?DO
OVER I + @ OVER I + @ (MP+)
OVER I + !
2
+LOOP
2DROP ;
VALUE2 VALUE1 10 MP+
Re: Is this the optimal 6502 NEXT?
;
; using Minimal Indirect Thread Code
; https://github.com/agsb/immu/blob/main/ ... h%20en.pdf
;
; in classic ITC, the inner interpreter code always jump and
; DOCOL does a push to return stack and SEMIS does pull from return stack
;
; in minimal ITC, the code only jumps when first reference is 0x0000,
; that marks all primitives, else it don't jump, just direct does nest,
; to push the reference into return stack, and all words ends to unnest,
; to pull the reference from return stack.
;
; Because there is few primitives than compound words,
; that does a option for inner interpreter by:
; shorten all compounds words one cell
; (does not need begin with DOCOL),
; no dependence of IP or W to hold next reference
; ( it is keeped at return stack ),
; it easy for MCus or CPUs, with separate code and data memory
; ( compounds could stay in 'data' non-executable segment,
; only inner address interpreter and primitives in 'code' executable segment ),
; on RiscV ISA does easy fast inner code,
;
;
; First version for 6502, no optimizations
;
; r0, top of return stack, y indexed, any memory page
; p0, top of parameter stack, x index, any memory page
; tos, nos, pseudo registers at page zero
; wrk, ptr, pseudo registers at page zero
; a_save, x_save, y_save, keep values
; no use of A, Y as TOS, just as acumulator
;
unnest: ; aka semis, pull from return stack
lda r0 + 0, y
sta ptr + 0
lda r0 + 1, y
sta ptr + 1
iny
iny
; jmp next
next:
; as is, classic ITC from fig-forth 6502
sty y_save
ldy #0
lda (ptr), y
sta wrk + 0
iny
lda (ptr), y
sta wrk + 1
ldy y_save
; jmp refer
refer:
; as is, classic ITC from fig-forth 6502
; pointer to next reference
clc
lda ptr + 0
adc #CELL_SIZE
sta ptr + 0
bne @end
inc ptr + 1
@end:
; jmp leaf
leaf:
; in MICT, all leafs start with NULL
; in 6502, none code at page zero
; just compare high byte
lda wrk + 1
bne nest
; none forth word at page zero
jmp (ptr)
nest: ; aka docol push into return stack
dey
dey
lda ptr + 1
sta r0 + 1, y
lda ptr + 0
sta r0 + 0, y
; jmp link
link:
; next reference
lda wrk + 0
sta ptr + 0
lda wrk + 1
sta ptr + 1
jmp next
sorry long post.
; using Minimal Indirect Thread Code
; https://github.com/agsb/immu/blob/main/ ... h%20en.pdf
;
; in classic ITC, the inner interpreter code always jump and
; DOCOL does a push to return stack and SEMIS does pull from return stack
;
; in minimal ITC, the code only jumps when first reference is 0x0000,
; that marks all primitives, else it don't jump, just direct does nest,
; to push the reference into return stack, and all words ends to unnest,
; to pull the reference from return stack.
;
; Because there is few primitives than compound words,
; that does a option for inner interpreter by:
; shorten all compounds words one cell
; (does not need begin with DOCOL),
; no dependence of IP or W to hold next reference
; ( it is keeped at return stack ),
; it easy for MCus or CPUs, with separate code and data memory
; ( compounds could stay in 'data' non-executable segment,
; only inner address interpreter and primitives in 'code' executable segment ),
; on RiscV ISA does easy fast inner code,
;
;
; First version for 6502, no optimizations
;
; r0, top of return stack, y indexed, any memory page
; p0, top of parameter stack, x index, any memory page
; tos, nos, pseudo registers at page zero
; wrk, ptr, pseudo registers at page zero
; a_save, x_save, y_save, keep values
; no use of A, Y as TOS, just as acumulator
;
unnest: ; aka semis, pull from return stack
lda r0 + 0, y
sta ptr + 0
lda r0 + 1, y
sta ptr + 1
iny
iny
; jmp next
next:
; as is, classic ITC from fig-forth 6502
sty y_save
ldy #0
lda (ptr), y
sta wrk + 0
iny
lda (ptr), y
sta wrk + 1
ldy y_save
; jmp refer
refer:
; as is, classic ITC from fig-forth 6502
; pointer to next reference
clc
lda ptr + 0
adc #CELL_SIZE
sta ptr + 0
bne @end
inc ptr + 1
@end:
; jmp leaf
leaf:
; in MICT, all leafs start with NULL
; in 6502, none code at page zero
; just compare high byte
lda wrk + 1
bne nest
; none forth word at page zero
jmp (ptr)
nest: ; aka docol push into return stack
dey
dey
lda ptr + 1
sta r0 + 1, y
lda ptr + 0
sta r0 + 0, y
; jmp link
link:
; next reference
lda wrk + 0
sta ptr + 0
lda wrk + 1
sta ptr + 1
jmp next
sorry long post.