Is this the optimal 6502 NEXT?
The project thus far: http://github.com/chitselb/pettil
The good news is that NEXT / ENTER / EXIT / EXECUTE are running, and NEXT is a mere 17 cycles.
The bad news is that it lacks a README, and the ability to compile for that matter. Feel free to throw down and email your diffs to me at gmail (hint: It's 'chitselb' there too)
For development work, I'm using Linux Mint 9 running:
vice 2.1.dfsg-3 The Versatile Commodore Emulator
xa65 2.3.5-1 cross-assembler and utility suite for 65xx/65816 processors
Charlie
The good news is that NEXT / ENTER / EXIT / EXECUTE are running, and NEXT is a mere 17 cycles.
The bad news is that it lacks a README, and the ability to compile for that matter. Feel free to throw down and email your diffs to me at gmail (hint: It's 'chitselb' there too)
For development work, I'm using Linux Mint 9 running:
vice 2.1.dfsg-3 The Versatile Commodore Emulator
xa65 2.3.5-1 cross-assembler and utility suite for 65xx/65816 processors
Charlie
some fool who sucks at cut and paste wrote:
An optimized version of the UM/MOD fix can be found here (along with the UM* fix):
http://6502org.wikidot.com/errata-software-figforth
kc5tja wrote:
Though in this case, I'd use Y as the data stack pointer, and reference the data stack in direct-page using absolute addresses.
<snip>
I think it's only one cycle longer than DP,X mode, which shouldn't be much of an issue, I'd think.
<snip>
I think it's only one cycle longer than DP,X mode, which shouldn't be much of an issue, I'd think.
I would (blindly) guess that most words change only 1 or 2 cells, so that would be 2 or 4 cycles (one per byte per cell), for the STAs.
kc5tja wrote:
The only difficulty would come from when you need to invoke @ and !, I think, as here you'd need to tweak the Y register and a reserved direct-page location to serve as your pointer.
Code: Select all
; C@
STY :1+1
:1 LDA (0)
STA 0,Y
LDA #0
STA 1,Y
JMP NEXT
However, abs,Y is of course 1 byte longer than zp,X, and there are 2 of those for every 1 byte savings from 1 INY vs. 2 INXs. Plus there's the additional space (and as you mentioned, time) of copying to the zero page so that it can be used as a pointer. So primitives will take up more space, for what may be on average only a slight speed improvement at best. Also, in the 17 cycle NEXT, Y can be overwritten, without penalty (see below), whereas in the 12 cycle NEXT both index registers are used and must be preserved. I seem to have underestimated the 12-cycle NEXT, but I'm still not sure I see a decisive advantage there.
Anyway, shifting gears back to the 17 cycle NEXT, is there a benefit to requiring primitives to preserve Y to save the 2 cycle LDY #1? LDA #imm takes the same 2 cycles as TYA. In other words, why not this:
Code: Select all
NEXT LDA #2
NEXT1 CLC
ADC :1+1
STA :1+1
BCS :2
:1 JMP ($FFFF)
:2 INC :1+2 ; this doesn't affect the carry so the BCS always branches
BCS :1 ; or JMP to ensure it always takes 3 cycles
Code: Select all
DW LIT,0
DW ZBRANCH,label
GARTHWILSON wrote:
Also, take a look at my article on servicing interrupts in high-level Forth with no overhead, as the method involves NEXT.
Quote:
It would take some convincing for me to believe that splitting the stack and having to keep moving things in and out of a fixed-address TOS pair of ZP bytes would be more efficient than keeping byte pairs together in every cell and doing indexing and doing the INX or DEX twice to add or drop stack cells.
A side-by-side example
Code: Select all
+
PETTIL figForth
pluslfa .word $adde L632 .BYTE $81,$AB
.byt (plus-*-1)|bit7 .WORD L619 ; link to V-ADJ
.asc "+"|bit7 PLUS .WORD *+2
plus lda tos CLC
adc stackl,x LDA 0,X
sta tos ADC 2,X
lda tos+1 STA 2,X
adc stackh,x LDA 1,X
sta tos+1 ADC 3,X
inx STA 3,X
jmp next INX
INX
JMP NEXTUnrelated to any of that, I'm thinking about the EDITOR. Of course I'll include the "Starting Forth" editor because most Forth programmers have seen it if they've ever read that book, and who hasn't, right? But I really liked the volksForth64 full screen editor too.
I am envisioning something that looks as much as possible like the ROM screen editor, including the 40/80-column stuff. There's no Control key on original PET or PET 2001-N keyboards, but RVS could do the trick and it's even positioned correctly. For example RVS-S could mean "flush the current block" and RVS-W could mean "toggle to the shadow block" and RVS-RVS could mean "act like I just typed a RVS". Another thing the volksForth editor has is a clear visual indication that you're in the editor (it changes the border color). For PETTIL I could flash the screen when the user presses RVS.
Case sensitive? I think not. A programmer writing a card game could define the shifted A,S,Z,X PETSCII card suit characters as Forth words, etc...
Speeding up Dictionary searches: figForth searches the entire dictionary for a blank-delimited string and then throws it to NUMBER. My approach to this is to break the dictionary up into 16 threads instead of one. A very fast and simple hashing scheme (XOR all the nybbles of the text portion of the word together) will give a hashkey between 0 and 15. A table of 16 words will point to the most recently linked LFA for each of those 16 hashkey values. Two new words RETHREAD and UNTHREAD will toggle every LFA between being threaded into this scheme (RETHREAD) and being set to the constant $DE $AD (UNTHREAD). Otherwise, implementing FORGET would be lots more difficult. The dictionary will be unthreaded at startup and COLD will invoke RETHREAD to get things working.
I'm hosting it all on github.com http://github.com/chitselb/pettil
chitselb wrote:
PETTIL guarantees to primitives that NEXT will always exit with Carry clear and Y=1, saves two clock cycles here. Still, I get 25 cycles for PETTIL '+' vs. 33 cycles for figForth '+'. Not bad.
It seems to me you would always be taking the time whether it's
needed or not and it would likely only save you a few bytes here and
there (?)
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
because at some future time I will probably wish to introduce preemptive (involuntary) multitasking.
Quote:
PETTIL guarantees to primitives that NEXT will always exit with Carry clear and Y=1, saves two clock cycles here.
Quote:
Speeding up Dictionary searches:
bogax wrote:
chitselb wrote:
PETTIL guarantees to primitives that NEXT will always exit with Carry clear and Y=1, saves two clock cycles here. Still, I get 25 cycles for PETTIL '+' vs. 33 cycles for figForth '+'. Not bad.
It seems to me you would always be taking the time whether it's
needed or not and it would likely only save you a few bytes here and
there (?)
Code: Select all
putya sty tos+1 ; YA to the TOS register
puta sta tos ; your mom
next1 ldy #1 ; entry to reset Y = 1
next tya ;[2] ; entry for most purposes
nexta sec ;[2] ; adds an arbitrary amount to IP
adc ip ;[3]
sta ip ;[3]
bcs nextp ;[2]
ip = *+1 ; dummy value $CAFE
nexto jmp ($cafe) ;[5] ; outtie! doesn't alter IP, Y or C
nextp inc ip+1 ;[17] ; next page
nexty ldy #1 ; like nexto, also fixes CLC & Y
nextx clc ; just fix CLC, then nexto
bcc nexto ; bra
The unnecessary three bytes are the instructions at 'nexty' and 'nextx'. For the price of 2 clock cycles and 1 byte, (which is a price I only pay only whenever the IP crosses a page boundary), I get to know what the C flag is at the start of every primitive. The cost (2 clocks/2 bytes for ldy #1) and logic apply similarly to Y. By carefully choosing which entry point into NEXT at the end of a primitive, I shave even more clocks here and there too. For me, coding on the 6502 is a logic game, a place to entertain myself by optimizing. A little ridiculous, a lot of fun.
Charlie
Consider this revision. It does away with the "Y=1 C=clear after NEXT" contract, and also the handy put stuff that stores the YA register pair to TOS. It still runs in 17 cycles. For clocks, it's a wash. The only win here is I free up some bytes of zero page. And I can use the Y register for something else. And it sort of simplifies things vs. the byzantine NEXT routine in the previous message.
The cost of this move is now every primitive that puts a value to TOS will have to handle that itself before calling NEXT. Every primitive that wants to use the Y register or the C flag will have to initialize them.
But wait a minute! Maybe you're onto something here. Now I can use Y for something else, besides always walking around with a 1 in there. What would I use it for?
The cost of this move is now every primitive that puts a value to TOS will have to handle that itself before calling NEXT. Every primitive that wants to use the Y register or the C flag will have to initialize them.
Code: Select all
minimum wage NEXT
next lda #2 ;[2] ; entry point for just about everything
nexta clc ;[2] ; adds an arbitrary amount (passed in A) to IP
adc ip ;[3]
sta ip ;[3]
bcs nextp ;[2]
ip = *+1 ; dummy value $CAFE
nexto jmp ($cafe) ;[5] ; outtie!
nextp inc ip+1 ;[17] ; next page
bcs nexto ; bra
I've had a day to think about it. It's not a good design, or at least it doesn't seem very Forth-like to me, to have a contract between the Forth virtual machine and primitives as far as the 6502 register contents. So before one or many of you try to convince me of the folly of my ways, here's what I've come up with after a day of soul searching, away from the keyboard. Same 17 clocks. It only affects the S and Z flags, and doesn't affect *any* other registers, so primitives can use A and Y to pass values between themselves. This is it, as fast, uncomplicated and minimal as I could make it.
The compiler's duty is to never break a word in half at a page boundary because of the JMP ($xxFF) bug. It will do this by inserting (or not) a NOP $EA *after* the JMP ENTER at the beginning of a colon definition.
The compiler's duty is to never break a word in half at a page boundary because of the JMP ($xxFF) bug. It will do this by inserting (or not) a NOP $EA *after* the JMP ENTER at the beginning of a colon definition.
Code: Select all
new and improved NEXT and zero page usage
stackl = $00
stackh = $3a
n = $74
up = $7c
tos = $7e
0080 next inc ip+1
0082 inc ip+1
0084 beq nextp
ip = *+1
0086 nexto jmp ($cafe)
0089 nextp inc ip+1
008b bne nextoRegister contracts already exist: for example, on x86 direct-threaded Forth implementations, it's very often the case that (E)SI holds the program counter, while (E)BX holds the top of stack.
That being said, I'm actually caught by surprise by this implementation. It's small, quick, and easy to understand.
Although, the code as listed is buggy:
That being said, I'm actually caught by surprise by this implementation. It's small, quick, and easy to understand.
Although, the code as listed is buggy:
Code: Select all
. . .
inc ip+1 ; should be inc ip
inc ip+1 ; should be inc ip
. . .
chitselb wrote:
The compiler's duty is to never break a word in half at a page boundary because of the JMP ($xxFF) bug. It will do this by inserting (or not) a NOP $EA *after* the JMP ENTER at the beginning of a colon definition.
primitive and get the compiler to insert a NEXTPAGE and lose the page
increment in NEXT?
bogax wrote:
So since you're doing that any way why don't you create a NEXTPAGE
primitive and get the compiler to insert a NEXTPAGE and lose the page
increment in NEXT?
primitive and get the compiler to insert a NEXTPAGE and lose the page
increment in NEXT?
chitselb wrote:
The compiler's duty is to never break a word in half at a page boundary because of the JMP ($xxFF) bug. It will do this by inserting (or not) a NOP $EA *after* the JMP ENTER at the beginning of a colon definition.
Here's my code for ENTER: Please notice particularly the bad ASCII art in the comment
Code: Select all
;--------------------------------------------------------------
;
; ENTER
;
; IP -> [-RP]
; [IP] -> W ; W points to the CFA of this secondary
; W+sizeof(CFA) -> IP
; NEXT
;
; ENTER is headerless, aka DOCOLON in some Forths
;
; Consider how these three secondaries cross a page boundary
;
; : foo ( n1 -- n1 n2 n1 ) dup 2+ over ;
; f8 f9 fa fb fc fd fe ff| 00 01
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+
; | 3 | FOO |jmp enter | dup |twoplus| over | exit |
; |<----NFA------>|<--CFA---->|<-----------PFA--------|------>+
;
; : bar ( n1-- n2 ) 42 swap - ;
; f8 f9 fa fb fc fd fe ff| 00 01 02 03
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+---+---+
; | 3 | BAR |nop|jmp enter |clit |42 | swap | minus | exit |
; |<----NFA------>|<----CFA------>|<----------PFA-----|-------------->+
;
; : baz ( -- ) ." pong" ;
; f8 f9 fa fb fc fd fe ff| 00 01 02
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+---+
; | 3 | BAZ |jmp enter | (.") | 4 | "pon" |"g"| exit |
; |<----NFA------>|<--CFA---->|<----------PFA---------|---------->+
;
; There are two ways for a primitive to get here
;
;(a) NOP
; JMP ENTER
;
;(b) JMP ENTER
;
; The NOP is added by the compiler, as necessary, to align words within
; colon definitions on a page boundary. More often than not, NOP won't be
; present because more often than not, secondaries don't cross a page.
;
; Since the opcodes for NOP and JMP are $EA and $4C respectively, we
; use the LDA instruction and check the N flag.
;
; The NOP comes *before* the JMP for a couple of reasons.
; 1) The Y register already has a zero in it after copying (IP) -> W
; 2) If the NOP is stored after the JMP, there's no way to distinguish
; between a NOP and a CFA with a low-order byte of $EA
;
; IP needs to point at the PFA of the secondary we're entering, so knowing
; whether a secondary's CFA takes up 3 or 4 bytes is pretty important.
;
enter lda ip+1 ;[3]
pha ;[3]
lda ip ;[3]
pha ;[3] ; IP -> [-RP]
ldy #0 ;[2]
clc ;[2] ; assume JMP
lda (ip),y ;[5]
bpl l021 ;[2|3] ; JMP or NOP?
sec ;[2|0] ; guess we were wrong. It's NOP
l021 adc #3 ;[2] ; A = <(W+3) (maybe 4)
pha ;[3]
tya ;[2]
iny ;[2]
adc (ip),y ;[5] ; A = >(W+3) (maybe 4)
sta ip+1 ;[3]
pla ;[4]
sta ip ;[3]
jmp nexto ;[3]
For BAR, when the word is being compiled into the dictionary, when it gets to MINUS the compiler will notice that DP = $xxFF. So it slides (CMOVE>) everything (from the first byte of the CFA of BAR to the byte at $xxFE) upward one byte. Then it stores $EA at the first byte of BAR's CFA. Finally it compiles in the MINUS at $xx00 of the following page.
For BAZ, the IP is incremented (and the page boundary crossed) by (.") so again there is no issue.
I'm not thrilled with this ENTER routine. It seems like a pretty expensive way of doing things. If anybody has a better idea, I'm all ears. Still, I'm willing to pay the price if it keeps NEXT at 17 cycles. I'm also unclear what you mean, exactly how invoking a NEXTPAGE primitive fits in with this version of ENTER. Can you please clarify, perhaps with better ascii art than what I came up with?
chitselb wrote:
With FOO, there is no issue with splitting a word at a page boundary crossing.
For BAR, when the word is being compiled into the dictionary, when it gets to MINUS the compiler will notice that DP = $xxFF. So it slides (CMOVE>) everything (from the first byte of the CFA of BAR to the byte at $xxFE) upward one byte. Then it stores $EA at the first byte of BAR's CFA. Finally it compiles in the MINUS at $xx00 of the following page.
For BAR, when the word is being compiled into the dictionary, when it gets to MINUS the compiler will notice that DP = $xxFF. So it slides (CMOVE>) everything (from the first byte of the CFA of BAR to the byte at $xxFE) upward one byte. Then it stores $EA at the first byte of BAR's CFA. Finally it compiles in the MINUS at $xx00 of the following page.
Code: Select all
inc ip
inc ip
ip = *+1
jmp(xxxx)
Code: Select all
inc ip+1
jmp NEXT
Code: Select all
; : foo ( n1 -- n1 n2 n1 ) dup 2+ over ;
; f8 f9 fa fb fc fd fe ff| 00 01 02 03
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+---+---+
; | 3 | FOO |jmp enter | dup |twoplus|NEXTPAG| over | exit |
; |<----NFA------>|<--CFA---->|<-----------PFA--------|-------------->+
;
; : bar ( n1-- n2 ) 42 swap - ;
; f8 f9 fa fb fc fd fe ff| 00 01 02 03 04 05
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+---+---+---+---+
; | 3 | BAR |nop|jmp enter |clit |42 |NEXTPAG| swap | minus | exit |
; |<----NFA------>|<----CFA------>|<----------PFA-----|-------------->+---+---+
If the jmp enter happens to end on a page boundary with no room for
NEXTPAGE :
Code: Select all
; fd fe ff| 00 01 02 03 04| 05 06 07 08 09
; +---+---+---+---+---+---+---+---+---+---+---+---+---|---+---+---+---+---+
; | 3 | BAR |nop|nop|jmp enter |clit |42 | swap | minus | exit |
; |<----NFA------>|<----CFA------>|<----------PFA-----|------------------>+
Code: Select all
ENTER
pla ;(4)
sta ip_lo_temp ;(3)
pla ;(4)
sta ip_hi_temp ;(3)
lda ip+1 ;(3)
pha ;(3)
lda ip ;(3)
pha ;(3)
lda ip_hi_temp ;(3)
sta ip+1 ;(3)
lda ip_lo_temp ;(3)
sta ip ;(3)
jmp NEXT2 ;(3)
NEXT
inc ip
NEXT2
inc ip
ip = *+1
jmp(xxxx)
Code: Select all
ENTER
pla ;(4)
tay ;(2)
pla ;(4)
sta ip_hi_temp ;(3)
lda ip+1 ;(3)
pha ;(3)
lda ip ;(3)
pha ;(3)
lda ip_hi_temp ;(3)
sta ip+1 ;(3)
iny ;(2)
sty ip ;(3)
jmp NEXTTO ;(3)
NEXT
inc ip
inc ip
ip = *+1
NEXTTO
jmp(xxxx)
Edit: this assumes that words don't cross page boundaries.
Last edited by bogax on Tue Aug 31, 2010 10:57 pm, edited 2 times in total.
What about this?
CODE NEXTPAGE2 IP 1+ INC, NEXT JMP, END-CODE
CODE NEXTPAGE3 IP INC, NEXPAGE2 JMP, END-CODE
Then [COMPILE] changes from
: [COMPILE] ( -- ) ?COMP ' , ;
to
: [COMPILE] ( -- ) ?COMP
here $00ff and $00fe = if ' nextpage2 , then
here $00ff and $00fd = if ' nextpage3 , $ea c, then
' , ;
and we do something similar with COMPILE
CODE NEXTPAGE2 IP 1+ INC, NEXT JMP, END-CODE
CODE NEXTPAGE3 IP INC, NEXPAGE2 JMP, END-CODE
Then [COMPILE] changes from
: [COMPILE] ( -- ) ?COMP ' , ;
to
: [COMPILE] ( -- ) ?COMP
here $00ff and $00fe = if ' nextpage2 , then
here $00ff and $00fd = if ' nextpage3 , $ea c, then
' , ;
and we do something similar with COMPILE
Code: Select all
; fb fc fd fe ff| 00 01 02 03 04| 05 06 07
; +---+---+---+---+---+---+---+---|---+---+---+---+---|---+---+---+
; | 3 | BAR | jmp enter | cl|it |42 | swap | minus | exit |
; |<----NFA------>|<---CFA--->|<--|-----------PFA---------------->+
^ yikes!Code: Select all
; fb fc fd fe ff| 00 01 02 03 04| 05 06 07 08
; +---+---+---+---+---+---+---+---|---+---+---+---+---|---+---+---+---+
; | 3 | BAR | jmp enter |nop| clit |42 | swap | minus | exit |
; |<----NFA------>|<----CFA------>|<----------PFA-------------------->+Code: Select all
enterp sec ; special case CFA = $xxFC
bcs enter+1
enter clc
lda ip+1 ;[3]
pha ;[3]
lda ip ;[3]
pha ;[3] ; IP -> [-RP]
ldy #0 ;[2]
lda (ip),y ;[5]
l021 adc #3 ;[2] ; A = <(W+3+C)
pha ;[3]
tya ;[2]
iny ;[2]
adc (ip),y ;[5] ; A = >(W+3+C)
sta ip+1 ;[3]
pla ;[4]
sta ip ;[3]
jmp nexto ;[3]wups. fixed:
Code: Select all
entern inc ip+1 ; special case CFA = $xxFD, $xxFE, $xxFF
bne enter
enterp inc ip+1
sec ; special case CFA = $xxFC
bcs enter+1
enter clc
lda ip+1 ;[3]
pha ;[3]
lda ip ;[3]
pha ;[3] ; IP -> [-RP]
ldy #0 ;[2]
lda (ip),y ;[5]
l021 adc #3 ;[2] ; A = <(W+3+C)
pha ;[3]
tya ;[2]
iny ;[2]
adc (ip),y ;[5] ; A = >(W+3+C)
sta ip+1 ;[3]
pla ;[4]
sta ip ;[3]
jmp nexto ;[3]