One thing I should point out is that LDA (ZP,X) is equivalent to (assuming 16-bit registers):
Code:
LDY ZP,X ; 5 cycles, zp,X addressing
LDA 0,Y ; 6 cycles, abs,Y addressing
That's 11 cycles as compared to the 7 cycles LDA (ZP,X) takes. A pair of (zp,X) instructions, such as ADC (ZP,X) STA (ZP,X) is equivalent to this:
Code:
LDY ZP,X ; 5 cycles, zp,X addressing
ADC 0,Y ; 6 cycles, abs,Y addressing
STA 0,Y ; 6 cycles, abs,Y addressing
Which is only a difference of 3 cycles (17 vs. 14=7*2). In a new/related core, fewer addressing modes means less logic (at or least more don't care logic) somewhere. The question is whether the benefit of fewer addressing modes is worth the cost of an extra 3 or 4 cycles in a few places. (Of course, the cycle counts may not be exactly the same as the 65C816 anyway.)
(It's also interesting to note that with the above sequences you don't necessarily have to use 0,Y; it could be offset,Y, which gives a you (ZP,X),offset addressing mode -- the syntax is analogous to (stack,S),Y addressing on the 65C816.)
Another consideration is the cost of adding an addtional register (or registers), as the sequences above overwrite Y and it might be useful to have another register available for clobbering.
GARTHWILSON wrote:
Many of these (DP,X) occurrences are in fairly heavily used words:
It would help if there were definitions for the non-standard words (either as equivalent colon definitions in terms of standard words, or the source code of the primitive). PERFORM or SKIP aren't descriptive enough names for me to have any idea of what they might do. My guesses for the rest are as follows (let me know where I'm wrong). It looks like a lot of them will be similar to +! so for reference: : +! ( x a -) TUCK @ + SWAP ! ;
Code:
: C_OR_BITS ( c a -) TUCK C@ OR SWAP C! ;
: C_ANDBITS ( c a -) TUCK C@ AND SWAP C! ;
: -! ( x a -) TUCK @ - SWAP ! ; \ or NEGATE +!
: DECR ( a -) DUP @ 1- SWAP ! ;
: INCR ( a -) DUP @ 1+ SWAP ! ;
: SWAP! ( a x -) SWAP ! ;
OFF and ON are commonly defined as equivalent to the following, so I'm guessing that's true here (and I'm guessing C_OFF is just C! instead of !):
Code:
: ON ( a -) TRUE SWAP ! ;
: OFF ( a -) 0 SWAP ! ;
: C_OFF ( a -) 0 SWAP C! ;
I'm guessing TOGGLE is either like DECR or +!
Code:
: TOGGLE ( a -) DUP @ INVERT SWAP ! ; \ like DECR
: TOGGLE ( x a -) TUCK @ XOR SWAP ! ; \ like +!
Since I'll be referring to them later on, I'll define AND! and OR! here; they are like +! except they using AND and OR instead of addition (in other words, they are cell-wide versions of C_ANDBITS and C_OR_BITS as defined above).
Code:
: AND! ( x a -) TUCK @ AND SWAP ! ;
: OR! ( x a -) TUCK @ OR SWAP ! ;
I've used ON and OFF (as I've defined them above) in the past, and I may have used AND! and OR! once or twice, but I've never defined any of the other non-standard words, nor have I even seen them defined or used anywhere else. Certainly none (in my experience) would qualify as heavily used, or even moderately used. Is your experience different? Do you have specific examples where these words are used, heavily or not?
The same technique for optimizing ! @ C! and C@ can be applied to DECR INCR OFF AND! and OR! to generate DEC, INC, STZ, TRB, and TSB absolute instructions (the TRB would need to be preceeded by a EOR #$FFFF for AND! as defined above). If you really felt like it, C_OR_BITS and C_AND_BITS could be optimized to generate SEP-TRB-REP and SEP-TSB-REP sequences. More on this below.
Of the standard words, I've used 2@ and 2! maybe once or twice ever, and while I've seen it used by others, it's not heavy usage; certainly not heavy enough that it even matters whether they are colon definitions or primitives.
COUNT is popular (and I've used it many times myself), but it's generally used outside loops, so it not all that performance sensitive; a little less than the absolute maximum possible speed should be okay.
If you're concerned about performance, FILL should probably have a loop that looks something like this:
Code:
.1 STA (zp),Y ; 7 cycles
DEY ; 2 cycles
DEY ; 2 cycles
BNE .1 ; 3 cycles
or use MVN or something, rather than having (zp,X) addressing inside a loop. You'll obviously need to do a little bit of preparation there, e.g. copy the address from the data stack to a fixed zp location, determine whether an even or odd number bytes are being filled, etc., but it won't take a large fill count for the preparation to pay for itself.
That leaves ! @ C! C@ and +!. All can be optimized when preceeded by a constant, literal, or variable. In fact, +! almost always operates on a variable; I'm hard pressed to come up with a single example that I've seen where +! operated on something other than a variable. More on optimizing +! below.
GARTHWILSON wrote:
People writing kernels often write many of these as secondaries instead of primitives
The point is not secondary vs. primitive, but whether the performance hit (assuming there is any) by replacing (zp,X) in the primitive with some other (slightly) slower sequence (such as the one I listed at the beginning of the post) will actually be significant or noticable.
On a 65C816, there is of course no reason not to use (zp,X) addressing since it is already available, but the system I had in mind here is a related core where the data is all 16 bits (or all 32 bits) wide.
GARTHWILSON wrote:
no JSRs or their overhead
I don't understand why you're disregarding the overhead of NEXT in DTC/ITC. The only DTC or ITC NEXT implementation that I know of that doesn't use the X register and takes fewer than the 12 cycles than a JSR-RTS pair (the STC equivalent of NEXT) takes is the 6-cycle RTS DTC trick.
On the 65816, FETCH is the same as DTC/ITC (only NEXT is different):
Code:
FETCH
LDA (0,X)
STA 0,X
RTS ; NEXT
The JSR-RTS pair will almost certainly beat the overhead of DTC/ITC NEXT in any non-goofy DTC/ITC implementation. However, in STC, you'll still have opportunities to optimize things like V @ (V is a variable here), which, in DTC or ITC, will compile to:
Code:
DW V,FETCH
This means DTC/ITC takes however many cycles your implementation of DOVAR (or its equivalent) takes, plus the NEXT at the end of DOVAR, plus LDA (0,X) and STA 0,X plus the NEXT at the end of FETCH. In STC, this can be optimized to:
Code:
LDA V.1 ; i.e. the address of the DW at .1 not the address of the LDA at V
JSR PUSH
where PUSH is:
Code:
PUSH
DEX
DEX
STA 0,X
RTS
and V is:
Code:
V LDA #.1
JMP PUSH
.1 DW 0
You could even inline PUSH after the LDA V.1 (it's only one extra byte) and have no JSR-RTS pairs at all:
Code:
LDA V.1
DEX
DEX
STA 0,X
That's faster than FETCH (I'm counting NEXT as a part of FETCH here) by itself -- even with a 6-cycle NEXT -- in DTC or ITC.
The optimization for V ! is:
Code:
JSR PULL
STA V.1
where PULL is what you'd expect:
Code:
PULL
LDA 0,X
INX
INX
RTS
And PULL could also be inlined. The V +! optimization is:
Code:
JSR PULL
CLC
ADC V.1
STA V.1
V AND! is:
Code:
JSR PULL
EOR #$FFFF
TRB V.1
V OR! is:
Code:
JSR PULL
TSB V.1
The V DECR optimization doesn't even have a JSR PULL to inline:
Code:
DEC V.1
V INCR is:
Code:
INC V.1
V OFF is:
Code:
STZ V.1
V ON is:
Code:
LDA #$FFFF
STA V.1
GARTHWILSON wrote:
I think most BASICs put intermediate values on the hardware stack, dont' they?
Of the three varieties of BASIC I listed (Apple 1 BASIC and Integer BASIC are cut from the same cloth, as are Applesoft and EhBASIC), only the Applesoft/EhBASIC variety does this. The Apple-1/Integer BASIC variety and Tiny BASIC use a Forth-like data stack, even using zp,X addressing.
For completeness, KIM-1 Focal uses a data stack that can span multiple pages (using (zp),Y addressing). The One Page Assembler does not have intermediate values (essentially you have to put intermediate values in local labels, so I suppose it could be said that intermediate values are put in an array rather than on a stack).
It's possible that addressing modes could have been better chosen, but without some analysis or specific examples, that is just blind speculation.
Dr Jefyll wrote:
If I read you properly, the comments about optimization elsewhere in your post are incidental (although thought provoking). What I get as being central is that the instructions marked <--- use an unnecessarily complex addressing mode since one of the two values added to create the pointer is zero.
The optimization is actually the central point. The call to FETCH is being eliminated entirely (and LDY/LDA immediate are replaced by LDY/LDA absolute), and the (zp,X) addressing doesn't come into play at all, because the code in FETCH is not being executed at all.
You can't eliminate every instance of JSR FETCH; for example, executing @ in interpretation state will call the code at FETCH. And in the following definition of PICK:
: PICK CELLS SP@ + @ ;
The @ will be a JMP FETCH (assuming the compiler performs tail call optimization) since the address to be fetched isn't known at compile time. But you can typically optimize away a number of calls to FETCH and friends. If you look at, e.g., sokoban.fs and tt.fs (in Gforth), you'll can see a number of instances of @ and ! which can be optimized because they are accessing variables.
Dr Jefyll wrote:
(zp) is a simpler mode, one not requiring addition to yield the pointer. The pointer is simply in the argument following the opcode.
In all of the non-Forth instances I listed above, (zp) would have been used had it been available on the NMOS 6502. Instead, a sequence like:
Code:
LDX #0
LDA (0,X)
was used, e.g. see POKE, PEEK, DOKE, and DEEK, in EhBASIC (all I did was search for ",X)" (without the quotes, of course) in a text editor). In other words, (zp) addressing would have saved a cycle (compared to (zp,X)), used one less instruction (no LDX #0) and would not need to overwrite the X register. Had Y been the more convenient register to clobber, (zp),Y no doubt would have been used.
For an implementation that doesn't need to run on the NMOS 6502, (zp,X) is needed a combined total of ZERO (!) times on the non-Forth languages listed. Mechanically replace all instances of (zp,X) with (zp) (you don't even need to bother removing any preceeding LDX #0 instructions) and you will have a version that runs on the 65C02 and no usage of (zp,X) whatsoever (and even saves a (probably unnoticeable) cycle here and there).
One other thing I should probably point out is that the languages I chose to look at were ones where I had familiarity with the implementation internals. It is not a comprehensive list of 6502 languages, of course. If anyone wants to disassemble/analyze other languages/compilers, or knows of any existing disassemblies, by all means, feel free to post suggestions. It may well be that what I am familiar with is not a good representation of what existed and/or was used.
On that note, I just took a quick look at ATALAN, and if I understand it correctly, it does not generate (zp,X) instructions at all.
I've used Apple Pascal and Apple Logo in the distant past. Apple Pascal used a p-code interpreter, and if I remember correctly, the p-code bytecodes were documented in the manual, but I don't know how the interpreter worked. I know nothing about the internals of Apple Logo.
Another good candidate to investigate might be cc65. I have hardly looked at it much, so I don't have any sort of feel for what sort of code (instructions or addressing modes) it typically generates.
One thing that this exercise has shown me is the importance of doing analysis and looking at specific examples, rather than just blind speculation, because the speculation can be totally wrong. I knew (zp,X) wasn't heavily used, but as Jeff so succinctly put it, wow.