Golly, Mike -- up past your bedtime again?
I am, so forgive me if I just skim what you posted. I'm glad to see the 65m32 flame is still burning. As for the m65c02a snippets, eyeballing the code shows no surprises but we don't have clock counts available, so I'd prefer to kick the "X vs S" ball around using '816. The focus is on
PHA vs
DEX DEX STA 0,X and the equivalent Pull operations. Which Forth stack should use which register??
That's 19 versus 17 clocks in 16-bit mode, so after you add NEXT, the difference in speed is insignificant. It would be nice to have a single-cycle double INX and double DEX. That would take the first version above down to 16 clocks, faster (but again insignificantly so) than the second version.
I don't arrive at the same numbers or the same interpretation.
In fairness I grant that
everything hinges on my premise that a fast P-stack is more important than a fast R-stack. My reasoning -- half-baked or maybe 2/3 or 3/4 baked

-- is as follows
: Forth code that's written for maximum performance does minimal nesting & un-nesting (ie; R-stack pushing & pulling) in the inner loops that are the hot spots.
Of course "performance" might mean simply lowest latency responding to an input. Nevertheless, if it's written for speed then the code that really matters probably doesn't spend much time on the overhead
of nesting & un-nesting
; instead it'll be busy doing real work.
As for the numbers, I find the "psp in S" approach saves 4 or 5 clocks, not 2 (19 versus 17). Admittedly Table 5-7
of the '816 datasheet is dodgy to read, so an extra set
of eyes would be welcome. I put my rundown in the attached text file. BTW the saving is 4 clocks for the exact code example I posted, but certain similar examples save 5 clocks. It depends whether the example is
a word that
grows the stack (and uses PHA) or
a word that
shrinks the stack (and uses PLA,
a slightly slower instruction). But 4.5 is the average saving.
after you add NEXT, the difference in speed is insignificant
Of course the speedup percentage can be diminished by taking
a broader view (such as by including NEXT), but does that improve the discussion? For
a broader picture I'd rather ask,
- how many words are affected, and are they ones which are commonly executed? And,
- what's the speedup per word?
The
speedup is about 4.5 clocks for each word, based strictly on swapping
PHA vs
DEX DEX STA 0,X or swapping
PLA vs
LDA 0,X INX INX. Notable words affected include
DOCON DOVAR LITERAL DUP OVER + - AND OR XOR R@ 0BRANCH. The price is
a corresponding 4.5 clock slowdown for R-stack push-pull words because now
they're forced to use X. These words notably include ENTER and EXIT (nest & un-nest).
So, the question becomes
: In the small but critical code sections which are the only spots where performance actually matters, how many ENTERs and EXITs occur, versus the "do actual work" words DOCON DOVAR LITERAL DUP OVER + - AND OR R@ and 0BRANCH ?
Other things being equal, and if I were writing
a new Forth specific to the '816, I'd opt for psp in S. The historical advantage for psp in X is decimated
; even (depending on the code) reversed.
(I also like psp in S because that'd let me pop the high-byte
of a long address straight into DBR! But if I start onto that topic I'll
definitely be de-railing Michael's thread! )
cheers,
Jeff