6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 5:26 am

All times are UTC




Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Thu Jun 13, 2013 2:12 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
One thing that's come up once or twice (or more) in the various Programmable Logic threads is the value of having (zp,X) addressing. Here is a list of 6502 implementations of several programming languages and their use of (zp,X):

  • Assembly: The One Page Assembler (256 bytes): used in 3 instances
  • BASIC: Apple 1 BASIC (~4K): used in 2 instances,
  • BASIC: Apple II "Integer" BASIC (~5K): used in 2 instances
  • BASIC: Applesoft BASIC (~10K): used in 5 instances
  • BASIC: EhBASIC (~12K): used in 11 instances
  • BASIC: Tiny BASIC (~2K): used in 0 instances
  • Focal: KIM-1 Focal (~6K): used in 0 instances

That's 25 instructions with (zp,X) addressing COMBINED in ~40K or so of code. Furthermore, in every single one of those instances the X register is zero (in several cases, this meant adding a preceeding LDX #0 instruction). So if it were meant for a 65C02, (zp) addressing would be used. So is (zp,X) worth adding to a 6502-like core?

Well, several Forth implementations (e.g. FIG-Forth) use (zp,X) addressing, particularly in @ (fetch) and ! (store). However, it's worth noting that in an STC Forth, it possible to optimize away (zp,X) addressing when compiling @ and ! (and C@ and C! as well). (IMO, this is a one of the most worthwhile optimizations to make, behind tail call optimization, and maybe inlining, but not much else.) Let's illustrate with an (totally untested) example. First, let's define PUSHAY and PULLAY for pushing to and pulling from the Forth data stack (from and to the A and Y registers):

Code:
PUSHAY
   DEX
   DEX
   STY 0,X
   STA 1,X
   RTS
PULLAY
   LDY 0,X
   LDA 1,X
   INX
   INX
   RTS


Now we can define LITERAL as follows:

: LITERAL DUP #, LDY, 8 RSHIFT #, LDA, PULLAY ; IMMEDIATE

In other words, 513 LITERAL compiles:

Code:
LDY #1     ; lo byte
LDA #2     ; hi byte
JSR PUSHAY


Now we can define @ and ! as follows:

Code:
FETCH
   LDA (0,X)
   INC 0,X
   BNE .1
   INC 1,X
.1 TAY
   LDA (0,X)
   STA 1,X
   STY 0,Y
   RTS
STORE
   LDA 2,X
   STA (0,X)
   INC 0,X
   BNE .1
   INC 1,X
.1 LDA 3,X
   STA (0,X)
   INX
   INX
   INX
   INX
   RTS


So, : A 258 @ B 260 ! ; compiles (assuming tail call optimization):

Code:
A LDY #2
  LDA #1
  JSR PUSHAY
  JSR AT
  JSR B
  LDY #4
  LDA #1
  JSR PUSHAY
  JMP STORE


But this can be optimized to:

Code:
A LDY 258
  LDA 258+1
  JSR PUSHAY
  JSR b
  JSR PULLAY
  STY 260
  STA 260+1
  RTS


which is both shorter and (much) faster. CONSTANT and VARIABLE can be defined as follows:

: CONSTANT : POSTPONE LITERAL POSTPONE ; ;
: VARIABLE HERE 7 + CONSTANT 0 , ;

and thus 262 CONSTANT C compiles to:

Code:
C LDY #6
  LDA #1
  JMP PUSHAY


And VARIABLE D compiles to

Code:
D  LDY #.1
   LDA #>.1
   JMP PUSHAY
.1 DW  0


Thus, a common usage of @ and ! (and similarly for C@ and C! and even +!) can be optimized by checking for (or keeping track of) just two possibilities: (1) a LDA #$xx LDA #$xx JSR PUSHAY sequence (i.e. a LITERAL) was just compiled, or (2) a JSR to a LDY #$xx LDA #$xx JMP PUSHAY routine (i.e. a call to a CONSTANT or a VARIABLE such as a JSR C or JSR D from the preceeding examples) was just compiled. Bear in mind that a compiler that is doing tail call optimization (pretty much a must-have for an STC Forth) is already keeping track of whether it just compiled a JSR.

So, while (zp,X) addressing is certainly nice to have, and its functionality should be possible, I'm not (yet) convinced that it needs to be easy or fast in a 6502-like processor. It depends on your goals, of course. If your goal is to run 6502 programs (or make it easy to port 6502 programs, as with the 65org16), then (zp,X) addressing is a must. If your goal is have a 6502-inspired processor that can run Forth, then you may be better off looking at a stack machine or Forth processor.

I am, of course, open to arguments to the contrary.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 3:56 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
I've never tried an STC Forth although I would like to after the 9 reasons you gave for it not necessarily being any less memory-efficient than ITC and DTC, starting in the middle of your post at the middle of the long page at viewtopic.php?p=3335 ; but my '816 ITC Forth has 25 occurrences of the (DP,X) addressing mode in just the main file, not including the ones that might be in the INCLude files (which I did not count). 8 of the times, the DP value is something other than 0, so a simple indirect is not adequate. Also, the X value must be preserved. Many of these (DP,X) occurrences are in fairly heavily used words:
Code:
TOGGLE
C_OR_BITS  (similar to TSB in assembly, for a single byte instead of a cell)
C_ANDBITS  (similar to TRB in assembly, for a single byte instead of a cell)
@
C@
!
SWAP!  (This one is headerless, but used enough in the kernel to justify merging SWAP and ! into a single primitive.)
C!
+!
-!
INCR  (like INC abs in assembly)
DECR  (like DEC abs in assembly)
2@
2!
FILL
COUNT
PERFORM
ON
OFF
C_OFF
SKIP
People writing kernels often write many of these as secondaries instead of primitives; but with the 816's ability to handle 16 bits at a time, making them primitives not only made a dramatic improvement in the performance, but even shortened the code in many cases. It was not worth doing them all as primitives on the 6502.

Fetch becomes just:
Code:
        LDA  (0,X)
        STA  0,X
having only 4 bytes and no JSRs or their overhead.

I think most BASICs put intermediate values on the hardware stack, dont' they? But they're probably written for machines that don't leave the user any space in ZP. If the OS itself were written in Forth, then in spite of the data stack in ZP, they probably wouldn't need as much ZP space, since many of the variables they put in ZP would cease to exist and to take up space when they're not needed. I would be interested to see how much (DP,X) or (ZP,X) is used for other languages. It's not just how many occurences in the whole kernel, but whether those are used heavily, meaning it would really cut performance to do a work-around if you didn't have this addressing mode.

[Added Feb 1, 2017:] BDD posted about his new use of <dp,X> at viewtopic.php?p=50484#p50484, in his long (23 pages so far) topic "POC VERSION TWO," changing the topic name there to "POC VERSION TWO: (<dp>,X) to the Rescue!"

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 4:30 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Wow -- I'm startled that (zp,X) addressing is so rarely used! :shock: Thanks for the analysis.

dclxvi wrote:
That's 25 instructions with (zp,X) addressing ... in every single one of those instances the X register is zero ... So if it were meant for a 65C02, (zp) addressing would be used.
This part I understand. We don't need to add the zero that's always in X, and really there's no need to use X at all. So, regarding this first type of example, (zp) mode is an acceptable substitute for (zp,X) mode, which could be omitted from the processor spec.

Quote:
So is (zp,X) worth adding to a 6502-like core?
Code:
FETCH
   LDA (0,X)
There's a strong parallel, but is it the same? For all cases of (zp,X) mode in the code you posted, we don't need to add the zero that's always in the 2nd byte of the instruction; it might as well not be there. Regarding this second type of example, (X) mode* -- if it were implemented -- would be an acceptable substitute for (zp,X) mode. But (zp) mode is not.

Hmmm... As for (X) mode, I kinda like that idea, "processor-implementation-wise" :roll: :D

The comments about optimization are very thought provoking. I've never played with STC, but the notion of high-level Forth source generating inline machine code is pretty darn compelling. 8)

cheers,
Jeff

* to be clear, what I call (X) mode would still use zero-page indirection.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Tue Jun 18, 2013 5:24 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 5:18 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
I edited my post above, in light of Jeff's post.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 6:30 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
With an 8 bit X register, the (zp, X) mode isn't particularly useful I think. However, if you made the X register as wide as the address bus, you could turn the instruction around into (X, #offset), which would be very useful, because it represents a very common scenario where X holds the base address of some data structure, and you can pick different parts of that data structure by the offset.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 6:42 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
I wrote:
I'm startled that (zp,X) addressing is so rarely used.
GARTHWILSON wrote:
It's not just how many occurences in the whole kernel, but whether those are used heavily
Yes, a poor choice of words on my part.
I wrote:
(zp) can't substitute for (zp,X). But such a thing as (X) mode could.
To be clear, I meant (X) as basically (zp,X) with the zp #offset removed. It still does indirection through zpg, though.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 7:44 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Arlet wrote:
With an 8 bit X register, the (zp, X) mode isn't particularly useful I think. However, if you made the X register as wide as the address bus, you could turn the instruction around into (X, #offset), which would be very useful, because it represents a very common scenario where X holds the base address of some data structure, and you can pick different parts of that data structure by the offset.

The data structure where (DP,X) is used so much in Forth is the data stack (as opposed to the return stack), which is only a portion of ZP, meaning that 8 bits is fine. Actually, my experiments suggest that you'll never need even a quarter of ZP unless you're doing multitasking and splitting up the ZP data stack area into sections and giving one to each task. X is used as the data stack pointer, and the data stack grows down. So regardless of the stack depth, 0,X always refers to the cell at the top of the stack, 2,X always refers to the next one, 4,X to the next, etc.. The '816 does of course allow a 16-bit X; and although the direct page is still only 256 bytes, you can index outside of it, unlike the 6502, 65c02, and the 6502-emulation mode of the '802/'816.

Quote:
Quote:
(zp) can't substitute for (zp,X). But such a thing as (X) mode could.

To be clear, I meant (X) as basically (zp,X) with the zp #offset removed. It still does indirection through zpg, though.

A (X) addressing mode would be interesting, as would (A) and (Y); but in the 8 places I mentioned (DP,X) having the base address was other than 0, some were 2 and some were 4, for accessing the 2nd or 3rd cell on the stack. Top of stack is accessed by (0,X). So far, I've had no reason to go further, like (6,X) etc. like I have with DP,X (which is not indirect).

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 13, 2013 8:09 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
GARTHWILSON wrote:
Arlet wrote:
With an 8 bit X register, the (zp, X) mode isn't particularly useful I think. However, if you made the X register as wide as the address bus, you could turn the instruction around into (X, #offset), which would be very useful, because it represents a very common scenario where X holds the base address of some data structure, and you can pick different parts of that data structure by the offset.

The data structure where (DP,X) is used so much in Forth is the data stack (as opposed to the return stack), which is only a portion of ZP, meaning that 8 bits is fine. Actually, my experiments suggest that you'll never need even a quarter of ZP unless you're doing multitasking and splitting up the ZP data stack area into sections and giving one to each task. X is used as the data stack pointer, and the data stack grows down. So regardless of the stack depth, 0,X always refers to the cell at the top of the stack, 2,X always refers to the next one, 4,X to the next, etc..

The way you are using the stack is basically what I was saying about using the addressing mode as (X, #offset), which makes it very useful in higher level languages. It's the same as (ind, X), but it's just a different way of thinking. Rather than thinking about indexing a ZP table with variable X, you think of it as indexing a structure with an 8-bit constant offset. I do think the reason we don't see this very often is because it's limited to zero page. Sometimes that's enough, such as in the Forth example, but often it isn't. With a 16 bit X register, this mode would be much more useful. In fact, with 16 bit index registers, I'd rather get rid of (ZP), Y.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 20, 2013 1:46 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
One thing I should point out is that LDA (ZP,X) is equivalent to (assuming 16-bit registers):

Code:
   LDY ZP,X ; 5 cycles, zp,X addressing
   LDA 0,Y  ; 6 cycles, abs,Y addressing


That's 11 cycles as compared to the 7 cycles LDA (ZP,X) takes. A pair of (zp,X) instructions, such as ADC (ZP,X) STA (ZP,X) is equivalent to this:

Code:
   LDY ZP,X ; 5 cycles, zp,X addressing
   ADC 0,Y  ; 6 cycles, abs,Y addressing
   STA 0,Y  ; 6 cycles, abs,Y addressing


Which is only a difference of 3 cycles (17 vs. 14=7*2). In a new/related core, fewer addressing modes means less logic (at or least more don't care logic) somewhere. The question is whether the benefit of fewer addressing modes is worth the cost of an extra 3 or 4 cycles in a few places. (Of course, the cycle counts may not be exactly the same as the 65C816 anyway.)

(It's also interesting to note that with the above sequences you don't necessarily have to use 0,Y; it could be offset,Y, which gives a you (ZP,X),offset addressing mode -- the syntax is analogous to (stack,S),Y addressing on the 65C816.)

Another consideration is the cost of adding an addtional register (or registers), as the sequences above overwrite Y and it might be useful to have another register available for clobbering.

GARTHWILSON wrote:
Many of these (DP,X) occurrences are in fairly heavily used words:


It would help if there were definitions for the non-standard words (either as equivalent colon definitions in terms of standard words, or the source code of the primitive). PERFORM or SKIP aren't descriptive enough names for me to have any idea of what they might do. My guesses for the rest are as follows (let me know where I'm wrong). It looks like a lot of them will be similar to +! so for reference: : +! ( x a -) TUCK @ + SWAP ! ;

Code:
: C_OR_BITS ( c a -) TUCK C@ OR  SWAP C! ;
: C_ANDBITS ( c a -) TUCK C@ AND SWAP C! ;
: -!        ( x a -) TUCK @  -   SWAP !  ; \ or NEGATE +!

: DECR ( a -) DUP @ 1-  SWAP ! ;
: INCR ( a -) DUP @ 1+  SWAP ! ;

: SWAP! ( a x -) SWAP ! ;


OFF and ON are commonly defined as equivalent to the following, so I'm guessing that's true here (and I'm guessing C_OFF is just C! instead of !):

Code:
: ON    ( a -) TRUE SWAP !  ;
: OFF   ( a -) 0    SWAP !  ;
: C_OFF ( a -) 0    SWAP C! ;


I'm guessing TOGGLE is either like DECR or +!

Code:
: TOGGLE ( a -)   DUP  @ INVERT SWAP ! ; \ like DECR
: TOGGLE ( x a -) TUCK @ XOR    SWAP ! ; \ like +!


Since I'll be referring to them later on, I'll define AND! and OR! here; they are like +! except they using AND and OR instead of addition (in other words, they are cell-wide versions of C_ANDBITS and C_OR_BITS as defined above).

Code:
: AND! ( x a -) TUCK @ AND SWAP ! ;
: OR!  ( x a -) TUCK @ OR  SWAP ! ;


I've used ON and OFF (as I've defined them above) in the past, and I may have used AND! and OR! once or twice, but I've never defined any of the other non-standard words, nor have I even seen them defined or used anywhere else. Certainly none (in my experience) would qualify as heavily used, or even moderately used. Is your experience different? Do you have specific examples where these words are used, heavily or not?

The same technique for optimizing ! @ C! and C@ can be applied to DECR INCR OFF AND! and OR! to generate DEC, INC, STZ, TRB, and TSB absolute instructions (the TRB would need to be preceeded by a EOR #$FFFF for AND! as defined above). If you really felt like it, C_OR_BITS and C_AND_BITS could be optimized to generate SEP-TRB-REP and SEP-TSB-REP sequences. More on this below.

Of the standard words, I've used 2@ and 2! maybe once or twice ever, and while I've seen it used by others, it's not heavy usage; certainly not heavy enough that it even matters whether they are colon definitions or primitives.

COUNT is popular (and I've used it many times myself), but it's generally used outside loops, so it not all that performance sensitive; a little less than the absolute maximum possible speed should be okay.

If you're concerned about performance, FILL should probably have a loop that looks something like this:

Code:
.1 STA (zp),Y ; 7 cycles
   DEY        ; 2 cycles
   DEY        ; 2 cycles
   BNE .1     ; 3 cycles


or use MVN or something, rather than having (zp,X) addressing inside a loop. You'll obviously need to do a little bit of preparation there, e.g. copy the address from the data stack to a fixed zp location, determine whether an even or odd number bytes are being filled, etc., but it won't take a large fill count for the preparation to pay for itself.

That leaves ! @ C! C@ and +!. All can be optimized when preceeded by a constant, literal, or variable. In fact, +! almost always operates on a variable; I'm hard pressed to come up with a single example that I've seen where +! operated on something other than a variable. More on optimizing +! below.

GARTHWILSON wrote:
People writing kernels often write many of these as secondaries instead of primitives


The point is not secondary vs. primitive, but whether the performance hit (assuming there is any) by replacing (zp,X) in the primitive with some other (slightly) slower sequence (such as the one I listed at the beginning of the post) will actually be significant or noticable.

On a 65C816, there is of course no reason not to use (zp,X) addressing since it is already available, but the system I had in mind here is a related core where the data is all 16 bits (or all 32 bits) wide.

GARTHWILSON wrote:
no JSRs or their overhead


I don't understand why you're disregarding the overhead of NEXT in DTC/ITC. The only DTC or ITC NEXT implementation that I know of that doesn't use the X register and takes fewer than the 12 cycles than a JSR-RTS pair (the STC equivalent of NEXT) takes is the 6-cycle RTS DTC trick.

On the 65816, FETCH is the same as DTC/ITC (only NEXT is different):

Code:
FETCH
   LDA (0,X)
   STA 0,X
   RTS       ; NEXT


The JSR-RTS pair will almost certainly beat the overhead of DTC/ITC NEXT in any non-goofy DTC/ITC implementation. However, in STC, you'll still have opportunities to optimize things like V @ (V is a variable here), which, in DTC or ITC, will compile to:

Code:
   DW V,FETCH


This means DTC/ITC takes however many cycles your implementation of DOVAR (or its equivalent) takes, plus the NEXT at the end of DOVAR, plus LDA (0,X) and STA 0,X plus the NEXT at the end of FETCH. In STC, this can be optimized to:

Code:
   LDA V.1  ; i.e. the address of the DW at .1 not the address of the LDA at V
   JSR PUSH


where PUSH is:

Code:
PUSH
   DEX
   DEX
   STA 0,X
   RTS


and V is:

Code:
V  LDA #.1
   JMP PUSH
.1 DW  0


You could even inline PUSH after the LDA V.1 (it's only one extra byte) and have no JSR-RTS pairs at all:

Code:
   LDA V.1
   DEX
   DEX
   STA 0,X


That's faster than FETCH (I'm counting NEXT as a part of FETCH here) by itself -- even with a 6-cycle NEXT -- in DTC or ITC.

The optimization for V ! is:

Code:
   JSR PULL
   STA V.1


where PULL is what you'd expect:

Code:
PULL
   LDA 0,X
   INX
   INX
   RTS


And PULL could also be inlined. The V +! optimization is:

Code:
   JSR PULL
   CLC
   ADC V.1
   STA V.1


V AND! is:

Code:
   JSR PULL
   EOR #$FFFF
   TRB V.1


V OR! is:

Code:
   JSR PULL
   TSB V.1


The V DECR optimization doesn't even have a JSR PULL to inline:

Code:
   DEC V.1


V INCR is:

Code:
   INC V.1


V OFF is:

Code:
   STZ V.1


V ON is:

Code:
   LDA #$FFFF
   STA V.1


GARTHWILSON wrote:
I think most BASICs put intermediate values on the hardware stack, dont' they?


Of the three varieties of BASIC I listed (Apple 1 BASIC and Integer BASIC are cut from the same cloth, as are Applesoft and EhBASIC), only the Applesoft/EhBASIC variety does this. The Apple-1/Integer BASIC variety and Tiny BASIC use a Forth-like data stack, even using zp,X addressing.

For completeness, KIM-1 Focal uses a data stack that can span multiple pages (using (zp),Y addressing). The One Page Assembler does not have intermediate values (essentially you have to put intermediate values in local labels, so I suppose it could be said that intermediate values are put in an array rather than on a stack).

It's possible that addressing modes could have been better chosen, but without some analysis or specific examples, that is just blind speculation.

Dr Jefyll wrote:
If I read you properly, the comments about optimization elsewhere in your post are incidental (although thought provoking). What I get as being central is that the instructions marked <--- use an unnecessarily complex addressing mode since one of the two values added to create the pointer is zero.


The optimization is actually the central point. The call to FETCH is being eliminated entirely (and LDY/LDA immediate are replaced by LDY/LDA absolute), and the (zp,X) addressing doesn't come into play at all, because the code in FETCH is not being executed at all.

You can't eliminate every instance of JSR FETCH; for example, executing @ in interpretation state will call the code at FETCH. And in the following definition of PICK:

: PICK CELLS SP@ + @ ;

The @ will be a JMP FETCH (assuming the compiler performs tail call optimization) since the address to be fetched isn't known at compile time. But you can typically optimize away a number of calls to FETCH and friends. If you look at, e.g., sokoban.fs and tt.fs (in Gforth), you'll can see a number of instances of @ and ! which can be optimized because they are accessing variables.

Dr Jefyll wrote:
(zp) is a simpler mode, one not requiring addition to yield the pointer. The pointer is simply in the argument following the opcode.


In all of the non-Forth instances I listed above, (zp) would have been used had it been available on the NMOS 6502. Instead, a sequence like:

Code:
   LDX #0
   LDA (0,X)


was used, e.g. see POKE, PEEK, DOKE, and DEEK, in EhBASIC (all I did was search for ",X)" (without the quotes, of course) in a text editor). In other words, (zp) addressing would have saved a cycle (compared to (zp,X)), used one less instruction (no LDX #0) and would not need to overwrite the X register. Had Y been the more convenient register to clobber, (zp),Y no doubt would have been used.

For an implementation that doesn't need to run on the NMOS 6502, (zp,X) is needed a combined total of ZERO (!) times on the non-Forth languages listed. Mechanically replace all instances of (zp,X) with (zp) (you don't even need to bother removing any preceeding LDX #0 instructions) and you will have a version that runs on the 65C02 and no usage of (zp,X) whatsoever (and even saves a (probably unnoticeable) cycle here and there).

One other thing I should probably point out is that the languages I chose to look at were ones where I had familiarity with the implementation internals. It is not a comprehensive list of 6502 languages, of course. If anyone wants to disassemble/analyze other languages/compilers, or knows of any existing disassemblies, by all means, feel free to post suggestions. It may well be that what I am familiar with is not a good representation of what existed and/or was used.

On that note, I just took a quick look at ATALAN, and if I understand it correctly, it does not generate (zp,X) instructions at all.

I've used Apple Pascal and Apple Logo in the distant past. Apple Pascal used a p-code interpreter, and if I remember correctly, the p-code bytecodes were documented in the manual, but I don't know how the interpreter worked. I know nothing about the internals of Apple Logo.

Another good candidate to investigate might be cc65. I have hardly looked at it much, so I don't have any sort of feel for what sort of code (instructions or addressing modes) it typically generates.

One thing that this exercise has shown me is the importance of doing analysis and looking at specific examples, rather than just blind speculation, because the speculation can be totally wrong. I knew (zp,X) wasn't heavily used, but as Jeff so succinctly put it, wow.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 20, 2013 4:35 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
dclxvi wrote:
Another consideration is the cost of adding an addtional register (or registers), as the sequences above overwrite Y and it might be useful to have another register available for clobbering.

So are you suggesting/suspecting that having another register to do this with might make for enough possible speed increase in the instruction decoding to justify getting rid of (ZP,X)?

I'll answer several things below without re-quoting.
Quote:
GARTHWILSON wrote:
Many of these (DP,X) occurrences are in fairly heavily used words:

It would help if there were definitions for the non-standard words (either as equivalent colon definitions in terms of standard words, or the source code of the primitive). PERFORM or SKIP aren't descriptive enough names for me to have any idea of what they might do. My guesses for the rest are as follows (let me know where I'm wrong). It looks like a lot of them will be similar to +! so for reference: : +! ( x a -) TUCK @ + SWAP ! ;

I think the only ones I myself named are the three that start with C_ . PERFORM is in common Forth usage, and can be defined as
Code:
: PERFORM @ EXECUTE ;
SKIP is also in common Forth usage for skipping over leading delimiters (usually spaces) in a string, so it gets heavy use during compilation. It (along with SCAN) was apparently introduced by Klaus Schliesiek in 1982, and used in Laxen-Perry's F83. +! is in Starting Forth, Thinking Forth, and ANS Core. I did not find anyone else's usage of ANDing or ORing bits of a byte, so I made up my own names. I use these a lot in I/O. TOGGLE is a common Forth word that XORs the byte at the specified address with the specified mask, also useful in I/O bit-twiddling. INCR and DECR were used by Labaoratory Microsystems in their UR/Forth. I don't know if it's original with them. These words just increment or decrement (by 1) the cell at the specified address. On the '816, 2@ and 2! are almost the same number of bytes as primitives as they are as secondaries, but the primitives are nearly ten times as fast omitting the running of nest, unnest, and a lot of NEXTs; so if there's no memory penalty, we might as well make them primitives, even though they're not used much. One use that comes to mind immediately is testing my 4Mx8 memory modules, where I also use a double-precision DO LOOP (ie, 2DO...2I...2LOOP ).

Quote:
GARTHWILSON wrote:
no JSRs or their overhead

I don't understand why you're disregarding the overhead of NEXT in DTC/ITC.

The context was where the @ is part of a primitive, so it is simply inlined, needing no NEXT or JSR/RTS. For some similar words like ! , it can go even further, because if the primitive will later need the cells for something else before it's done, it doesn't have to INX INX and later DEX DEX again for each cell.


Quote:
GARTHWILSON wrote:
I think most BASICs put intermediate values on the hardware stack, dont' they?

Of the three varieties of BASIC I listed (Apple 1 BASIC and Integer BASIC are cut from the same cloth, as are Applesoft and EhBASIC), only the Applesoft/EhBASIC variety does this. The Apple-1/Integer BASIC variety and Tiny BASIC use a Forth-like data stack, even using zp,X addressing.

For completeness, KIM-1 Focal uses a data stack that can span multiple pages (using (zp),Y addressing). The One Page Assembler does not have intermediate values (essentially you have to put intermediate values in local labels, so I suppose it could be said that intermediate values are put in an array rather than on a stack).

That's good to know.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 20, 2013 7:25 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Quote:
Apple Pascal used a p-code interpreter, and if I remember correctly, the p-code bytecodes were documented in the manual, but I don't know how the interpreter worked.


IIRC, when I wrote one of these interpreters decades ago I used (zp,x) once, and only because I couldn't free the Y-register at that point (so yes, the X-register was zero).

But it did occur to me that (zp,x) could be used to manage multiple pointers in zero page, if you're willing to advance those pointers by means of INC instructions. There are situations where more than one pointer is in use and they can't easily be aligned so a single value in the Y-register suffices.

For example, a printf() implementation. The routine might accept as an argument only the format string. Scan along that with (zp),y, and when a '%' format specifier is found, load the associated value or pointer from an agreed location (the hardware stack, say, or perhaps the caller already pre-loaded them into specified zero page locations). If it's a pointer that's required, use (zp,x) to load the associated value.

I actually did put a primitive version of printf() in that interpreter, mainly so I could print out debugging messages which could tell me the values of certain variables at the point of error. It did have the ability to handle however many pointers were handed to it (of course the caller had to be very careful to supply one pointer for each '%', because the callee had no way to check whether or not the number it got was correct). I don't remember if I used (zp,x) in it, though.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 20, 2013 7:49 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
In the past, I have used (ZP,X) for accessing device drivers by device number (shades of the old Commodore "kernal"). However, with the 65C816 (ZP,X) is somewhat "obsolete," since its functionality is possible with some of the new '816 instructions, without consuming valuable zero page locations. I suppose if I were starting with a blank sheet of paper I'd scrap (ZP,X) and use its opcode for something else. What would be interesting would be (ZP,X),Y. :D

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 20, 2013 8:01 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8505
Location: Midwestern USA
teamtempest wrote:
But it did occur to me that (zp,x) could be used to manage multiple pointers in zero page, if you're willing to advance those pointers by means of INC instructions. There are situations where more than one pointer is in use and they can't easily be aligned so a single value in the Y-register suffices.

This would be quite practical with the 65C816, as you can increment the full 16 bit pointer in one operation. Be that as it may, the '816 kind of marginalizes such zero page acrobatics, since something like LDA (<offset>,S),Y can be used for essentially the same purpose if you are okay with self-modifying code—you'd be changing <offset> to access different pointers. There's also the possibility of using a 16 bit index, which gives you 64K of addressing range without use of zero page. I use that technique in the SCSI driver code in my POC unit's BIOS to read/write the buffer.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Fri Jun 21, 2013 5:10 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
From the point of processor design, it would be much better to keep (ZP,X) and get rid of (ZP), Y. The latter requires memory look ups for the opcode, the operand, and 2 lookups for the address, and only then is followed by the actual operation. This is very hard to implement efficiently (as we've seen in heated discussions in other threads)

Of course, the (ZP), Y mode is too useful to get rid of without a proper replacement, so what would be a reasonable alternative ? I'm thinking that you need two things: more index registers, and bigger ones. If you have 16 bit (same as address bus width) index registers, and you have at least 4 of them, you can keep a bunch of address pointers around in registers, rather than in ZP pairs.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jun 27, 2013 2:17 am 
Offline
User avatar

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362
GARTHWILSON wrote:
So are you suggesting/suspecting that having another register to do this with might make for enough possible speed increase in the instruction decoding to justify getting rid of (ZP,X)?


It might be nice to have another register available, but it's possible that you don't actually need another register. Y usually winds up being a scratch register in Forth, at least for me, so it's okay if it gets clobbered.

My thought with an additional register was that adding registers seems to be a (relatively) attractive approach, for whatever reasons. Arlet suggested additional registers above, for example. Another example is that the 65org16 was extended by adding registers.

GARTHWILSON wrote:
SKIP is also in common Forth usage for skipping over leading delimiters (usually spaces) in a string, so it gets heavy use during compilation.


I can't recall having seen PERFORM before, but I tend to use @ EXECUTE in those situations rather than define another word, so maybe I actually have encountered it before and simply forgotten.

I've tended to have PARSE and PARSE-WORD (or WORD) be single long-ish definitions, and not factor it into smaller words, simply becuase I don't wind up needing or using the smaller words. Using smaller words does seem to be a more common approach, but I hadn't remembered that word being called SKIP.

But parsing is another good example where the maximum possible speed isn't absolutely necessary, though obviously you don't want it to be too slow. The whole point of having an inner interpreter is that you compile "once" (well, once when you load the code) and execute many times. It's more important that the "many times" part be fast. Forth gets its performance from the fact the inner interpreter is a lot faster than parsing and dictionary lookup.

GARTHWILSON wrote:
I did not find anyone else's usage of ANDing or ORing bits of a byte, so I made up my own names. I use these a lot in I/O.


I/O locations tend to be at known (at compile (or load) time) locations and thus potentially subject to the optimizations above. I'm hard pressed to come up with any examples where an I/O address would actually change at run time.

GARTHWILSON wrote:
The context was where the @ is part of a primitive


There's nothing that prevents you from combining several operations into a single primitive in STC and you'll still get at least a speed improvement as compared to using secondaries. Without (zp,X) you'll (possibly) have some primitives that are slightly slower, in situations where the optimizations above don't apply. Those situations do exist, but I'm not sure they'll be encountered enough to matter much.

GARTHWILSON wrote:
it doesn't have to INX INX and later DEX DEX again for each cell


This is one of the reasons why keeping TOS is the accumulator is tempting. It can get in the way with DTC/ITC since many NEXT implentations use the accumulator, but in STC, since next is RTS and nest and unnest are JSR and RTS, the accumulator is (more) available. There are a few simple benefits, e.g. 2* is ASL instead of ASL 0,X (both smaller and faster), and there are variations on the ! and @ optimization theme above, e.g. 1234 XOR can be optimized to a single EOR #1234 instruction. It's easy to get carried away daydreaming about optimizations. :)

BigDumbDinosaur wrote:
However, with the 65C816 (ZP,X) is somewhat "obsolete," since its functionality is possible with some of the new '816 instructions, without consuming valuable zero page locations.


Bob Bishop noticed that he wasn't using the zero/direct page on the 65C816 much in his Modular Assembly Language series (the parenthetical statement at top left of p. 31 in the Using Page Zero section):

http://bob-bishop.awardspace.com/Apple-IIGS/index.html

Other than Forth, my experience has been similar for native mode code (though my code hasn't used his calling convention). The December 1986 article he references in the Designing Modular Code For The Apple IIGS section can be found here:

http://bob-bishop.awardspace.com/Apple- ... index.html


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron