Updated the M65C02A repository with a new instruction set map. The new instruction set map is for the M65C02B. The M65C02B is a modification of the base instruction set of M65C02A soft-core processor.
The M65C02A has five unused opcodes in Column 2. One of these is reserved for a co-processor, and the remaining four opcodes were reserved for signed / unsigned multiplication and division opcodes. However, Column 3 of the M65C02A instruction set implemented a ip,I++, register-indirect with autoincrement addressing mode (using the 16-bit IP register from the M65C02A's built-in Forth VM). The sixteen instruction implemented were:
Code:
ORA/ASL/AND/ROL/EOR/LSR/ADD/ROR/STA/TSB/LDA/TRB/CMP/DEC/SUB/INC ip,I++
This complement of instructions was seen as representing nearly all of the important instructions of the basic 6502/65C02 with this new addressing mode.
I originally saw these Column 3 instructions as being important for the implementation of a fast Forth, but that assumption was woefully lacking in understanding of Forth. In the end, only the
STA/LDA ip,I++ instructions were used, and the auto-increment of the IP was often unwanted.
In the implementation of the basic extended 6502/65C02 architecture of the M65C02A, the emphasis was toward making the instruction set support the code generator of the Mak Pascal compiler, a recursive descent compiler originally targeting the 8086 instruction set. After studying the compiler's code generator, I determined that the M65C02A would require both BP-relative and SP-relative addressing modes.
The standard pre-indexed (by X) zero page and pre-indexed (by X) absolute addressing modes would easily support the BP-relative addressing mode required to access parameters and local variables passed on the stack. A specific modification was needed to use zp,X addressing mode in the manner required for BP-relative addressing mode available with the 8086: if the BP in X is greater than 512, the parameter is treated as signed, else the parameter is treated as unsigned. When X is less than 512, X points into page 0 or page 1, in which case, for backward compatibility, the 8-bit offset in the instruction must be treated as unsigned as it is in the 6502/65C02 processors. When X is greater than 512, the assumption is made that the extended stack capability of the M65C02A (which supports 16-bit stack pointers with stacks to 64k in size) is being used, and the base pointer (BP) placed in X, using the 16-bit
TSX.w instruction of the M65C02A, points into a stack with sufficient space to support both parameter passing and local variable declarations. In this situation, the 8-bit parameter in the instruction is treated as signed and sign-extended automatically.
The SP-relative addressing mode is easily constructed from the zero page pre-indexed (by X) and absolute pre-indexed (by X) addressing modes. By applying the
OSX (override X with S) prefix instruction, these two addressing modes are converted to the required SP-relative addressing mode, with 8-bit and 16-bit unsigned offsets, respectively. For those instructions in the base 6502/65C02 instruction set which don't support pre-indexed (by X) addressing, applying the OSX prefix instruction is allowed, and will provide SP-relative addressing for those instructions. However, these instructions will not support the BP-relative addressing mode which only applies to instructions supporting the pre-indexed by X addressing modes.
Given those new addressing modes and a few other custom instructions, the code generator of the Mak Pascal was easily modified to support the extended 6502/65C02 architecture of the M65C02A. One problem that came up was that the order of the operands for subtractions and comparisons was not as needed. For the 8086. the evaluation order could be easily reversed so the order of the operands on the evaluation / operand stack was not an issue. The issue arises primarily because the recursive descent compiler pushes the operands of an expression onto the operand stack in left to right order. This puts the left operand deeper in the operand stack than the right operand of operations, such as subtraction and compare, in which the order of the operands is important. Popping the first operand (right operand) on the stack into the accumulator leaves the second operand (left operand) on the stack. With the 8086, the second operand could be popped into a second register, and the order of the operations using these two registers can be reversed.
I resolved this issue by adding an instruction,
XMA zp,X, that exchanges A with an operand in memory, i.e. in the operand stack when the SP-relative form of the addressing mode is used:
XMA[.w] 1,S. As useful as this instruction is in correctly ordering the operands of subtract / compare operations, it just did not set well with me that subtractions / comparisons require an additional 7 clock cycles compared to the other two operand instructions of the 6502/65C02/M65C02A instruction set. Without major changes to the compiler, these additional 7 cycles would be required for all subtraction and comparison operations.
The SP-relative indirect addressing mode, created by applying the
OSX and
IND prefix flags to an pre-indexed by X addressing mode, allows the processor to load / access the value of a variable whose address is on the stack. This addressing mode is very useful, and fairly inexpensive, but requires more cycles than if the pointer is first popped off the stack and the
LDA ip,I++ or
STA ip,I++ instructions in Column 3 were used instead. The SP-relative indirect addressing mode uses two additional memory cycles to resolve the address of the operand, and leaves the pointer of the stack which frequently will require that the operand stack be cleaned up by popping the value at a typical cost of an additional two memory cycles. Pulling the pointer off the operand stack into IP solves the stack cleanup problem, but if the pointer was also to be used to store the value of the expression into the same variable, the auto-increment feature of the ip,I++ addressing mode modified the pointer in the IP register. Thus, sometimes it would be necessary to use an instruction sequence using the SP-relative indirect addressing mode, especially if the variable whose address is on the stack is being updated with the results of the operation:
CLC [1], ADC.w (1,S) [6], STA.w (1,S) [6], ADJ #2 [2]. Using the register-indirect mechanism supporting by the Column 3 ip,I++ instructions is difficult since the pointer loaded in IP is auto-incremented and there's no mechanism to override the auto-increment function, nor is there a decrement IP instruction in the M65C02A instruction set.
After some other work with the compiler and the assembler to speed up basic instruction sequences, I started considering what changes could be made to the instruction set of the M65C02A that would improve the situation with respect to a few bottlenecks that I noticed in the compiled code: (1) comparison operations, which are quite common, require a lot of additional clock cycles because the order of the operands on the evaluation stack is incorrect for the CMP instruction; (2) Pascal boolean operations require a lot of clock cycles to perform; (3) 8-bit SP-relative indirect addressing mode requires a second prefix instruction to implement.
The last of these three is not that important because the Pascal compiler doesn't use 8-bit constants, but it just seemed to me that the 8-bit
LDA (1,S) [7] instruction (
OSX, IND, LDA 1,X) should not be longer than the 16-bit
LDA.w (1,S) [7] instruction (
OIS, LDA 1,X). To make the 8-bit and 16-bit versions of this instruction the same length, another multi-flag prefix instruction would be needed:
OSI {
OSX +
IND}.
Comparisons occur quite frequently, and in the right to left expression parsing of the Pascal compiler, the operands for the CMP instruction are in the wrong order on the evaluation stack. This means that the
XMA.w 1,S [7] instruction (
OSZ, XMA 1,X) must be used to place the left operand in A and the right operand on the stack (after having first popped it into A). Only in this way will the subtraction operation in the CMP instruction yield the comparison results as written in the source code. It is possible to complement the branch instruction used to test the results of the comparison, but I decided that would obfuscate the generated code more that I was willing to allow.
A complete example of the sequence instructions generated by the compiler for a REPEAT UNTIL construction (from the Sieve of Eratosthenes program) is as follows:
Code:
( 163) ; ; 30: UNTIL prime > limit;
( 164) 033C 4CEA02 ; jmp L_016
( 165) ; L_018
( 166) 033F ABAD180D ; lda.w prime_006
( 167) 0343 AB48 ; pha.w
( 168) 0345 ABAD160D ; lda.w limit_005
( 169) 0349 CB4401 ; xma.w 1,S
( 170) 034C CBD501 ; cmp.w 1,S
( 171) 034F C202 ; adj #2
( 172) 0351 08 ; php
( 173) 0352 A901 ; lda #1
( 174) 0354 28 ; plp
( 175) 0355 AB1002 ; bgt L_020
( 176) 0358 A900 ; lda #0
( 177) ; L_020
( 178) 035A ABC90100 ; cmp.w #1
( 179) 035E F003 ; beq L_012
( 180) 0360 4C8D02 ; jmp L_011
( 181) ; L_012
A couple of peephole optimizations that I added to the assembler transformed the previous code sequence into the following shorter / faster code sequence:
Code:
( 164) ; ; 30: UNTIL prime > limit;
( 165) 02CD 4C9A02 ; jmp L_016
( 166) ; L_018
( 167) 02D0 ABAD710C ; lda.w prime_006
( 171) 02D4 ABCD6F0C ; cmp.w limit_005
( 173) 02D8 AB1004 ; bgt L_020-2
( 174) 02DB A900 ; lda #0
( 175) 02DD 8002 ; bra L_020
( 177) 02DF A901 ; lda #1
( 178) ; L_020
( 180) 02E1 D003 ; bne L_012
( 181) 02E3 4C7002 ; jmp L_011
( 182) ; L_012
If the
bgt L_020-2 instruction would leave the boolean result in the accumulator, rather than using code sequence in lines 174-177 (missing line numbers represent code removed by the peephole optimizer), then the 6 cycles could be saved from the example loop termination code sequence.
The M65C02B instruction set adds a mode to the branch instructions by using the
OSX prefix instruction so that the branch instruction deposits into the accumulator the value of the internal CC flag, i.e., the Condition Code flag. The unconditional
BRA rel branch instruction is converted to a conditional branch instruction when
OSX is prefixed, and under those conditions, the branch takes the branch if the accumulator is non-zero, else it does not take the branch. The above optimized code sequence would resemble the following:
Code:
( 164) ; ; 30: UNTIL prime > limit;
( 165) 02CD 4C9A02 ; jmp L_016
( 166) ; L_018
( 167) 02D0 ABAD710C ; lda.w prime_006
( 171) 02D4 ABCD6F0C ; cmp.w limit_005
( 173) 02D8 CB1000 ; bgt.s L_020
( 178) ; L_020
( 180) 02E1 CBD8003 ; bra.s L_012
( 181) 02E3 4C7002 ; jmp L_011
( 182) ; L_012
(Note: if the range is longer than supported by an 8-bit offset, the range can be easily extended from 8 to 16 bits by using the
OIS prefix instruction.)
In the preceding example from the Sieve of Eratosthenes program, there are no subroutines, so all relevant variables are at level 0, and therefore utilize absolute addressing. The situation is a bit different if the loop variables are local variables within a procedure / function. In this case, the variable are loaded on the stack, in the wrong order. This brings us to the changes that I've made to the instructions in Column 3 of the M65C02A instruction set. For these instructions, the order of the ALU operands are reversed. In other words, the normally left operand is treated as the right operand and vice versa. However, the destination is always the accumulator. For example, the definition of the standard subtraction instruction is A <= A + (~M) + C. For Column 3 instructions, subtraction is defined as A <= M + (~A) + 1, where M is either the value read from the memory location addressed by post-indexed (by Y) IP-indirect addressing mode, or the Y register: A <= [I+Y] + (~A) + C or A <= Y + (~A) + C. The following are the instructions of the M65C02B in Column 3:
Code:
0x03: NEG A // A <= 0 - A
0x13: ORA Y // A <= Y | A
0x23: CLR A // A <= 0
0x33: AND Y // A <= Y & A
0x43: CPL A // A <= ~A
0x53: EOR Y // A <= Y ^ A
0x63: ADC I,Y // M <= [I+Y]; A <= M + A + C
0x73: ADC Y // A <= Y + A + C
0x83: STA I,Y // [I+Y] <= A
0x93: STA I // [I] <= A
0xA3: LDA I,Y // A <= [I+Y]
0xB3: LDA I // A <= [I]
0xC3: CMP I,Y // M <= [I+Y]; (SIZ ? {N,V,Z,C} : {N,Z,C}) <= M - A
0xD3: CMP Y // (SIZ ? {N,V,Z,C} : {N,Z,C}) <= Y - A
0xE3: SBC I,Y // M <= [I+Y]; A <= M + ~A + C
0xF3: SBC Y // A <= Y + ~A + C
The following are the four instructions added to the M65C02B in Column 2:
Code:
0x22: JSR I,Y
0x42: OSI
0x62: OWI
0x82: JMP I,Y
The
OSI instruction rounds out the prefix instructions, and
OWI adds another prefix instruction which allows W to override IP when the IP register-indirect addressing mode is used. The
JSR I,Y and
JMP I,Y instructions were just added to see if they would be useful in some circumstances not currently supported by the compiler.
JSR I,Y can certainly be useful if support for indirect function calls, of function pointers were added to the compiler.
Edit: modified example. Original incorrect. Given example not as effective in communicating the advantage of the use of the
OSX prefix with the
BRA rel8 instruction.