The 65T2.
A while back as a thought exercise I worked on a high-level language friendly 6502 alternative I called the 65T2.
65T2.
The 65T2 has the following objectives:
• An 8-bit 6502 type architecture.
• The equivalent MOS transistor and resource budget as a 6502 (or 65C02).
• Better overall performance than a 6502.
• Better suitability for compiled code than a 6502.
It achieves this in the following ways:
• The same set of registers as a 6502 and nearly the same set of core instructions.
• More orthogonal addressing modes reduce decoding.
• BCD support is omitted, reducing complexity.
• More instructions are available across more addressing modes meaning less copying is required. Indexed-stack addressing increases performance in many situations.
• The substitution of zero-page by indexed-stack addressing modes enables better suitability for compiled languages than a 6502. This is also supported by two 16-bit parameter copying instructions (lea and stw) which operate on X and A as a pair of registers (X is the high byte).
Zero-Page Background
The most controversial part of the 65T2 is the substitution of zero-page addressing by indexed stack addressing. Why was this done? The insight is to observe that the zero-page addressing mode is more of an artifact that hinders to the 6502 than a benefit.
Firstly, it's an artifact originally from the 6800, which in turn is taken from typical minicomputer architectures from the 1960s. These architectures had fixed-length instructions which meant they didn't have enough addressing bits to represent the entire address space in an instruction. Yet, computers need to be able to address absolute, common locations across the whole address space. The solution was to implement a zero-page addressing mode (and zero-page indirect addressing modes).
The 6800 supposedly has zero-paged addressing to increase code density, but really it's because the architecture resembles an early minicomputer such as the pdp-8 or HP2100. They also implemented a zero page, because they had a fixed instruction width containing a base address (which with no modifications is a zero page address). In addition, because they were designed before recursive languages became popular, they had no stack pointer.
My thinking is that the 6502 as the 6800's cousin, adopted a number of conflicting design features because it appeared just at the point where these languages were starting to be considered: zero-page addressing; byte-oriented instructions; a stack pointer for subroutines and multiple addressing modes.
More problematically, the 6502 incorporated architectural features that were restrictive even by 1970s microprocessor standards, namely the lack of registers and an inability to use any registers to index the whole of memory (contemporary CPUs such as the 8080, the RCA 1802 and the SC/MP all had at least 8 x 8-bit registers which could be paired up). However, the 6502 was redeemed in a big way by the availability of its zero-page indirect indexed addressing mode which enabled the CPU to treat a pair of zero-page locations as a pointer and then index it by its y register.
Nevertheless, the 6502 is a better CPU than the 6800: it's faster, smaller and cheaper by design.
Zero-Page Problem
Redesigning the 6502 to make it more suitable for high-level languages is problematic. The most important criteria is to be able to support an indexed stack addressing mode. The 65816 achieves it in part by adding another 16 or so additional addressing modes including the indexed stack addressing mode, but also indirect indexed stack addressing modes to address the need within high-level languages to copy parameters and access memory via pointers within a stack frame.
The problem here though is that there's a chronic lack of instruction space for the extra addressing modes: 256/24 => 10 possible instructions which could use them. As a result few instructions actually support the additional addressing modes. In fact the 6502 itself suffers from the same problem as many instructions are available only with limited addressing modes.
Any attempt to redesign the 6502 would also suffer from the same issue and invariably lead to a 65816 type approach.
Solution
The insight is to realize that (a) zero-page addressing is actually a problem but that (b) indirect indexed addressing is actually a good thing.
This solution therefore dumps all the zero-page addressing modes, leaving room for a number of stack addressing modes. Stack indexed, indirect indexed addressing modes can then be supported. One further observation is that the new addressing modes can be seen as a set of 4 addressing modes and their indirect corollaries.
Code:
Direct modes: #n [Really [PC]++], s+n, X+Abs, Y+Abs.
Indirect modes: Abs [Really [[PC]++]] , *s+n, *s+n,x; *s+n,y.
Two final steps are to support additional copying operations for 16-bit values and supporting an actual add instruction.
16-bit Copying. The 65T2 supports 16-bit copies by using X:A as a register pair. You can copy an effective address to X:A (and thanks to the indirect modes you can copy a 16-bit memory location in some circumstances). You can store a 16-bit value in several ways.
Add. Instruction set frequencies reveal that add is a very common instruction, but the 6502 doesn't support it. For example, the availability of add cuts the frequent 8-bit add operations from clc/lda/adc/sta = 11cycles to 9 cycles, a 20% improvement. The 65T2 provides this instruction. As well as this, because of the design of 6502 carry flag, sub #n works in the same way as add #-n (similarly, adc #n works in the same way as sbc #-n).
Other Differences
There are a few other differences between the 6502 and the 65T2. Branch instructions include blt and bge for signed branches at the loss of bvc and bvs.
There is an unconditional branch instruction (bra) and a jump subroutine indirect instruction: jsi as well as a jump indirect (jpi). Only four flag operations are supported: clc , sec and a general purpose clp and sep instruction pair which are followed by an immediate. Register transfers are limited to a->reg and reg->a; thus tsx and txs are no longer possible, but tfa s and tfr s to move s between a and s are.
Absolute Stack Pointer Management
The 65T2 implements a 9-bit stack. To read the stack address you would execute: lea s+0; which puts the high byte into x and the low byte into a.
On the 65T2, the tfa s instruction copies A to bits 0..7 of S and bit 0 of X to bit 8 of S.
Assembler Syntax
The assembler syntax for the 65T2 has been designed for simple parsing. An instruction format is:
[:Label] [(operation | directive) operand] [;Comment]
Labels are preceded by ':' so that the assembler knows a label is coming before the label text itself. Local labels are just numbers and the definition of a local label resets labels whose numbers are >= itself. For example:
Code:
:example bra 10
:5 inc a
:10 rts
:20 jsr ex2
:ex2 dec a ; :10 and :20 are the labels above.
:10 bne 20 ; :20 is the label below (because 10 resets locals >=10).
:20 bcc 5 ; Is the one above (because 10 doesn't reset :5).
bra 10 ;Is the one after :ex2 (because 10 resets >=10).
Every operation with a unique base opcode has a unique mnemonic.
Every operand has an unambiguous format: one of the following:
• No operand.
• A register operand: X, Y, A, S or P.
• An 8-bit branch operand.
• A 16-bit absolute address.
• An effective address.
All effective addresses are decoded as [prefix] value [index] where prefix is one of "#" , "s+", "x+" , "y+" , "*s+" and index is one of ",x" or ",y". Avoiding '(' and ')' as part of an addressing mode means that there's never any ambiguity between memory indirection and expressions. At a simplistic level the default addressing mode (absolute addressing) translates to 4 and prefixes translate as: "#"=>-4, "s+"=>-3, "x+"=>-2, "y+"=> -1, "*s+"=>5, ",x"=>1 and ",y"=>2.
In operands, the current compilation address is in '.'.
Directives always start with a '.'. Standard directives are: ".org" , ".equ" , ".db", ".dw". A macro assembler would support ".mac" and ".end".
Has same (or lower) transistor budget than 6502.
Pinout: A15..A0 (16)
D7..D0 (8)
R/W (1)
Clk (2)
Vcc/Gnd (2)
Bus (2): 11=ACK, 10=Ins, 01=Data, 00=Internal/Wait.
IRQ (1)
Total: 32.
Fewer pins also means a cheaper package (20% cheaper).
Interrupt Cycle:
1. Assert IRQ by peripheral for >=1 clk (it will be latched).
2. On the next Ins cycle CPU sets A15..A2 high (Bus=10, Clk=0) Peripheral Asserts Vector on A15..A2 (pull-down). Vector 0xfffc is default IRQ, 0xfff8 is Reset.
3. CPU sets F<4> and performs an ACK cycle [Bus=11, Clk=0). Peripheral should deassert Vector.
Thus, a simple peripheral (without ACK) can assert an IRQ by edge, or we could have a simple bit of logic which recognizes Bus=11, Clk=0 as CLR IRQ (flip-flop with simple decoder). Or we could have a more complex Interrupt generator.
Reset is an IRQ: Hold IRQ=0 and Bit 2=0, on release, CPU comes out of reset.
Regs And Instruction Set
****************************
Regs: A S X Y all 8-bit. Flags: ??,x,H,i,v,n,c,z
nvcz flags are normal. i is interrupt flag. H is halt flag: the 65T2 will halt after each instruction if H is set. x is reserved for the future 16-bit mode flag for the 65T2 successor, where A=16-bits, X=16-bits, Y=16-bits. [??] is undefined (should be 0).
Code:
<mon> modes that operate directly on memory locations (or A):
mode: A s+d x+Abs y+Abs @nn *s+d *s+d,x *s+d,y
Extra clocks: -2 +0 +1 +1 +3 +3 +3 +3
<ea> modes that have #n or memory location operands:
mode: #n s+d x+Abs y+Abs @nn *s+d *s+d,x *s+d,y
Extra clocks: 0 1 2 2 3 3 3 3
Code:
Assembler syntax: 'a' | '*s+' n [ ',x' | ',y'] | ['s+' | '@' | 'x+' | 'y+' ] n
00-3F <mon> ops: inc, dec, rol, ror, lsl, lsr, asr, bit [ 4c each]
40-7F <ea> ops: adc, sbc, and, ora, eor, lda, sta, cmp [2c each]
80-BF Quad3: ldx, stx, cpx, ads, ldy, sty, cpy, add [2c each]
Quad4:
C0 lea ?? sd xa ya Abs *sd *sdx *sdy
C8 stw ?? sd xa ya Abs *sd *sdx *sdy
D0 bcc: beq, bne, bcc, bcs, bmi bpl blt bge [2 or 3 c]
D8 jmp: jmp jsr bra rts jpi jsi swi rti
E0 Flags: sep clp sec clc ?? ?? ?? ??
E8 X/Y: inx iny ?? ?? dex dey ?? ??
F0 TFA/TFR: x y s p x y s p
F8 psh/pul: x y a p x y a p
Ins count: 8+8+8+2+8+8+4+4+2+2 = 54.
There are 10 free opcodes: $C0, $C8, $E4 to $E7, $E8, $E9, $EE, $EF. 65T16 candidate extensions are: LNG [Address is 24-bits, causes every Abs ref or *s ref to fetch 3 bytes]. OPB (ALU byte extension in word mode), MVB (block move up), MVN (block move down), MUL.
EA: (bit7 ~& bit6) | (bit5 & bit4)
MonEA: ~bit7 & ~bit6 & ~bit2 & ~bit1 & ~bit0
dd' : ~bit1 & bit0 | bit2 & ~bit1 & ~bit0
nn : EA & dd'
n : EA & ~dd' & ~MonEA
Example code sequences.
Code:
uint8_t Sum(uint8_t a, b) { return a+b; }
ads #-4
lda s+5
add s+6
sta s+5
ads #4
rts
BlockMove: ;ret, (2,s)=src, (4,s)=dst, (6,s)=len
ldy s+6 ;low byte of len.
beq BlockMove20
inc s+7
bra BlockMove20
BlockMove10:
lda *s+2,y
sta *s+4,y
BlockMove20:
dey
bne BlockMove10
dec s+7
bne BlockMove10
lda *s+2,y
sta *s+4,y
rts
To call Block Move:
ads #-6
lea kLen
stw s+4
lea kDst
stw s+2
lea kSrc
stw s+0
jsr BlockMove
ads #6 ;deallocate.
...
65T2 Forth
Here, y=return stack pointer. s=data stack pointer. gRStack is the stack pointer base. gIP is the instruction pointer. It's a direct threaded forth. Because the 65T2 doesn't support zero page, gRStack and gIP are 16-bit addresses. Nevertheless, it's easier to make it as fast or faster than the equivalent 6502 version.
Code:
Next:
lda #2 ;gIP is pre-incremented on 65T2 Forth.
Next2:
add gIP
sta gIP
bcs Next10
jpi gIP ;17c, 18b.
Next10:
inc gIP+1
jpi gIP ;23c, 18b.
Enter:
iny
iny ;Pre-increment to make space on the datastack.
lda gIP
ldx gIP+1
stw gRStack,y ;push old IP
lea *s+0 ;get return address from JSR (already inc'd).
stw gIP ;store in IP
ads #2 ;pop rts address from machine stack.
jpi gIP ;jump to next ins: 34c [59KIPs]
Exit:
lda gRStack,y
ldx gRStack+1,y ;8c
dey
dey
add #2 ;
bcc Exit10
inx
Exit10:
stw gIP
jpi gIP ;27c/28c [74KIPs]
GetR:
lda gRStack,y
ldx gRStack+1,y
ads #-2 ;allocate space on datastack.
stw s+0
jmp Next ;17+19 = 38c = 53KIPs.
DoAdd:
lda s+0
add s+2
sta s+2
lda s+1
adc s+3
sta s+3
ads #2 ;pop data stack.
jmp Next ;23c+17c=42c, 46KIPs.
;Same is true for OR, XOR.
Lit:
ldx gIP+1
lda #2
add gIP
bcc Lit10
inx
Lit10:
stw gIP
ads #-2 ;allocate 2 bytes
stw s+0
lda *s+0
ldx #1
ldx *s+0,x
stw s+0 ;
jmp Next ;
Dup:
ads #-2
lea *s+2
stw s+0
jmp Next ;13c
TwoOver: ;al ah bl bh - al ah bl bh al ah
ads #-4
lea *s+4
stw s+0
lea *s+6
stw s+0
jmp Next ;21c.
Swap:
ads #-2
lea *s+4
stw s+0
lea *s+2
stw s+4
lea *s+0
stw s+2
ads #2
jmp Next ;
Loop: ;s+0=I, s+2=I'
inc s+0 ;5c
bne Loop10 ;3c.
inc s+1 ;inc hi byte.
lda s+0 ;3c
cmp s+2 ;3c
bne Branch ;3c Total 17c. 39c total, 51KLoops/s.
lda s+1
cmp s+2
bne Branch
ldy #4
jmp Next2 ;skip jump target.
Branch:
lda *s+0
ldy #1
ldx *s+0,y ;x:a=Branch target
stw gIP
jpi gIP ;Execute. 22c. 91KIPs.