A Hypothetical C Friendly 6502

load81 · Post by **load81** » Sun Nov 15, 2020 1:42 pm

The 6502 isn't exactly C friendly. Neither is the Z80, really. Both chips have C compilers that target their architecture. Some of them are native compilers, like Power C on the 6502. Others, like cc65 are cross-compilers. So, it is possible to target the 6502 it's just that the processor requires quite a few trade-offs to make it work.

I read somewhere that several C compilers implement a tiny Forth-like machine using part of the 6502's 8-bit stack and a few bytes of zero page memory as virtual registers for the stack machine. Ok, that eases the pain a bit.

But, if you were in charge of a project to make a C friendly 6502 what changes would you make? Ideally, your changes would still be reasonably backwards compatible.

The only things I can think of off the top of my head are:

Implement "direct page" just like some of the WDC variants so that zero page can be relocated.
INC and DEC can operate on .A not just .X and .Y registers. (Also borrowed from WDC)
Allow the stack to similarly be relocated. (This would help with multitasking too)
A long BRA instruction that can take a 16 bit memory address.

That would ease a lot of pain. But, I'm not sure how best to make the 6502 work better with C-style structs. A new mode of memory addressing is probably at least part of the answer, but I'm fuzzy on the details. This feels like a good start though.

What do you think?

Martin_H · Post by **Martin_H** » Sun Nov 15, 2020 5:56 pm

To be efficient C needs a large stack that supports stack pointer relative addressing. But if you have a small call depth, the existing stack could work if it supported SP relative addressing.

Register linkage using part of page zero as pseudo registers would work, but you will need a software stack to push those "registers" for the next call. One of those page zero words could also be used for return values which C uses extensively.

The 6502 is already setup well to handle structures via the various indirect addressing modes.

barrym95838 · Post by **barrym95838** » Sun Nov 15, 2020 6:34 pm

I have been known to march to the beat of a different drummer, but I would borrow a hint from the 6809 and add a B and a U register. All of the registers (including A) should have first-class index functionality, like X and Y have now. There's not enough opcode real estate in 8-bits, so one solution is to widen the "bytes" to 9 or 10-bits, preferably ten. I know, that's a bit kinky, but it would eliminate the need for clumsy 2-byte opcodes and ease some of the register pressure and claustrophobic address space for "larger" programs (1KB for stack, 1MB total). Would a "backward binary 'C02-compatible" mode be possible? Probably, but I don't know how kludgy it would look without further analysis.

Purely hypothetical, right?

BigEd · Post by **BigEd** » Sun Nov 15, 2020 7:18 pm

It might be worth adding an instruction cache and then feeling free to add some prefix or suffix bytes to access a larger instruction set.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Nov 15, 2020 7:26 pm

Changes to the 6502 just to accommodate a C compiler? I wouldn't even consider it. I'd use the 65C816.

BillG · Post by **BillG** » Sun Nov 15, 2020 7:34 pm

barrym95838 wrote:

I have been known to march to the beat of a different drummer, but I would borrow a hint from the 6809 and add a B and a U register.

Many of the added 6809 instructions use a prefix byte to escape the 256 opcode limit. The problem is that fetching the prefix consumes an additional clock cycle. The Z80 has the same problem.

barrym95838 wrote:

All of the registers (including A) should have first-class index functionality, like X and Y have now. There's not enough opcode real estate in 8-bits, so one solution is to widen the "bytes" to 9 or 10-bits, preferably ten. I know, that's a bit kinky, but it would eliminate the need for clumsy 2-byte opcodes and ease some of the register pressure and claustrophobic address space for "larger" programs (1KB for stack, 1MB total). Would a "backward binary 'C02-compatible" mode be possible? Probably, but I don't know how kludgy it would look without further analysis.

Purely hypothetical, right?

In an "anything goes" world, you can choose to have memory "bytes" of 10 bits. Ordinarily, your CPU fetches all 10 bits while reading or writing only the lower 8 bits, your registers remain 8 bits (keeping the 256 byte stack size limit) and addresses ignore the upper bits remaining at 16 bits; writing memory automagically clears the upper two new bits. This is how backward compatibility is easily achieved.

You may choose to expand the "zero" page to 1024 bytes without destroying backward compatibility.

Special instructions allow writing the upper bits to be used by the program loader for the new instructions. Classic 8-bit programs continue to be loaded as before. The operating system only needs to be aware of the upper bits when loading new style programs. The debugger needs access to read the upper bits for disassembling code and managing breakpoints.

The only other constituency inconvenienced is self-modifying code.

cjs · Post by **cjs** » Mon Nov 16, 2020 4:39 am

BillG wrote:

barrym95838 wrote:

I have been known to march to the beat of a different drummer, but I would borrow a hint from the 6809 and add a B and a U register.

Many of the added 6809 instructions use a prefix byte to escape the 256 opcode limit. The problem is that fetching the prefix consumes an additional clock cycle. The Z80 has the same problem.

While that does slow some things down a bit, I don't think it's as bad as the alternatives, so long as you allocate the lesser-used instructions (or instructions that the developer can reasonably choose to use less) to the prefixes. For example, on the 6809 CMPX requires no prefix, though CMPY does, so you can try to arrange your code to have the pointers that need comparisons in X rather than Y.

On the 6800, I'd happily have the instructions indexing off of X plus a non-zero fixed offset (LDAA 1,X) use a prefix byte if the zero-offset version (LDAA 0,X or LDAA ,X) could be non-prefixed and drop the offset byte entirely, saving a byte and a cycle (or maybe even more than one cycle! since no add need be done) on by far the most common indexing operation.

Quote:

In an "anything goes" world, you can choose to have memory "bytes" of 10 bits. Ordinarily, your CPU fetches all 10 bits while reading or writing only the lower 8 bits, your registers remain 8 bits (keeping the 256 byte stack size limit) and addresses ignore the upper bits remaining at 16 bits; writing memory automagically clears the upper two new bits. This is how backward compatibility is easily achieved.

That doesn't strike me as "easy" when trying to use programs with the new instructions since, as you point out, you now need special instructions for loading programs into memory. How do you handle storage of these programs on disk, for example?

BigDumbDinosaur wrote:

Changes to the 6502 just to accommodate a C compiler? I wouldn't even consider it. I'd use the 65C816.

That actually sounds like a pretty decent option!

GARTHWILSON · Post by **GARTHWILSON** » Mon Nov 16, 2020 5:26 am

cjs wrote:

BillG wrote:

barrym95838 wrote:

I have been known to march to the beat of a different drummer, but I would borrow a hint from the 6809 and add a B and a U register.

Many of the added 6809 instructions use a prefix byte to escape the 256 opcode limit. The problem is that fetching the prefix consumes an additional clock cycle. The Z80 has the same problem.

While that does slow some things down a bit, I don't think it's as bad as the alternatives, so long as you allocate the lesser-used instructions (or instructions that the developer can reasonably choose to use less) to the prefixes. For example, on the 6809 CMPX requires no prefix, though CMPY does, so you can try to arrange your code to have the pointers that need comparisons in X rather than Y.

Wouldn't it still make instruction decoding more complex, reducing the maximum operating speed? The 6809 never reached clock & bus speeds anywhere near as high as the 65c02 did. The 65816 has the WDM prefix byte that was reserved for future expansion into another 256 op codes, but those next 256 never materialized.

barrym95838 · Post by **barrym95838** » Mon Nov 16, 2020 5:41 am

GARTHWILSON wrote:

cjs wrote:

BillG wrote:

barrym95838 wrote:

I have been known to march to the beat of a different drummer, but I would borrow a hint from the 6809 and add a B and a U register.

Many of the added 6809 instructions use a prefix byte to escape the 256 opcode limit. The problem is that fetching the prefix consumes an additional clock cycle. The Z80 has the same problem.

While that does slow some things down a bit, I don't think it's as bad as the alternatives, so long as you allocate the lesser-used instructions (or instructions that the developer can reasonably choose to use less) to the prefixes. For example, on the 6809 CMPX requires no prefix, though CMPY does, so you can try to arrange your code to have the pointers that need comparisons in X rather than Y.

Wouldn't it still make instruction decoding more complex, reducing the maximum operating speed? The 6809 never reached clock & bus speeds anywhere near as high as the 65c02 did. The 65816 has the WDM prefix byte that was reserved for future expansion into another 256 op codes, but those next 256 never materialized.

The 68HC12 rearranged the 6809 opcode map to allow S X Y and D to be equally machine code efficient, but sacrificed U in the process. They are available with clocks up to 50 MHz from what I've heard, but they're big-endian, so bleh ...

GARTHWILSON · Post by **GARTHWILSON** » Mon Nov 16, 2020 5:55 am

barrym95838 wrote:

The 68HC12 rearranged the 6809 opcode map to allow S X Y and D to be equally machine code efficient, but sacrificed U in the process. They are available with clocks up to 50 MHz from what I've heard, but they're big-endian, so bleh ...

One would have to compare the geometries. I suspect that for a given die feature size, the instruction set without the op code prefix bytes will allow a faster clock & bus speed.

cjs · Post by **cjs** » Mon Nov 16, 2020 7:22 am

GARTHWILSON wrote:

Wouldn't [prefix bytes] still make instruction decoding more complex, reducing the maximum operating speed? The 6809 never reached clock & bus speeds anywhere near as high as the 65c02 did.

Well, I'm not expert on microprocessor design, but based on what little I know it doesn't really seem that it would add significantly more complexity than adding instructions anyway, nor have much effect on maximum clock speed given that the extra fetch is also adding significant extra time to do setup of the instruction to be executed.

It's true that the 6809 never exceeded 2 MHz, but there appear to be other reasons for this. Part of it may be that Motorola more or less abandoned development of that processor in the early '80s; Wikipedia claims that, "With little to improve, the 6809 marks the end of the evolution of Motorola's 8-bit processors; Motorola intended that future 8-bit products would be based on an 8-bit data bus version of the 68000 (the 68008)," though they don't provide a source for this. I suspect that the 6809 using random logic rather than microcode may have been an issue; Hitachi's improved version, the 6309, switched to CMOS and microcode, reduced the number of cycles used by some instructions, and was produced in versions rated up to 3 or 3.5 MHz (and apparently could be run as fast as 5 MHz without issues), but they also chose not to take that further.

There were certainly other elements of Motorola's 8-bit designs that probably caused speed issues, too. I used to be a big-endian fan until I finally figured out why the 6502 designers switched to little-endian: it let them start the addition of an index to an address on the LSB before the MSB was read, whereas if you read the MSB first you need to wait for the LSB addition before you know if you need to add a carry to it. I wouldn't be surprised if there are other clever bits of design like this that affect speed (clock or cycle count), but I don't think that prefix bytes (assuming good design of them) is likely to be one of them.

It's certainly interesting learning about how these kinds of things in low-level design, particularly of the 6502, that annnoyed me from a programmer's point of view are actually big wins when it comes to implementing the hardware. I still wish that the 6502 designers had gone with 16-bit index registers, though, but perhaps even there it would not have resulted in any speed increase at all due to hardware considerations I'm not aware of. (And hey, anyway, a decent $25 CPU! Who can complain? :-))

GARTHWILSON wrote:

I remember the early articles in the trade magazines about the new RISC concept, which was that with a reduced instruction set, even though you would need more instructions to get the job done, you could turn up the clock speed more than enough to compensate, resulting in better performance at lower cost.

Yeah, that seems to me a bit of a simplistic (as in, over-simplified) explanation of the advantages of RISC. Much of the clock speed issue was actually memory speed, rather than potential processor speed, as I understand it, and so moving to a load-store architecture was about removing that restriction on processor speed, rather than anything much inherent in the processor hardware itself. (I think we see something of this even in the 8080, running at higher clock speeds than the 6800 and 6502 even on older processes.) Obviously you do need an instruction cache to properly take advantage of this.

Beyond that, more complex instructions weren't working for many other reasons than how fast they could be made to run. In many cases, the complex instructions were almost, but not quite, exactly what the compiler writer needed, so they synthesized the one they did need from a set of simpler instructions. Or the CPU could do the correct action quickly but some instructions were not properly optimized in the initial design. (For example, on the VAX-11/780 PUSHL R0 was mysteriously slower than MOVL R0,-(SP), despite both instructions doing exactly the same thing!) In such cases even if you get a fix out, you've still got out in the field older hardware without the fix and compilers already producing code optimized for that older hardware.

And then of course there's the chip manufacture process improvements. If it takes you four years to get out a complex design, even if it is in the same process faster than a simple design that takes only two years, the guys who started two years later started on a more advanced process and get the advantage of that. This also applies to software, it turns out; the TI ASC had a bit-reverse instruction intended for use in FFT calculations, but by the time the computer was put into production it was no longer useful because better algorithms had been found. (Apparently TI offered a never-collected bounty to anyone who could find a use for the instruction.)

There's a good 1980 paper by Patternson and Ditzel, "The Case for the Reduced Instruction Set Computer" that goes into detail and provides references for these arguments and more. I've attached a copy here. (The ACM allows reproduction for the purposes of (non-commercial) personal research; See the policy if you need further details.)

Snial · Post by **Snial** » Thu Nov 26, 2020 10:51 pm

The 65T2.

A while back as a thought exercise I worked on a high-level language friendly 6502 alternative I called the 65T2.

65T2.

The 65T2 has the following objectives:

• An 8-bit 6502 type architecture.
• The equivalent MOS transistor and resource budget as a 6502 (or 65C02).
• Better overall performance than a 6502.
• Better suitability for compiled code than a 6502.

It achieves this in the following ways:

• The same set of registers as a 6502 and nearly the same set of core instructions.
• More orthogonal addressing modes reduce decoding.
• BCD support is omitted, reducing complexity.
• More instructions are available across more addressing modes meaning less copying is required. Indexed-stack addressing increases performance in many situations.
• The substitution of zero-page by indexed-stack addressing modes enables better suitability for compiled languages than a 6502. This is also supported by two 16-bit parameter copying instructions (lea and stw) which operate on X and A as a pair of registers (X is the high byte).

Zero-Page Background

The most controversial part of the 65T2 is the substitution of zero-page addressing by indexed stack addressing. Why was this done? The insight is to observe that the zero-page addressing mode is more of an artifact that hinders to the 6502 than a benefit.

Firstly, it's an artifact originally from the 6800, which in turn is taken from typical minicomputer architectures from the 1960s. These architectures had fixed-length instructions which meant they didn't have enough addressing bits to represent the entire address space in an instruction. Yet, computers need to be able to address absolute, common locations across the whole address space. The solution was to implement a zero-page addressing mode (and zero-page indirect addressing modes).

The 6800 supposedly has zero-paged addressing to increase code density, but really it's because the architecture resembles an early minicomputer such as the pdp-8 or HP2100. They also implemented a zero page, because they had a fixed instruction width containing a base address (which with no modifications is a zero page address). In addition, because they were designed before recursive languages became popular, they had no stack pointer.

My thinking is that the 6502 as the 6800's cousin, adopted a number of conflicting design features because it appeared just at the point where these languages were starting to be considered: zero-page addressing; byte-oriented instructions; a stack pointer for subroutines and multiple addressing modes.

More problematically, the 6502 incorporated architectural features that were restrictive even by 1970s microprocessor standards, namely the lack of registers and an inability to use any registers to index the whole of memory (contemporary CPUs such as the 8080, the RCA 1802 and the SC/MP all had at least 8 x 8-bit registers which could be paired up). However, the 6502 was redeemed in a big way by the availability of its zero-page indirect indexed addressing mode which enabled the CPU to treat a pair of zero-page locations as a pointer and then index it by its y register.

Nevertheless, the 6502 is a better CPU than the 6800: it's faster, smaller and cheaper by design.

Zero-Page Problem

Redesigning the 6502 to make it more suitable for high-level languages is problematic. The most important criteria is to be able to support an indexed stack addressing mode. The 65816 achieves it in part by adding another 16 or so additional addressing modes including the indexed stack addressing mode, but also indirect indexed stack addressing modes to address the need within high-level languages to copy parameters and access memory via pointers within a stack frame.

The problem here though is that there's a chronic lack of instruction space for the extra addressing modes: 256/24 => 10 possible instructions which could use them. As a result few instructions actually support the additional addressing modes. In fact the 6502 itself suffers from the same problem as many instructions are available only with limited addressing modes.

Any attempt to redesign the 6502 would also suffer from the same issue and invariably lead to a 65816 type approach.

Solution

The insight is to realize that (a) zero-page addressing is actually a problem but that (b) indirect indexed addressing is actually a good thing.

This solution therefore dumps all the zero-page addressing modes, leaving room for a number of stack addressing modes. Stack indexed, indirect indexed addressing modes can then be supported. One further observation is that the new addressing modes can be seen as a set of 4 addressing modes and their indirect corollaries.

Code: Select all

Direct modes:	#n [Really [PC]++], s+n, X+Abs, Y+Abs.
Indirect modes: Abs [Really [[PC]++]] , *s+n, *s+n,x; *s+n,y.

Two final steps are to support additional copying operations for 16-bit values and supporting an actual add instruction.

16-bit Copying. The 65T2 supports 16-bit copies by using X:A as a register pair. You can copy an effective address to X:A (and thanks to the indirect modes you can copy a 16-bit memory location in some circumstances). You can store a 16-bit value in several ways.

Add. Instruction set frequencies reveal that add is a very common instruction, but the 6502 doesn't support it. For example, the availability of add cuts the frequent 8-bit add operations from clc/lda/adc/sta = 11cycles to 9 cycles, a 20% improvement. The 65T2 provides this instruction. As well as this, because of the design of 6502 carry flag, sub #n works in the same way as add #-n (similarly, adc #n works in the same way as sbc #-n).

Other Differences

There are a few other differences between the 6502 and the 65T2. Branch instructions include blt and bge for signed branches at the loss of bvc and bvs.

There is an unconditional branch instruction (bra) and a jump subroutine indirect instruction: jsi as well as a jump indirect (jpi). Only four flag operations are supported: clc , sec and a general purpose clp and sep instruction pair which are followed by an immediate. Register transfers are limited to a->reg and reg->a; thus tsx and txs are no longer possible, but tfa s and tfr s to move s between a and s are.

Absolute Stack Pointer Management

The 65T2 implements a 9-bit stack. To read the stack address you would execute: lea s+0; which puts the high byte into x and the low byte into a.

On the 65T2, the tfa s instruction copies A to bits 0..7 of S and bit 0 of X to bit 8 of S.

Assembler Syntax

The assembler syntax for the 65T2 has been designed for simple parsing. An instruction format is:

[:Label] [(operation | directive) operand] [;Comment]

Labels are preceded by ':' so that the assembler knows a label is coming before the label text itself. Local labels are just numbers and the definition of a local label resets labels whose numbers are >= itself. For example:

Code: Select all

	:example	bra 10
	:5		inc a
	:10		rts
	:20		jsr ex2

	:ex2		dec a	; :10 and :20 are the labels above.
	:10		bne 20	; :20 is the label below (because 10 resets locals >=10).
	:20		bcc 5	; Is the one above (because 10 doesn't reset :5).
			bra 10	;Is the one after :ex2 (because 10 resets >=10).

Every operation with a unique base opcode has a unique mnemonic.

Every operand has an unambiguous format: one of the following:

• No operand.
• A register operand: X, Y, A, S or P.
• An 8-bit branch operand.
• A 16-bit absolute address.
• An effective address.

All effective addresses are decoded as [prefix] value [index] where prefix is one of "#" , "s+", "x+" , "y+" , "*s+" and index is one of ",x" or ",y". Avoiding '(' and ')' as part of an addressing mode means that there's never any ambiguity between memory indirection and expressions. At a simplistic level the default addressing mode (absolute addressing) translates to 4 and prefixes translate as: "#"=>-4, "s+"=>-3, "x+"=>-2, "y+"=> -1, "*s+"=>5, ",x"=>1 and ",y"=>2.

In operands, the current compilation address is in '.'.

Directives always start with a '.'. Standard directives are: ".org" , ".equ" , ".db", ".dw". A macro assembler would support ".mac" and ".end".

Has same (or lower) transistor budget than 6502.

Pinout: A15..A0 (16)
D7..D0 (8)
R/W (1)
Clk (2)
Vcc/Gnd (2)
Bus (2): 11=ACK, 10=Ins, 01=Data, 00=Internal/Wait.
IRQ (1)
Total: 32.

Fewer pins also means a cheaper package (20% cheaper).

Interrupt Cycle:
1. Assert IRQ by peripheral for >=1 clk (it will be latched).
2. On the next Ins cycle CPU sets A15..A2 high (Bus=10, Clk=0) Peripheral Asserts Vector on A15..A2 (pull-down). Vector 0xfffc is default IRQ, 0xfff8 is Reset.
3. CPU sets F<4> and performs an ACK cycle [Bus=11, Clk=0). Peripheral should deassert Vector.

Thus, a simple peripheral (without ACK) can assert an IRQ by edge, or we could have a simple bit of logic which recognizes Bus=11, Clk=0 as CLR IRQ (flip-flop with simple decoder). Or we could have a more complex Interrupt generator.

Reset is an IRQ: Hold IRQ=0 and Bit 2=0, on release, CPU comes out of reset.

Regs And Instruction Set
****************************

Regs: A S X Y all 8-bit. Flags: ??,x,H,i,v,n,c,z

nvcz flags are normal. i is interrupt flag. H is halt flag: the 65T2 will halt after each instruction if H is set. x is reserved for the future 16-bit mode flag for the 65T2 successor, where A=16-bits, X=16-bits, Y=16-bits. [??] is undefined (should be 0).

Code: Select all

<mon> modes that operate directly on memory locations (or A):
mode:           A s+d  x+Abs y+Abs @nn *s+d *s+d,x *s+d,y
Extra clocks:   -2 +0   +1    +1    +3   +3   +3     +3 
               
<ea> modes that have #n or memory location operands:
mode:            #n s+d  x+Abs y+Abs @nn    *s+d *s+d,x *s+d,y
Extra clocks:    0 1    2     2     3      3     3     3

Code: Select all

Assembler syntax: 'a' | '*s+' n [ ',x' | ',y'] | ['s+' | '@' | 'x+' | 'y+' ] n

00-3F <mon> ops: inc, dec, rol, ror, lsl, lsr, asr, bit [ 4c each]
40-7F <ea> ops:  adc, sbc, and, ora, eor, lda, sta, cmp [2c each]
80-BF Quad3:     ldx, stx, cpx, ads, ldy, sty, cpy, add [2c each]
Quad4:

C0 lea          ?? sd xa ya Abs *sd *sdx *sdy
C8 stw          ?? sd xa ya Abs *sd *sdx *sdy
D0 bcc:         beq, bne, bcc, bcs, bmi bpl blt bge [2 or 3 c]
D8 jmp:         jmp jsr bra rts jpi jsi swi rti
E0 Flags:       sep clp sec clc ?? ?? ?? ??
E8 X/Y:         inx iny ?? ?? dex dey ?? ??
F0 TFA/TFR:     x y s p x y s p
F8 psh/pul:     x y a p x y a p

Ins count: 8+8+8+2+8+8+4+4+2+2 = 54.

There are 10 free opcodes: $C0, $C8, $E4 to $E7, $E8, $E9, $EE, $EF. 65T16 candidate extensions are: LNG [Address is 24-bits, causes every Abs ref or *s ref to fetch 3 bytes]. OPB (ALU byte extension in word mode), MVB (block move up), MVN (block move down), MUL.

EA: (bit7 ~& bit6) | (bit5 & bit4)
MonEA: ~bit7 & ~bit6 & ~bit2 & ~bit1 & ~bit0
dd' : ~bit1 & bit0 | bit2 & ~bit1 & ~bit0
nn : EA & dd'
n : EA & ~dd' & ~MonEA

Example code sequences.

Code: Select all

uint8_t Sum(uint8_t a, b) { return a+b; }

	ads	#-4
	lda	s+5
	add	s+6
	sta	s+5
	ads #4
	rts

BlockMove:	;ret, (2,s)=src, (4,s)=dst, (6,s)=len
	ldy s+6	;low byte of len.
	beq BlockMove20
	inc s+7
	bra BlockMove20
BlockMove10:
	lda *s+2,y
	sta *s+4,y
BlockMove20:
	dey
	bne BlockMove10
	dec s+7
	bne BlockMove10
	lda *s+2,y
	sta *s+4,y
	rts

To call Block Move:
	ads #-6
	lea kLen
	stw s+4
	lea kDst
	stw s+2
	lea kSrc
	stw s+0
	jsr BlockMove
	ads #6	;deallocate.
	...

65T2 Forth

Here, y=return stack pointer. s=data stack pointer. gRStack is the stack pointer base. gIP is the instruction pointer. It's a direct threaded forth. Because the 65T2 doesn't support zero page, gRStack and gIP are 16-bit addresses. Nevertheless, it's easier to make it as fast or faster than the equivalent 6502 version.

Code: Select all

Next:
	lda #2	;gIP is pre-incremented on 65T2 Forth.
Next2:
	add gIP
	sta gIP
	bcs Next10
	jpi gIP	;17c, 18b.
Next10:
	inc gIP+1
	jpi gIP	;23c, 18b.

Enter:
	iny
	iny		;Pre-increment to make space on the datastack.
	lda gIP
	ldx gIP+1
	stw gRStack,y	;push old IP
	lea *s+0	;get return address from JSR (already inc'd).
	stw gIP	;store in IP
	ads #2	;pop rts address from machine stack.
	jpi gIP	;jump to next ins: 34c [59KIPs]

Exit:
	lda gRStack,y
	ldx gRStack+1,y	;8c
	dey
	dey
	add #2	;
	bcc Exit10
	inx
Exit10:
	stw gIP
	jpi gIP	;27c/28c [74KIPs]

GetR:
	lda gRStack,y
	ldx gRStack+1,y
	ads #-2	;allocate space on datastack.
	stw s+0
	jmp Next	;17+19 = 38c = 53KIPs.

DoAdd:
	lda s+0
	add s+2
	sta s+2
	lda s+1
	adc s+3
	sta s+3
	ads #2	;pop data stack.
	jmp Next	;23c+17c=42c, 46KIPs.
	;Same is true for OR, XOR.

Lit:
	ldx gIP+1
	lda #2
	add gIP
	bcc Lit10
	inx
Lit10:
	stw gIP
	ads #-2	;allocate 2 bytes
	stw s+0
	lda *s+0
	ldx #1
	ldx *s+0,x
	stw s+0	;
	jmp Next	;

Dup:
	ads #-2
	lea *s+2
	stw s+0
	jmp Next	;13c

TwoOver:	;al ah bl bh - al ah bl bh al ah
	ads #-4
	lea *s+4
	stw s+0
	lea *s+6
	stw s+0
	jmp Next	;21c.

Swap:
	ads #-2
	lea *s+4
	stw s+0
	lea *s+2
	stw s+4
	lea *s+0
	stw s+2
	ads #2
	jmp Next	;

Loop:	;s+0=I, s+2=I'
	inc s+0			;5c
	bne Loop10		;3c.
	inc s+1	;inc hi byte.
	lda s+0			;3c
	cmp s+2			;3c
	bne Branch		;3c Total 17c. 39c total, 51KLoops/s.
	lda s+1
	cmp s+2
	bne Branch
	ldy #4
	jmp Next2	;skip jump target.
Branch:
	lda *s+0
	ldy #1
	ldx *s+0,y	;x:a=Branch target
	stw gIP
	jpi gIP	;Execute. 22c. 91KIPs.

BigEd · Post by **BigEd** » Sat Nov 28, 2020 9:34 am

Thanks Snial, quite a lot to digest there. On a single quick read, it looks interesting and feasible.

I'm not sure why you extended the stack pointer to just 9 bits: with the mechanism you describe, it could be more. Perhaps you are trying to keep the on chip state to a minimum?

Have you written any tools for or models of this machine? Sometimes that will show up things which you hadn't thought of.

Martin_H · Post by **Martin_H** » Sat Nov 28, 2020 4:58 pm

Interesting. Overall I like it because I use page zero as a data stack anyway. But I have a few questions.

* You state that SP is nine bits, but then state all registers are eight bits. How is the extra stack bit retained?

* Does the nine bit stack grow downward through page zero?

* You state you prefix labels with a colon, but code examples use postfix. I prefer a postfix colon, but that's what I am used to.

cjs · Post by **cjs** » Sat Nov 28, 2020 11:57 pm

Snial wrote:

A while back as a thought exercise I worked on a high-level language friendly 6502 alternative I called the 65T2.

Your analysis is certainly interesting, I like the general thrust of your proposal, and I think that that aiming at having the same MOS transistor and resource budget is a good idea, though I wonder whether being a little less strict about this (perhaps, moving up to "lower transistor/resource budget than the 6800") might allow for significant architecture improvement worthy of the cost.

But I'd like to start with what I believe is a significant misconception of your analysis of the 6502, though I don't know how much a different analysis would have changed the rest of your proposal.

Quote:

The 6800 supposedly has zero-paged addressing to increase code density, but really it's because the architecture resembles an early minicomputer such as the pdp-8 or HP2100.

The 6800 and 6502 are my primary microcomputer development platforms, and I've done a fair amount of thinking about their differences. I don't think you're right here.

To be probably overly brief, there are two basic styles of handling addresses in the instruction stream: in-word (PDP-10, PDP-8, etc.) and separate-word (PDP-11). The former executes (perhaps not exclusively) single-word operations containing both the opcode and an address, forcing such addresses to be smaller than word size. The latter uses separate bytes/words for opcode and address and almost invariably the addresses in the instruction stream are both full word size and cover the full address space. The 6800 clearly falls in the latter camp.

Nor does the 6800 have the zero-page through-memory indirect addressing modes of the minicomputers, or even the non-zero-page ones. All indirection is done only via the 16-bit X register, never through an address in memory. (This is key in the 6800.)

So it seems clear that zero-page addressing (or "direct addressing," in 6800 terminology) really is just a measure for saving space in the instruction stream, and not anything fundamental to the 6800 architecture. In fact, it's so non-fundamental that they didn't even bother with zero-page addressing for the single-operand instructions (TST, NEG, INC, etc.). My experience in by analyzing 6800 ROM code and programming it myself also bears out that the zero page is not a big deal architecturally; there's no issue with moving any zero-page storage elsewhere in memory, so long as you don't mind the extra bytes and cycles in your code.

That said, the zero page is a fairly big efficiency win in one way. Because all indirect/indexed addressing uses the single X register, it's very frequently loaded and stored even in inner loops, and thus being able to do this in two bytes instead of three helps significantly here. We'll see more below on how this affected the 6502 design.

Quote:

My thinking is that the 6502 as the 6800's cousin, adopted a number of conflicting design features because it appeared just at the point where these languages were starting to be considered: zero-page addressing; byte-oriented instructions; a stack pointer for subroutines and multiple addressing modes.

Actually, the 6502's changes from the 6800 architecture were a (very successful) attempt to solve the single X register problem I mention above without adding more registers, and entirely changed the relationship of the zero page to the rest of the architecture. The surface similarities between the two zero page designs are misleading: the zero page on the 6502 is integral and essential to the 6502's operation with indirect values in a way it is not on the 6800.

As I mentioned above, the 6800 does a lot of X register loads and stores, particularly in inner loops where this hurts performance significantly. The obvious solution to this is to add another index register (which is exactly what the 6809 and 6811 did), but that of course means more transistors in your CPU.

Quote:

The insight is to observe that the zero-page addressing mode is more of an artifact that hinders to the 6502 than a benefit.

It's quite the opposite: much improved use of the zero page is exactly what gave the 6502 its performance advantage over the 6800.

Rather than add another index register, the 6502 designers took a completely different approach: do indirect accesses through pointers in memory rather than in a register (which remember the 6800 does not do at all), and then use some clever design changes to mitigate (in fact, entirely remove, compared to the 6800) the performance penalty this would otherwise cause. This effectively gives you up to 128 "index registers" which cost no more to use than the 6800's single index register. They did this by:

Using zero-page addressing for the index locations, allowing them to be referenced in only one additional byte after the instruction.
Adding an 8-bit "offset" register (Y) that can be used with these to allow limited (up to 256 bytes, which is good enough for many purposes) modification of the indirect address without loads and stores.
Removing the constant offset when using the index register, saving a byte and a cycle for its use. (The 6800 always supplies this: `LDAA ,X` is actually `LDAA 0,X` and is still a two-byte instruction. Adding another addressing mode that could indirect through X with fixed 0 offset in a single-byte instruction would have been a pretty big win, IMHO, given how often that special case is used.)
Changing to little-endian addresses in memory, allowing parallel addition of the offset to the low byte of the index "register" while loading the high byte.

So now we can see where, using these tricks, the 6502 gets a performance advantage over the 6800 even before we consider X register loads and stores:

Code: Select all

      6800:  A6 00  (5)  LDAA 0,X    ; load A from addr in X register
             08     (4)  INX

      6502:  B5 34  (4)  LDA  $34,X  ; load A from addr in $0034, with offset
             E8     (2)  INX

I have more comments on your proposed new architecture, but this has taken enough time that I think I'll stop here for now. I'd be interested to here if this view of the 6502 changes any of your proposals, and what your thoughts are on adding registers versus the 6502 approach.

A Hypothetical C Friendly 6502

A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502