A Hypothetical C Friendly 6502

Snial · Post by **Snial** » Tue Dec 01, 2020 10:34 pm

Hello Garth,

GARTHWILSON wrote:

Snial, I don't know much about C compilers' innards; but I think you'll want to look into the 816's stack-relative addressing modes...

Thanks, I'm aware of the 65C816 and it's very extensive addressing modes, including stack addressing modes and its DP register which can function well as a frame pointer. It's a really creative CPU and I like it. I have one ready to go into a future DIY SBC, perhaps a little similar to the Veronica project.

Obviously the 65C816 is much better adapted to 'C'; or Pascal (the TML Pascal manual I have has an interesting section in an Appendix on how it generates code including its ABI).

The concept of the 65T2 is really just focussed on being a hypothetical 'C' friendly 6502: it's not a 16-bit processor, for example. I'm just responding to the OP.

I hope this answers your comment.

johnwbyrd · Post by **johnwbyrd** » Wed Dec 02, 2020 8:50 am

Snial wrote:

Redesigning the 6502 to make it more suitable for high-level languages is problematic. The most important criteria is to be able to support an indexed stack addressing mode.

I was with you until this point. While you can obviously choose whatever design criteria that you want, it's not immediately clear to me how any modern C compiler would benefit from this.

I don't understand how you would think that "modern CPUs are stack based." Is ARM stack based? SPARC? PowerPC? SiL? AVR? I think it would be much fairer to say that modern C compilers for these targets do NOT spend time juggling items on stacks -- instead they move parameters into registers, and work with them in registers, and then spill those registers back onto the stack as needed.

I can't see exactly how I might use all those wild addressing modes to improve the performance of llvm. There may be a way, but I can't see it at this time.

I feel that the 6502's Achilles heel, from the perspective of this compiler author, is its paucity of registers. Wozniak noticed this himself, and developed SWEET16 to work around this problem.

Personally, the thing that would permit my compiler to turn out more efficient code, is a 65xx variant that can perform adds, subtracts, multiplies, and divisions, on operands located in zero page, at a precision of 8, 16, 32, or 64 bits. I would give up all the fancy addressing modes for a hardware multiplier, and variable-length virtual registers in zero page. Please make your hardware perform those tasks for me efficiently, please? An 8x8 multiplier would probably be around 600 transistors or so.

GARTHWILSON · Post by **GARTHWILSON** » Wed Dec 02, 2020 10:17 am

johnwbyrd wrote:

I think it would be much fairer to say that modern C compilers for these targets do NOT spend time juggling items on stacks -- instead they move parameters into registers, and work with them in registers, and then spill those registers back onto the stack as needed.

I think the idea was about passing parameters on the hardware stack and accessing them in random order, with no juggling, and using them in place. The 6502 can do it with TSX and then doing for example ADC 107,X, but takes away that index register. The '816 has addressing modes to access them without taking an index register. Forth solves a lot of problems of efficiency in parameter-passing by using a ZP data stack that's separate from the hardware stack. The last half of chapter 5 of the 6502 stacks treatise goes into these things, at http://wilsonminesco.com/stacks/stackaddressing.html .

Quote:

I feel that the 6502's Achilles heel, from the perspective of this compiler author, is its paucity of registers. Wozniak noticed this himself, and developed SWEET16 to work around this problem.

My understanding is that he developed it as a model of frugal coding to manipulate 16-bit pointer data. The '816 does it nearly an order of magnitude more efficiently, without additional processor registers.

Quote:

Personally, the thing that would permit my compiler to turn out more efficient code, is a 65xx variant that can perform adds, subtracts, multiplies, and divisions, on operands located in zero page, at a precision of 8, 16, 32, or 64 bits. I would give up all the fancy addressing modes for a hardware multiplier, and variable-length virtual registers in zero page. Please make your hardware perform those tasks for me efficiently, please? An 8x8 multiplier would probably be around 600 transistors or so.

Memory is cheap now; so if you can dedicate 128KB to each of the desired functions, you can implement large look-up tables which I make available on my website, at http://wilsonminesco.com/16bitMathTables/ . Then you can go beyond just the basic functions, and for example an '816 can get a 16-bit sine (a trig function) in 23 cycles, accurate to all 16 bits, with no interpolation, because every answer is pre-calculated in the tables. An 8x8 multiply is similar. You can't divide with a table, but you can look up the inverse and multiply by it; and I do provide a table of inverses, with 32-bit answers so you can get full resolution across the entire 16-bit range. (It's not always necessary to use all 32 bits though.)

cjs · Post by **cjs** » Wed Dec 02, 2020 11:03 am

johnwbyrd wrote:

I don't understand how you would think that "modern CPUs are stack based."

I have the feeling that you may be talking about "stack-based" as in Forth or the JVM, and Snial is talking about "stack-based" as in frames on the hardware call stack being used for procedure parameters, amongst other context.

GARTHWILSON wrote:

johnwbyrd wrote:

I feel that the 6502's Achilles heel, from the perspective of this compiler author, is its paucity of registers. Wozniak noticed this himself, and developed SWEET16 to work around this problem.

My understanding is that he developed it as a model of frugal coding to manipulate 16-bit pointer data. The '816 does it nearly an order of magnitude more efficiently, without additional processor registers.

If by "frugal" you mean "writing a lot fewer instructions to do 16-bit manipulation," yes, that is my understanding, too. If you have a lot of code doing 16-bit operations, the opcode savings can far outweigh the size of the Sweet 16 interpreter (and it was in ROM on the Apple II, anyway), though of course being somewhat slower.

As for the register thing, if you're going to go with "what could the original MOS 6502 have done differently without adding features that massively increase the cost," adding half a dozen registers is probably out.

This brings up the thought that possibly we should be specifying a little more about what kind of efficiency we're talking about when we use that word. (Code size or speed might be a good start, since the former fairly often comes at the expense of the latter.)

Quote:

An 8x8 multiplier would probably be around 600 transistors or so.

That alone would increase the transistor count of the original MOS 6502 by more than 20%.

Snial · Post by **Snial** » Wed Dec 02, 2020 1:40 pm

johnwbyrd wrote:

Snial wrote:

Redesigning the 6502 to make it more suitable for high-level languages is problematic. The most important criteria is to be able to support an indexed stack addressing mode.

I was with you until this point. While you can obviously choose whatever design criteria that you want, it's not immediately clear to me how any modern C compiler would benefit from this.

I don't understand how you would think that "modern CPUs are stack based." Is ARM stack based? SPARC? PowerPC? SiL? AVR?

I merely meant it in the sense that cjs and garthwilson meant it. 'C' expects to have stack frames and register rich architectures spill registers to the current thread's stack when they call functions. That is, registers act as a cache for the stack, but the language is stack oriented.

The 65T2 should make compiler writing easier than for a 6502, because one can conceive of pseudo registers being aliased in actual local variables and parameters on the stack. If we imagine R0..Rn mapping to S+0 .. S+2n, then operations on A, Rm map to A,S+2m. This is the reverse of what happens in a register rich architecture where actual registers alias actual local variables. There's an interesting discussion on this approach for the 'C' compiler written for the Magic-1 architecture.

Magic-1 is register poor, but the designer ported LCC to it by creating a number of pseudo registers which are then mapped to its stack. The same approach could work for the 65C816 to port to a newer version of LCC.

In RISC processors, the primary addressing mode is: IndexReg+offset; where one register is usually designated as the stack pointer itself. Thus the 65T2 does the same thing, it supports IndexReg+Offset modes for S, X, Y,but it also supports the indirect equivalents for stack locals and parameters: (s+offset),X and (s+offset),Y.

This makes its addressing modes somewhat more versatile than an actual 6502 (but not a 65C816), because indirect, indexed addressing is available for both X and Y; and space on the stack can be easily allocated using ADS #n.

65T2 makes register spilling and filling easy because it can be largely achieved by ADS #n instead of having to move values in and out of zero-page to the stack. Consider:

uint16_t bar(uint16_t *Sum) { return (*Sum)+(*Sum>>1); }
uint16_t foo(void) { uint16_t v=100; return bar(&v); }

On the 65T2 the code could look like this:

Code: Select all

:bar ;Sum is at S+2 and S+3
  lda *s+3 ;hi byte (indirect= ea mode 5 due to the *s, not mode 1 which would be lda s+3)
  asra   ;btw *s+3 means *(s+3).
  pha
  lda *s+3; lo byte.
  rora
  add *s+3;lo byte
  sta *s+3
  pla
  adc *s+3 ;hi byte.
  sta *s+3;hi byte.
  rts

:foo ;
  lda #0
  pha
  lda #100
  pha ;v=100.
  lea s
  phx
  pha
  jsr bar

Passing parameters and accessing the contents of pointers is relatively easy on the 65T2 (at least for these simple examples, Dhrystone would be an interesting and more substantial test). A compiler might pre-allocate space in foo for V and for temporary variables in bar using ads #-n and then use lda / sta instructions rather than pha or pla, but this wouldn't substantially change the code generation.

Back later...

johnwbyrd · Post by **johnwbyrd** » Wed Dec 02, 2020 6:56 pm

Snial wrote:

The 65T2 should make compiler writing easier than for a 6502, because one can conceive of pseudo registers being aliased in actual local variables and parameters on the stack. If we imagine R0..Rn mapping to S+0 .. S+2n, then operations on A, Rm map to A,S+2m. This is the reverse of what happens in a register rich architecture where actual registers alias actual local variables.

Perhaps, but I don't perceive "making compiler writing easier" to be my primary goal. I believe a compiler's goal should be to generate reasonably efficient code.

The kind of register aliasing you're describing, is exactly the thing that modern research tries to avoid. See for example https://www.youtube.com/watch?v=hf8kD-eAaxg .

Snial wrote:

65T2 makes register spilling and filling easy because it can be largely achieved by ADS #n instead of having to move values in and out of zero-page to the stack.

But a key performance goal for modern compilers, is to avoid register spilling and filling as much as possible. Optimal code would move hot data to 8-bit addressable memory, i.e. zero page, or to real registers, and leave it there for as long as possible. Ideally the data in your function would persist in registers or pseudo-registers across function call boundaries.

I am sure some toy compilers exist that allocate registers the way you describe. But the vast majority of modern compiler research assumes that the target has a homogeneous array of n^2 registers, and modern compilers will spend a great deal of effort on minimizing register spills. The more that a 65xx can behave like an n^2 register machine -- minimizing instruction size and maximizing performance within those registers -- the more that modern compilers can generate good-quality code for them.

Please consider making your CPU behave as though zero page is a collection of uniform registers, which may be addressed as 256 8-bit registers, 128 16-bit registers, 64 32-bit registers, and so on.

johnwbyrd · Post by **johnwbyrd** » Wed Dec 02, 2020 7:15 pm

Okay 65xx hardware tinkerers, here is the new instruction type I need to write a better C backend.

Check out this instruction sequence:

; 16 bit Binary Addition
CLC ;Ensure carry is clear
LDA VLA+0 ;Add the two least significant bytes
ADC VLB+0
STA RES+0 ;... and store the result
LDA VLA+1 ;Add the two most significant bytes
ADC VLB+1 ;... and any propagated carry bit
STA RES+1 ;... and store the result

Where all the operations occur on zero page.

I want this whole business to be represented as a single new instruction:

CLC
ADC16 RES, VLA, VLB

In general, I want a new set of instructions of the following format:

[MATHOP]T, [ARG1,] [ARG2]

Where:

MATHOP is a mathematical operation (examples: add, subtract, negate, move, shift left, shift right, increment, decrement, and, or, exclusive or... [multiply?])
SIZE is a number of bits: one of 8, 16, 32, and 64
RESULT is a SIZE-aligned zero-page address to receive the result
ARG1 is the SIZE-aligned zero-page address of the first argument
ARG2 is the SIZE-aligned zero-page address of the second argument

You are welcome to compress this instruction or represent it as as you like. I'm guessing it's a four-byte instruction, but maybe you can make it three.

If you can't do a four byte instruction, then I want this instead:

[MATHOP] ARG1, ARG2

Where ARG1 stores the first argument, and it is overwritten with the result of the operation. But my first desire is the three operand format as described above.

If you need to trash A, X, or Y, that's okay, just document what you trash.

cjs · Post by **cjs** » Wed Dec 02, 2020 7:46 pm

johnwbyrd wrote:

In general, I want a new set of instructions of the following format:

[MATHOP] RESULT, [ARG1,] [ARG2]
...
SIZE is a number of bits: one of 8, 16, 32, and 64

Perhaps it's just me, but this is feeling really not at all like a 6502 any more. I know there's no bright line between what does and what doesn't have "6502-nature," but (to my mind, anyway) moving 32- and 64-bit arithmetic into the CPU and switching to memory-to-memory operations seems to be well on the far side of it, and also my guess is that such a CPU would be many times the cost to build of the original 6502.

Maybe I'm just totally missing something here, and you could describe what about a processor that works this way is still maintaining something of the spirit or practicalities of the 6502, rather than just being a completely different processor?

What you've described above is also clearly a separate topic from the 65T2, since it (as far as I can tell) has no hope of coming anywhere near complying with the "approximately same transistor budget as the original MOS 6502" constraint, but I would be interested in hearing any ideas you have that would comply with that constraint while improving things for you as a compiler writer.

John West · Post by **John West** » Wed Dec 02, 2020 8:31 pm

cjs wrote:

What you've described above is also clearly a separate topic from the 65T2, since it (as far as I can tell) has no hope of coming anywhere near complying with the "approximately same transistor budget as the original MOS 6502" constraint

Well, yes and no. What modern compilers do is irrelevant to the 65T2, because they target modern processors. Modern processors have lots of registers. Something with the same hardware constraints as the 6502 is going to need an entirely different approach.

But it is relevant to the overall thread topic, which is "A Hypothetical C Friendly 6502". The 65T2 is merely one example that has been offered. Go with "must be 6502ish in some vague undefined way, but you've got as many transistors as you need", and you're into territory that the thread hasn't really explored, but in which it might be reasonable to think about what modern compilers do. They co-evolved with modern processors the way they did for a reason.

Using zero page as a substitute for registers could work well if you have a dedicated cache for it, so you aren't going to memory all the time. Then it's more of a register file that happens to shadow part of memory.

I've mentioned my 65020 project before, and it is certainly intended to be a C-friendly 6502. It doesn't feel much like a 6502, but it's binary compatible, so I'll claim it meets the "vague undefined way" requirement. It takes the "decent number of registers" route.

But the 65T2's constraints are fascinating. With infinite time and fewer competing projects, it's definitely the sort of thing I'd enjoy having a go at.

JimBoyd · Post by **JimBoyd** » Thu Dec 03, 2020 12:53 am

cjs wrote:

Perhaps it's just me, but this is feeling really not at all like a 6502 any more.

You're not the only one. It kind of reminds me of x86 assembly language.

BigEd · Post by **BigEd** » Thu Dec 03, 2020 10:49 am

(Thanks for reappearing, snial! I'd still recommend a new thread, especially as further different avenues have opened up. But also nice to see KDF-9 and Magic-1 referenced! It seems you have the technical background, so might you write an emulator or HDL model of 65T2? Or indeed a C compiler for it?!)

GARTHWILSON · Post by **GARTHWILSON** » Thu Dec 03, 2020 11:25 am

johnwbyrd wrote:

Okay 65xx hardware tinkerers, here is the new instruction type I need to write a better C backend.

Check out this instruction sequence:

; 16 bit Binary Addition
CLC ;Ensure carry is clear
LDA VLA+0 ;Add the two least significant bytes
ADC VLB+0
STA RES+0 ;... and store the result
LDA VLA+1 ;Add the two most significant bytes
ADC VLB+1 ;... and any propagated carry bit
STA RES+1 ;... and store the result

Where all the operations occur on zero page.

I want this whole business to be represented as a single new instruction:

CLC
ADC16 RES, VLA, VLB

In general, I want a new set of instructions of the following format:

[MATHOP]T, [ARG1,] [ARG2]

Where:

MATHOP is a mathematical operation (examples: add, subtract, negate, move, shift left, shift right, increment, decrement, and, or, exclusive or... [multiply?])
SIZE is a number of bits: one of 8, 16, 32, and 64
RESULT is a SIZE-aligned zero-page address to receive the result
ARG1 is the SIZE-aligned zero-page address of the first argument
ARG2 is the SIZE-aligned zero-page address of the second argument

You are welcome to compress this instruction or represent it as as you like. I'm guessing it's a four-byte instruction, but maybe you can make it three.

If you can't do a four byte instruction, then I want this instead:

[MATHOP] ARG1, ARG2

Where ARG1 stores the first argument, and it is overwritten with the result of the operation. But my first desire is the three operand format as described above.

If you need to trash A, X, or Y, that's okay, just document what you trash.

How many bytes and cycles would it take? The '816, set to 16-bit accumulator (which I keep it in nearly full time in my Forth), would do:

Code: Select all

        CLC
        LDA  VLA
        ADC  VLB
        STA  RES

7 bytes, 14 cycles, and the average time that an interrupt would have to wait before starting the interrupt sequence is less than 3 cycles. If the accumulator can be used for the beginning and ending numbers, it's down to 3 bytes and 6 cycles, needing only the CLC and ADC.

Snial · Post by **Snial** » Thu Dec 03, 2020 1:18 pm

BigEd wrote:

(Thanks for reappearing, snial! I'd still recommend a new thread, especially as further different avenues have opened up. But also nice to see KDF-9 and Magic-1 referenced! It seems you have the technical background, so might you write an emulator or HDL model of 65T2? Or indeed a C compiler for it?!)

Happy to be part of the community! All of these things are interesting. I hadn't thought of taking the 65T2 further until this post appeared, but I'm happy to start a new thread, in the Hardware Section?

I have a little HDL experience - my MPhil project was written in Verilog, it was a 16-bit RISC cpu, so it doesn't directly transfer to the 65T2. I've also done some work on a 'C' compiler called ItsyC (cf ItsyForth), which is intended to be a direct code compiler with a relative simple and ideally portable CG interface; along with a novel approach to translating expressions into code; and the use of syntactic data structures (incorporating callbacks) with a generalised Grammar Code Interpreter instead of explicit code for parsing 'C'. In this sense it's different to TCC.

I wrote an expression translator and datatype manager before getting diverted, but I could push the code to GitHub / GitLab and see if anyone wants to take it further. One of the primary goals would be for the compiler to fit within the constraints of a 16-bit machine, so that it could be self-hosted on relatively modest systems. The code quality is intended to be 'tolerable'.

BigEd · Post by **BigEd** » Thu Dec 03, 2020 1:27 pm

Great - Programmable Logic is often used for new core designs, because that's the best endpoint for implementation. But I don't think it's too critical.

Snial · Post by **Snial** » Thu Dec 03, 2020 10:41 pm

Druzyek wrote:

Think about something like this: while (*ptr) func(*ptr++);

How might a 65T2 C compiler handle this? It's likely it wouldn't use the Y register.

Code: Select all

:Foo ;ptr is at s+3:s+2
    Bra Foo1
:Foo2
    Lda *s+2;
    Jsr Func
    Inc s+2
    Bne Foo3
    Inc s+3
:Foo3
:Foo1
    Lda *s+2
    Bne Foo2
    Rts

This sequence takes 5+5+(7+5)+5+3+5/256+5+3 = 38+Func cycles /byte. With the y reg optimisation:

Code: Select all

:Foo ;ptr is at s+3:s+2
    Ldy #0
    Bra Foo1
:Foo2
    Jsr Func
    Iny
    Bne Foo3
    Inc s+3
:Foo3
:Foo1
    Lda *s+2,y
    Bne Foo2
    Rts

So we save the extra lda (5c) and inc is replaced by iny saving 3c. => 30 +Func cycles / byte. So optimised version is about 27% faster. So, the compiled version is tolerable IMHO.

A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502

Re: A Hypothetical C Friendly 6502