Page 1 of 3

RISCY-V02

Posted: Sat Mar 01, 2025 4:56 am
by mysterymath
I was thinking of restarting a stalled project, and I wanted to consult the 6502 expert oracle about it.

Here was my question: was the 6502 a "historically optimal" design? A lot of people think that it's impossible to have done better at the same size, regardless of what we've learned about how to design processors.

I believe in the idea of progress, so I wanted to question this notion. So, I took to designing as modern of a RISC-V-alike CPU that could be made to fit in the footprint of a 6502. I'm calling it the "RISCY-V02" project. (Pronounced "risky five oh two".)

Rules of the game:
- Roughly the same bus as a 6502
- Roughly the same transistor count as a CMOS 65C02 (I'm not going to try to design a NMOS chip, and noone can manufacture one anyway.)
- No extreme regressions on any of code density, interrupt latency, or performance. Making different tradeoffs is fine.
- A very simple target for a compiler, but it should still be possible to write by hand, like modern ARM Thumb2 code.

I spent a lot of time designing something, and mostly completed something really cool. I got bored in actually finishing thing out, testing, bug fixing, etc. The basic trick is the RISC one: Replace things like BCD, microcode, decode PLAs, etc., with things like register files, pipelines, and barrel shifters.

My question is this: If I can actually finish the physical design, will I have succeeded in my goals? It seems to me that what I came up with outclasses the 6502 considerably, but it may just be a matter of taste on my part. It would help a lot to know how 6502 fans feel about the usability or lack thereof. I'll follow with a description of the instruction set for y'all to poke holes in.

One last thing: The *way* you program in a modern RISC is dramatically different than how you program a CISC like the 6502. I may have to point out subtleties in how the ISA works. The point wasn't to make the most 6502-like processor; for that you want a 6502 ;)

Here's what I came up with:
- LOL it's a 16-bit CPU.
- 8x 16-bit general purpose registers (including a 16-bit stack pointer)
- All instructions are 16 bits of code.
- Most instructions are two cycles. All arithmetic is 16-bit.
- Loads and stores take 3 cycles for bytes and 4 cycles for 16-bit words.
- 2 cycle variable bit shifts
- Easy/modern position independent code
- Backward branches are predicted taken; forward branches not taken. Predicted branches are 2 cycles; mispredicted, 4.

The instructions come in broadly 3 encodings:

Code: Select all

- 9-bit immediate:
  - LUI reg, imm9: Load Upper Immediate: Load the immediate into the high 9 bits of reg. Used to materialize 16-bit constants.
  - AUIPC reg, imm9: Add Upper Immediate to PC: Add the immediate to the high 9 bits of the PC, store the results in reg. Used in position independent code and far jumps/calls w/ JALR.
  - JAL reg, imm9: Jump and Link: Jump to PC + imm, place previous PC in reg
  - J reg, imm9: Jump: Jump to PC + imm
  - BZ/BNZ reg, imm9: Compare reg to zero or non-zero; branch to PC + imm

- 7-bit immediate.
  - Loads and stores
    - Src/Dst is any register, base is any of the first 4 registers.
    - LB/LBU/LW dst, base, imm: Load signed byte/unsigned byte/word from base + imm. Signed bytes are sign extended; unsigned bytes are zero extended.
    - SB/SW src, base, imm: Store byte/word to base + imm.
  - Arithmetic
    - These modify a register in place:
      - LI (Load Immediate), ADDI, ANDI, ORI, XORI, SRI (Shift Right), (SLI) Shift Left
    - These place the result in register x3, but take an argument register
      - SLT/SLTU: Set Less Than Signed/Unsigned. (Sets a register to 1 if less then the sign-extended immediate, otherwise sets it to zero.)
      - JALR: Jump and Link Register. Not technically arithmetic. Adds the sign-extended immediate to the given register, jumps to the result, and stores the current PC in x3.
      - JR: Adds the sign-extended immediate to the given register and jumps to the result.
      - XORIA: Alternative XOR for equality testing

- Reg Reg Reg instructions
  - These are all arithmetic; they take 3 registers as arguments. They perform arithmetic on two and place the result in a third. The arguments and destination can freely overlap; all three could be the same register, for example.
  - ADD, SUB, AND, OR, XOR, SLL (Shift Left Logical), SRL (Shift Right Logical), SRA (Shift Right Arithmetic), SLT (Set Less Than), SLTU (Set Less Than Unsigned)
  - Note that the shifts take the amount in a register, and still complete in 2 cycles.

Re: RISCY-V02

Posted: Sat Mar 01, 2025 5:11 am
by mysterymath
Because I figure this will immediately come up:

How do I do this?

Code: Select all

LDA #42, STA 1234?
That's 5 bytes, 6 cycles.

In RISCY-V02:

Code: Select all

LI r1, 42
LUI r2, hi(1234)
SB r1, r2 + lo(1234)
That's 6 bytes, 7 cycles.

But, afterwards, there is a little 128-byte "zero page" centered on r2:

Code: Select all

LI r3, 43
SB r3, r2 + lo(1235)
That's 4 bytes, 5 cycles.

And here's a simple memory copy routine. (As in the modern style, it's possible to do better with word loads and stores, but it's more complex.)

Code: Select all

memcpy:
; r1 = dst, r2 = src, r3 = amt
LBU r4, r2 + 0
SB r4, r1 + 0
ADDI r1, 1
ADDI r2, 1
ADDI r3, -1 
BNZ r3, loop
12 bytes, 14 cycles per iteration (16 on the last due to branch misprediction)

Re: RISCY-V02

Posted: Sat Mar 01, 2025 7:00 am
by johnwbyrd
Seems like a very fun project!

One of the difficult things, in any retro-engineering effort, is making the decision of how historically accurate you would like to be in your work. Do you want physical hardware, or is emulation enough? Is it legal to use modern tools and technologies, or must all your tools be period-correct? All those choices are ultimately creative ones, but they do all clearly influence the quality of the result.

With that in mind, I suggest the following (I think) fascinating article by Jason Sachs, about the physical layout process for the 6502. Particularly interesting is the process for hand-drawing on mylar, with a special type of non-smudging pencil for the job, and Charpentier's description of a "peeling party" where unwanted Rubylith would be removed.

https://www.embeddedrelated.com/showarticle/1453.php

I was kinda hoping to see some VLSI or Verilog, and perhaps an FPGA in a breadboard flashing LEDs or a 16X2 LCD or something, but maybe that's outside your envisioned project scope?

tl:dr; have fun playing, per your own definition of "optimal", which none may steal from you

Re: RISCY-V02

Posted: Sat Mar 01, 2025 1:45 pm
by BigEd
Certainly a fun project!

The following are points of reference, for interest, and might not have any direct bearing on your project:

The smallest RISC-V is about 2k gates, normally reckoned as 8k transistors, so that's not too far adrift. It is, however, a bit-serial machine (always a good way to reduce logic, but at a performance cost.) Otherwise, under 20k gates is reckoned possible, at least by this article.

The smallest synthesisable 6502 is, I think, Arlet's core. See this comparison thread for details of this core and many others. Arlet also implemented a 6502 in four CPLDs which I find very impressive and interesting.

Re: RISCY-V02

Posted: Sat Mar 01, 2025 6:36 pm
by mysterymath
johnwbyrd wrote:
One of the difficult things, in any retro-engineering effort, is making the decision of how historically accurate you would like to be in your work. Do you want physical hardware, or is emulation enough? Is it legal to use modern tools and technologies, or must all your tools be period-correct? All those choices are ultimately creative ones, but they do all clearly influence the quality of the result.
I definitely want an ASIC chip I can hold in my hand at the end of this, and for the project to achieve its goals, I need accurate transistor counts. Accordingly, when/if I restart the project, I'll be using an open source ASIC design flow targeting a Tiny Tapeout: https://tinytapeout.com/

Actually answering the question of "would this have been manufacturable in the 1970s" is unfortunately a design challenge beyond my abilities. I just don't know enough about the physical design of NMOS transistors on 1970s processes. The best I can personally do is target a modern CMOS process like Skywater 130nm and use that as a decidedly imperfect proxy for, say, a 1980s CMOS chip. (130nm is circa 2002.) If you squint and replace the pull-up networks with depletion-mode pull-up transistors, then CMOS is a decidedly imperfect proxy for a NMOS chip. So the project at best could supply evidence and suggestion, but never proof.
johnwbyrd wrote:
I was kinda hoping to see some VLSI or Verilog, and perhaps an FPGA in a breadboard flashing LEDs or a 16X2 LCD or something, but maybe that's outside your envisioned project scope?
I do have a private repo full of Verilog, but I made the mistake of not thoroughly unit testing each component, due to how awful writing tests in Verilog is. Now that I've discovered the wonders of cocotb, writing tests is much easier, but the design is so buggy that I need to pick it apart and build up working versions of the modules from the ground. That, and the Skywater synthesis process supports vastly different Verilog constructs than the open source PDK I was using to get transistor estimates.

I'll open source a repo just as soon as I have something that isn't made of chicken scratch, and once I've figured out my ideal way to license a hardware project.

Re: RISCY-V02

Posted: Wed Mar 05, 2025 4:54 pm
by jgharston
mysterymath wrote:
Here's what I came up with:
- LOL it's a 16-bit CPU.
- 8x 16-bit general purpose registers (including a 16-bit stack pointer)
- All instructions are 16 bits of code.
- Most instructions are two cycles. All arithmetic is 16-bit.
- Loads and stores take 3 cycles for bytes and 4 cycles for 16-bit words.
- 2 cycle variable bit shifts
- Easy/modern position independent code
- Backward branches are predicted taken; forward branches not taken. Predicted branches are 2 cycles; mispredicted, 4.
Looks like you've invented the PDP11 :)

I did something similar with the Z80 when I was at school, and when I discovered the PDP11 realised that I had been well on the way towards a PDP11. ;)

One of these days I'll implement my 32-bit 6502.

Re: RISCY-V02

Posted: Wed Mar 05, 2025 5:27 pm
by mysterymath
jgharston wrote:
Looks like you've invented the PDP11 :)
Kinda, but not really? If you squint at it, the ISA characteristics are similar, but a RISC is a pretty big departure from an orthogonal instruction set. It's decidedly non-orthogonal: there's only one addressing mode, and only dedicated load and store instructions can access memory. I'd also doubt that most PDP-11 implementations are pipelined, but I don't know enough about its implementation details to be confident.

Re: RISCY-V02

Posted: Wed Mar 05, 2025 6:07 pm
by John West
It looks like a fun project. I've considered something like it myself, but always thought the amount of effort required to get something you could compare to a real chip would be too much.

8x16 bit registers and a barrel shifter are the parts that surprise me. That's a lot of transistors! But they're transistors in a nice regular layout that allows a much higher density than the random logic that makes a large chunk of the original 6502.

You have JAL for near calls. It looks like JALR can be used for return as well as far calls. Is that the intention?

Re: RISCY-V02

Posted: Wed Mar 05, 2025 7:13 pm
by mysterymath
John West wrote:
You have JAL for near calls. It looks like JALR can be used for return as well as far calls. Is that the intention?
Yep, this was cribbed from RISC-V. JAL gives a near call, AUIPC/LUI hi + JALR lo gives a far call, and JALR 0 on the "link register" gives a return.

Re: RISCY-V02

Posted: Mon Mar 17, 2025 1:25 pm
by Dr Jefyll
I've been struggling to avoid engaging with this project, as my life has too many distractions already. :lol: But last night I accidentally started drafting a Programmers Reference Card (which for me is the first step in wrapping my head around something like this).
doodle 1.png
mysterymath, can you please clarify the snippets I've highlighted in color? Is "register x3" simply one of the general registers, but one with special meaning in regard to this instruction group? Also, why the choice to restrict the load/stores to using only a "base" register ("any of the first 4 registers")? Is this to limit that particular bit-field to only two bits so it'll conform to a certain format (as my instruction-format diagram suggests)?

A few other stray questions & observations. What are the capabilities of the 16-bit stack pointer? Is this one of the "base" registers?

Re instruction timing, I find it useful to remind myself that every instruction is 16 bits (2 bytes), and thus takes two cycles to fetch. Presumably this has implications for pipelining. In my diagram I purely guessed at the instruction format, but for pipelining purposes it should be arranged in a way that includes the chewiest bit-fields in the first byte-fetch rather than the second! :)

-- Jeff

ps: AIUI, memory is addressed at byte granularity but all instructions are two bytes in length, which means the Program Counter will always have zero in its lsb. To me that's like catnip... we need to put that bit to work! At the most basic level maybe we could double the code space to 128K. Or, at the risk of abandoning KISS, we could probably get even more clever than that... :twisted: :mrgreen:

Re: RISCY-V02

Posted: Mon Mar 17, 2025 2:23 pm
by barnacle
Dr Jefyll wrote:
every instruction is 16 bits (2 bytes), and thus takes two cycles to fetch.
In my own considerations of 8-bit access to 16-bit opcodes (and e.g. single bytes from text in memory) I came across the surprising but obvious in retrospect point of big memories with either sixteen bit words or eight bits and a separate high-low pin. Which I haven't thought through, but seems like a simple mechanism to offer eight or sixteen bit loads.

Neil

Re: RISCY-V02

Posted: Mon Mar 17, 2025 3:39 pm
by Dr Jefyll
barnacle wrote:
a simple mechanism to offer eight or sixteen bit loads.
If I'm following correctly, Neil, you're reminding us that a memory that's organized as sixteen bits wide could also do eight-bit accesses... with the advantage that a 16-bit access could occur in just a single cycle.

I feel that too -- The Mad Scientist in me really wishes the bus was 16 bits wide! But in the lead post mysterymath specifies "roughly the same bus as a 6502," and that means 8 bits.

OK, in the big picture that's a serious constraint, because it means we need 2-cycle instruction fetches. But there's a silver lining in the sense that we can probably make good use of that extra cycle to hide latencies such as decoding.

-- Jeff

Re: RISCY-V02

Posted: Mon Mar 17, 2025 6:10 pm
by barnacle
Exactly... but perhaps the OP could be persuaded to change to a 16-bit data bus?

It took me a while to realise that the extra bit is _not_ a0; using the a1 to whatever bits accesses aligned 16-bit words; using a0 (and the mystery control which I can't recall quite at the moment) provides access to the higher or lower byte, but on the 8-bit data port.

So a byte access - read or write - would use a0 and the control bit, and expect/deliver data on the lower eight bits of the bus, while word access - again read or write but most likely, I think, to be an instruction read - would use the full bus width.

So I envisage a similar processor architecture to the OP's but with every memory access using an address register plus an offset register to access either the program code, or address to a stack or a stack frame or wherever (or something along those lines - this is in-the-shower-while-still-asleep thinking). And register zero returns always zero...

Or maybe just take inspiration from Bruce Jacob's Ridiculously Simple processor? https://user.eng.umd.edu/~blj/RiSC/RiSC-isa.pdf

Neil

Re: RISCY-V02

Posted: Mon Mar 17, 2025 7:35 pm
by Dr Jefyll
barnacle wrote:
Bruce Jacob's Ridiculously Simple processor
Also on a somewhat self-deprecating note we have Brad Rodriguez's Pathetic Instruction Set Computer. :) And believe it or not I have a cache of those 74172 4-port (yes four port :shock: ) register file chips if anyone feels moved to start soldering!

Another sixteen-bit machine with only eight instructions is my pal Myron Plichota's Steamer 16, which attracted a lot of attention long ago including that of forumite Sam Falvo aka kc5tja; his version is here. A more recent plaything of Myron's -- a 3-operand RISC -- can be found on anycpu.org, here.

Getting back to RISCY-V02, the additional processing time made available by byte-at-a-time memory access makes me wonder if perhaps most of the ALU logic could be put on a diet and reduced to only 8 bits. Heck, the Z-80 uses a four-bit ALU, and yet it operates on values as wide as 16 bits!

If the RISCY-V02 used an approach that broke each computation into multiple 8-bit chunks then, as well as 16-bit, perhaps it could also (with a slight penalty) execute some 24- or 32-bit operations! But right now I think it's time for me to quit being so darn creative and let mysterymath get a word in edgewise! :wink:

-- Jeff

Re: RISCY-V02

Posted: Tue Mar 18, 2025 10:44 pm
by mysterymath
Dr Jefyll wrote:
Is "register x3" simply one of the general registers, but one with special meaning in regard to this instruction group? Also, why the choice to restrict the load/stores to using only a "base" register ("any of the first 4 registers")? Is this to limit that particular bit-field to only two bits so it'll conform to a certain format (as my instruction-format diagram suggests)?
Correct on both counts!
Dr Jefyll wrote:
A few other stray questions & observations. What are the capabilities of the 16-bit stack pointer? Is this one of the "base" registers?
It's conventionally register 0, but there's absolutely nothing special about it. A push is just a regular store followed by an increment or decrement. It's not even necessarily hardware-specified whether the stack grows up or down; AFAICT both are equally efficient in the hardware, so it's a matter of ABI.
Dr Jefyll wrote:
Re instruction timing, I find it useful to remind myself that every instruction is 16 bits (2 bytes), and thus takes two cycles to fetch. Presumably this has implications for pipelining. In my diagram I purely guessed at the instruction format, but for pipelining purposes it should be arranged in a way that includes the chewiest bit-fields in the first byte-fetch rather than the second! :)
In the implementation I had, IIRC the first cycle didn't actually do all that much work in the fetch unit. The instructions are almost trivial to decode, so the only real work to do is to statically predict the next IP in case of branches. But there's just not going to be a way to do that until the entire instruction has been seen; you need every bit of a branch or jump to make a guess at the next IP. (That also means I need a dedicated 16-bit adder for the fetch unit, which just barely fit.)
Dr Jefyll wrote:
ps: AIUI, memory is addressed at byte granularity but all instructions are two bytes in length, which means the Program Counter will always have zero in its lsb. To me that's like catnip... we need to put that bit to work! At the most basic level maybe we could double the code space to 128K. Or, at the risk of abandoning KISS, we could probably get even more clever than that... :twisted: :mrgreen:
Ewww, Harvard architecture. ;P One way to put that bit to work is to not have it at all. Make the PC 15 bits internally, and save the transistors. Then, shift the PC offsets for branches and jumps by one. That doubles the jump radius for 9-bit immediates (JAL) from 256 bytes to 512 bytes, and the range for 7-bit immediates (JALR) from 64 bytes to 128 bytes.