RISCY-V02

mysterymath · Post by **mysterymath** » Tue Mar 18, 2025 11:04 pm

Dr Jefyll wrote:

If the RISCY-V02 used an approach that broke each computation into multiple 8-bit chunks then, as well as 16-bit, perhaps it could also (with a slight penalty) execute some 24- or 32-bit operations!

Aha, you discovered my clever trick, but in reverse. The ALU is already 8 bit; I'm taking advantage of your observation to keep the ALU exactly the same width as the one in the 65c02. That's what gives the project a prayer of fitting within the same footprint; a 16-bit ALU is almost certainly too big. (I can't remember whether I ever actually ran the numbers, but a 16-bit barrel shifter is almost certainly a no-go, since an 8-bit one barely fits.)

Using a 16-bit bus wouldn't actually make the processor any faster without making it bigger. But, a plain-jane 16-bit implementation would also be quite straightforward if one wants merely a very small processor, not a 65c02-sized one. That would introduce the question of misaligned accesses, but that's not that big of a deal; you can just make the processor trap and implement it as a trap handler.

I left this out for brevity, but the ISA also uses a corner of the instruction space for interrupt handling control status registers ( and instructions. I didn't actually finish this, so I have much less confidence that it works/fits as is.

The interrupt instructions are:
- BRK (hardware trap)
- RETI (return from interrupt)
- CSRR (read CSR to reg)
- CSRW (write reg to CSR)
- SEI (manipulate CSRs to enable/disable interrupts)
- STP (stop!)
- WAI (wait for interrupt)

The CSRs are:
- IE: whether interrupts are enabled
- PIE: whether interrupts were enabled before the current interrupt.
- EPC: Interrupt return PC
- CAUSE: Whether BRK or interrupt. (NMI has a different vector)

mysterymath · Post by **mysterymath** » Tue Mar 18, 2025 11:16 pm

As an aside, I feel like a pipelined 4 cycle 32-bit RISC-V implementation with roughly the 6502 8-bit bus would be a lot of fun to build using this approach. Maybe some other time. (Or some other person

).

mysterymath · Post by **mysterymath** » Tue Mar 18, 2025 11:27 pm

I also apparently was out of date with my notes compared to my actual implementation doc. It looks like I was able to fit these additional instructions:
- J: Jump (JAL, but leave link register alone)
- JR: Jump register (JALR, but leave link register alone, and with a wider offset)

acharles · Post by **acharles** » Sun Aug 10, 2025 6:51 am

mysterymath wrote:

As an aside, I feel like a pipelined 4 cycle 32-bit RISC-V implementation with roughly the 6502 8-bit bus would be a lot of fun to build using this approach. Maybe some other time. (Or some other person

).

There is tinyQV which almost does that, but 8 cycle (4 bits per cycle) that has been taped out on TT06. It uses QSPI PSRAM/FLASH for code/memory on Tiny Tapeout.

gfoot · Post by **gfoot** » Sun Aug 10, 2025 5:33 pm

A very interesting project! Any updates on its progress?

mysterymath wrote:

I left this out for brevity, but the ISA also uses a corner of the instruction space for interrupt handling control status registers ( and instructions. I didn't actually finish this, so I have much less confidence that it works/fits as is.

Speaking of interrupts then - do you swap to a different register file during interrupts, like ARM does? If not, I'm curious how you go about writing an IRQ handler that doesn't corrupt any state. e.g. maybe the OS has its own stack somewhere, that it can save registers on - but with the load-store architecture and no dedicated stack instructions, in order to make use of it, it feels like it's going to have to overwrite some registers before it gets a chance to save them.

Quote:

- CAUSE: Whether BRK or interrupt. (NMI has a different vector)

Why not split BRK onto a separate vector as well - it feels like it wouldn't cost much silicon to do that, and removes the need for this extra state.

mysterymath · Post by **mysterymath** » Fri Feb 13, 2026 3:18 am

gfoot wrote:

A very interesting project! Any updates on its progress?

I'm still at it! I hadn't done much work on it in a while; dealing with verilog is *very* tedious, and I couldn't get the tests working. Thankfully, Claude is rather good at verilog, which has allowed me to try a lot of ideas very quickly, despite not having very much time for the project. I'm actually targetting TinyTapeout now, and I've been able to more-or-less fit my design in roughly the area that a 6502 verilog core would take when synthesized with the same tools for tiny tapeout. (With some adjustments for not being able to actually use a register SRAM with this tooling.) I'm still trying to get it down in size for vanity's sake. But it passes the tests! I do want to write a CPU fuzzer though. The deadline is fast approaching though!

gfoot wrote:

Speaking of interrupts then - do you swap to a different register file during interrupts, like ARM does? If not, I'm curious how you go about writing an IRQ handler that doesn't corrupt any state. e.g. maybe the OS has its own stack somewhere, that it can save registers on - but with the load-store architecture and no dedicated stack instructions, in order to make use of it, it feels like it's going to have to overwrite some registers before it gets a chance to save them.

My design has an "EPC" register: an architectural register that stores the PC of the interrupted routine. You can push a register using a pre-increment store word on R7, the stack pointer, and you can pop a word using a post-increment load word instruction similarly. Neither actually require any registers. For nested interrupts, you can move EPC to a regular register, the then stack that too. This is the standard RISC "link register" technique extended to interrupts, as done on ARM, RISC-V, etc. But yeah, on most systems, including the 6502, interrupted and interrupting code share the same stack pointer. The stack is usually the mechanism used to gain interrupt reentrancy, and we make stack operations relatively easy and fast.

The nice thing is that for non-nested interrupts, no memory operations are actually needed to take one. The CPU just finishes the current instruction and immediately begins fetching from 0x0006 (the interrupt handler locations are fixed). The 4 handlers for RESET, NMI, BRK, and IRQ are arranged such that the IRQ handler can continue executing normally, while the others must issue a 2-byte jump instruction to their actual locations. Such instructions can jump up to 1023 bytes forward in 4 cycles; if necessary, a 2-instruction sequence can jump anywhere in the full 64KiB in 6 cycles.

gfoot wrote:

Why not split BRK onto a separate vector as well - it feels like it wouldn't cost much silicon to do that, and removes the need for this extra state.

Yep, did that!

mysterymath · Post by **mysterymath** » Sun Mar 01, 2026 4:40 am

Well, after much vibe codery, Claude and I have managed to get a design ready for Tiny Tapeout!
https://github.com/mysterymath/riscyv02

You can see a cool 3d view of the die here: https://mysterymath.github.io/riscyv02

Highlights:

8x 16-bit general-purpose registers (vs 3x 8-bit on 6502)
2-stage pipeline (Fetch/Execute) with speculative fetch
61 fixed 16-bit instructions
2-cycle interrupt entry (vs 7 on 6502)
1.0-2.6x faster than 6502 across common routines
13,280 SRAM-adjusted transistors (vs 13,176 for 6502 on same process)

Manual here: https://github.com/mysterymath/riscyv02 ... cs/info.md

BigEd · Post by **BigEd** » Sun Mar 01, 2026 8:41 am

Splendid - love the 3D view!

(To some extent the 6502 is jointly transistor-limited and also interconnect-limited - one layer of metal, needs to carry power and clocks and some proportion of signals. But still, an interesting project and an interesting comparison!)

Edit: but your point that we've learned a great deal about CPU design and the meaning of performance, since then, and we all of us have access to lots of information that wasn't available to the 6502 team. And tools too - including computers - so we can simulate our ideas and try them out.

GARTHWILSON · Post by **GARTHWILSON** » Sun Mar 01, 2026 10:38 pm

mysterymath wrote:

You can see a cool 3d view of the die here: https://mysterymath.github.io/riscyv02

This is all I get:

: tinytapeout.gif (6.59 KiB) Viewed 601 times

and the rest of the screen is blank, and nothing is clickable.

Martin_H · Post by **Martin_H** » Mon Mar 02, 2026 12:46 am

I took a look at your documents and it seems more like a modernized 1802 than a 6502. The 1802 had sixteen 16-bit registers used for call and return, indexing memory, and interrupt handling. The 1802's major design issue was the serial ALU and requirement to use a single 8-bit D register as the ALU input. The ALU bottleneck was somewhat alleviated because registers could increment, decrement, and automatically do either after a load or store from D.

As for the 6502's addressing modes. They're a design holdover from 60's minicomputers which used main memory rather than registers to address memory. Usually because registers were expensive and memory was cheaper. The PDP-8 being the poster child for this approach which referred to page zero as registers.

Minicomputer designers moved away from that with the DG Nova and PDP-11. But the PDP-11's transistor count was around 120,000 transistors which it took until the late 80s for microprocessors like the 68030 to beat.

BigEd · Post by **BigEd** » Mon Mar 02, 2026 10:00 am

Although it's interesting to compare with the 1802, which was a rare case of an early CPU with a good size of register file, I'd be wary of saying this new thing looks more like this old thing rather than that old thing.

I'm working my way through the splendid documentation for this project. I think an interesting question for me is how does it feel to program this CPU in assembly. Does it feel easy? Does it feel familiar? (For me, both for ARM and for OPC, it did feel easy and familiar, and that gives some sense of similarity to 6502.)

I recommend a look at the code comparison doc:
https://github.com/mysterymath/riscyv02 ... parison.md

The other aspect which is highly relevant here is the socket compatibility, or near-compatibility. Here's a CPU with RDY and SYNC, which accesses a 64k byte address space with a byte-size databus and a familiar relationship of input clock with bus cycles, and with predictable and somewhat familiar cycle counts for each instruction.

(Edit: and so, an in-socket replacement into an existing 6502 system should be possible, whether with an FPGA dev board or a Tiny Tapeout custom chip. One need only write the code for a ROM!)

mysterymath · Post by **mysterymath** » Mon Mar 02, 2026 11:26 pm

BigEd wrote:

(To some extent the 6502 is jointly transistor-limited and also interconnect-limited - one layer of metal, needs to carry power and clocks and some proportion of signals. But still, an interesting project and an interesting comparison!)

You're the second person to point this out; it's a valid concern!

Ideally I'd do this up on actual 70s NMOS... but I have no earthly idea how to do that. Even if I had e.g. Claude whip something up; how could I possibly validate it?

Sans that, best I could figure to do was to take an off the shelf verilog 6502 core (I used Arlet's) that passes the Dorman suite and run it through the same process with the same interface. So, I did that, and at least on a modern process, the areas and max frequency end up quite comparable. (With some fudging of the numbers due to lack of a real SRAM for the register file.) Now, that's with 5 funky metal routing layers, so who knows. Still, the actual design of the CPU on the inside looks a bit like a demented Z80 or 8080 with no microcode, so it seems vaguely plausible that someone could have gotten this working within some percentage of the area of a 6502. Not in the ~months they had to design it of course, and at the time, RISC was just a twinkle in the eye.

EDIT: I'm working on this again for SKY130 at the following link. Note that I've updated the docs to be easier to read, and will continue to do so here: https://github.com/mysterymath/riscyv02-sky

BigEd · Post by **BigEd** » Tue Mar 03, 2026 8:06 am

Nice idea for comparison.

It might be worth looking at the 6530 RRIOT die, compared to the 6502 die, to see what the size of that 16x8 register file might be. Of course, a 2R1W is going to be a bit bigger than a simple SRAM.
http://retro.hansotten.nl/6502-sbc/6530 ... dissected/

The z80 has, of course, a register file too, 24 bytes total. The z80 as a whole is about 4x the area of the 6502, IIRC, but you can see the register file in a die shot.
https://www.righto.com/2014/10/how-z80s ... -down.html

mysterymath · Post by **mysterymath** » Tue Mar 03, 2026 7:00 pm

BigEd wrote:

It might be worth looking at the 6530 RRIOT die, compared to the 6502 die, to see what the size of that 16x8 register file might be. Of course, a 2R1W is going to be a bit bigger than a simple SRAM.
http://retro.hansotten.nl/6502-sbc/6530 ... dissected/

The z80 has, of course, a register file too, 24 bytes total. The z80 as a whole is about 4x the area of the 6502, IIRC, but you can see the register file in a die shot.
https://www.righto.com/2014/10/how-z80s ... -down.html

Hah, 128 bytes is just a bit bigger than RISCY-V02's 8x16-bit file (so, 16 bytes total). But good point on the Z80, I might be able to do some sleuthing from its die.
The design also has a few registers kept out of the file: 2 bits of status (interrupt disable and a general-purpose T flag), 2 bits of interrupt status shadow, and a 15-bit interrupt PC shadow. But those are just done in DFFs, as their accesses are pretty irregular.

EDIT: Ah, you probably meant I could size it down based on the percentage used. Def.

jgharston · Post by **jgharston** » Wed Mar 04, 2026 12:20 am

Vector table (2-byte spacing; IRQ last for inline handler):
Vector ID Address Trigger
RESET $0000 RESB rising edge
NMI $0002 NMIB falling edge, non-maskable
BRK $0004 BRK instruction, unconditional
IRQ $0006 IRQB low, level-sensitive, masked by I=1

It's more useful to have the NMI last so it can be an inline handler. NMIs are more often needed to get away very very quickly without the delays of bouncing off somewhere.

ADD Add rd = rs1 + rs2
ADDI Add immediate rd += sext(imm8)

I don't like addressing modes being specified in the instruction, I always find it painful trying to read MASM code. It's much clearer specified in the operands.

CLI Clear interrupt disable I = 0
SEI Set interrupt disable I = 1

It would be simpler if I=0 disabled interupts, as then on RESET you zero everything, instead of zeroing everything except the I which is set.

RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02

Re: RISCY-V02