Page 1 of 7

Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 3:01 am
by sburrow
Hey everyone, long time!

I've been off programming in PIC32-land for a while now. Very much enjoying it! Here's my GitHub page if you are interested: https://github.com/stevenchadburrow/AcolyteHandPICd32

Anyways, I recently had a crazy idea to run a C-based NES emulator on this PIC32. Why use FPGA's when you got something so much cooler, right?

Wrong. I did manage to actually get it 'running' (I copied code from https://github.com/franzflasch/nes_emu and made necessarily modifications). But its *magnitudes* slower than what I need. I did some cycle counts and its literally hundreds if not thousands of cycles per single CPU and PPUx3 cycles. So the NES runs at 1.79 MHz, the PPU runs 3 times faster, that's just roughly 4 * 1.79 MHz = 7.16 MHz equivalent. My PIC32 is running a nominal 100 MHz lets say, thus 100 / 7.16 = 13.966 = 14x PIC32 cycles per 1x NES emulated cycles. [ This is all very rough math. ]

Thus to keep up the hardware emulation, my PIC32 would have to be doing all the work of the CPU and PPU in about 14 clock cycles. That's literally impossible.

My question to you guys: Am I thinking about all of this correctly?

When I first saw it working at all, that was neat. But then it was SO SLOW, barely crawling. This morning I optimized some code and got it twice as fast as previously, but I think I need like 30+ times the speed to make it run smoothly. Thus *magnitudes*. Simple code optimization won't cut it. Yes?

Thanks for any insight.

Chad

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 4:13 am
by GARTHWILSON
In another topic, someone said ten to one is approximately the best you can expect in having a more-powerful microcontroller emulate a 6502.  So far, I'm not remembering good-enough search terms to find the topic.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 5:10 am
by Yuri
sburrow wrote:
Thus to keep up the hardware emulation, my PIC32 would have to be doing all the work of the CPU and PPU in about 14 clock cycles. That's literally impossible.
Maybe, it isn't always so cut and dry. Especially when you start talking about a more modern processor like the ARM Cortex M0+ (or M23) which have superscalar pipelined architectures.

Though the pipeline is only two instructions deep on those two processors; it is effectively working on two instructions at the same time. If I'm not mistaken the 6502 has sort of a one instruction deep prefetch pipeline, but doesn't actually start decoding the instruction until it is done with the first.

(Maybe someone with more in depth knowledge of the 6502 silicon can chime in on that)

There might be some other tricks to be had (see below)
Quote:
I did manage to actually get it 'running' (I copied code from https://github.com/franzflasch/nes_emu and made necessarily modifications).
The code you copied is in C, which is fine when you have a fast general purpose CPU and wish to keep it easily read/maintained, but not so hot when you are trying to squeeze performance out of a low power embedded CPU like the ARM Cortex M0+ (M23?)

Many of the original NES/SNES emulators from way back in the day as I recall had to hand craft a lot of the emulation in assembly.

There is also the fact that the ARM is a 32-bit CPU and may have some instructions for doing some SMID type operations allowing it to compute several 8-bit numbers in one call. I'm not super familiar with all of the bits of the ARM instructions that would do that though, so I don't know if the M0+ or M23 have those types of instructions.

(much like RISC-V, ARM has sort of a smorgasbord of options where you can pick and chose which sets of instructions you need for an application. Doesn't look like the M0+ or M23 have many categories selected.)

If SMID is possible, this would probably require some sort of an optimization path that could examine the 6502 instructions it was about to process and do some reordering/reworking so they could be run in parallel.

Doing SMID type things might go double for the PPU, having the 32-bit registers handling 4x 8-bit pixel data (or whatever format the NES uses, I don't recall)

That being said, 100MHz might still be too slow for something like that; you'd certainly have your work cut out for you.
Quote:
Why use FPGA's when you got something so much cooler, right?
FPGAs have an advantage in that they can be restructured on the hardware level. It would not be out of the ordinary to have part of the FPGA effectively emulating the 6502 itself, and another part of that same silicon emulating the PPU.


Another thought, looking at some of the details you have listed on the GitHub for your Acolyte design, if you've left that many ports of the PIC open, could you interface to an actual 6502 and off load the instructions to there?

Or maybe use two boards? One could run the 6502 emulation the other could forward PPU calls to the other? (There is a reason why the NES had the PPU in a different chip after all)

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 8:47 am
by drogon
sburrow wrote:
Thus to keep up the hardware emulation, my PIC32 would have to be doing all the work of the CPU and PPU in about 14 clock cycles. That's literally impossible.
A wise man once said; It's only impossible until it's not.

It sounds like you need to move to assembler rather than C though. Or another architecture, but I can see the advantage of a PIC32 device here.

I also know that a 1GHz ARM device can emulate a 6502 at just under 300Mhz which is pretty good, so there ought to be room for much improvement in the PIC32 (MIPS) environment at 100Mhz if you take the bold step of re-writing it all in assembler..

-Gordon

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 9:18 am
by BigEd
The best results these days are from JIT approaches - PiTubeDirect has a very fast model which uses this approach. (Edit: but it uses a lot of RAM and runs on a Pi-sized ARM, rather more capable than your average microcontroller)

The extra difficulties with a console emulation are two-fold
- cycle accuracy will be very important
- you have at least three parallel processes to keep in sync: CPU, sound, video. Possibly more.

You will need extreme efficiency, which means you need a lot of ingenuity and diligence. It might be more than you presently manage to do.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 9:33 am
by sark02
With modern architectures, where the fast-clocked CPU core is attached to fast cache backed by (relatively) slow memory, it's important to keep as much hot code and data in the caches as you can.

In the PIC32MZ, I'd suggest turning on microMIPS support (where many instructions use 16-bit opcodes rather than 32-bit opcodes) in order to make the best use of the instruction cache, and carefully select the sizes and locations of your data objects to try to keep all the emulated instruction fetch/execute loop data together in tightly packed cache lines.

Using 'C' should be fine for this... the compiler should do a great job at ordering instructions to avoid stalls, and for instruction cache reasons I'd suggest using -Os (optimize for size), rather than, say, -O3 (maaxxximmuuuum speeeeeed).

I'd try putting the whole emulator loop in a single function, to maximize the use of registers and try to restrict load/stores to emulated processor operations.

I think there's a lot of potential in the MIPS, but you have to be aware of the memory hierarchy and how performance collapses as you step down it.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 10:13 am
by John West
How are you building it? I don't see a makefile or any build instructions in there (I also can't find the CPU simulation in your repository, so this doesn't look like something that other people can download and try, unless I'm missing something obvious).

The compiler should do a decent job of making a fast executable, but you will need to tell it to optimise. Try the various options, including -Os to optimise for size (that can sometimes give better performing code than the options that are actually aiming for performance, because caches are so important).

Other things you can try:
Force inlining of the flag-manipulation functions.
Rearrange the code to keep the most commonly used instructions and addressing modes together. It doesn't matter if EOR (zp, X) gives a cache miss, because no one ever uses it.
It has been more than 20 years since I last used gcc on MIPS, and the version we used then was already old. But enabling strict aliasing may or may not help.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 10:32 am
by sburrow
GARTHWILSON wrote:
someone said ten to one is approximately the best you can expect
That sounds about right because I'm guesstimating about 30-40 times speed needed, and there are 4 'cycles' I'm needing to compute each go-around.
Yuri wrote:
allowing it to compute several 8-bit numbers in one call
Whoa, that would definitely help in speed. But yes I would definitely need to know that stuff inside and out for that type of optimization.
Yuri wrote:
ould you interface to an actual 6502 and off load the instructions to there
Yes I was thinking about that. But honestly the 6502 CPU wasn't taking the majority of the code anyways, it was really the PPU hogging most of the software time. I've already thought of ways to make the 6502 part very quick with lookup tables in specific parts in memory, etc. But that darned PPU is a whole different beast!
drogon wrote:
the bold step of re-writing it all in assembler.
Bold indeed! Though I was getting pretty good at PIC24 assembly for a while, I'm not very familiar with the PIC32 assembly. They are very different.
BigEd wrote:
You will need extreme efficiency, which means you need a lot of ingenuity and diligence. It might be more than you presently manage to do.
I'm thinking so, indeed.
sark02 wrote:
I'd suggest turning on microMIPS support
Good thinking there. If I were to jump into that side of things, that's a good thought.
John West wrote:
I also can't find the CPU simulation in your repository
I didn't put it there because I'm not really 'solid' on it yet. Sorry about that.

****

So, since a lot of y'all were mentioning 6502 optimizations, here's my thoughts on how to do that. We're looking at a lot of hex code, and the next one is an instruction.

$A9 = LDA#

I take that value, bit shift it, and add something like $9D010000 to it, so that its located in ROM, then jump to that location.

asm("jal $9D010A90");

At that memory location we have instructions specific for that code.

Code: Select all

void __attribute__((address(0x9D010A90))) inst_a9() {
// grab next byte
// store in virtual accumulator
// increment cycle counter
// return by going back to the loop
asm("jal $9D008000");
}
Each 6502 instruction would have its own little home in ROM, and it would be extremely optimized. Sure it's a huge waste of space, but who cares when you have 2MB of ROM on the chip. LDA# uses 2 cycles, and this setup would take about 8-10 PIC32 cycles which is very doable. Again, is my ~100 MHz CPU running at least 4-5 times faster than the 6502 would run? Yes.

But when I was looking at cycle counts earlier, the 6502 emulation did not use nearly as much time as the PPU did. And, looking at the PPU code (in C), there is indeed a lot more going on. It is not just "find it, do it, move on". There's a lot of background stuff going on as well.

Just had a though though, not sure how familiar y'all are with the NES PPU. What if I were to only do stuff with it upon a change to the PPU. As in, the CPU accesses the PPU, and *that* is when I actually update registers, change things around, etc. Obviously the game code is not accessing the PPU all the time, so perhaps I could store everything in a 'state' in RAM, and then change it when it asks for a change. Still, if a particular game accesses the PPU more often, it could slow everything down a lot.

In the end, like you all said, it would require a ton of optimizations, assembly code, etc. And even then, it might not be enough with that PPU emulation. I'll be thinking it over some more, but thank you all for the advice. I appreciate all of the responses!

Chad

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 10:45 am
by BigEd
GARTHWILSON wrote:
In another topic, someone said ten to one is approximately the best you can expect in having a more-powerful microcontroller emulate a 6502.  So far, I'm not remembering good-enough search terms to find the topic.
I think that estimate for an upper bound was surfaced (from Usenet) in this thread: Other threads which have interesting and relevant findings: Edit: two more: The Pi Pico is a 133MHz ARM-based microcontroller, very deterministic. Very different from the bigger Pi models. This thread on stardot reports nearly 32MHz 65C02 emulation, using very expertly hand-tuned ARM code.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 11:19 am
by drogon
sburrow wrote:
So, since a lot of y'all were mentioning 6502 optimizations, here's my thoughts on how to do that. We're looking at a lot of hex code, and the next one is an instruction.

$A9 = LDA#

I take that value, bit shift it, and add something like $9D010000 to it, so that its located in ROM, then jump to that location.

asm("jal $9D010A90");

At that memory location we have instructions specific for that code.
There are several strategies and some will depend on the underlying architecture - cache, jump efficiency (delayed branch slots/prediction, whatever...)

My own "pet" bytecode system is the bytecode produced by the BCPL compiler (Cintcode). You can treat 6502 (or 65C02) as a bytecode too as each opcode is one byte long.

Form a table of addresses of the handler for each instruction (you only have 256 instructions, so 256*4 = 1024 bytes of lookup table) then you need a fetch, multiply by 4, lookup and jump sequence for each opcode.

This is where doing it all in assembler can be advantageous - you keep most data in CPU registers.

The ARM (ARM32) has an opcode sequence that could have been purpose designed for executing bytecodes:

Code: Select all

    ldrb    r0,[regPC],#1		@ Fetch byte at PC, Increment PC
    ldr     pc,[ptrJ, r0, lsl #2]	@ Take byte in r0, << 2, add to ptrJ
                                                @ fetch that word, transfer into PC.
Here, regPC is an ARM 32-bit register and ptrj is another ARM register that has the base address of the jump table. This register is 'fixed' for the life of the system which may seem like a waste but it's OK as I don't have another use for it and I don't need to stick to anyone elses ABI in this system.

So that's just 2 instrutions to fetch, increment the PC and jump to the handling code.

I've really no idea about MIPS so don't know how this would translate. The best I came up with in my RISC-V implementation is 6 instructions.

What I also do is at the end of each opcode handler is to in-line this code. So each handler is 2 more instruction (so the whole is 256 * 2 * 4 = 2048 bytes of 'overhead' in ARM). There is no "JSR", no stacks to handle - the whole thing just looks like one big program. (And in-fact the 65816 version of this bytecode interpreter doesn't use any native stack at all)

However it all depends on jump efficiency. My ARM implementation is some 10KB which fits entirely in the instruction cache on the ARM CPU I'm using it on. (Of-course everything else is data)

Modern C compilers can turn switch statements into something quite efficient too, so a big 256 entry switch statement can make it all work - often better than building up a table of functions to call from C as that will (may) involve stack call/return shenanigans. If I had to do it in C and wanted it fast I would not be afraid to experiment with GOTO rather than have the switch in a loop to see if that was any faster.

Of-course that's just the 6502. Alternating between 6502 and PPU instructions is more a challenge. Are there any multi-core PIC32 MCUs?

-Gordon

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 3:52 pm
by sburrow
What I'm getting from y'all (overall) is that "yes, it is theoretically possible". Seeing that the Pi Pico do that makes me think that, yes, it is possible, but only possible.

Can I muse with you for a minute? Say I start re-programming my own emulator (as I already am in my brain), 6502 CPU and PPU alike from the ground up. And I absolutely maximize the performance, however I can. What if, what if it still doesn't work? What if it is still too slow? What if I had to make too many shortcuts to where it wasn't feasible past a single game or two? When do I gamble and start a huge project, all with the possibility of failure not by software but by hardware limitations?

At what point do I take that gamble?

I came here sure that y'all would say, "Chad, your crazy, there's no way it's fast enough for that!" But I got an entirely different response. So, where do I go from here? Do I say, "Eh, the probability of failure it too high, I give up now." or "Surely there is a way!" I didn't come this far to 'give up', but a wise man should also see real limitations.

Right?

Thank you all, for the encouragement and wisdom.

Chad

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Wed Dec 04, 2024 4:16 pm
by BigEd
I think you should persevere, at least in some way, towards
- a working 6502 system emulation
- a performant 6502 system emulation
- a sophisticated 6502 system emulation (adding sound, video, maybe sprites)
- an accurate 6502 system emulation.

Which is to say, NES is very ambitious, even for a desktop application, and if you haven't previously worked on and studied emulator tactics and organisations, a good way to proceed is to start simple.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Thu Dec 05, 2024 1:13 am
by resman
Many moons ago I wrote an Apple II emulator for the PalmPilot PDA. Original site: https://palmapple.sourceforge.net/ , archived site with source: https://github.com/dschmenk/Appalm/tree/master. Basically a 68000 with an 8 bit data bus. At 16 MHz. So many concessions had to be made to get anywhere near emulating a 1 MHz 6502. I (re)wrote the 6502 emulation in 68000 assembly, keeping most state in registers. Handling video was a break from traditional emulation - instead of generating cycle accurate video in lock step with CPU emulation, I left the video generation up to a timed framerate. This would break certain advanced techniques, but the vast majority of games worked fine, albeit a little jerky on slow PalmPilots. Faster PalmPilots came on the market and the emulator ran quite nicely and the video framerate could be increased to look much smoother. I also didn't start this project from scratch - it had already been written, in C, based on an emulator for the PC. It sort-of worked, but took about an hour to boot to BASIC. I just went in and replaced components piecemeal. By the time I was done, only a couple of routines for floppy disk emulation remained from the original code base.

It looks like you are in a similar position. You have a working emulator, you just need to get in there and start understanding what your strategy should be. I have no idea what the PPU is, but does it really need to be emulated at 7.16 MHz or can you record the state changes and do much of it all at once? Writing a 6502 emulator in MIPS assembly shouldn't be too taxing if you just can't get the C compiler to cooperate (I also wrote a lot of MIPS assembly many, many moons ago). So go for it - if anything you'll learn much about emulating a system and tons about the PIC32.

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Fri Dec 06, 2024 1:00 am
by sburrow
Well, an update. But not what you expected!

I tried some minor modifications to the NES Emulator code to see if it would help. To check if I could find a starting place to optimize. Such as, not render the screen with the PPU but perhaps every other frame. I'm talking, have the PPU *not run at all* for a single frame (or more) and then draw it all at once whenever I wanted. Big failure, there is so many intricacies with interrupts and Sprite 0 collisions, it just wasn't working except when I would have it run as expected: 1 CPU cycle, 3 PPU cycles, draw pixels as you go.

Frustrated that I couldn't find much a starting point for the PPU (as the 6502 CPU wasn't really a concern for me anyways), I started thinking of other options. And another option came up. Why not make a Gameboy Emulator? I found a very clean single-file set of code here: https://github.com/deltabeard/Peanut-GB, and then modified the code from SDL into my own procedures. I had done this with the NES Emulator, so it wasn't much a problem.

After a day or so finding issues more with Elm-Chan's FatFS functions (or perhaps my logic in using them in weird particular ways?), I got it to work. AND, it runs at a very nice speed without any modification. It might even need a small delay, but running it straight in C seems to have it run at basically full speed. Cool!

You might consider this move a 'lazy' one. *shrug* I'm not really dying to get the NES-specifically working here, but I do want a large software library to play with. Having the Gameboy (and soon the Gameboy Color), I would have a very large library and for very little effort. More time for me, my hobbies, my wife and three kids, and even my job (in that order?!).

Anyways, that's where the story ends here, because I'm no longer in 6502-land again. I have many more plans for this, additional hardware updates, and other neat stuff, but again, that's not for this forum. Check my GitHub (link in first post) for updates when I'm happy to post them.

Thank you everyone! I appreciate the wisdom, encouragement, and support.

Chad

Re: Emulating NES CPU and PPU on PIC32, too slow?

Posted: Sun Feb 09, 2025 1:32 am
by sburrow
Hey everyone!

I'm posting on this topic again because... I've been working on my very own NES emulator! Over the past couple of days I've been coding the 6502 instruction set in C, using a lot of pre-compiler macros. Did you know that STA (d),y uses the indirect addressing to *store* a value, not read it?! Hm! That was 4 hours of debugging today :) [ It was actually an very simple oversight since all the other uses of that type of addressing read instead of write. I was just carelessly copying code it seems... ]

Anyways, see attached pictures. I really don't have any PPU set up right now, what I'm doing here is 'faking it', but it seems to work none-the-less. I haven't yet implemented sprites. palettes, audio, etc. I just barely have nametables, interrupts, and... that's about it.

But here's the thing: It is running at nearly the exact speed it needs to run, but I would expect it to be running much faster without all the fancy stuff I haven't yet added to it. On top of that, it is only drawing this in a tiny corner of screen. Read above why I stopped this last time: This PIC32, even at 200 MHz now, is just running too slow. It seems like I'm going to run into a similar problem again!

I'll be playing with it and will update you when I have something good (or bad) to report. Thanks everyone!

Chad