6502 is easy. Getting the 65C816s emulation/native mode and 8/16 size changes correct and working efficiently is more of a challenge.
The first cut of my emulator was far too slow so I'm rewriting it. It doesn't help that the default optimisation setting for the Arduino code framework is for 'space' (-Os) rather than 'speed' (-O3).
How was it far too slow? What kind of changes are you doing to make it faster?
I'm not concerned with raw performance right now, but I'm just curious.
I was getting around 2Mhz initially. I had to edit the code generation settings to enable more optimisation and increase the speed to 2.5Mhz but I think that is still poor. Xtensa assembly code is hard to read and I could not determine how efficiently it had implemented the main opcode switch.
The processor has a few peculiarities I'm not familiar with. The clock speed is 240MHz but the flash is only rated for 80Mhz so that explains some of slow down. My memory mapping scheme takes 8-10 instructions per byte access but handles RAM and ROM uniformly so I think I'll leave that alone in the next version. I have enough RAM that I can probably move some of the opcode functions into it which I believe will make them run at the full 240MHz
in the new version I'm implementing opcode execution using structure containing an array of pointers to functions. There are 5 such structures, one for each CPU state much like lib65c816.
Benchmarking these emulators can be a bit tricky especially if it contains any I/O.
I've been testing my new code with some odd bits of code like the Fibonacci calculator I did earlier this year. This 6502 code prints out each of the numbers it generates. When I run it in the emulator on my 3GHz i7 desktop i get only 2Mhz equivalent 6502 speed with output enabled and 310MHz with none. Clearly the time taken output a character into a command shell window is taking a very long time.
I think we need a standard compute only task that runs for a large number of cycles to get a real feel for the raw emulation speed before environmental factors like I/O speed are added. Possibly a standard Ackermann function configuration
I agree. There needs to be some type of benchmark really for the various instructions. This could be built into an instruction test program. Each instruction emulation takes a certain amount of time, and some optimize better than others.
There may be a few more code optimisations I can squeeze out of it and I want to try running the emulator on the second core -- which I don't think is running anything else so it might go a tad faster.
With the emulator running on core 0 and the duration of the test code increased to provide a better set of values to work out the speed from I now get ..
I have a working system with UART based I/O, 100Hz timer and interrupts. I need to port over my SXB hacker code to create a simple monitor and then I'll make the repository public.
The CPU execution speed has been reduced down to 7.5MHz for now. When I try to move tasks to the second core the system becomes unresponsive. I think I'll have to move to using FreeRTOS directly at some point to make it work reliably but its good enough for now.