emulator performance on embedded cpu

BigEd · Post by **BigEd** » Sun Aug 08, 2010 1:38 pm

WDC's 65c02 and 65816 are rated up to 14MHz. An FPGA T65 seems to promise about 40-50MHz. ASIC implementations are faster, but we don't have access to those.

I was wondering whether any cheap and available CPUs in dev kits or salvageable from consumer electronics would, when emulating a 6502, be a reasonable solution for a faster 6502.

I'm not sure, but I think the 6502 emulator originally written by Acorn for the Archimedes had a performance about 1/4 the clock speed of the ARM. (It would be written in assembly.) (Edit: I don't believe that performance any more.)

I ran a quick test of lib6502 on an 800MHz ARM-based storage device, and it looks like it runs a bit over 100MHz, so that's 1/8 the host clock. That's written in C. (Edit: with a less trivial benchmark, emulated speed is only 60MHz)

It seemed like the mbed LPC1768 dev kit, which has a 96MHz ARM, might be a good platform for an efficient 6502 emulation, for £50 (USB powered, 40-pin 0.1 inch form factor, weird online-only toolchain) - but based on the above, it would be a struggle to emulate at more than 12MHz. (It might be that 24MHz performance would be enough to make this interesting)

Any ideas or thoughts?

kc5tja · Post by **kc5tja** » Sun Aug 08, 2010 3:07 pm

You can get substantially improved performance with a smarter emulator. For example, transcoding the instructions you emulate into native instructions can get near-native performance levels, provided you don't rely on self-modifying code. (If code has to be self-modifying, you'll constantly be flushing your native cache of the 6502 code.) This is how DEC Alpha systems running Windows NT were able to run binaries compiled for the x86 architecture so fast.

You'd manage the native code the same way a physical CPU manages a hardware code cache, so anything that'd invalidate a code cache line would invalidate your transcompiled code and vice versa.

BigEd · Post by **BigEd** » Mon Aug 09, 2010 10:23 am

Interesting, I hadn't heard of that:

Quote:

After translation, x86 applications run as fast under DIGITAL FX!32 on a 500Mz Alpha system as on a 200Mz Pentium-Pro.

See also Hans Wennborg's thesis "Emulator Speed-up Using JIT and LLVM" for some interesting recent work.

Reflecting on this, I realise that for ROM code which doesn't care too much about precise interrupts and has clear distinctions between code and data, the ROM could be translated - once - into a large static C program.

The meat of that translation process would be concatenating many copies of the same code snippets for each opcode which one would find in an emulator. Perhaps inside a monster case statement to deal with branches and the like.

An optimising compiler might then make a decent job of removing dead code and perhaps even unrolling some loops. At minimum, the transition from one opcode to the next would be straightened out, because there's no interpretive loop.

(For dead code, I'm thinking for example of not needing to update the flags after a load, if a subsequent load comes before the next test.)

Another benefit would be that the emulated PC could be statically loaded prior to any JSR or interrupt opportunity - no need to increment for each opcode.

For a simple target processor, such as a cheap ARM, the compiler would be making greater comparative gains than for a sophisticated one like a desktop x86.

It would be nice not to have to test every memory access for I/O accesses, only those which sometimes or always are I/O accesses, but that seems like a difficult analysis problem to automate. Perhaps one could analyse the ROM manually and annotate the I/O routines or accesses.

BigEd · Post by **BigEd** » Sat Aug 14, 2010 4:13 pm

I've dug up some more info.

(Parenthetically: DEC's just-in-time translator is actually a better-luck-next-time translator: "Later, after the application exits, the profile data directs the background optimizer to generate native Alpha code as replacement for all the frequently executed procedures. When the application is next run, the native Alpha code is used and the application executes much faster. This process is repeated whenever a sufficiently enlarged profile shows that it is warranted")

As for one-time conversion to compiled code, Peter Jennings' Microchess site has a copy of Bill Forster's translation into C, which is worth a read. He defines a macro for each opcode, and constructs a function for each subroutine.

I found a couple of threads from 2007 which also discuss translation into C - coincidentally also for chess purposes. (The second thread contains ptorric's posting of his C-based emulator, derived from Bart Trzynadlowski's aminet code.)

Back on the subject of fast emulation being a useful alternative to a real processor, I believe an ARM-based system will be the most efficient because it's so similar to the 6502. (Some other system might be faster just because of clock speed.)

I still don't know how fast an emulated 6502 could be on the 100MHz Cortex-M3 ARM in the mbed.

I did look into Acorn's two emulators: the 65Host and 65Tube programs, using the excellent disassembler Armalyser - it's amazing how dense the ARM instructions are, and very useful that the machine is so like the 6502. 65Tube is the faster of the two: it keeps the 6502 registers in ARM registers and handles PC and SP as pointers into a byte array. Each opcode finishes with a fetch and a computed jump, into a table of 16-word sections, one per opcode. For example, the code for BCC is just 6 istructions:

Code: Select all

; handler for 0x90 BCC
LDRCCB  R0,[R10],#1           ;  fetch the operand byte
MOVCC   R0,R0,LSL #24         ;  shift left to prepare sign-extension
ADDCC   R10,R10,R0,ASR #24    ;  adjust the PC for branch-taken case
ADDCS   R10,R10,#1            ;  increment PC for branch not taken

; standard postamble: fetch next instruction.  R10 is the 6502 PC, as a byte pointer
LDRB    R0,[R10],#1           ;  ifetch into R0 and PC++
ADD     PC,R12,R0,LSL #6      ;  computed jump to next opcode handler

Notice how the predicated instructions remove the need to branch, and how the ARM's own carry flag serves to emulate the 6502's - same for N, Z and V. All the 6502 state is held in registers throughout. The free shifts, auto-increment and the familiar-looking address modes help a lot too.

I think 65Tube is able to assume very simple address decoding and may even assume all OS calls are made by official vectors. Perhaps 65Host is slower because it has to handle emulation of a BBC micro: banked memory and memory-mapped video.

Cheers
Ed

kc5tja · Post by **kc5tja** » Sat Aug 14, 2010 4:16 pm

I had in mind Java Hotspot JIT compiler when I offered the suggestion, which performs run-time translation of code. VX!32 is positively ancient technology in comparison. It's effectively emulating a CPU's code cache, if it were to have one.

BigEd · Post by **BigEd** » Sat Aug 14, 2010 4:22 pm

kc5tja wrote:

I had in mind Java Hotspot JIT compiler when I offered the suggestion, which performs run-time translation of code.

Right.

I've just posted in another thread to mention Sarah Walker's dynamic recompilation in her ARM emulator - it's a ground-up construction, rather than using something else like LLVM. Of course, it's specific to a particular target, in his case x86. It seems there's work to do to make use of the x86_64 featureset.

BigEd · Post by **BigEd** » Sat Sep 04, 2010 7:46 pm

BigEd wrote:

...the ROM could be translated - once - into a large static C program.

Inevitably, it turns out this has been done: Asteroids Static Binary Translation by David Welch with various others, including Graham Toal and somewhere nearby Michael Steil of pagetable.com fame, and also nearby Norbert Kehrer ported to Java.

For light relief, there's a playable static java translation of the original Asteriods ROM there, and also the sources for that translation - but not for the translator itself, I think. (It's playable, if your computer is not too fast.)

I think David's page is saying there was a 3x speedup comparing the static translation to a conventional emulation.

BigEd · Post by **BigEd** » Sun Sep 19, 2010 10:46 am

There's an apposite thread on comp.sys.acorn.programmer from 2001:
"Emulating a 6502 with an ARM?"

There's info there on 65Tube and 65Host emulators, and a suggestion that an ARM at more than a couple of hundred MHz will outpace a 20MHz '02

The thread poster is looking to rejuvenate a 2MHz '02 system design by making an ARM system and emulating (at least as a starting point.)

Many ARMs implement a couple of 16-bit instruction set extensions, which mean half the bandwidth used in fetching code, at the cost of some restrictions.

Because ARM has such similarities with 6502, this trick applies:

Quote:

Torben Mogensen:
Arithmetic flags are, IMO, best emulated by arithmetic flags. To do
this, all bytes that emulate 6502-values are shifted to occupy the
most significant byte of a register before addition or subtraction.
This way, resulting N, Z, C and V flags will (as far as I can judge
without consulting manuals) be set correctly. However, incoming C
flags and C flags in shift-instructions will have to be treated
differently

kc5tja · Post by **kc5tja** » Sun Sep 19, 2010 4:15 pm

My 2.8GHz Athlon can emulate a 65816 at approximately 180 to 220MHz, depending on computer load and with fairly unoptimized code. When writing the Kestrel-2 emulator, I had to insert code to deliberately waste time to slow it down to near real-world speeds. And that is without direct-mapping CPU flags.

BigEd · Post by **BigEd** » Sun Sep 19, 2010 4:44 pm

Well, indeed, the CPU power of the modern desktop system is amazing. And it was on those lines that I was musing on how cost-effective it would be to build a cheap self-contained system with a modern CPU to emulate a 65-series, instead of buying a real 65-series or putting one on FPGA. Are modern CPUs fast enough and cheap enough to compete, when used as emulators?

As a suitable thought experiment, take a 6502-based chess machine based on a 5MHz 6502 (running on batteries, for extra credit). Is there any hardware you could buy that would fit inside the case and run the original ROM code, on batteries, significantly faster?

I'm presuming ARM is best for power and low emulation overhead.

The really affordable dev kits I can find only offer about 100MHz, which probably won't outpace a real 65-series at 20MHz. It looks like something in the $100-$200 range would be needed, at which point, maybe a highly optimised emulation approach might outpace an FPGA. (But it probably won't run on batteries.)

Maybe the one-time compilation approach to native code is the only way that might do it - conventional emulation loses too much speed, and you don't get adequate extra speed at low cost and running on batteries.

kc5tja · Post by **kc5tja** » Sun Sep 19, 2010 6:51 pm

BigEd wrote:

As a suitable thought experiment, take a 6502-based chess machine based on a 5MHz 6502 (running on batteries, for extra credit). Is there any hardware you could buy that would fit inside the case and run the original ROM code, on batteries, significantly faster?

I'm assuming your goal is only to accelerate existing 65xx-based architectures, and not considering the creation of new systems?

BigEd · Post by **BigEd** » Sun Sep 19, 2010 7:40 pm

kc5tja wrote:

BigEd wrote:

As a suitable thought experiment, take a 6502-based chess machine based on a 5MHz 6502 (running on batteries, for extra credit). Is there any hardware you could buy that would fit inside the case and run the original ROM code, on batteries, significantly faster?

I'm assuming your goal is only to accelerate existing 65xx-based architectures, and not considering the creation of new systems?

Yes, I think that's a reasonable assumption - if you were creating a new system you might as well use the 'modern CPU' natively. Of course, a new system with a need for backward compatibility is a good case - you don't want two CPUs.

I should say that I'm not trying to do this, but others might be. I'm just interested in what the current solutions might look like. So if it all seems a bit hypothetical, that's because it is.

I think it comes down to the efficiency of emulation, in the different possible approaches. If the emulation penalty is 10x, then a 100MHz ARM is barely worth considering. If it's only 3x, it's a possibility.

That said, maybe there is no 100MHz limit. I see a chumby hacker board is 450MHz for $90 and a beagleboard is 720MHz for $150

GARTHWILSON · Post by **GARTHWILSON** » Sun Sep 19, 2010 8:02 pm

Quote:

My 2.8GHz Athlon can emulate a 65816 at approximately 180 to 220MHz, depending on computer load and with fairly unoptimized code.

Can you comment on how the emulated 200MHz '816 might perform compared to the normal 2.8GHz Athlon when the '816 is far more practical to write tight assembly code for, plus the differences in requiring (or not requiring) cache, the cache misses and so on.

GARTHWILSON · Post by **GARTHWILSON** » Sun Sep 19, 2010 8:15 pm

Quote:

I think it comes down to the efficiency of emulation, in the different possible approaches. If the emulation penalty is 10x, then a 100MHz ARM is barely worth considering. If it's only 3x, it's a possibility.

It seems like you answered your own question above with the "Emulating a 6502 with an ARM?" link which said it's approximately 10:1. I would guess that it would be approximately double that for an '816, and the '816 runs my Forth at 2-3 times the speed of the '02 running Forth at a given clock speed.

kc5tja · Post by **kc5tja** » Sun Sep 19, 2010 8:32 pm

GARTHWILSON wrote:

Can you comment on how the emulated 200MHz '816 might perform compared to the normal 2.8GHz Athlon when the '816 is far more practical to write tight assembly code for, plus the differences in requiring (or not requiring) cache, the cache misses and so on.

I'm not sure I fully understand your questions, but I'll try to answer them based on my current interpretation.

Quote:

Can you comment on how the emulated 200MHz '816 might perform compared to the normal 2.8GHz Athlon when the '816 is far more practical to write tight assembly code for

A 200MHz Athlon would utterly smoke the 200MHz 65816 on several accounts, assuming everything fits in cache:

1) It can handle multiple instructions per clock cycle. I believe my particular model has 7 pipelines, 4 integer, 2 floating-point, and I think one just for control flow instructions. I could be confusing my Athlon with something else though. It's an AthlonXP architecture, and I'm too lazy to look up specifics right now.

2) Although most instructions require 4 to 8 clocks to complete, the pipelined architecture permits one instruction to retire per clock, for each pipeline in use.

Note that the CPU dynamically fills pipelines to the best of its ability, so hand-coding optimized assembly isn't nearly as difficult as you might be led to believe.

That being said, if you find it too hard to schedule instructions for maximum supserscalar performance, it's usually a simple task to optimize for a single pipe. Given this, and the 1-instruction-per-clock retirement rate, you're already talking 3x average performance gain over even the best-optimized 65816.

Lots of transistors are dedicated to optimizing subroutine calls as best as possible. While it isn't possible to statically determine when a CALL or RET will take one cycle to complete, it is usually the case that they do.

3) The CPU is a 64-bit CPU with more useful addressing modes.

So, to answer your question, I'd wager that a 2.0GHz Athlon compared against the 65816 would offer about 30x to 120x performance, depending on how often you work with 32- or 64-bit data. With just 16-bit data and assuming no half-word overhead like the old Pentium Pro has, you're still looking at substantial performance gains.

Quote:

plus the differences in requiring (or not requiring) cache, the cache misses and so on

This is much tougher to answer, because caching overheads tend to be very CPU-specific. The Transwarp accelerator for the Apple IIgs, for example, configured the 65816 to run from hardware-managed cache, while the SuperCPU tends to take an explicitly-managed, software-driven cache approach. Even then, cache overheads on a 80486 versus AthlonXP will be very different.

Basically, in a truly simplistic model, what you end up with is this: when the CPU references a memory location that is not in its cache (this can be determined in phase-1 if your logic is fast enough), the CPU will need to be halted (RDY line low), so that DMA can be used to load faster static RAM with an image from slower RAM sources (perhaps RAM found over a CPU-independent I/O bus). This might require writing memory first, to make room of course. With an average 16 byte-long cache "line", a cache miss can be reasonably expected to delay the CPU 16 to 32 cycles per miss.

Toshi had posted a response to a disagreement we had where he quoted a figure in the vicinity of 580 cycles or so for a cache miss on (more or less) contemporary CPUs.

Taking these figures into account, at 16.67MHz, 32 cycles implies about 2 microseconds of wait-time. 580 cycles at 2.0GHz implies about 64ns of wait-time. Both of these figure assume zero-overhead for bus arbitration. So, even with a cache miss, the Athlon is still faster than the 65816. Also, Athlon caches hold 64 bytes per line, not 16. Statement for statement, we can expect the 65816 to miss 4x as often.