However, if I were using a 65816 to emulate an 80386, I suspect it would take much longer (closer to 20:1 or 30:1) simply because of register width issues. There's also that small matter of very complicated instruction decoding as well.
emulator performance on embedded cpu
GARTHWILSON wrote:
I would guess that it would be approximately double that for an '816
However, if I were using a 65816 to emulate an 80386, I suspect it would take much longer (closer to 20:1 or 30:1) simply because of register width issues. There's also that small matter of very complicated instruction decoding as well.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Wouldn't the Athlon have to do basically more MIPS to get the job done in the same amount of time, just because it's so complex you almost can't code it efficiently in assembly, and have to resort to a higher-level compiled language, meaning the gains are not as great as they initially appear? (Or maybe you were already taking that into account?)
GARTHWILSON wrote:
Wouldn't the Athlon have to do basically more MIPS to get the job done in the same amount of time, just because it's so complex you almost can't code it efficiently in assembly, and have to resort to a higher-level compiled language, meaning the gains are not as great as they initially appear? (Or maybe you were already taking that into account?)
MEANING 1: Single-cycle instruction execution. Answer: no. It already achieves this, and in fact, Intel has been executing its instructions in a single cycle as far back as the 80486.
MEANING 2: An intelligently designed, complex instruction set that actually gives evidence for the benefits of CISC over RISC, instead of vice versa. Answer: Yes, in some cases.
The 6502/65816, for example, will automatically set flags based on virtually anything coming in off the data bus or ALU output. It seems rare when the CPU inhibits all flags updates. Thus, you don't often see code like:
Code: Select all
LDA someVariable
ORA #0 ; set flags
BEQ fooBar
Code: Select all
MOV EAX,someVariable
OR EAX,EAX
JZ fooBar
However, there's a flip-side to this. I'm sure I'm not the only one who has found writing flags-transparent code annoying:
Code: Select all
PHP
LDA foo
ADC bar
STA baz
PLP
The degree to which the Intel-architecture processor needs to execute more MIPS than a comparable 65816 depends on how frequently you depend on things like automatic flag settings.
On the other hand, Intel architecture supports:
* multi-bit rotates and shifts, with and without carry, in a single cycle.
* explicit I/O and memory address spaces, which saves on bank-switching costs on constrained systems. You rarely see code fragments in Intel programs correlating with 6510 code like LDA $01:PHA:LDA #xx:STA $01:....:PLA:STA $01
* real indirect addressing modes for JMP and CALL targets, permitting no more than two-cycle (assuming cache hit) vectored execution, which OO- and functional-programming code relies on extensively.
* Any CPU register can be used as a base address and/or index (saving the need for TAX/TAY instructions), complete with power-of-two scaling (saving the need for TXA,ASL,TAX sequences).
* Saving registers to, and popping registers from, the stack are single cycle, not 3 to 4.
* More useful set of conditional branches, testing for <, <=, =, /=, >=, and > in both signed and unsigned variants.
* Floating point instructions.
So, yes, it's true that Intel-architecture CPUs need more instructions to do what the 65816 can do in one well-chosen instruction. However, exploiting other unique features of the x86 architecture can more than make up for that deficiency. It all depends on the kind of program you're writing.
I do want to point out, though, that I still enjoy coding for the 65816 for one simple reason: the x86 architecture is notoriously difficult to get working in a circuit. There's no question that a contemporary x86 processor with adequate cache support and a NorthBridge chip suited to the specific CPU you're working with will obviously trounce the latest from WDC.
On the other hand, getting your 65xx code to run as fast as an x86 will involve a ton of look-up tables, because that's about the only way you're going to recover the computation losses that x86 overcomes.
On the other hand, getting your 65xx code to run as fast as an x86 will involve a ton of look-up tables, because that's about the only way you're going to recover the computation losses that x86 overcomes.
GARTHWILSON wrote:
Quote:
I think it comes down to the efficiency of emulation, in the different possible approaches. If the emulation penalty is 10x, then a 100MHz ARM is barely worth considering. If it's only 3x, it's a possibility.
OK, it looks to me like they think 8x slowdown for 6502 is expected, maybe as little as 4x with the emulator written in optimal assembly code. (I suspect 4x is optimistic.)
I don't think anyone on that thread knows about emulating the '816 optimally on an ARM, but there's some guessing around 7x. You're thinking it might be 20x.
Samuel can run an '816 with 14x slowdown using a compiled emulator on a high-end x86. It being a high-end CPU is going to give advantages.
I just built and ran Samuel's lib65816 emulator on an 800MHz ARM: it runs only about 6MHz. (lib6502 manages about 60MHz.) I think the penalty here for some styles of compiled code is showing up: most things run at acceptable speed but python is particularly slow on this machine. (I gather ARM sell a compiler which does much better than gcc.)
I've run two of Acorn's ARM-based 6502 emulators: one runs 2.5x-3x the speed of the other, because the faster one runs a flat 64k RAM emulation with I/O done by undocumented instructions - the slower one emulates the BBC with 32k RAM, 8k paged ROM, 8k OS ROM with 3 i/o pages mapped over it.
Note also that the ARMs in the $60 dev boards are low-end CPUs - won't have the cache, branch prediction etc that will be in the $150 products. I think that means most code will run slower, whether emulated or native.
For the '816, the annoying mode-bits might be a help: an emulator can run in several modes, perhaps by using a different base address for the array of 256 code snippets. So no need for each instruction handler to handle all modes. It just might be that the 16-bit operations are similar enough for the ARM (as a 32-bit machine) to handle them with similar tactics to the 8-bit case. So the overhead might be less than expected.
BigEd wrote:
I don't think anyone on that thread knows about emulating the '816 optimally on an ARM, but there's some guessing around 7x. You're thinking it might be 20x.
Quote:
For the '816, the annoying mode-bits might be a help: an emulator can run in several modes, perhaps by using a different base address for the array of 256 code snippets.
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
I've been playing around at emulating a 65C02 on a 40MIPS dsPIC. The emulation supports 32K of ROM (read from the program flash memory) and 4, 8 or 16K of RAM (depending on the dsPIC device model) replicated as needed to fill the low 32K address space.
Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02 with code that keeps the flags and registers in memory. I'm going to switch to using the device status register for N,V,Z,C which will move me closer to my goal of around 4MHz.
Using memory addresses for I/O would have slowed down normal instruction execution so I've added the 65816's coprocessor opcode (COP) and I'm going to use it control I/O.
Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02 with code that keeps the flags and registers in memory. I'm going to switch to using the device status register for N,V,Z,C which will move me closer to my goal of around 4MHz.
Using memory addresses for I/O would have slowed down normal instruction execution so I've added the 65816's coprocessor opcode (COP) and I'm going to use it control I/O.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
I've been playing around at emulating a 65C02 on a 40MIPS dsPIC. [...]
Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02
Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
I found a PIC 24F device with 96K of RAM and have updated the code so that the embedded simulation gives a full 64K of RAM to the 65C02 emulation. On reset I copy 32K of 6502 ROM image from the PICs flash program memory into the upper RAM area.
The device supports USB so I'm working on a composite CDC driver which will provide two serial connections, one to the communicate with the simulator (for debugging) and one to communicate with the emulated 65C02. I'm also going to add VIA/PIA, ACIA and SIB emulation using the device hardware features.
The downside to using a 24F is that the maximum clock speed is lower and the simulation will end up being around 1.5-2.0Mhz but still better than the real 6502 in the BBC microcomputer I used own.
Another pain is that the device is only available in surface mount packages. I'm planning on using a Schmartboard to mount to the device and its decoupling capacitors then use normal 0.1" strip board for the rest of the supporting circuitry. Made professionally you could get the whole thing on something the side of a 40pin DIP, like the GODIL boards.
Should keep me busy over Xmas.
The device supports USB so I'm working on a composite CDC driver which will provide two serial connections, one to the communicate with the simulator (for debugging) and one to communicate with the emulated 65C02. I'm also going to add VIA/PIA, ACIA and SIB emulation using the device hardware features.
The downside to using a 24F is that the maximum clock speed is lower and the simulation will end up being around 1.5-2.0Mhz but still better than the real 6502 in the BBC microcomputer I used own.
Another pain is that the device is only available in surface mount packages. I'm planning on using a Schmartboard to mount to the device and its decoupling capacitors then use normal 0.1" strip board for the rest of the supporting circuitry. Made professionally you could get the whole thing on something the side of a 40pin DIP, like the GODIL boards.
Should keep me busy over Xmas.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
I'm using a 24FJ256GB206/210 at 32Mhz (16 MIPS). Its my own design but theres not much to it MCU + USB + ICSP + Power regulator (maybe more later).
I have TQFPs for the 64 and 100 pin devices. I'm hoping Schmartboards will make hand soldering 0.4 and 0.5 mm spaced pins feasible with basic equipment. (They have grooved channels for the pins which are pre-tinned. You drop on the device and run a fine point iron up the groove to solder the pin. The outer edges of the board have 0.1" spaced holes for wire connections).
If the prototype works well then I'll design a proper PCB and see about having some made properly.
I have TQFPs for the 64 and 100 pin devices. I'm hoping Schmartboards will make hand soldering 0.4 and 0.5 mm spaced pins feasible with basic equipment. (They have grooved channels for the pins which are pre-tinned. You drop on the device and run a fine point iron up the groove to solder the pin. The outer edges of the board have 0.1" spaced holes for wire connections).
If the prototype works well then I'll design a proper PCB and see about having some made properly.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
faster cheaper ARM dev board
BigEd wrote:
It seemed like the mbed LPC1768 dev kit, which has a 96MHz ARM, might be a good platform for an efficient 6502 emulation, for £50 (USB powered, 40-pin 0.1 inch form factor, weird online-only toolchain) - but based on the above, it would be a struggle to emulate at more than 12MHz.
Again, it's USB-powered, it has 100-pin breakout (but not in compact DIL footprint like Bitwise's PIC-based modules). There's a choice of Windows toolchains but apparently linux can be got to work too (see comments here).
PDFs: technical, marketing and benchmarks.
(The flash doesn't run at 168MHz, but there's a cache and some graphs implying that it keeps up.)
Mouser, Farnell, DigiKey
Edit: since there's a little doubt about the linux recipes, here are some links:
- http://hackaday.com/2011/10/17/how-to-d ... ing-linux/
http://cu.rious.org/make/stm32f4-discov ... ith-linux/
http://stm32.spacevs.com/index.php?opti ... Itemid=103
Edit: 192kByte RAM
Edit: can use a collection of the I/Os as an external memory bus
Edit: can emulate 6502 at about 18Mhz. See https://github.com/BigEd/a6502
Last edited by BigEd on Mon May 25, 2015 2:35 pm, edited 3 times in total.
Re: emulator performance on embedded cpu
Right now, TI are taking pre-orders for an 80MHz ARM dev board - just $5 including free shipping worldwide. Max 2 boards per person, and you have to declare (or invent) a company name.
Only 32K RAM and no simple support for external memory (unlike the STM discovery board), but has 36 GPIO available and an FPU.
http://www.ti.com/product/lm4f120h5qr#feature
https://estore.ti.com/Stellaris-LaunchPad.aspx
Only 32K RAM and no simple support for external memory (unlike the STM discovery board), but has 36 GPIO available and an FPU.
http://www.ti.com/product/lm4f120h5qr#feature
https://estore.ti.com/Stellaris-LaunchPad.aspx
Re: emulator performance on embedded cpu
be quick to catch this Kickstarter: from $22 for 48MHz ARM dev board

(will be available commercially in due course.)
Edit: just 16kByte RAM

(will be available commercially in due course.)
Edit: just 16kByte RAM
Last edited by BigEd on Sat Sep 15, 2012 5:53 pm, edited 2 times in total.