6502.org • View topic - emulator performance on embedded cpu

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

emulator performance on embedded cpu

Page 2 of 6

[ 82 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 Next

Previous topic | Next topic

Author

Message

kc5tja

Post subject:

Posted: Sun Sep 19, 2010 8:38 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

GARTHWILSON wrote:

I would guess that it would be approximately double that for an '816

As long as the state of the emulated CPU can be kept (reasonably) within the width of the emulator's register width, the 10:1 ratio is pretty constant. I observe 10:1 for both 6502 and 65816 emulation.

However, if I were using a 65816 to emulate an 80386, I suspect it would take much longer (closer to 20:1 or 30:1) simply because of register width issues. There's also that small matter of very complicated instruction decoding as well.

Top

GARTHWILSON

Post subject:

Posted: Sun Sep 19, 2010 8:45 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Wouldn't the Athlon have to do basically more MIPS to get the job done in the same amount of time, just because it's so complex you almost can't code it efficiently in assembly, and have to resort to a higher-level compiled language, meaning the gains are not as great as they initially appear? (Or maybe you were already taking that into account?)

Top

kc5tja

Post subject:

Posted: Sun Sep 19, 2010 8:59 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

GARTHWILSON wrote:

I see two possible meanings of "efficiency" here.

MEANING 1: Single-cycle instruction execution. Answer: no. It already achieves this, and in fact, Intel has been executing its instructions in a single cycle as far back as the 80486.

MEANING 2: An intelligently designed, complex instruction set that actually gives evidence for the benefits of CISC over RISC, instead of vice versa. Answer: Yes, in some cases.

The 6502/65816, for example, will automatically set flags based on virtually anything coming in off the data bus or ALU output. It seems rare when the CPU inhibits all flags updates. Thus, you don't often see code like:

Code:

LDA someVariable
ORA #0 ; set flags
BEQ fooBar

However, you do see this with Intel's instruction set. Intel's instruction set, you might say, is more modular, and as with anything modular, you need more glue to put the pieces together. So,

Code:

MOV EAX,someVariable
OR EAX,EAX
JZ fooBar

becomes a frequent coding experience.

However, there's a flip-side to this. I'm sure I'm not the only one who has found writing flags-transparent code annoying:

Code:

PHP
LDA foo
ADC bar
STA baz
PLP

The x86 architecture provides fewer cases this becomes a problem.

The degree to which the Intel-architecture processor needs to execute more MIPS than a comparable 65816 depends on how frequently you depend on things like automatic flag settings.

On the other hand, Intel architecture supports:

* multi-bit rotates and shifts, with and without carry, in a single cycle.

* explicit I/O and memory address spaces, which saves on bank-switching costs on constrained systems. You rarely see code fragments in Intel programs correlating with 6510 code like LDA $01:PHA:LDA #xx:STA $01:....

LA:STA $01

* real indirect addressing modes for JMP and CALL targets, permitting no more than two-cycle (assuming cache hit) vectored execution, which OO- and functional-programming code relies on extensively.

* Any CPU register can be used as a base address and/or index (saving the need for TAX/TAY instructions), complete with power-of-two scaling (saving the need for TXA,ASL,TAX sequences).

* Saving registers to, and popping registers from, the stack are single cycle, not 3 to 4.

* More useful set of conditional branches, testing for <, <=, =, /=, >=, and > in both signed and unsigned variants.

* Floating point instructions.

So, yes, it's true that Intel-architecture CPUs need more instructions to do what the 65816 can do in one well-chosen instruction. However, exploiting other unique features of the x86 architecture can more than make up for that deficiency. It all depends on the kind of program you're writing.

Top

kc5tja

Post subject:

Posted: Sun Sep 19, 2010 9:07 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

I do want to point out, though, that I still enjoy coding for the 65816 for one simple reason: the x86 architecture is notoriously difficult to get working in a circuit. There's no question that a contemporary x86 processor with adequate cache support and a NorthBridge chip suited to the specific CPU you're working with will obviously trounce the latest from WDC.

On the other hand, getting your 65xx code to run as fast as an x86 will involve a ton of look-up tables, because that's about the only way you're going to recover the computation losses that x86 overcomes.

Top

BigEd

Post subject:

Posted: Sun Sep 19, 2010 9:42 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

GARTHWILSON wrote:

Quote:

I think it comes down to the efficiency of emulation, in the different possible approaches. If the emulation penalty is 10x, then a 100MHz ARM is barely worth considering. If it's only 3x, it's a possibility.

It seems like you answered your own question above with the "Emulating a 6502 with an ARM?" link which said it's approximately 10:1. I would guess that it would be nearly double that for an '816, and the '816 runs my Forth at 2-3 times the speed of the '02 running Forth at a given clock speed.

Let me re-read that thread...

OK, it looks to me like they think 8x slowdown for 6502 is expected, maybe as little as 4x with the emulator written in optimal assembly code. (I suspect 4x is optimistic.)

I don't think anyone on that thread knows about emulating the '816 optimally on an ARM, but there's some guessing around 7x. You're thinking it might be 20x.

Samuel can run an '816 with 14x slowdown using a compiled emulator on a high-end x86. It being a high-end CPU is going to give advantages.

I just built and ran Samuel's lib65816 emulator on an 800MHz ARM: it runs only about 6MHz. (lib6502 manages about 60MHz.) I think the penalty here for some styles of compiled code is showing up: most things run at acceptable speed but python is particularly slow on this machine. (I gather ARM sell a compiler which does much better than gcc.)

I've run two of Acorn's ARM-based 6502 emulators: one runs 2.5x-3x the speed of the other, because the faster one runs a flat 64k RAM emulation with I/O done by undocumented instructions - the slower one emulates the BBC with 32k RAM, 8k paged ROM, 8k OS ROM with 3 i/o pages mapped over it.

Note also that the ARMs in the $60 dev boards are low-end CPUs - won't have the cache, branch prediction etc that will be in the $150 products. I think that means most code will run slower, whether emulated or native.

For the '816, the annoying mode-bits might be a help: an emulator can run in several modes, perhaps by using a different base address for the array of 256 code snippets. So no need for each instruction handler to handle all modes. It just might be that the 16-bit operations are similar enough for the ARM (as a 32-bit machine) to handle them with similar tactics to the 8-bit case. So the overhead might be less than expected.

Top

kc5tja

Post subject:

Posted: Sun Sep 19, 2010 9:51 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

BigEd wrote:

I don't think anyone on that thread knows about emulating the '816 optimally on an ARM, but there's some guessing around 7x. You're thinking it might be 20x.

I'm not sure where you got that figure from. I said 20:1 for a 65816 emulating an 80386 because of register widths (emulating a 32-bit CPU via a 16-bit CPU. This should be fairly common sense, as it takes 2x as long to perform 32-bit math on a 16-bit ALU). ARM emulating 65816 should be 10:1-ish because ARM is 32-bits, and 65816 is 16-bit. Exploiting ARM's unique features allows it to go a bit faster, but I'm ignorant of ARM's unique abilities in this area. I'm making conservative estimates.

Quote:

For the '816, the annoying mode-bits might be a help: an emulator can run in several modes, perhaps by using a different base address for the array of 256 code snippets.

This is how lib65816 operates.

Top

BitWise

Post subject:

Posted: Mon Sep 20, 2010 7:33 am

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

I've been playing around at emulating a 65C02 on a 40MIPS dsPIC. The emulation supports 32K of ROM (read from the program flash memory) and 4, 8 or 16K of RAM (depending on the dsPIC device model) replicated as needed to fill the low 32K address space.

Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02 with code that keeps the flags and registers in memory. I'm going to switch to using the device status register for N,V,Z,C which will move me closer to my goal of around 4MHz.

Using memory addresses for I/O would have slowed down normal instruction execution so I've added the 65816's coprocessor opcode (COP) and I'm going to use it control I/O.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs

Top

GARTHWILSON

Post subject:

Posted: Mon Sep 20, 2010 7:52 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Quote:

I've been playing around at emulating a 65C02 on a 40MIPS dsPIC. [...]
Currently I estimate I'm getting the performance equivalent to a 2.5-3MHz 65C02

so, about 60 instructions on the dsPIC to emulate one 65c02 instruction. This does sound more in line with some experiments I did only mentally. The number for emulating an all-32-bit 6502 is staggering.

Top

kc5tja

Post subject:

Posted: Mon Sep 20, 2010 3:02 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

Remember that a reasonably-optimized emulator for the Intel architecture can see 4 to 7 instructions retire per clock cycle, so the 10:1 ratio seems to mesh also with your estimations, Garth.

Top

BitWise

Post subject:

Posted: Fri Dec 10, 2010 10:08 am

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

I found a PIC 24F device with 96K of RAM and have updated the code so that the embedded simulation gives a full 64K of RAM to the 65C02 emulation. On reset I copy 32K of 6502 ROM image from the PICs flash program memory into the upper RAM area.

The device supports USB so I'm working on a composite CDC driver which will provide two serial connections, one to the communicate with the simulator (for debugging) and one to communicate with the emulated 65C02. I'm also going to add VIA/PIA, ACIA and SIB emulation using the device hardware features.

The downside to using a 24F is that the maximum clock speed is lower and the simulation will end up being around 1.5-2.0Mhz but still better than the real 6502 in the BBC microcomputer I used own.

Another pain is that the device is only available in surface mount packages. I'm planning on using a Schmartboard to mount to the device and its decoupling capacitors then use normal 0.1" strip board for the rest of the supporting circuitry. Made professionally you could get the whole thing on something the side of a 40pin DIP, like the GODIL boards.

Should keep me busy over Xmas.

Top

BigEd

Post subject:

Posted: Fri Dec 10, 2010 11:51 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Interesting! Is that 32MHz, or lower? Dev board, or your own design? Do you expect to make your emulator sources available?

Cheers
Ed

Top

BitWise

Post subject:

Posted: Fri Dec 10, 2010 2:10 pm

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK

I'm using a 24FJ256GB206/210 at 32Mhz (16 MIPS). Its my own design but theres not much to it MCU + USB + ICSP + Power regulator (maybe more later).

I have TQFPs for the 64 and 100 pin devices. I'm hoping Schmartboards will make hand soldering 0.4 and 0.5 mm spaced pins feasible with basic equipment. (They have grooved channels for the pins which are pre-tinned. You drop on the device and run a fine point iron up the groove to solder the pin. The outer edges of the board have 0.1" spaced holes for wire connections).

If the prototype works well then I'll design a proper PCB and see about having some made properly.

Top

BigEd

Post subject: faster cheaper ARM dev board

Posted: Sat Oct 22, 2011 7:39 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

BigEd wrote:

It seemed like the mbed LPC1768 dev kit, which has a 96MHz ARM, might be a good platform for an efficient 6502 emulation, for £50 (USB powered, 40-pin 0.1 inch form factor, weird online-only toolchain) - but based on the above, it would be a struggle to emulate at more than 12MHz.

ChuckT let me know that Moore's law has brought us faster and cheaper ARM dev kits: the STM32F4DISCOVERY has a cortex M4 running at 168MHz for £12 (plus delivery)

Again, it's USB-powered, it has 100-pin breakout (but not in compact DIL footprint like Bitwise's PIC-based modules). There's a choice of Windows toolchains but apparently linux can be got to work too (see comments here).

PDFs: technical, marketing and benchmarks.

(The flash doesn't run at 168MHz, but there's a cache and some graphs implying that it keeps up.)

Mouser, Farnell, DigiKey

Edit: since there's a little doubt about the linux recipes, here are some links:

http://hackaday.com/2011/10/17/how-to-d ... ing-linux/

http://cu.rious.org/make/stm32f4-discov ... ith-linux/

http://stm32.spacevs.com/index.php?opti ... Itemid=103

(You can use JTAG, but you don't need to - the USB is sufficient. You can connect a gdb session. Apparently.)

Edit: 192kByte RAM
Edit: can use a collection of the I/Os as an external memory bus

Edit: can emulate 6502 at about 18Mhz. See https://github.com/BigEd/a6502

Last edited by BigEd on Mon May 25, 2015 2:35 pm, edited 3 times in total.

Top

BigEd

Post subject: Re: emulator performance on embedded cpu

Posted: Tue Sep 11, 2012 7:36 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Right now, TI are taking pre-orders for an 80MHz ARM dev board - just $5 including free shipping worldwide. Max 2 boards per person, and you have to declare (or invent) a company name.

Only 32K RAM and no simple support for external memory (unlike the STM discovery board), but has 36 GPIO available and an FPU.

http://www.ti.com/product/lm4f120h5qr#feature
https://estore.ti.com/Stellaris-LaunchPad.aspx

Top

BigEd

Post subject: Re: emulator performance on embedded cpu

Posted: Wed Sep 12, 2012 7:25 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

be quick to catch this Kickstarter: from $22 for 48MHz ARM dev board

(will be available commercially in due course.)

Edit: just 16kByte RAM

Last edited by BigEd on Sat Sep 15, 2012 5:53 pm, edited 2 times in total.

Top

Page 2 of 6

[ 82 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 Next

Board index » 6502.org Users Forum » Emulation and Simulation

All times are UTC

Who is online

Users browsing this forum: No registered users and 6 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum