emulator performance on embedded cpu

Topics pertaining to the emulation or simulation of the 65xx microprocessors and their peripheral chips.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: emulator performance on embedded cpu

Post by hoglet »

BigEd wrote:
I don't know of a previous effort, or of source. Or even who wrote it! I do see a hint that it might not be 32-bit clean, so maybe watch out for that.
Yes, it wasn't, but I'm past that hurdle.

There's an initial check-in of the code here:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S

I'm testing this running bare metal on a Raspberry Pi, acting as second processor to the Beeb. I've now got the startup banner coming up on the Beeb, so quite a lot is working.

The next hurdle is to add in interrupt handling so I can do some more testing with real programs, like the Klaus Dormann's test suites.

I'm wondering if it might actually be possible to use ARM interrupts directly, and in the FETCH_NEXT macro briefly enable them and disable them. The problematic bit is that IRQ is level sensitive, which is not supported by the directly by the bcm2835.

Dave
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: emulator performance on embedded cpu

Post by hoglet »

If anyone is interested, I've now got this "Acorn 65tube" derived ARM 65C02 emulation running successfully.

In my environment, it's about 70% faster than using lib6502.

And, after a bit of bug fixing, it passes the 6502 and 65C02 Dormann tests.

The latest code is here.

The limiting factor in speed is now:
- checking for IRQ/NMI/RST every N instructions (N=16 at the moment, which gives ~2us interrupt latency)
- checking all memory accesses against the range FEF8-FEFF for special handling of the tube registers.

There's a slightly longer update here:
http://stardot.org.uk/forums/viewtopic. ... 47#p127347

What's left to do is:
- run Bruce Clark's BCD tests
- look for a more efficient pattern for trapping accesses to certain address ranges
- cleaner separation of the core 6502 emulation from my use of it as an emulated Beeb Co Processor

Dave
Last edited by hoglet on Mon Dec 21, 2015 10:24 pm, edited 2 times in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: emulator performance on embedded cpu

Post by BigEd »

Brilliant! A big step forward for 6502 on ARM!

Edit: now running a great deal faster than lib6502, see the linked stardot thread for more.
Last edited by BigEd on Fri Jan 08, 2016 7:06 pm, edited 1 time in total.
stecdose
Posts: 4
Joined: 21 Dec 2015
Location: DK

Re: emulator performance on embedded cpu

Post by stecdose »

I'm working on an emulator on an STM32F103 with 256kb Flash and 48kb RAM.

It is written in C. The CPU ist just below 900 lines. It uses ~9,5kB of Flash and ~50bytes of RAM, exculding the emulated memory.

I've assigned a few k to this emulator and assembled the skier and the random demo from 6502asm.org. Both are running at about ~1,03MHz without graphics.

There is a 3,2" TFT attached to the controller and a quick hack gave me that 32x32 pixel screen, zoomed by 7, on the TFT.
This slows down the emulator to about 440,7khz.

These speed-values were "measured" by letting the cpu run for a while, giving every 1000mS status information including ticks.

Soon i can post more about this, if you have questions up to then, just ask.

I just registered, because i stumbled across this thread while doing the same.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: emulator performance on embedded cpu

Post by hoglet »

BigEd wrote:
Brilliant! A big step forward for 6502 on ARM!
Thanks Ed, I'm sure there is still considerable room for improvement.

I'm not convinced my conversion from 26-bit ARM to 32-bit ARM was optimal.

Here's an example, the CPY instruction, where the value of the V flag need to be preserved.

The old 26-bit code was:

Code: Select all

// Opcode C0 - CPY #$00
    3cf0:   e4da0001    ldrb    r0, [sl], #1
    3cf4:   e1a0200f    mov r2, pc
    3cf8:   e1580c00    cmp r8, r0, lsl #24
    3cfc:   e022200f    eor r2, r2, pc
    3d00:   e2022201    and r2, r2, #268435456  ; 0x10000000
    3d04:   e132f00f    teqp    r2, pc
// Fetch next
    3d08:   e4da0001    ldrb    r0, [sl], #1
    3d0c:   e08cf300    add pc, ip, r0, lsl #6
The new 32-bit code is:

Code: Select all

// Opcode C0 - CPY #$00
.align I_ALIGN
opcode_C0:
        ldrb    r0, [sl], #1
        mrs     r2, CPSR
        and     r2, r2, #V_FLAG
        cmp     r8, r0, lsl #24
        mrs     r1, CPSR
        bic     r1, r1, #V_FLAG
        orr     r1, r1, r2
        msr     CPSR_flg, r1
        FETCH_NEXT
So two additional instructions.

I must admit, I struggled to understand how the original code was working. It only makes sense if TEQP r2, pc toggles the appropriate flag bit. I couldn't find a good description of this behaviour in any documentation though.

Can anyone come up with a more efficient pattern?

Dave
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: emulator performance on embedded cpu

Post by Arlet »

According to my copy of the ARM ARM (DDI 0100B) the TEQP performs an EOR operation between the operands, and then overwrites the flag bits in the PC with the result, so it would toggle the V bit if it was wrong.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: emulator performance on embedded cpu

Post by BigEd »

According to http://www.heyrick.co.uk/armwiki/TEQ, the V flag is not affected... and that would also be true if Acorn's code is correct, which it almost certainly is. The Peter Cockerell book at http://www.peter-cockerell.net/aalp/ doesn't mention V as a special case. I wonder if this instruction changed...

This is perhaps something to be checked using VisualARM.

Edit: this is the thumb2 code from a6502:

Code: Select all

.macro do_compare addressmode reg
	@ somewhat like addsub, but ARM CMP sets V flag whereas 6502 CMP does not
	load_operand_and_ea \addressmode r0 r1	@ in this case we don't need the effective address
	mov	r0, r0, asl #24   @ we left-justify to get N correct
	mov	r1, \reg, asl #24
	setflags_pre
	cmp	r1, r0		@ this has set N and Z, not V (or C)
	MRS	r0, APSR	@ these manipulations feel pedestrian - could probably do better
	bic	r0, #armVflag
	tst	rPbyteHi, #armVflag
	it ne
	orrne	r0, #armVflag
	mov	rPbyteHi, r0
.endm
Last edited by BigEd on Tue Dec 22, 2015 9:21 am, edited 1 time in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: emulator performance on embedded cpu

Post by BigEd »

stecdose wrote:
I just registered, because i stumbled across this thread while doing the same.
Welcome! Feel free to introduce yourself over at the Introduce Yourself thread.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: emulator performance on embedded cpu

Post by Arlet »

The TEQ instruction does not affect the V flag, but that does not matter here. In this case, the flag bits are copied directly from the ALU output, which is equal to EOR of the two operands (since the TEQ is implemented by EOR).
arm.jpg
See also http://morrow.ece.wisc.edu/ECE353/arm_r ... rm_arm.pdf chapter A8.3.
Here's a picture of R15 (PC/flags register).
reg15.png
JimDrew
Posts: 107
Joined: 14 Oct 2012

Re: emulator performance on embedded cpu

Post by JimDrew »

Well, after a year of hiatus I am back working on my CBM disk drive emulator. I have switched to a dsPIC33EP512GM304, which gives me a 25MHz SPI bus - much better for the SD reads/writes. Since I don't need any more than a few MHz in performance (I am at nearly 16MHz 6502 speeds now), I decided to make the instruction fetch an interrupt. The peripherals are also using interrupts now. This sure makes it nice because the main loop really now just handles the OLED, SD control, and UI. Everything is perfectly in sync, adjusting the interrupt clock by the instruction cycle length (taking into account crossing page boundaries and other nuances). I also dropped the CPU speed from 70MIPS to 64MIPS because that made everything an even multiple of 2 for the clock dividers. I should have made this thing interrupt driven before, it sure makes handling non-emulation stuff a breeze. If you aren't looking for the absolute fastest emulation, and need it cycle exact, I would highly recommend looking at making the instruction fetch/6502 emulation as an interrupt routine.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: emulator performance on embedded cpu

Post by BigEd »

Pretty impressive to get 16MHz emulation from a 128MHz CPU! Anything better than a 10x penalty for emulation is good going. It's interesting that even the overhead of taking and returning from interrupts isn't hurting performance so much - I would have expected to have to make every host CPU cycle count. Especially if each instruction has to tweak the time of the next interrupt according to emulated cycle count.
JimDrew
Posts: 107
Joined: 14 Oct 2012

Re: emulator performance on embedded cpu

Post by JimDrew »

I *was* getting almost 16MHz average speed @70MIPS (140MHz) without the instruction fetch/emulation being interrupt based. I am not sure what the max speed would be now, but using the interrupt to fetch the next instruction and emulate it is not the way to go if you want the fastest emulation. The interrupt overhead adds about 8 PIC CPU cycles (64 cycles per 1us), so at least a 12.5% penalty. The reason for using the interrupt is to make it cycle exact. There is an instruction clock that I setup on a timer that triggers the interrupt that fetches the next instruction. As part of the instruction decoding, the timer value is updated. I only need a 1MHz (or 2MHz, depending on the drive) CPU, so I have plenty of time left over. I don't have a flat 64K model, so I have to maintain a logic/physical address. I actually do that by using the translated address always and change that only on any jmp/jsr/rts/int. So, it's pretty quick, but still nowhere as fast as it could be if no translation was required. With CBM drives, the translation is required because the peripherals are not fully decoded so they appear mirrored in several memory locations. As part of the translation I decode any of the mirrors to the physical address that is used in the PIC's memory to store that peripheral's data.

I might look at doing this on a high speed STM32 (in assembly) to see how fast it could really go. One tip I can add is this... for instructions that do not affect a lot of flags, do a simple bit clear/set for the 6502 flags. All other instructions you are better off using a lookup table to translate the flags... a load/AND/OR (especially if you can store while OR'ing).
Last edited by JimDrew on Thu May 19, 2016 3:13 pm, edited 1 time in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: emulator performance on embedded cpu

Post by BigEd »

Ah, I see, two slightly different goals there. It does make good sense - if you have the horsepower - to use interrupts to ensure cycle accuracy. Although, I imagine you can only make each instruction run for the right length of time - presumably not make each memory access occur in exactly the right cycle?
JimDrew
Posts: 107
Joined: 14 Oct 2012

Re: emulator performance on embedded cpu

Post by JimDrew »

Correct, I don't have to worry about memory access timing, except between the CPU and two VIAs. I have a check in the VIA to adjust the read/write of the timers.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: emulator performance on embedded cpu

Post by hoglet »

Hi Guys,

Just a quick update to say I've got some updated performance figures for the ARM 65C02 emulation code a few of us have been working on over on the StarDot forums.

The headline figure is now 225MHz on a Pi Zero (1000MHz ARM clock / 250MHz core clock)
IMG_0474.JPG
There is a historic summary of the work in a new thread here:
http://www.stardot.org.uk/forums/viewto ... 30&t=11328

The code is in git:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S

Dave
Post Reply