Page 5 of 6
Re: emulator performance on embedded cpu
Posted: Sun Dec 20, 2015 7:50 pm
by hoglet
I don't know of a previous effort, or of source. Or even who wrote it! I do see a hint that it might not be 32-bit clean, so maybe watch out for that.
Yes, it wasn't, but I'm past that hurdle.
There's an initial check-in of the code here:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S
I'm testing this running bare metal on a Raspberry Pi, acting as second processor to the Beeb. I've now got the startup banner coming up on the Beeb, so quite a lot is working.
The next hurdle is to add in interrupt handling so I can do some more testing with real programs, like the Klaus Dormann's test suites.
I'm wondering if it might actually be possible to use ARM interrupts directly, and in the FETCH_NEXT macro briefly enable them and disable them. The problematic bit is that IRQ is level sensitive, which is not supported by the directly by the bcm2835.
Dave
Re: emulator performance on embedded cpu
Posted: Mon Dec 21, 2015 9:02 pm
by hoglet
If anyone is interested, I've now got this "Acorn 65tube" derived ARM 65C02 emulation running successfully.
In my environment, it's about 70% faster than using lib6502.
And, after a bit of bug fixing, it passes the 6502 and 65C02 Dormann tests.
The latest code is
here.
The limiting factor in speed is now:
- checking for IRQ/NMI/RST every N instructions (N=16 at the moment, which gives ~2us interrupt latency)
- checking all memory accesses against the range FEF8-FEFF for special handling of the tube registers.
There's a slightly longer update here:
http://stardot.org.uk/forums/viewtopic. ... 47#p127347
What's left to do is:
- run Bruce Clark's BCD tests
- look for a more efficient pattern for trapping accesses to certain address ranges
- cleaner separation of the core 6502 emulation from my use of it as an emulated Beeb Co Processor
Dave
Re: emulator performance on embedded cpu
Posted: Mon Dec 21, 2015 9:08 pm
by BigEd
Brilliant! A big step forward for 6502 on ARM!
Edit: now running a great deal faster than lib6502, see the linked stardot thread for more.
Re: emulator performance on embedded cpu
Posted: Mon Dec 21, 2015 11:16 pm
by stecdose
I'm working on an emulator on an STM32F103 with 256kb Flash and 48kb RAM.
It is written in C. The CPU ist just below 900 lines. It uses ~9,5kB of Flash and ~50bytes of RAM, exculding the emulated memory.
I've assigned a few k to this emulator and assembled the skier and the random demo from 6502asm.org. Both are running at about ~1,03MHz without graphics.
There is a 3,2" TFT attached to the controller and a quick hack gave me that 32x32 pixel screen, zoomed by 7, on the TFT.
This slows down the emulator to about 440,7khz.
These speed-values were "measured" by letting the cpu run for a while, giving every 1000mS status information including ticks.
Soon i can post more about this, if you have questions up to then, just ask.
I just registered, because i stumbled across this thread while doing the same.
Re: emulator performance on embedded cpu
Posted: Tue Dec 22, 2015 5:17 am
by hoglet
Brilliant! A big step forward for 6502 on ARM!
Thanks Ed, I'm sure there is still considerable room for improvement.
I'm not convinced my conversion from 26-bit ARM to 32-bit ARM was optimal.
Here's an example, the CPY instruction, where the value of the V flag need to be preserved.
The old 26-bit code was:
Code: Select all
// Opcode C0 - CPY #$00
3cf0: e4da0001 ldrb r0, [sl], #1
3cf4: e1a0200f mov r2, pc
3cf8: e1580c00 cmp r8, r0, lsl #24
3cfc: e022200f eor r2, r2, pc
3d00: e2022201 and r2, r2, #268435456 ; 0x10000000
3d04: e132f00f teqp r2, pc
// Fetch next
3d08: e4da0001 ldrb r0, [sl], #1
3d0c: e08cf300 add pc, ip, r0, lsl #6
The new 32-bit code is:
Code: Select all
// Opcode C0 - CPY #$00
.align I_ALIGN
opcode_C0:
ldrb r0, [sl], #1
mrs r2, CPSR
and r2, r2, #V_FLAG
cmp r8, r0, lsl #24
mrs r1, CPSR
bic r1, r1, #V_FLAG
orr r1, r1, r2
msr CPSR_flg, r1
FETCH_NEXT
So two additional instructions.
I must admit, I struggled to understand how the original code was working. It only makes sense if TEQP r2, pc toggles the appropriate flag bit. I couldn't find a good description of this behaviour in any documentation though.
Can anyone come up with a more efficient pattern?
Dave
Re: emulator performance on embedded cpu
Posted: Tue Dec 22, 2015 7:40 am
by Arlet
According to my copy of the ARM ARM (DDI 0100B) the TEQP performs an EOR operation between the operands, and then overwrites the flag bits in the PC with the result, so it would toggle the V bit if it was wrong.
Re: emulator performance on embedded cpu
Posted: Tue Dec 22, 2015 9:18 am
by BigEd
According to
http://www.heyrick.co.uk/armwiki/TEQ, the V flag is not affected... and that would also be true if Acorn's code is correct, which it almost certainly is. The Peter Cockerell book at
http://www.peter-cockerell.net/aalp/ doesn't mention V as a special case. I wonder if this instruction changed...
This is perhaps something to be checked using VisualARM.
Edit: this is the thumb2 code from a6502:
Code: Select all
.macro do_compare addressmode reg
@ somewhat like addsub, but ARM CMP sets V flag whereas 6502 CMP does not
load_operand_and_ea \addressmode r0 r1 @ in this case we don't need the effective address
mov r0, r0, asl #24 @ we left-justify to get N correct
mov r1, \reg, asl #24
setflags_pre
cmp r1, r0 @ this has set N and Z, not V (or C)
MRS r0, APSR @ these manipulations feel pedestrian - could probably do better
bic r0, #armVflag
tst rPbyteHi, #armVflag
it ne
orrne r0, #armVflag
mov rPbyteHi, r0
.endm
Re: emulator performance on embedded cpu
Posted: Tue Dec 22, 2015 9:19 am
by BigEd
I just registered, because i stumbled across this thread while doing the same.
Welcome! Feel free to introduce yourself over at the
Introduce Yourself thread.
Re: emulator performance on embedded cpu
Posted: Tue Dec 22, 2015 9:35 am
by Arlet
The TEQ instruction does not affect the V flag, but that does not matter here. In this case, the flag bits are copied directly from the ALU output, which is equal to EOR of the two operands (since the TEQ is implemented by EOR).
See also
http://morrow.ece.wisc.edu/ECE353/arm_r ... rm_arm.pdf chapter A8.3.
Here's a picture of R15 (PC/flags register).
Re: emulator performance on embedded cpu
Posted: Wed May 18, 2016 7:16 pm
by JimDrew
Well, after a year of hiatus I am back working on my CBM disk drive emulator. I have switched to a dsPIC33EP512GM304, which gives me a 25MHz SPI bus - much better for the SD reads/writes. Since I don't need any more than a few MHz in performance (I am at nearly 16MHz 6502 speeds now), I decided to make the instruction fetch an interrupt. The peripherals are also using interrupts now. This sure makes it nice because the main loop really now just handles the OLED, SD control, and UI. Everything is perfectly in sync, adjusting the interrupt clock by the instruction cycle length (taking into account crossing page boundaries and other nuances). I also dropped the CPU speed from 70MIPS to 64MIPS because that made everything an even multiple of 2 for the clock dividers. I should have made this thing interrupt driven before, it sure makes handling non-emulation stuff a breeze. If you aren't looking for the absolute fastest emulation, and need it cycle exact, I would highly recommend looking at making the instruction fetch/6502 emulation as an interrupt routine.
Re: emulator performance on embedded cpu
Posted: Thu May 19, 2016 8:12 am
by BigEd
Pretty impressive to get 16MHz emulation from a 128MHz CPU! Anything better than a 10x penalty for emulation is good going. It's interesting that even the overhead of taking and returning from interrupts isn't hurting performance so much - I would have expected to have to make every host CPU cycle count. Especially if each instruction has to tweak the time of the next interrupt according to emulated cycle count.
Re: emulator performance on embedded cpu
Posted: Thu May 19, 2016 3:06 pm
by JimDrew
I *was* getting almost 16MHz average speed @70MIPS (140MHz) without the instruction fetch/emulation being interrupt based. I am not sure what the max speed would be now, but using the interrupt to fetch the next instruction and emulate it is not the way to go if you want the fastest emulation. The interrupt overhead adds about 8 PIC CPU cycles (64 cycles per 1us), so at least a 12.5% penalty. The reason for using the interrupt is to make it cycle exact. There is an instruction clock that I setup on a timer that triggers the interrupt that fetches the next instruction. As part of the instruction decoding, the timer value is updated. I only need a 1MHz (or 2MHz, depending on the drive) CPU, so I have plenty of time left over. I don't have a flat 64K model, so I have to maintain a logic/physical address. I actually do that by using the translated address always and change that only on any jmp/jsr/rts/int. So, it's pretty quick, but still nowhere as fast as it could be if no translation was required. With CBM drives, the translation is required because the peripherals are not fully decoded so they appear mirrored in several memory locations. As part of the translation I decode any of the mirrors to the physical address that is used in the PIC's memory to store that peripheral's data.
I might look at doing this on a high speed STM32 (in assembly) to see how fast it could really go. One tip I can add is this... for instructions that do not affect a lot of flags, do a simple bit clear/set for the 6502 flags. All other instructions you are better off using a lookup table to translate the flags... a load/AND/OR (especially if you can store while OR'ing).
Re: emulator performance on embedded cpu
Posted: Thu May 19, 2016 3:13 pm
by BigEd
Ah, I see, two slightly different goals there. It does make good sense - if you have the horsepower - to use interrupts to ensure cycle accuracy. Although, I imagine you can only make each instruction run for the right length of time - presumably not make each memory access occur in exactly the right cycle?
Re: emulator performance on embedded cpu
Posted: Thu May 19, 2016 3:20 pm
by JimDrew
Correct, I don't have to worry about memory access timing, except between the CPU and two VIAs. I have a check in the VIA to adjust the read/write of the timers.
Re: emulator performance on embedded cpu
Posted: Sat Jun 18, 2016 12:59 pm
by hoglet
Hi Guys,
Just a quick update to say I've got some updated performance figures for the ARM 65C02 emulation code a few of us have been working on over on the StarDot forums.
The headline figure is now 225MHz on a Pi Zero (1000MHz ARM clock / 250MHz core clock)
There is a historic summary of the work in a new thread here:
http://www.stardot.org.uk/forums/viewto ... 30&t=11328
The code is in git:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S
Dave