Page 4 of 6
Re: emulator performance on embedded cpu
Posted: Wed Jun 24, 2015 11:05 am
by BigEd
Elsewhere:
My PIC based 6502 emulator takes 90.757us to run those 1141us worth of instructions (~12.57MHz). I didn't take out the code for any instruction where a page boundary can be crossed (which corrects the cycle counter), so it could be a bit faster still if I did that. That is using a 70MIPS PIC24EP CPU (assembly code of course - I only write in assembly, no matter what CPU I am using).
I would bet that if I used a PIC32 @200MHz I could get close to 50MHz 6502 emulation. The PIC24EP has 3 cycle memory fetches where the PIC32 has single cycle.
It would be interesting to see this done! The best so far, I think, is about 20MHz on a 168MHz ARM. That code can certainly be improved, as it keeps saving and restoring the flags - with a bit of care it could mostly leave them in place. As it happens, the ARM's use of flags is very like the 6502's (not a coincidence.) But this ARM platform is hampered by slow Flash - there's a cache, but probably not big enough. Moving code around to make use of RAM or better use of the cache might get some gains.
Re: emulator performance on embedded cpu
Posted: Wed Jun 24, 2015 2:47 pm
by JimDrew
My code is setup for holding off instruction fetch to emulate the exact cycle time of a 1MHz 6502. So, I can get faster results with the PIC24 by removing the cycle counting boundary checks for any instruction that can cross a page (which is a quite a few). Also, at the end of each instruction is a GOTO W14, which jumps to whatever W14 is pointing to. That typically points to the routine (macro) to fetch the next instruction and then BRA to that handling code (just 2 instructions). So, I could change that macro so that it just does the same lookup/branch after each instruction instead of the GOTO W14, increasing the code size but saving 2 PIC cycles per instruction. An interrupt routine can change W14 to point to whatever house keeping has to be done still. I use this GOTO W14 because I needed a way to handle specific hardware writes, but I can still do this and save a couple of cycles (14.285ns) per 6502 cycle.
I think the PIC32 would be really easy to setup in assembly for this, with the whole 64K fitting into memory. Right now with the PIC24 version, I only have 48K of RAM which is just enough for 16K of RAM, 16K of ROM, and a couple of VIAs mapped, along with space for variables and such. Also, every READ and WRITE access from/to memory has to be translated (physical to logical) for RAM, ROM, and VIAs and that actually burns up most of the time for any instruction involving memory. A flat 64K model would be so much faster!
Re: emulator performance on embedded cpu
Posted: Wed Jun 24, 2015 3:15 pm
by BitWise
I hope its a GOTO W14 and not a BRA W14

Re: emulator performance on embedded cpu
Posted: Wed Jun 24, 2015 3:46 pm
by BigEd
Do any others here have any experience with PIC32?
Re: emulator performance on embedded cpu
Posted: Wed Jun 24, 2015 4:26 pm
by BitWise
Do any others here have any experience with PIC32?
I started writing a PIC32 version of my emulator in assembler but Microchip have made working in assembler very hard and I've had to switch to C which introduces additional inefficiences, especially as the highest levels of optimisation cost $$$ to switch on. To use a PIC32MZ I had to update MPLABX 3.0 and their awful 'Harmony' framework as they don't support the MZ devices properly in the old peripheral libraries.
I don't have enough written yet to get any performance figures. There are some hardware performance issues. Flash memory access incurs delay cycles so you need to compile the emulator to run out of RAM for best performance but that then takes away RAM for emulation memory areas.
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 4:49 am
by JimDrew
I hope its a GOTO W14 and not a BRA W14

How funny... yes, it is!
I added a switch to turn off all of the cycle counting stuff and I am now down to ~81us to run that test. I could probably optimize it some more, but I really don't need to do that for what I am doing. In fact, it looks like I could do the emulation with a 16MIPs part.
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 4:56 am
by JimDrew
Yeah, and DAW.B is still broken in MPLAB-X v3.05!
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 5:55 am
by BigEd
Decimal mode is so twisty and seldom used that omitting it, or at least ensuring there's a fast path for binary mode, seems right. Whether or not you 'need' it depends on your intended use of course.
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 8:16 am
by BitWise
Jim's comment refers to a bug I reported 4 years ago. The MPLAB simulator incorrectly sets/clears the carry flag when performing a decimal adjust on a byte. It works correctly on 16-bit words.
The code I used to emulate decimal mode uses the instruction so it fails when simulated but works correctly on silicon.
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 8:35 am
by BigEd
Wow!
Re: emulator performance on embedded cpu
Posted: Thu Jun 25, 2015 7:36 pm
by JimDrew
Yeah, this drove me nuts for several days. I PM'd Andrew to let him know that his PIC based 6502 emulation was broken - just like mine.

Turns out that he reported this bug to Microchip many years ago.. as of v3.05 (just released), the same bug is there. This instruction (DAW) is just like the any other CPU that has a decimal-adjust instruction, so it was a natural to use. It sort of works in the simulator (the flags are wrong under various conditions). Andrew, did you try this with ICE or some other real in-circuit debugger to see if the problem shows up there?
I have ALL of the 6502 unimplemented opcodes supported, and I need to have decimal mode working perfectly as well. There are many different programs that use unimplemented opcodes as part of their copy protection, and so emulating these are not an option for my CBM drive emulator.
Re: emulator performance on embedded cpu
Posted: Sun Jun 28, 2015 4:53 am
by BigEd
It may be that you've got the most challenging 6502 application there, in terms of how much fidelity you need. Do you also need to model stray writes, and the exact timing of interrupts?
Re: emulator performance on embedded cpu
Posted: Sun Jun 28, 2015 5:04 pm
by JimDrew
Yes, everything is sub-cycle accurate. Both (virtual) 6522's can generate interrupts, and both have dual timers. There is a data separator emulation that runs asynchronously to system, clocking in the flux data and driving the SO line. One VIA is connected to the IEC (serial bus) using the same double-ended open collector logic - so data, clock, and attention signals all have separate read and write lines connected to the VIA. I also have a SD media interface and a OLED screen (which is GREAT for debugging!)
I still have a few VIA things to add (like the shift register implementation), and I am writing a FAT32 filesystem handler (in assembly like everything else). I don't have tons of free time at the moment to get it done.
Re:
Posted: Sat Dec 19, 2015 6:12 pm
by hoglet
I did look into Acorn's two emulators: the 65Host and 65Tube programs, using the excellent disassembler
Armalyser - it's amazing how dense the ARM instructions are, and very useful that the machine is so like the 6502. 65Tube is the faster of the two: it keeps the 6502 registers in ARM registers and handles PC and SP as pointers into a byte array. Each opcode finishes with a fetch and a computed jump, into a table of 16-word sections, one per opcode. For example, the code for BCC is just 6 istructions:
Code: Select all
; handler for 0x90 BCC
LDRCCB R0,[R10],#1 ; fetch the operand byte
MOVCC R0,R0,LSL #24 ; shift left to prepare sign-extension
ADDCC R10,R10,R0,ASR #24 ; adjust the PC for branch-taken case
ADDCS R10,R10,#1 ; increment PC for branch not taken
; standard postamble: fetch next instruction. R10 is the 6502 PC, as a byte pointer
LDRB R0,[R10],#1 ; ifetch into R0 and PC++
ADD PC,R12,R0,LSL #6 ; computed jump to next opcode handler
Notice how the predicated instructions remove the need to branch, and how the ARM's own carry flag serves to emulate the 6502's - same for N, Z and V. All the 6502 state is held in registers throughout. The free shifts, auto-increment and the
familiar-looking address modes help a lot too.
I wanted to come back to one of Ed's observations from a few years ago.
I've also just been disassembling 65Tube, and it looks like a very efficient 65C02 implementation in ARM.
I'm thinking of trying to reverse engineer the 65C02 core of this back into a buildable source form.
Has this been done already by anyone?
I was also wondering if the original sources were ever released? I did have a poke around the
RISC OS Open CVS repository, and couldn't find anything.
Dave
Re: emulator performance on embedded cpu
Posted: Sun Dec 20, 2015 6:35 pm
by BigEd
I don't know of a previous effort, or of source. Or even who wrote it! I do see a hint that it might not be 32-bit clean, so maybe watch out for that.
(I wanted a6502 not to be a derivation so I could license it without any grey area - copyright in Acorn's software might yet be owned by someone. But that is a bit of a restriction and probably meaningless in practice.)