I would bet that if I used a PIC32 @200MHz I could get close to 50MHz 6502 emulation. The PIC24EP has 3 cycle memory fetches where the PIC32 has single cycle.
emulator performance on embedded cpu
Re: emulator performance on embedded cpu
Elsewhere:
It would be interesting to see this done! The best so far, I think, is about 20MHz on a 168MHz ARM. That code can certainly be improved, as it keeps saving and restoring the flags - with a bit of care it could mostly leave them in place. As it happens, the ARM's use of flags is very like the 6502's (not a coincidence.) But this ARM platform is hampered by slow Flash - there's a cache, but probably not big enough. Moving code around to make use of RAM or better use of the cache might get some gains.
JimDrew wrote:
My PIC based 6502 emulator takes 90.757us to run those 1141us worth of instructions (~12.57MHz). I didn't take out the code for any instruction where a page boundary can be crossed (which corrects the cycle counter), so it could be a bit faster still if I did that. That is using a 70MIPS PIC24EP CPU (assembly code of course - I only write in assembly, no matter what CPU I am using).
I would bet that if I used a PIC32 @200MHz I could get close to 50MHz 6502 emulation. The PIC24EP has 3 cycle memory fetches where the PIC32 has single cycle.
I would bet that if I used a PIC32 @200MHz I could get close to 50MHz 6502 emulation. The PIC24EP has 3 cycle memory fetches where the PIC32 has single cycle.
Re: emulator performance on embedded cpu
My code is setup for holding off instruction fetch to emulate the exact cycle time of a 1MHz 6502. So, I can get faster results with the PIC24 by removing the cycle counting boundary checks for any instruction that can cross a page (which is a quite a few). Also, at the end of each instruction is a GOTO W14, which jumps to whatever W14 is pointing to. That typically points to the routine (macro) to fetch the next instruction and then BRA to that handling code (just 2 instructions). So, I could change that macro so that it just does the same lookup/branch after each instruction instead of the GOTO W14, increasing the code size but saving 2 PIC cycles per instruction. An interrupt routine can change W14 to point to whatever house keeping has to be done still. I use this GOTO W14 because I needed a way to handle specific hardware writes, but I can still do this and save a couple of cycles (14.285ns) per 6502 cycle.
I think the PIC32 would be really easy to setup in assembly for this, with the whole 64K fitting into memory. Right now with the PIC24 version, I only have 48K of RAM which is just enough for 16K of RAM, 16K of ROM, and a couple of VIAs mapped, along with space for variables and such. Also, every READ and WRITE access from/to memory has to be translated (physical to logical) for RAM, ROM, and VIAs and that actually burns up most of the time for any instruction involving memory. A flat 64K model would be so much faster!
I think the PIC32 would be really easy to setup in assembly for this, with the whole 64K fitting into memory. Right now with the PIC24 version, I only have 48K of RAM which is just enough for 16K of RAM, 16K of ROM, and a couple of VIAs mapped, along with space for variables and such. Also, every READ and WRITE access from/to memory has to be translated (physical to logical) for RAM, ROM, and VIAs and that actually burns up most of the time for any instruction involving memory. A flat 64K model would be so much faster!
Last edited by JimDrew on Thu Jun 25, 2015 4:55 am, edited 1 time in total.
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: emulator performance on embedded cpu
I hope its a GOTO W14 and not a BRA W14 
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
Re: emulator performance on embedded cpu
Do any others here have any experience with PIC32?
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: emulator performance on embedded cpu
BigEd wrote:
Do any others here have any experience with PIC32?
I don't have enough written yet to get any performance figures. There are some hardware performance issues. Flash memory access incurs delay cycles so you need to compile the emulator to run out of RAM for best performance but that then takes away RAM for emulation memory areas.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
Re: emulator performance on embedded cpu
BitWise wrote:
I hope its a GOTO W14 and not a BRA W14 
I added a switch to turn off all of the cycle counting stuff and I am now down to ~81us to run that test. I could probably optimize it some more, but I really don't need to do that for what I am doing. In fact, it looks like I could do the emulation with a 16MIPs part.
Re: emulator performance on embedded cpu
Yeah, and DAW.B is still broken in MPLAB-X v3.05!
Re: emulator performance on embedded cpu
Decimal mode is so twisty and seldom used that omitting it, or at least ensuring there's a fast path for binary mode, seems right. Whether or not you 'need' it depends on your intended use of course.
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: emulator performance on embedded cpu
Jim's comment refers to a bug I reported 4 years ago. The MPLAB simulator incorrectly sets/clears the carry flag when performing a decimal adjust on a byte. It works correctly on 16-bit words.
The code I used to emulate decimal mode uses the instruction so it fails when simulated but works correctly on silicon.
The code I used to emulate decimal mode uses the instruction so it fails when simulated but works correctly on silicon.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
Re: emulator performance on embedded cpu
Yeah, this drove me nuts for several days. I PM'd Andrew to let him know that his PIC based 6502 emulation was broken - just like mine.
Turns out that he reported this bug to Microchip many years ago.. as of v3.05 (just released), the same bug is there. This instruction (DAW) is just like the any other CPU that has a decimal-adjust instruction, so it was a natural to use. It sort of works in the simulator (the flags are wrong under various conditions). Andrew, did you try this with ICE or some other real in-circuit debugger to see if the problem shows up there?
I have ALL of the 6502 unimplemented opcodes supported, and I need to have decimal mode working perfectly as well. There are many different programs that use unimplemented opcodes as part of their copy protection, and so emulating these are not an option for my CBM drive emulator.
I have ALL of the 6502 unimplemented opcodes supported, and I need to have decimal mode working perfectly as well. There are many different programs that use unimplemented opcodes as part of their copy protection, and so emulating these are not an option for my CBM drive emulator.
Re: emulator performance on embedded cpu
It may be that you've got the most challenging 6502 application there, in terms of how much fidelity you need. Do you also need to model stray writes, and the exact timing of interrupts?
Re: emulator performance on embedded cpu
Yes, everything is sub-cycle accurate. Both (virtual) 6522's can generate interrupts, and both have dual timers. There is a data separator emulation that runs asynchronously to system, clocking in the flux data and driving the SO line. One VIA is connected to the IEC (serial bus) using the same double-ended open collector logic - so data, clock, and attention signals all have separate read and write lines connected to the VIA. I also have a SD media interface and a OLED screen (which is GREAT for debugging!)
I still have a few VIA things to add (like the shift register implementation), and I am writing a FAT32 filesystem handler (in assembly like everything else). I don't have tons of free time at the moment to get it done.
I still have a few VIA things to add (like the shift register implementation), and I am writing a FAT32 filesystem handler (in assembly like everything else). I don't have tons of free time at the moment to get it done.
Re:
BigEd wrote:
I did look into Acorn's two emulators: the 65Host and 65Tube programs, using the excellent disassembler Armalyser - it's amazing how dense the ARM instructions are, and very useful that the machine is so like the 6502. 65Tube is the faster of the two: it keeps the 6502 registers in ARM registers and handles PC and SP as pointers into a byte array. Each opcode finishes with a fetch and a computed jump, into a table of 16-word sections, one per opcode. For example, the code for BCC is just 6 istructions:
Notice how the predicated instructions remove the need to branch, and how the ARM's own carry flag serves to emulate the 6502's - same for N, Z and V. All the 6502 state is held in registers throughout. The free shifts, auto-increment and the familiar-looking address modes help a lot too.
Code: Select all
; handler for 0x90 BCC
LDRCCB R0,[R10],#1 ; fetch the operand byte
MOVCC R0,R0,LSL #24 ; shift left to prepare sign-extension
ADDCC R10,R10,R0,ASR #24 ; adjust the PC for branch-taken case
ADDCS R10,R10,#1 ; increment PC for branch not taken
; standard postamble: fetch next instruction. R10 is the 6502 PC, as a byte pointer
LDRB R0,[R10],#1 ; ifetch into R0 and PC++
ADD PC,R12,R0,LSL #6 ; computed jump to next opcode handlerI've also just been disassembling 65Tube, and it looks like a very efficient 65C02 implementation in ARM.
I'm thinking of trying to reverse engineer the 65C02 core of this back into a buildable source form.
Has this been done already by anyone?
I was also wondering if the original sources were ever released? I did have a poke around the RISC OS Open CVS repository, and couldn't find anything.
Dave
Re: emulator performance on embedded cpu
I don't know of a previous effort, or of source. Or even who wrote it! I do see a hint that it might not be 32-bit clean, so maybe watch out for that.
(I wanted a6502 not to be a derivation so I could license it without any grey area - copyright in Acorn's software might yet be owned by someone. But that is a bit of a restriction and probably meaningless in practice.)
(I wanted a6502 not to be a derivation so I could license it without any grey area - copyright in Acorn's software might yet be owned by someone. But that is a bit of a restriction and probably meaningless in practice.)