65C02 Simulator on a bare metal Raspberry PI

BigEd · Post by **BigEd** » Mon Feb 05, 2018 7:33 pm

Well done - a milestone! It will be interesting to see the progress as you find optimisations.

whartung · Post by **whartung** » Tue Feb 06, 2018 6:07 pm

14MHz on a 700MHz machine is 50Hz per instruction. Now, I don't know anything about ARM assembly, or how many "Hz per instruction". On the 6502, a cycle is a clock pulse, so 14MHz is 14M cycles.

With an average instruction time of 2 cycles (total swag), that's 7M Instructions (7 MIPS).

So, how many instructions can an ARM do in 50 cycles? That should give you a better idea of what "ultimate performance" might be.

Are you using a table lookup for the instruction processing? I assume so. But, really, 50 cycles is a lot, but it's not that much in this context. 2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.

Basic dispatch is: (contrived CPU Agnostic 6502ish Assembly syntax)

Code: Select all

main_loop:
  LDX (PC)
  INC PC
  AND PC, #$FFFF
  JSR (INSTRUCTIONS,X)
  JMP main_loop

There's 4 or 5+ instructions right there, and not "cheap" instructions either.

Then something as simple as LDA <ABS>:

Code: Select all

LDA_ABS:
    LDA (PC)
    INC PC,2
    AND PC, $FFFF
    STA A_REGISTER
    RTS

So, LDA ABS is close to 60% of that 14MHz budget already.

Anyway, you may not be able to get much more than 14MHz. Its just interesting how fast it all adds up.

Have you considered JIT compiling the 6502? Zziinnng!!

BigEd · Post by **BigEd** » Tue Feb 06, 2018 6:18 pm

(The PiTubeDirect project manages >200MHz at 1GHz. It's written in very optimised assembly language, and is a fast-as-possible emulator making no effort to match the proportional cycle counts of each instruction.
There's a timeline here showing the progress from the initial 84MHz. The project offers several models of CPU, including another 6502 model based on Ian Piumarta's lib6502 written in C, which runs at about 20MHz. So that's a benchmark for a C model, and is very much in line with 14MHz at 700MHz.)

Druzyek · Post by **Druzyek** » Wed Feb 07, 2018 4:44 am

whartung wrote:

2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.

Aren't a lot of ARM instructions single cycle or close to it? You may have more instructions to work with than 25. OTOH, not sure how the raspberry pi is set up and how fast it can run from RAM.
https://hardwarebug.org/2014/05/15/cort ... e-timings/

BitWise · Post by **BitWise** » Wed Feb 07, 2018 9:59 am

I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.

8BIT · Post by **8BIT** » Wed Feb 07, 2018 12:47 pm

Wow, last night I posted a lengthy response and this morning its not here. I always do the preview before I submit, so there's a small chance I had done that and forgot to click submit.

I'll try to recap that post and add more when I get home from work. Thank you to everyone who has offered help!!

Daryl

BigEd · Post by **BigEd** » Wed Feb 07, 2018 12:52 pm

(I think if you compose something and someone else posts in the meantime, you get a second confirmation stage - I've lost a post or two that way, thinking I'd posted when I hadn't.)

GARTHWILSON · Post by **GARTHWILSON** » Wed Feb 07, 2018 7:04 pm

I hope you can find it in your browser history, either by re-opening closed tabs, or by using the back arrow ("Go back one page") in the tab history enough times.

White Flame · Post by **White Flame** » Thu Feb 08, 2018 12:11 am

BitWise wrote:

I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.

I'm not sure if this is the same as your approach, but what I've done for N and Z is when those flags are modified, save the current 8-bit value that would affect them into a "last value" location. Then when a branch on those is needed, pull that last value and use a native branch. Those 2 flags in particular are often modified but ignored, so it doesn't make sense to put work into computing them all the time.

With a custom native asm environment on ARM, you might even dedicate a register for holding that NZ-affecting value.

8BIT · Post by **8BIT** » Thu Feb 08, 2018 5:34 am

I don't have the energy to recompose the full message from yesterday, but the summary is this:

Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.

I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.

This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.

Here's a sample of my LDA ABS instruction:

Code: Select all

C_AD:	 			; LDA_ABS - opcode $AD
	ABS			
	C_LDA			
	CYCLE 4			
	FETCH

ABS will read the 2 bytes after the opcode, form the target address, and read the byte at that address (RAM, ROM, or IO).
C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)

By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.

I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.

Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.

More still to come.

Daryl

BigEd · Post by **BigEd** » Thu Feb 08, 2018 8:51 am

Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.

dp11 · Post by **dp11** » Thu Feb 08, 2018 11:10 am

With the ARM on the Pi shifts aren't always free these days, it depends on interlocks.

BigEd · Post by **BigEd** » Thu Feb 08, 2018 11:48 am

Oh, that's a surprise!

8BIT · Post by **8BIT** » Thu Feb 08, 2018 12:59 pm

So far, I haven't run into a problem with "free" shifts for operand 2. I haven't seen anything about interlocks in the assembly docs, can you explain what they are dp11? I've only been using ARM assembly for about 3 weeks, and have a lot to learn still.

I had not considered keeping the data left aligned, thanks for the tip Ed! I was toying with the idea of filling bits 8-30 with 1's and copying the sign (bit 7) to bit 31. I think I would have the same results, but there'd be too much overhead. I will definitely experiment with left alignment.

Another trick I leaned from the tube project was placing simulated memory at address 0x0000 - 0xFFFF, that way the PC can directly address memory. So far, that is working well for me. My only concern is with the Video core mailboxes (which I have not played with yet). I think they live in the space between 0x0000 and 0x8000. If that is true and I cannot relocate them, I will not be able to use this trick. I will eventually want to use the video display, especially after I get a 65816 simulator working, which can address 16MB of RAM.

Back to the code.

Daryl

dp11 · Post by **dp11** » Thu Feb 08, 2018 1:28 pm

With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8

You will only spot the interlocks if you accurately cycle count the ARM instructions.

Code: Select all

; code that might simulate an LDA
    LDRB R0,[Rx]
    MOVS R1,R0
; the above say takes n cycles

    LDRB R0,[Rx]
    MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1

mailboxes can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.

65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI