65C02 Simulator on a bare metal Raspberry PI

Topics pertaining to the emulation or simulation of the 65xx microprocessors and their peripheral chips.
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BigEd »

Well done - a milestone! It will be interesting to see the progress as you find optimisations.
whartung
Posts: 1004
Joined: 13 Dec 2003

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by whartung »

14MHz on a 700MHz machine is 50Hz per instruction. Now, I don't know anything about ARM assembly, or how many "Hz per instruction". On the 6502, a cycle is a clock pulse, so 14MHz is 14M cycles.

With an average instruction time of 2 cycles (total swag), that's 7M Instructions (7 MIPS).

So, how many instructions can an ARM do in 50 cycles? That should give you a better idea of what "ultimate performance" might be.

Are you using a table lookup for the instruction processing? I assume so. But, really, 50 cycles is a lot, but it's not that much in this context. 2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.

Basic dispatch is: (contrived CPU Agnostic 6502ish Assembly syntax)

Code: Select all

main_loop:
  LDX (PC)
  INC PC
  AND PC, #$FFFF
  JSR (INSTRUCTIONS,X)
  JMP main_loop
There's 4 or 5+ instructions right there, and not "cheap" instructions either.

Then something as simple as LDA <ABS>:

Code: Select all

LDA_ABS:
    LDA (PC)
    INC PC,2
    AND PC, $FFFF
    STA A_REGISTER
    RTS
So, LDA ABS is close to 60% of that 14MHz budget already.

Anyway, you may not be able to get much more than 14MHz. Its just interesting how fast it all adds up.

Have you considered JIT compiling the 6502? Zziinnng!!
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BigEd »

(The PiTubeDirect project manages >200MHz at 1GHz. It's written in very optimised assembly language, and is a fast-as-possible emulator making no effort to match the proportional cycle counts of each instruction.
There's a timeline here showing the progress from the initial 84MHz. The project offers several models of CPU, including another 6502 model based on Ian Piumarta's lib6502 written in C, which runs at about 20MHz. So that's a benchmark for a C model, and is very much in line with 14MHz at 700MHz.)
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by Druzyek »

whartung wrote:
2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.
Aren't a lot of ARM instructions single cycle or close to it? You may have more instructions to work with than 25. OTOH, not sure how the raspberry pi is set up and how fast it can run from RAM.
https://hardwarebug.org/2014/05/15/cort ... e-timings/
User avatar
BitWise
In Memoriam
Posts: 996
Joined: 02 Mar 2004
Location: Berkshire, UK
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BitWise »

I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
User avatar
8BIT
Posts: 1787
Joined: 30 Aug 2002
Location: Sacramento, CA
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by 8BIT »

Wow, last night I posted a lengthy response and this morning its not here. I always do the preview before I submit, so there's a small chance I had done that and forgot to click submit.

I'll try to recap that post and add more when I get home from work. Thank you to everyone who has offered help!!

Daryl
Please visit my website -> https://sbc.rictor.org/
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BigEd »

(I think if you compose something and someone else posts in the meantime, you get a second confirmation stage - I've lost a post or two that way, thinking I'd posted when I hadn't.)
User avatar
GARTHWILSON
Forum Moderator
Posts: 8773
Joined: 30 Aug 2002
Location: Southern California
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by GARTHWILSON »

I hope you can find it in your browser history, either by re-opening closed tabs, or by using the back arrow ("Go back one page") in the tab history enough times.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
White Flame
Posts: 704
Joined: 24 Jul 2012

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by White Flame »

BitWise wrote:
I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.
I'm not sure if this is the same as your approach, but what I've done for N and Z is when those flags are modified, save the current 8-bit value that would affect them into a "last value" location. Then when a branch on those is needed, pull that last value and use a native branch. Those 2 flags in particular are often modified but ignored, so it doesn't make sense to put work into computing them all the time.

With a custom native asm environment on ARM, you might even dedicate a register for holding that NZ-affecting value.
User avatar
8BIT
Posts: 1787
Joined: 30 Aug 2002
Location: Sacramento, CA
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by 8BIT »

I don't have the energy to recompose the full message from yesterday, but the summary is this:

Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.

I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.

This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.

Here's a sample of my LDA ABS instruction:

Code: Select all

C_AD:	 			; LDA_ABS - opcode $AD
	ABS			
	C_LDA			
	CYCLE 4			
	FETCH			
ABS will read the 2 bytes after the opcode, form the target address, and read the byte at that address (RAM, ROM, or IO).
C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)

By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.

I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.

Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.

More still to come.

Daryl
Please visit my website -> https://sbc.rictor.org/
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BigEd »

Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.
dp11
Posts: 33
Joined: 11 Nov 2017

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by dp11 »

With the ARM on the Pi shifts aren't always free these days, it depends on interlocks.
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by BigEd »

Oh, that's a surprise!
User avatar
8BIT
Posts: 1787
Joined: 30 Aug 2002
Location: Sacramento, CA
Contact:

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by 8BIT »

So far, I haven't run into a problem with "free" shifts for operand 2. I haven't seen anything about interlocks in the assembly docs, can you explain what they are dp11? I've only been using ARM assembly for about 3 weeks, and have a lot to learn still.

I had not considered keeping the data left aligned, thanks for the tip Ed! I was toying with the idea of filling bits 8-30 with 1's and copying the sign (bit 7) to bit 31. I think I would have the same results, but there'd be too much overhead. I will definitely experiment with left alignment.

Another trick I leaned from the tube project was placing simulated memory at address 0x0000 - 0xFFFF, that way the PC can directly address memory. So far, that is working well for me. My only concern is with the Video core mailboxes (which I have not played with yet). I think they live in the space between 0x0000 and 0x8000. If that is true and I cannot relocate them, I will not be able to use this trick. I will eventually want to use the video display, especially after I get a 65816 simulator working, which can address 16MB of RAM.

Back to the code.

Daryl
Please visit my website -> https://sbc.rictor.org/
dp11
Posts: 33
Joined: 11 Nov 2017

Re: 65C02 Simulator on a bare metal Raspberry PI

Post by dp11 »

With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8

You will only spot the interlocks if you accurately cycle count the ARM instructions.

Code: Select all

; code that might simulate an LDA
    LDRB R0,[Rx]
    MOVS R1,R0
; the above say takes n cycles

    LDRB R0,[Rx]
    MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1



mailboxes can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.
Last edited by dp11 on Thu Feb 08, 2018 3:49 pm, edited 1 time in total.
Post Reply