65C02 Simulator on a bare metal Raspberry PI
Re: 65C02 Simulator on a bare metal Raspberry PI
Well done - a milestone! It will be interesting to see the progress as you find optimisations.
Re: 65C02 Simulator on a bare metal Raspberry PI
14MHz on a 700MHz machine is 50Hz per instruction. Now, I don't know anything about ARM assembly, or how many "Hz per instruction". On the 6502, a cycle is a clock pulse, so 14MHz is 14M cycles.
With an average instruction time of 2 cycles (total swag), that's 7M Instructions (7 MIPS).
So, how many instructions can an ARM do in 50 cycles? That should give you a better idea of what "ultimate performance" might be.
Are you using a table lookup for the instruction processing? I assume so. But, really, 50 cycles is a lot, but it's not that much in this context. 2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.
Basic dispatch is: (contrived CPU Agnostic 6502ish Assembly syntax)
There's 4 or 5+ instructions right there, and not "cheap" instructions either.
Then something as simple as LDA <ABS>:
So, LDA ABS is close to 60% of that 14MHz budget already.
Anyway, you may not be able to get much more than 14MHz. Its just interesting how fast it all adds up.
Have you considered JIT compiling the 6502? Zziinnng!!
With an average instruction time of 2 cycles (total swag), that's 7M Instructions (7 MIPS).
So, how many instructions can an ARM do in 50 cycles? That should give you a better idea of what "ultimate performance" might be.
Are you using a table lookup for the instruction processing? I assume so. But, really, 50 cycles is a lot, but it's not that much in this context. 2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.
Basic dispatch is: (contrived CPU Agnostic 6502ish Assembly syntax)
Code: Select all
main_loop:
LDX (PC)
INC PC
AND PC, #$FFFF
JSR (INSTRUCTIONS,X)
JMP main_loop
Then something as simple as LDA <ABS>:
Code: Select all
LDA_ABS:
LDA (PC)
INC PC,2
AND PC, $FFFF
STA A_REGISTER
RTS
Anyway, you may not be able to get much more than 14MHz. Its just interesting how fast it all adds up.
Have you considered JIT compiling the 6502? Zziinnng!!
Re: 65C02 Simulator on a bare metal Raspberry PI
(The PiTubeDirect project manages >200MHz at 1GHz. It's written in very optimised assembly language, and is a fast-as-possible emulator making no effort to match the proportional cycle counts of each instruction.
There's a timeline here showing the progress from the initial 84MHz. The project offers several models of CPU, including another 6502 model based on Ian Piumarta's lib6502 written in C, which runs at about 20MHz. So that's a benchmark for a C model, and is very much in line with 14MHz at 700MHz.)
There's a timeline here showing the progress from the initial 84MHz. The project offers several models of CPU, including another 6502 model based on Ian Piumarta's lib6502 written in C, which runs at about 20MHz. So that's a benchmark for a C model, and is very much in line with 14MHz at 700MHz.)
Re: 65C02 Simulator on a bare metal Raspberry PI
whartung wrote:
2-3 cycles per instruction, that's 16-25 instructions for each 6502 instruction.
https://hardwarebug.org/2014/05/15/cort ... e-timings/
- BitWise
- In Memoriam
- Posts: 996
- Joined: 02 Mar 2004
- Location: Berkshire, UK
- Contact:
Re: 65C02 Simulator on a bare metal Raspberry PI
I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs
Re: 65C02 Simulator on a bare metal Raspberry PI
Wow, last night I posted a lengthy response and this morning its not here. I always do the preview before I submit, so there's a small chance I had done that and forgot to click submit.
I'll try to recap that post and add more when I get home from work. Thank you to everyone who has offered help!!
Daryl
I'll try to recap that post and add more when I get home from work. Thank you to everyone who has offered help!!
Daryl
Please visit my website -> https://sbc.rictor.org/
Re: 65C02 Simulator on a bare metal Raspberry PI
(I think if you compose something and someone else posts in the meantime, you get a second confirmation stage - I've lost a post or two that way, thinking I'd posted when I hadn't.)
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: 65C02 Simulator on a bare metal Raspberry PI
I hope you can find it in your browser history, either by re-opening closed tabs, or by using the back arrow ("Go back one page") in the tab history enough times.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: 65C02 Simulator on a bare metal Raspberry PI
BitWise wrote:
I found the biggest speed gains in my code were from using the processors native status word and creating P only when it was actually needed. Also C is only flag bit that ever needs to be restored and then only for shifts and arithmetic.
With a custom native asm environment on ARM, you might even dedicate a register for holding that NZ-affecting value.
Re: 65C02 Simulator on a bare metal Raspberry PI
I don't have the energy to recompose the full message from yesterday, but the summary is this:
Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.
I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.
This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.
Here's a sample of my LDA ABS instruction:
ABS will read the 2 bytes after the opcode, form the target address, and read the byte at that address (RAM, ROM, or IO).
C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)
By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.
I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.
Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.
More still to come.
Daryl
Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.
I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.
This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.
Here's a sample of my LDA ABS instruction:
Code: Select all
C_AD: ; LDA_ABS - opcode $AD
ABS
C_LDA
CYCLE 4
FETCH C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)
By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.
I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.
Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.
More still to come.
Daryl
Please visit my website -> https://sbc.rictor.org/
Re: 65C02 Simulator on a bare metal Raspberry PI
Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.
Re: 65C02 Simulator on a bare metal Raspberry PI
With the ARM on the Pi shifts aren't always free these days, it depends on interlocks.
Re: 65C02 Simulator on a bare metal Raspberry PI
Oh, that's a surprise!
Re: 65C02 Simulator on a bare metal Raspberry PI
So far, I haven't run into a problem with "free" shifts for operand 2. I haven't seen anything about interlocks in the assembly docs, can you explain what they are dp11? I've only been using ARM assembly for about 3 weeks, and have a lot to learn still.
I had not considered keeping the data left aligned, thanks for the tip Ed! I was toying with the idea of filling bits 8-30 with 1's and copying the sign (bit 7) to bit 31. I think I would have the same results, but there'd be too much overhead. I will definitely experiment with left alignment.
Another trick I leaned from the tube project was placing simulated memory at address 0x0000 - 0xFFFF, that way the PC can directly address memory. So far, that is working well for me. My only concern is with the Video core mailboxes (which I have not played with yet). I think they live in the space between 0x0000 and 0x8000. If that is true and I cannot relocate them, I will not be able to use this trick. I will eventually want to use the video display, especially after I get a 65816 simulator working, which can address 16MB of RAM.
Back to the code.
Daryl
I had not considered keeping the data left aligned, thanks for the tip Ed! I was toying with the idea of filling bits 8-30 with 1's and copying the sign (bit 7) to bit 31. I think I would have the same results, but there'd be too much overhead. I will definitely experiment with left alignment.
Another trick I leaned from the tube project was placing simulated memory at address 0x0000 - 0xFFFF, that way the PC can directly address memory. So far, that is working well for me. My only concern is with the Video core mailboxes (which I have not played with yet). I think they live in the space between 0x0000 and 0x8000. If that is true and I cannot relocate them, I will not be able to use this trick. I will eventually want to use the video display, especially after I get a 65816 simulator working, which can address 16MB of RAM.
Back to the code.
Daryl
Please visit my website -> https://sbc.rictor.org/
Re: 65C02 Simulator on a bare metal Raspberry PI
With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8
You will only spot the interlocks if you accurately cycle count the ARM instructions.
mailboxes can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.
You will only spot the interlocks if you accurately cycle count the ARM instructions.
Code: Select all
; code that might simulate an LDA
LDRB R0,[Rx]
MOVS R1,R0
; the above say takes n cycles
LDRB R0,[Rx]
MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1
mailboxes can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.
Last edited by dp11 on Thu Feb 08, 2018 3:49 pm, edited 1 time in total.