Page 3 of 6

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 08, 2018 1:33 pm
by hoglet
In case you missed it (because it's a different thread), a while back I wrote up some of the history of the PiTubeDirect project:
http://www.stardot.org.uk/forums/viewto ... 55#p140955

Dave

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 08, 2018 1:45 pm
by 8BIT
Yes, I have read that, but just on a cursory level. I do want to go back and dig into it and the source code at some point. The speeds you are getting are outstanding, and I'm sure I could learn a lot from this project. Thanks for sharing this Dave!

Daryl

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 08, 2018 1:51 pm
by 8BIT
dp11 wrote:
With newer ARMs some of what you thought you knew about timings have changed. A lot isn't documented publicly either. It also can change between ARMv6 v7 and v8

You will only spot the interlocks if you accurately cycle count the ARM instructions.

Code: Select all

; code that might simulate an LDA
    LDRB R0,[Rx]
    MOVS R1,R0
; the above say takes n cycles

    LDRB R0,[Rx]
    MOVS R1,R0,LSL #24
; the above take n +1 cycles on the ARM core used in the Pi1



mailbox box can live where ever you like due to the MMU. Using 0x0000 for the memory doesn't give you a big win in performance. The big wins are from scheduling the loads from memory and keeping the code small enough it fits in the cache. Dealing with 6502 interrupts can be a big overhead.
OK, I see what you are saying. Thanks for sharing. I know using 0x0000 won't boost performance, but it will help ease the code a bit. I have not played with the MMU yet, I'm pretty much just using David Welch's basic set the stack pointer and go example. I did manage to enable the L1 caches. I know I have a lot to learn yet, especially with the PI hardware on top of the ARM.

Daryl

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 08, 2018 10:48 pm
by White Flame
BigEd wrote:
Because many ARM instructions offer a free shift, you can consider the technique of keeping 8-bit values left-aligned in their registers - this way, N and V work out much better.
I think you meant N and Z, not V? I was going to suggest something similar, but refrained because it doesn't support carry coming in from the LSB for addition/subtraction. There are a few solutions that pull in more cases, but they all have downsides:
  1. Do ops in the upper 8 bits: C, N, V, Z work, but not right-input carry (ADC/SBC/ROL) nor ROR/LSR's right-output carry.
  2. Do ops in the upper 8 bits with the lower 24 set: Right-input carry for ADC/SBC now work, but Z breaks.
  3. Do ops in the lower 8 bits, then shift left 24 bits: C, N, Z, right-input, right-output, & left-output carry all work, but V and left-input carry don't.
Across the space of ops, you'd have to find the best balance for the intermediate register storage, but probably handle the individual ops differently. I also admit that I don't know how ARM's carry bit interacts with the shifter, if at all, so that would affect how useful my "shift by 24 after operation to output carry" assumptions are.

You could also do something crazy like keep the accumulator as a 10-digit number to represent what's going on with carry on both ends. But I'm pretty sure you'd always lose V in the process.

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Fri Feb 09, 2018 12:30 am
by 8BIT
Having 3 of the 4 flags would still greatly simplify (and speed) the simulation. V is the hardest to simulate, so ensuring it gets adjusted would be my first priority. N,Z,C are easy to adjust.

I am going to work up a left-aligned model and give that a go. My dev process is pretty fast, downloaded the 90k image takes about 20 seconds, and compiling takes 5. Should only take a few hours to work out the bugs.

Daryl

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Fri Feb 09, 2018 9:54 am
by BigEd
I've the vaguest recollection that the left-aligned approach works well for everything except ADC and SBC. But you can if you like have a look at a reverse-engineered source for one of Acorn's efforts:
http://stardot.org.uk/forums/viewtopic. ... 32#p127732

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Sat Feb 10, 2018 1:05 am
by White Flame
If just for ADC and SBC you set the low 24 bits to 1 before the operation, then the input carry should ripple up and have its proper effect. You'd need to clear them afterwards, though, to get Z working again.

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Sun Feb 11, 2018 6:07 pm
by dp11
Just been doing a few speed tests with PitTubeDirect on a RPiZero at 1GHz.

This is with the C version of the 6502 core, not the high performance assembler version which is about 275MHz

With an older GCC ( most likely GCC 6) we get about 86.44MHz equivalent performance
With GCC7 we get about 73.47MHz equivalent performance

I haven't investigated this difference yet. So it might be another compiler flag is needed to get gcc7 working correctly.

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Sun Feb 11, 2018 10:36 pm
by BitWise
Has anyone tried lookup tables? The pi has plenty of RAM.

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Mon Feb 12, 2018 8:44 am
by dp11
Lookup tables have been slower in the past possibly due the code not fitting into the cache. My spreadsheet simulation suggests lookup table might now be about 0.5% faster if a few other things hold true.

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Wed Feb 14, 2018 3:49 pm
by 8BIT
I've done some experimenting with the left-alignment and effects on the ARM's flags. This method works well with most 6502 instructions. ROR and ROL need some extra help and I have not started testing with ADC or SBC yet. I did note that the ARM V flag gets changed with many of the instructions that the 65C02 does not. For instance, INC and DEC have to use ARM ADD and SUB, which change the V flag. It might be more costly to adjust and save/restore the ARM's flags that to continue simulating the 65C02 Status register. Using the left aligned method does reduce my code some though, so I plan to implement it in the A, X, and Y registers.

I have also written a script to convert my macro based source to the full assembly for each opcode. This will allow me to streamline each opcode and eliminate some extra register moves the macros created. The Tube project aligns the opcode processing on 16 word boundaries, which makes the fetch and execute control really efficient. But, it gives a maximum of 16 ARM instructions per opcode. About half of mine are more than 16 currently. I could use a 32 word boundary, but the 16 word method allows the full 256 opcodes to reside in the L1 cache, which provides a big speed boost.

I need to focus on the memory decoding of the read and write functions, as testing for IO and write protecting the "ROM" takes too many instructions.

This has been a huge learning experience for me... and I'm sure I 'll keep finding ways to reduce code and increase speed.

Daryl

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 22, 2018 5:40 pm
by 8BIT
Here's an update, I've been working on optimizing to ARM assembly code, mostly in the areas of memory reads and writes. I have also worked out the code for keeping the A,X,Y registers left aligned which has really simplified the Status register updates. I'm still keeping the SR simulated vs. using the ARM's own SR for N,V,Z,C. BCD math is still not very optimized, but I may try to tackle that. With all that, I was able to bump the speed up to 25MHz. There are few places that I cut a corner though. A 16 bit read of address $FF will not wrap around to $00. Also, a 16 bit read at the last RAM byte will not get its upper value from the IO page, but rather the hidden RAM behind the IO page. Both are acceptable to me because there's no logical reason to have a program need to do that.

Not being satisfied with the speed, I started digging into the ARM Architecture Reference Manual (ARM ARM) as I really didn't understand the cache and MMU functions that well. I also dug into David Welch's references and tried to set up the MMU and caches. It seems the ARM comes out of power up with the MMU and caches disabled. After 3 or 4 attempts, I finally got both enabled and my simulated speed went up to over 100MHz. Bingo! And, after reading how the caches are organized, found that disabling the data cache increased the speed by about 10-12 more MHz. My speed test now reads around 120Mhz!

I will most likely go back and fix the shortcuts mentioned above and then start reading about the IO available in the PI to play with. Getting video display working is #1.

I'll package up what I have so far also and post it for those who are interested in giving it a try.

Daryl

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 22, 2018 7:33 pm
by BigEd
Excellent progress!

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 22, 2018 8:23 pm
by hoglet
8BIT wrote:
Bingo! And, after reading how the caches are organized, found that disabling the data cache increased the speed by about 10-12 more MHz.
Interesting... can you say a bit more about why this would be the case?

Re: 65C02 Simulator on a bare metal Raspberry PI

Posted: Thu Feb 22, 2018 8:33 pm
by BigEd
(Maybe because cache lines are much bigger than bytes, and an emulator's access doesn't have great locality compared to the size of a cache line??)