65C02 Simulator on a bare metal Raspberry PI

8BIT · Post by **8BIT** » Thu Feb 22, 2018 9:22 pm

hoglet wrote:

Interesting... can you say a bit more about why this would be the case?

I went back to the manual to find where I read this. I made a few errors. It was actually in the BCM2835 Technical Reference Manual, not the ARM ARM. ( http://infocenter.arm.com/help/topic/co ... p7_trm.pdf )

And, it had to do with interaction between Tightly Coupled Memory and the data cache. I recalled reading this (pg 7-13)

Quote:

There is a penalty of three clock cycles when the core switches between accessing
cache and TCM, for example if it thinks the access is in TCM, but it is in fact in
cache. So. three cycles for the first non-sequential access to TCM, when the
previous access on that side, I-side or D-side, was to cache and similarly, three
cycles penalty for the first non-sequential access to cache, when the previous
access on that side was to TCM. This is not an issue on the I-side, where code does
not typically branch between TCM and cacheable areas, but can be an issue for
data.

It stuck in my mind that disabling the data cache might increase performance. Since then, I discovered that the BMC2835 does not have TCM. For some reason, I still had it in my mind to test the speed with data cache enabled and disabled. At this point, I cannot say exactly why its faster with the data cache off. It could be with the randomness of the access to the simulated memory area (zero page, ROM, RAM) that there are too many cache misses??

Hope this helped a little.

Daryl

8BIT · Post by **8BIT** » Sat Feb 24, 2018 6:25 am

For anyone interested, here is the working binary (kernal.img) for the 65C02 simulator for the Raspberry PI v1. Put these three files on a FAT formatted SD card and you should boot up to my Monitor. The only useful IO supported now (besides the serial port) is the ability to turn the Act LED on and off. A write to $8020 will turn it on (any value) and a write to $8021 will turn it off.

I fixed the page crossing short cuts mentioned previously and my crude speed test is coming in at 111 Mhz. If you do give it a try, let me know that it worked, or any problems that you encountered.

thanks!

Daryl

dp11 · Post by **dp11** » Sat Feb 24, 2018 9:03 am

Can we see your source ? I don't think turning of the cache would give a performance increase, so I suspect there is a configuration issue.

8BIT · Post by **8BIT** » Sat Feb 24, 2018 4:25 pm

Yes, the source will come soon, its still a mess in terms of readability. You are right, the speed boost came when I turn on the MMU. I had already had the caches turned on, but without the MMU, the caching does not work well. I only saw a few MHZ difference between caches on and off without the MMU enabled.

Daryl

8BIT · Post by **8BIT** » Sat Feb 24, 2018 7:14 pm

I have cleaned up the sources. There are still ways to add efficiency, but this is where I am at currently.

I am using the latest ARM GNU toolchain on a Windows 10 PC, downloaded from here:
https://developer.arm.com/open-source/g ... /downloads

The included batch file will compile the sources as long as the GNU tools are in the environment path.

To dp11, I was wrong about disabling the data cache. While cleaning up the bios.s file, I discovered I had some duplicate code which enabled the data cache previously, so the code I changed was not disabling the data cache. I cannot explain the slowdown I saw. I have removed the duplication and am still in the 112 Mhz range. Chalk it up to my ARM inexperience

... moving forward.

My next goal is to start looking at video generation on the PI.

Enjoy!

Daryl

hoglet · Post by **hoglet** » Sat Feb 24, 2018 9:10 pm

8BIT wrote:

Yes, the source will come soon, its still a mess in terms of readability. You are right, the speed boost came when I turn on the MMU. I had already had the caches turned on, but without the MMU, the caching does not work well. I only saw a few MHZ difference between caches on and off without the MMU enabled.

Our experience was that the ARM data cache doesn't work unless that MMU is enabled and the TLBs/page tables are correctly configured to mark the pages as cacheable.

i.e. just setting the L1 data cache enable bit in the ARM system control register is not sufficient.

The code for doing that is here:
https://github.com/hoglet67/PiTubeDirec ... che.c#L132

It took us about 3 months to get this working across all the different Pi Models.

8BIT · Post by **8BIT** » Sat Feb 24, 2018 10:07 pm

Thanks for the code example... I'll take a look more closely and see if I need to adjust any of my code.

For now, this is just PI v.1

Daryl

hoglet · Post by **hoglet** » Sun Feb 25, 2018 8:10 am

8BIT wrote:

For now, this is just PI v.1

This is an earlier, simpler, version that just supports the Pi One:
https://github.com/hoglet67/PiTubeClien ... lib.c#L140

Dave

dp11 · Post by **dp11** » Sun Feb 25, 2018 10:03 am

Can you publish the speed test program you are using?

8BIT · Post by **8BIT** » Sun Feb 25, 2018 5:10 pm

Here's my timing test. This is a very crude test.

Code: Select all

	*= $1000

	LDA  #$14		; 2 - 20 million cycles
L1	JSR  C1M		; 6
	LDX  $00		; 3
	DEA			; 2
	BNE  L1			; 2/3
	RTS			; 6
C1M	LDY  #$F8		; 2 - 1 million Cycles
L4	JSR  C1000		; 6
	JSR  C1000		; 6
	JSR  C1000		; 6
	JSR  C1000		; 6
	LDX  $10		; 3
	DEY			; 2
	BNE  L4			; 2/3
	LDX  #$02		; 2
L6	JSR  C10		; 6
	DEX			; 2
	BNE  L6			; 2/3
	RTS			; 6
C1000	LDX  #$2F		; 2  1000 Cycles
L7	JSR  C10		; 6
	DEX			; 2
	BNE  L7			; 2/3
	NOP			; 2
	NOP			; 2
	NOP			; 2
	RTS			; 6
C10	NOP			; 2 10 cycles
	NOP			; 2
	RTS			; 6

Sorry the formatting is off. This code executes exactly 20,000,000 cycles. To get more accuracy, I call it 100 times using this code:

Code: Select all

$800     LDA  #$64
$802     PHA
            JSR   $1000
            PLA
            DEA
            BNE  $802
            RTS

I then time the execution and divide to total cycles (2,000,000,000) by the time (~18 seconds) for 111MHz.

This is crude for many reasons. I am only using a small portion of 65C02 opcodes so am not testing every case. Also, the code at $800 adds ~2300 cycles to the total execution, but I don't count that (I'd rather fall slower than faster on my estimated speed.) Also, timing with a clock is no the most accurate, but it gets me a ballpark answer. Reading and writing to IO will also slow down the execution so if I were to cycle the LED within a loop, the apparent clock speed would most likely decrease.

I had originally thought of doing cycle counting and then use a system timer to throttle the execution to a fixed number. That would make a more stable platform for emulation of a "system", but at this stage I don't see a need.

Daryl

dp11 · Post by **dp11** » Sun Feb 25, 2018 6:51 pm

Thanks for that . It does explain the speed different from what I would estimate from the code.

Don't worry about testing every instruction. Less than half are typically used. Only about 50 instructions will make up 90% of instructions for some programs. You may well find just a few say 20 instructions make up over 50% of the execution time. As an example in an adventure games SEI and NOP will hardly be used.

8BIT · Post by **8BIT** » Mon Apr 30, 2018 8:36 pm

I have kind of gotten stalled on this project. I have had too many non-6502 related issues taking a lot of my spare time for one. I was able to set up a video display, but am trying to figure out how to interface a 640x480x16bit display memory into a 64k memory map. My first thought is to jump to a 65816 simulator that can address 16MB of memory.

Also, I had wanted to take advantage of the other on-board IO, such as USB, Ethernet, and sound. In reading the hardware reference, it appears most of the IO is interfaced through a USB controller on chip. All of the controller firmware I have found so far uses C code. I have not found an assembly source yet. i am trying to decide if I want to undertake an assembly driver for the USB controller, move to a C based 65816, or just stop here.

I'm in no hurry, so will continue to ponder the choices and read more documentation.

Daryl

whartung · Post by **whartung** » Mon Apr 30, 2018 8:42 pm

8BIT wrote:

Also, I had wanted to take advantage of the other on-board IO, such as USB, Ethernet, and sound. In reading the hardware reference, it appears most of the IO is interfaced through a USB controller on chip. All of the controller firmware I have found so far uses C code. I have not found an assembly source yet.

If you're motivated, you can compile the C code to assembly and "drop it in" as an assembly module, perhaps with some tweaking to get the interfacing right. Might be a good middle ground without having to completely reinvent the wheel, even if the generated code is essentially opaque (I don't know if you can compile the C code with the source code as comments to the assembly or not).

Tor · Post by **Tor** » Tue May 01, 2018 8:03 am

whartung wrote:

8BIT wrote:

(I don't know if you can compile the C code with the source code as comments to the assembly or not).

That should in principle be easy with gcc: gcc -fverbose-asm -S source.c
That works fine on Intel. But not with the arm version of gcc on my pi3, for some reason. But you can match source lines to code with objdump if you compile with the -g flag and then use objdump on the .o file:

Code: Select all

$ cat pr.c
int test (int a)
{
        int b;
        int c;
        b = a + 2;
        c = b + 3;
        return (c);
}

$ gcc -c -g pr.c
$ objdump -d -S pr.o

pr.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <test>:
int test (int a)
{
   0:   e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
   4:   e28db000        add     fp, sp, #0
   8:   e24dd014        sub     sp, sp, #20
   c:   e50b0010        str     r0, [fp, #-16]
        int b;
        int c;
        b = a + 2;
  10:   e51b3010        ldr     r3, [fp, #-16]
  14:   e2833002        add     r3, r3, #2
  18:   e50b3008        str     r3, [fp, #-8]
        c = b + 3;
  1c:   e51b3008        ldr     r3, [fp, #-8]
  20:   e2833003        add     r3, r3, #3
  24:   e50b300c        str     r3, [fp, #-12]
        return (c); 
  28:   e51b300c        ldr     r3, [fp, #-12]
}
  2c:   e1a00003        mov     r0, r3
  30:   e24bd000        sub     sp, fp, #0
  34:   e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
  38:   e12fff1e        bx      lr
pi@

Klaus2m5 · Post by **Klaus2m5** » Tue May 01, 2018 1:33 pm

Have you considered to migrate to the newer multicore versions of the raspberry? On these you could run the bare metal code on just one of the cores while the rest of the cores would run a standard linux image. That way you do not need to recreate all the device drivers. You just need to provide an interface to the linux device tree through a shared part of RAM. A screenbuffer could also be part of the shared RAM area and would be displayed in a window of the linux GUI.

65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI

Re: 65C02 Simulator on a bare metal Raspberry PI