6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Tue Apr 23, 2024 7:43 am

All times are UTC




Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next
Author Message
PostPosted: Sun Dec 20, 2015 7:50 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 337
BigEd wrote:
I don't know of a previous effort, or of source. Or even who wrote it! I do see a hint that it might not be 32-bit clean, so maybe watch out for that.

Yes, it wasn't, but I'm past that hurdle.

There's an initial check-in of the code here:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S

I'm testing this running bare metal on a Raspberry Pi, acting as second processor to the Beeb. I've now got the startup banner coming up on the Beeb, so quite a lot is working.

The next hurdle is to add in interrupt handling so I can do some more testing with real programs, like the Klaus Dormann's test suites.

I'm wondering if it might actually be possible to use ARM interrupts directly, and in the FETCH_NEXT macro briefly enable them and disable them. The problematic bit is that IRQ is level sensitive, which is not supported by the directly by the bcm2835.

Dave


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 21, 2015 9:02 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 337
If anyone is interested, I've now got this "Acorn 65tube" derived ARM 65C02 emulation running successfully.

In my environment, it's about 70% faster than using lib6502.

And, after a bit of bug fixing, it passes the 6502 and 65C02 Dormann tests.

The latest code is here.

The limiting factor in speed is now:
- checking for IRQ/NMI/RST every N instructions (N=16 at the moment, which gives ~2us interrupt latency)
- checking all memory accesses against the range FEF8-FEFF for special handling of the tube registers.

There's a slightly longer update here:
http://stardot.org.uk/forums/viewtopic. ... 47#p127347

What's left to do is:
- run Bruce Clark's BCD tests
- look for a more efficient pattern for trapping accesses to certain address ranges
- cleaner separation of the core 6502 emulation from my use of it as an emulated Beeb Co Processor

Dave


Last edited by hoglet on Mon Dec 21, 2015 10:24 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 21, 2015 9:08 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10792
Location: England
Brilliant! A big step forward for 6502 on ARM!

Edit: now running a great deal faster than lib6502, see the linked stardot thread for more.


Last edited by BigEd on Fri Jan 08, 2016 7:06 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 21, 2015 11:16 pm 
Offline

Joined: Mon Dec 21, 2015 11:00 pm
Posts: 4
Location: DK
I'm working on an emulator on an STM32F103 with 256kb Flash and 48kb RAM.

It is written in C. The CPU ist just below 900 lines. It uses ~9,5kB of Flash and ~50bytes of RAM, exculding the emulated memory.

I've assigned a few k to this emulator and assembled the skier and the random demo from 6502asm.org. Both are running at about ~1,03MHz without graphics.

There is a 3,2" TFT attached to the controller and a quick hack gave me that 32x32 pixel screen, zoomed by 7, on the TFT.
This slows down the emulator to about 440,7khz.

These speed-values were "measured" by letting the cpu run for a while, giving every 1000mS status information including ticks.

Soon i can post more about this, if you have questions up to then, just ask.

I just registered, because i stumbled across this thread while doing the same.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 22, 2015 5:17 am 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 337
BigEd wrote:
Brilliant! A big step forward for 6502 on ARM!

Thanks Ed, I'm sure there is still considerable room for improvement.

I'm not convinced my conversion from 26-bit ARM to 32-bit ARM was optimal.

Here's an example, the CPY instruction, where the value of the V flag need to be preserved.

The old 26-bit code was:
Code:
// Opcode C0 - CPY #$00
    3cf0:   e4da0001    ldrb    r0, [sl], #1
    3cf4:   e1a0200f    mov r2, pc
    3cf8:   e1580c00    cmp r8, r0, lsl #24
    3cfc:   e022200f    eor r2, r2, pc
    3d00:   e2022201    and r2, r2, #268435456  ; 0x10000000
    3d04:   e132f00f    teqp    r2, pc
// Fetch next
    3d08:   e4da0001    ldrb    r0, [sl], #1
    3d0c:   e08cf300    add pc, ip, r0, lsl #6

The new 32-bit code is:
Code:
// Opcode C0 - CPY #$00
.align I_ALIGN
opcode_C0:
        ldrb    r0, [sl], #1
        mrs     r2, CPSR
        and     r2, r2, #V_FLAG
        cmp     r8, r0, lsl #24
        mrs     r1, CPSR
        bic     r1, r1, #V_FLAG
        orr     r1, r1, r2
        msr     CPSR_flg, r1
        FETCH_NEXT

So two additional instructions.

I must admit, I struggled to understand how the original code was working. It only makes sense if TEQP r2, pc toggles the appropriate flag bit. I couldn't find a good description of this behaviour in any documentation though.

Can anyone come up with a more efficient pattern?

Dave


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 22, 2015 7:40 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
According to my copy of the ARM ARM (DDI 0100B) the TEQP performs an EOR operation between the operands, and then overwrites the flag bits in the PC with the result, so it would toggle the V bit if it was wrong.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 22, 2015 9:18 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10792
Location: England
According to http://www.heyrick.co.uk/armwiki/TEQ, the V flag is not affected... and that would also be true if Acorn's code is correct, which it almost certainly is. The Peter Cockerell book at http://www.peter-cockerell.net/aalp/ doesn't mention V as a special case. I wonder if this instruction changed...

This is perhaps something to be checked using VisualARM.

Edit: this is the thumb2 code from a6502:
Code:
.macro do_compare addressmode reg
   @ somewhat like addsub, but ARM CMP sets V flag whereas 6502 CMP does not
   load_operand_and_ea \addressmode r0 r1   @ in this case we don't need the effective address
   mov   r0, r0, asl #24   @ we left-justify to get N correct
   mov   r1, \reg, asl #24
   setflags_pre
   cmp   r1, r0      @ this has set N and Z, not V (or C)
   MRS   r0, APSR   @ these manipulations feel pedestrian - could probably do better
   bic   r0, #armVflag
   tst   rPbyteHi, #armVflag
   it ne
   orrne   r0, #armVflag
   mov   rPbyteHi, r0
.endm


Last edited by BigEd on Tue Dec 22, 2015 9:21 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 22, 2015 9:19 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10792
Location: England
stecdose wrote:
I just registered, because i stumbled across this thread while doing the same.

Welcome! Feel free to introduce yourself over at the Introduce Yourself thread.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 22, 2015 9:35 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
The TEQ instruction does not affect the V flag, but that does not matter here. In this case, the flag bits are copied directly from the ALU output, which is equal to EOR of the two operands (since the TEQ is implemented by EOR).
Attachment:
arm.jpg
arm.jpg [ 167.68 KiB | Viewed 6771 times ]

See also http://morrow.ece.wisc.edu/ECE353/arm_r ... rm_arm.pdf chapter A8.3.
Here's a picture of R15 (PC/flags register).
Attachment:
reg15.png
reg15.png [ 75.19 KiB | Viewed 6754 times ]


Top
 Profile  
Reply with quote  
PostPosted: Wed May 18, 2016 7:16 pm 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
Well, after a year of hiatus I am back working on my CBM disk drive emulator. I have switched to a dsPIC33EP512GM304, which gives me a 25MHz SPI bus - much better for the SD reads/writes. Since I don't need any more than a few MHz in performance (I am at nearly 16MHz 6502 speeds now), I decided to make the instruction fetch an interrupt. The peripherals are also using interrupts now. This sure makes it nice because the main loop really now just handles the OLED, SD control, and UI. Everything is perfectly in sync, adjusting the interrupt clock by the instruction cycle length (taking into account crossing page boundaries and other nuances). I also dropped the CPU speed from 70MIPS to 64MIPS because that made everything an even multiple of 2 for the clock dividers. I should have made this thing interrupt driven before, it sure makes handling non-emulation stuff a breeze. If you aren't looking for the absolute fastest emulation, and need it cycle exact, I would highly recommend looking at making the instruction fetch/6502 emulation as an interrupt routine.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 19, 2016 8:12 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10792
Location: England
Pretty impressive to get 16MHz emulation from a 128MHz CPU! Anything better than a 10x penalty for emulation is good going. It's interesting that even the overhead of taking and returning from interrupts isn't hurting performance so much - I would have expected to have to make every host CPU cycle count. Especially if each instruction has to tweak the time of the next interrupt according to emulated cycle count.


Top
 Profile  
Reply with quote  
PostPosted: Thu May 19, 2016 3:06 pm 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
I *was* getting almost 16MHz average speed @70MIPS (140MHz) without the instruction fetch/emulation being interrupt based. I am not sure what the max speed would be now, but using the interrupt to fetch the next instruction and emulate it is not the way to go if you want the fastest emulation. The interrupt overhead adds about 8 PIC CPU cycles (64 cycles per 1us), so at least a 12.5% penalty. The reason for using the interrupt is to make it cycle exact. There is an instruction clock that I setup on a timer that triggers the interrupt that fetches the next instruction. As part of the instruction decoding, the timer value is updated. I only need a 1MHz (or 2MHz, depending on the drive) CPU, so I have plenty of time left over. I don't have a flat 64K model, so I have to maintain a logic/physical address. I actually do that by using the translated address always and change that only on any jmp/jsr/rts/int. So, it's pretty quick, but still nowhere as fast as it could be if no translation was required. With CBM drives, the translation is required because the peripherals are not fully decoded so they appear mirrored in several memory locations. As part of the translation I decode any of the mirrors to the physical address that is used in the PIC's memory to store that peripheral's data.

I might look at doing this on a high speed STM32 (in assembly) to see how fast it could really go. One tip I can add is this... for instructions that do not affect a lot of flags, do a simple bit clear/set for the 6502 flags. All other instructions you are better off using a lookup table to translate the flags... a load/AND/OR (especially if you can store while OR'ing).


Last edited by JimDrew on Thu May 19, 2016 3:13 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu May 19, 2016 3:13 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10792
Location: England
Ah, I see, two slightly different goals there. It does make good sense - if you have the horsepower - to use interrupts to ensure cycle accuracy. Although, I imagine you can only make each instruction run for the right length of time - presumably not make each memory access occur in exactly the right cycle?


Top
 Profile  
Reply with quote  
PostPosted: Thu May 19, 2016 3:20 pm 
Offline

Joined: Sun Oct 14, 2012 7:30 pm
Posts: 107
Correct, I don't have to worry about memory access timing, except between the CPU and two VIAs. I have a check in the VIA to adjust the read/write of the timers.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jun 18, 2016 12:59 pm 
Offline

Joined: Sun Jun 29, 2014 5:42 am
Posts: 337
Hi Guys,

Just a quick update to say I've got some updated performance figures for the ARM 65C02 emulation code a few of us have been working on over on the StarDot forums.

The headline figure is now 225MHz on a Pi Zero (1000MHz ARM clock / 250MHz core clock)
Attachment:
IMG_0474.JPG
IMG_0474.JPG [ 415.2 KiB | Viewed 6583 times ]

There is a historic summary of the work in a new thread here:
http://www.stardot.org.uk/forums/viewto ... 30&t=11328

The code is in git:
https://github.com/hoglet67/PiTubeClien ... 5tubeasm.S

Dave


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 82 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: