6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 7:13 am

All times are UTC




Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Sat Oct 23, 2021 3:42 pm 
Online

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
BigEd wrote:
On interrupts, yes, the microcontroller can take an interrupt and perhaps do something useful in the ISR, but to communicate that to the emulated 6502, the 6502 probably needs to take an interrupt to go into its ISR to do what it needs to do, and this has to happen on an emulated instruction boundary - and preferably at low cost.

Possibly "needs to" is a bit strong here.

Dominic's new JIT-6502 Co Processor in PiTubeDirect, for example, will happily allow a 6502 instruction to be interrupted mid-execution, and this hasn't yet caused any incompatibilities. It works, I think, because an interrupt service routine rarely cares exactly where the main program is at. As long as the 6502 instructions execute correctly when interrupted, that should be sufficient.

I'm sure there will be counter examples, but it's worth considering this as an option.

Dave


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 3:49 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Ah, yes, I had temporarily forgotten this new and clever scheme: is it not, in a sense, that a second 6502 emulation context runs the 6502 ISR code?


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 4:13 pm 
Online

Joined: Sun Jun 29, 2014 5:42 am
Posts: 352
BigEd wrote:
Ah, yes, I had temporarily forgotten this new and clever scheme: is it not, in a sense, that a second 6502 emulation context runs the 6502 ISR code?

Each invocation of the 6502 IRQ or NMI handler is a seperate emulation context with undefined register A/X/Y register values on entry. There is, however, still a shared stack, and a single stack pointer register.

It ended up like this through necessity rather than choice, and it's obviously a little different from a real 6502.

Dave


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 24, 2021 5:27 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
I've just remembered one of the observations about the teensy vs the pico: the teensy's pinout, in the versions we looked at, don't make it easy to drive or sample say a 16 bit bus in one go. There's more shifting and masking than would be ideal. That said, with a 600MHz processor overclockable to 900MHz, perhaps a few extra instructions isn't a big deal.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 24, 2021 6:37 pm 
Offline

Joined: Thu Oct 05, 2017 2:04 am
Posts: 62
Quote:
I've just remembered one of the observations about the teensy vs the pico: the teensy's pinout, in the versions we looked at, don't make it easy to drive or sample say a 16 bit bus in one go. There's more shifting and masking than would be ideal. That said, with a 600MHz processor overclockable to 900MHz, perhaps a few extra instructions isn't a big deal.

Yes, I used this shifting and masking on the MCL65+ to improve the Teensy 4.1's IO timing:

Code:
// -------------------------------------------------
// Drive the 6502 Address pins
// -------------------------------------------------
inline void send_address(uint32_t local_address) {
  register uint32_t writeback_data=0;
 
    writeback_data = (0x6DFFFFF3 & GPIO6_DR);   // Read in current GPIOx register value and clear the bits we intend to update
    writeback_data = writeback_data | (local_address & 0x8000)<<10 ;  // 6502_Address[15]   TEENSY_PIN23   GPIO6_DR[25]
    writeback_data = writeback_data | (local_address & 0x2000)>>10 ;  // 6502_Address[13]   TEENSY_PIN0    GPIO6_DR[3]
    writeback_data = writeback_data | (local_address & 0x1000)>>10 ;  // 6502_Address[12]   TEENSY_PIN1    GPIO6_DR[2]
    writeback_data = writeback_data | (local_address & 0x0002)<<27 ;  // 6502_Address[1]    TEENSY_PIN38   GPIO6_DR[28]
    GPIO6_DR       = writeback_data | (local_address & 0x0001)<<31 ;  // 6502_Address[0]    TEENSY_PIN27   GPIO6_DR[31]
   
    writeback_data = (0xCFF3EFFF & GPIO7_DR);   // Read in current GPIOx register value and clear the bits we intend to update
    writeback_data = writeback_data | (local_address & 0x0400)<<2  ;  // 6502_Address[10]   TEENSY_PIN32   GPIO7_DR[12]
    writeback_data = writeback_data | (local_address & 0x0200)<<20 ;  // 6502_Address[9]    TEENSY_PIN34   GPIO7_DR[29]
    writeback_data = writeback_data | (local_address & 0x0080)<<21 ;  // 6502_Address[7]    TEENSY_PIN35   GPIO7_DR[28]
    writeback_data = writeback_data | (local_address & 0x0020)<<13 ;  // 6502_Address[5]    TEENSY_PIN36   GPIO7_DR[18]
    GPIO7_DR       = writeback_data | (local_address & 0x0008)<<16 ;  // 6502_Address[3]    TEENSY_PIN37   GPIO7_DR[19]
               
    writeback_data = (0xFF3BFFFF & GPIO8_DR);   // Read in current GPIOx register value and clear the bits we intend to update
    writeback_data = writeback_data | (local_address & 0x0100)<<14 ;  // 6502_Address[8]    TEENSY_PIN31   GPIO8_DR[22]
    writeback_data = writeback_data | (local_address & 0x0040)<<17 ;  // 6502_Address[6]    TEENSY_PIN30   GPIO8_DR[23]
    GPIO8_DR       = writeback_data | (local_address & 0x0004)<<16 ;  // 6502_Address[2]    TEENSY_PIN28   GPIO8_DR[18]
   
    writeback_data = (0x7FFFFF6F & GPIO9_DR);   // Read in current GPIOx register value and clear the bits we intend to update
    writeback_data = writeback_data | (local_address & 0x4000)>>10 ;  // 6502_Address[14]   TEENSY_PIN2    GPIO9_DR[4]
    writeback_data = writeback_data | (local_address & 0x0800)>>4  ;  // 6502_Address[11]   TEENSY_PIN33   GPIO9_DR[7]
    GPIO9_DR       = writeback_data | (local_address & 0x0010)<<27 ;  // 6502_Address[4]    TEENSY_PIN29   GPIO9_DR[31]
   
    return;
}


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 5:49 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
BigEd wrote:
I've just remembered one of the observations about the teensy vs the pico: the teensy's pinout, in the versions we looked at, don't make it easy to drive or sample say a 16 bit bus in one go. There's more shifting and masking than would be ideal. That said, with a 600MHz processor overclockable to 900MHz, perhaps a few extra instructions isn't a big deal.
MicroCoreLabs wrote:
Yes, I used this shifting and masking on the MCL65+ to improve the Teensy 4.1's IO timing:

Thanks for pointing that out, Ed and Ted. Seeing Ted's code is actually encouraging for me: If one can get away with that, then it should not be a problem to use the Teensy CPU in an accelerator design which is meant to support different emulation targets with different pinouts.

Ted, I looked at your code for the MCL65+ in github and noticed that you even do all the address shuffling after having detected the positive clock edge. For write cycles you then send the data byte as eight separate bits too. And that still works nicely with the Apple's 1 MHz clock, right?

That is quite reassuring. I would trust that with some optimization (preparing as much as possible before waiting for the clock edge, and using assembler code if required) I should be able to keep up with the faster host clocks used e.g. in many chess computers -- 5 MHz 65C02s were commonly used in the late 1980s.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 6:29 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
65f02 wrote:
MicroCoreLabs wrote:
Yes, when not running cycle-accurate the acceleration is for code execution and not for the I/O's.
How fast does the MCL65+ run for purely internal code execution? Say, just a loop with many iterations in BASIC, with some computations but no I/O?

Ted, would you have a data point on the above? When you run a program on the Apple which does internal computation only, at what effective speed does the MCL65+ execute it? Thanks!


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 4:45 pm 
Offline

Joined: Thu Oct 05, 2017 2:04 am
Posts: 62
Hard to say how much faster the MCL65+ is compared to the stock 6502 at 1Mhz when running in an AppleII, VIC20, or a C64... Eventually some on-motherboard resource such as video needs to be accessed which will slow things down, but 6502 code which only accesses mirrored memory inside of the microcontroller runs extremely fast.

Here is a video of the difference running Brian's Theme: https://microcorelabs.wordpress.com/2021/10/08/mcl65-running-brians-theme-on-apple-ii/
And here is another one which is not a 6502 emulation and only uses the Apple's video and keyboard: https://microcorelabs.wordpress.com/2021/09/29/ultimate-apple-ii-accelerator-the-mcl65-fast/


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 6:11 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Thanks, Ted. Brian's Theme will of course access the video RAM quite a bit. (Although the effect is not too bad. With the 65F02 I found that, when mirroring the video RAM on-chip so that reads from the video RAM are fast, I still get more than 50x acceleration on Brian's Theme, compared to the "raw" acceleration of 100x.)

But if you time a simple FOR/NEXT loop in BASIC (without any output in the loop), there will be hardly any I/O overhead -- besides the interpreter polling the keyboard occasionally to check whether you have typed Ctrl-C, which is negligible. That allows you to measure the "raw" acceleration of the emulated 6502 over the original.

It would be great if you could do that little test; it would help in getting an idea how fast the Teensy 4 processor is. Thanks!


Last edited by 65f02 on Tue Oct 26, 2021 7:46 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 6:41 pm 
Offline

Joined: Thu Oct 05, 2017 2:04 am
Posts: 62
Quote:
With the 65F02 I found that, when mirroring the video RAM on-chip so that reads from the video RAM are fast, I still get more than 50x acceleration on Brian's Theme, compared to the "raw" acceleration of 100x


I would be interested to see a video of your 65F02 running Brian's Theme.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 25, 2021 7:15 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
It's not pleasant to look at, flickering at 3 Hz. ;-)

I need to get the 65F02 configured for the Apple again; it's in a chess computer now. (I split the VHDL into two separate configurations, since the routing paths were getting congested in the logic which monitors the address pattern for the different I/O addresses.) And will need to figure out how to best share a video; I don't have a Youtube account and don't want one. Not sure when I will get around to that, but over the weekend at the latest.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 26, 2021 7:33 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Alright, here is quick clip of "Brian's Theme", the 1979 Apple II demo, being loaded from disk and run with the 65F02 in the CPU socket. For comparison, here is the original speed on a 1 MHz Apple II (someone else's video).

As stated before, the acceleration factor is >50 even with the graphics output. Apparently drawing an angled line still involves quite a bit of internal computation which gets accelerated. And as stated before, accelerating the reads from video RAM (by caching the RAM on-chip) has a surprisingly large benefit. Thanks again to Ed, by the way, who originally suggested caching in last year's thread on the 65F02!

But the thing I am most proud of is actually that the floppy disk still works with the accelerator on, since the 65F02 automatically finds all of Woz's cycle-counted code to drive the disk and time the nibbles. 8)


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 26, 2021 8:48 pm 
Offline

Joined: Thu Oct 05, 2017 2:04 am
Posts: 62
I guess when you run at 10ns per instruction you will get a blazingly fast result! Thats quite fast indeed...

If we assume there are at least 10 ARM instructions for every emulated opcode, then even the overclocked Teensy would be no faster than the equivalent 80Mhz... and it's likely there are many more than 10 instructions that the C code actually compiles to...

Mirroring everything except the writes to the video memory ranges and eliminating the original 6502's over-fetches and double-writes helps, but not by a lot.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 26, 2021 9:31 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
I would be more optimistic regarding the performance. If you look at dp11's post earlier in this thread, he mentions an emulated 6502 clock of 1/4 the emulator clock for his Pi Pico (RP2040) emulator.

That's for a tightly optimized emulator where the core parts are written in assembler, running on a not-so-powerful Cortex-M0+. Even neglecting the more efficient architecture of the Teensy's Cortex-M7, that should translate to 200 MHz 6502 clock on a Teensy overclocked to 800 MHz.

There is probably a bit more overhead involved in emulating all 6502 signals on the DIP socket vs. driving the BBC computer's "tube" interface. (Just guessing here; I have not looked into the "tube" in any detail.) Let's assume that eats up the performance benefit of the M7 vs. M0 cores, then we would still expect an emulated speed of 200 MHz. 8)

A few assumptions in the above... I intend to try and port Dominic's emulator to a Teensy 4.1 board to hopefully get confirmation. That will take a while: There are some nice features in the i.MX RT1062 processor when you program it low-level. (E.g. a very flexible crossbar and basic logic unit so you can hard-wire the output clocks Phi1 and Phi2 to Phi0 and don't need to give them any attention in software.) But that comes at the cost of a 3400 page reference manual for the processor, and a heavyweight programming environment. :roll:


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 26, 2021 11:12 pm 
Offline

Joined: Sat Nov 11, 2017 1:08 pm
Posts: 33
Please do play with my stuff. But I will warn you that it is heavily optimisied for for the tube. If you spot any performance improvements do let me know. Decimal adc has the feeling of a few extra cycles could be removed, but I haven't put any time into it.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: hoglet and 35 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: