6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat May 11, 2024 11:51 pm

All times are UTC




Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next
Author Message
PostPosted: Thu Oct 21, 2021 11:39 pm 
Offline

Joined: Thu Oct 05, 2017 2:04 am
Posts: 62
Quote:
I am aware of the MCL65+, and actually have been in touch with Ted recently while he was working on language card support for the Apple II. A very nice implementation on the Teensy 4, although I have to object to Ted's claim that it is the fastest Apple II. :wink: The datapoint I have there is 10x acceleration over the Apple's 1 MHz, which I assume to mean more than 10x for code execution, and no acceleration for I/O. (Using a 600 MHz Cortex M7 in the Teensy, although probably not heavily optimized code yet.)

Yes, when not running cycle-accurate the acceleration is for code execution and not for the I/O's. Otherwise it would not work when dropped in an Apple II+, Commodore 64, or VIC-20... :)


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 5:11 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
MicroCoreLabs wrote:
Quote:
The datapoint I have there is 10x acceleration over the Apple's 1 MHz, which I assume to mean more than 10x for code execution, and no acceleration for I/O. (Using a 600 MHz Cortex M7 in the Teensy, although probably not heavily optimized code yet.)
Yes, when not running cycle-accurate the acceleration is for code execution and not for the I/O's.

How fast does the MCL65+ run for purely internal code execution? Say, just a loop with many iterations in BASIC, with some computations but no I/O?

Another quick question: When I re-read your blog, I noticed a mention of the Teensy 4.1 having an "800 MHz ARM A9" CPU. I assume the 800 MHz are for overclocked operation vs. the nominal 600 MHz. But is the A9 correct -- shouldn't that be M7?


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 7:18 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
dp11 wrote:
My Rp2040 6502 core referred to earlier has the aim to be fast but not cycle accurate. On average the 6502 clock rate is 1/4 of the arm M0 clock rate.

It seems that the Teensy 4.x might be an even better platform for an emulator than the Pi Pico/RP2040? Single-core only (which should not matter?), but an M7 core spec'd at 600 MHz and known to run well at 800 MHz.

The MCU is an NXP i.MX RT1062 which comes in a manageable BGA package (0.8 mm pitch, 196 balls). At around 12€ it is quite a bit more expensive than the RP2040. But in contrast to the Pi Zero SOC the chip alone is available from the major distributors -- at least in principle, semiconductor crisis aside...

I don't know whether the M7 vs. M0 difference brings significant performance gains, and what amount of changes to the optimized assembly core of the emulator would be needed, but am tempted to play with that. Or has somebody already done an M7 version of 65tubeasm, by any chance?


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 1:11 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
The naming convention for the ARM chips is pretty confusing. There is an A9, which is one of the Cortex A cores as well as ARMv9 which is not a core at all but a revision of the instruction set. To add to the confusion, there is also an A7 and both that and the A9 are ARMv7.

If you look at the M0+ in the RP2040 and the M7 you'll see that they have quite a lot of overlap. As I understand it, the M0/M0+ were meant to compete with 8-bit and low cost MCUs whereas M3/M4 were the high end parts until M7 came along not long ago. It seems the M0+ does not implement all of the Thumb instructions, while the M7 does. Here's the list for M0+. I'm not sure how much the differences matter for your purposes. You'll also see if you compare them that the M7 has much better hardware like branch prediction, 6 stage pipeline compared to 2 stages on M0+, and 64 bit wide instruction bus. I'd expect anything you write for M0+ to work on the M7 with much better performance cycle for cycle though not all assembly for M7 to run on M0+.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 1:28 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
It sounds like this 12€ microcontroller with an M7 might be (architecturally) comparable to one of the fancier Pi models:
    Pi Pico uses a Cortex-M0+ (Arm v6M) with partial Thumb 1 and 2
    Pi Zero, and original A and B model, use a ARM1176JZF-S (Arm v6) with Thumb
    Pi 3 uses a quad core Cortex A53 (Arm v8)
    Pi 4 uses a quad core Cortex A72 (Arm v8) with Thumb-2


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 2:02 pm 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Druzyek wrote:
The naming convention for the ARM chips is pretty confusing. There is an A9, which is one of the Cortex A cores as well as ARMv9 which is not a core at all but a revision of the instruction set. To add to the confusion, there is also an A7 and both that and the A9 are ARMv7.
The other very confusing thing is that the naming convention for older chips is very similar such as the ARM7 discontinued in 2001. So, ARM7 was ARMv3, 4, or 5 and the modern A7, A9, and M7 are ARMv7. :shock:

BigEd, that's an interesting point about Pi models. I would add that the ARM1176JZF-S, Cortex A53, and Cortex A72 all run 32-bit code and support Thumb-2. You should be able to run your M7 JIT on any of those. I'm not sure what the performance trade off is when you run 16-bit Thumb code on a core that supports 32-bit code.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 2:11 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Oh right, a bit more overlap than I thought, thanks.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 2:55 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Thank you both for the additional comments, Druzyek and Ed!

Druzyek wrote:
I'd expect anything you write for M0+ to work on the M7 with much better performance cycle for cycle though not all assembly for M7 to run on M0+.

Yes, that's what I also figured (hoped) after having read a bit more this morning. The instruction sets of the better Mx cores should all be backwards compatible with the smaller/earlier versions. So I should be able to use DP11's RP2040 (i.e. M0) version of the 6502 emulator directly on the M7, hoping to already get better performance beyond the effect of the increased clock frequency. At a later point I could look through the assembly code for opportunities to put the extended opcodes of the M7 to good use.

I will still need to cobble an interface from Teensy to 6502 socket together -- which might be as simple as building a copy of Ted's MCL65 board. Also need to add the detection of I/O addresses and the external bus interface to the fast emulator. I won't get to that right away, but am definitely keen to try and see how much improvement the fast emulator core brings over Ted's Arduino C implementation. (Which is written with a focus on clarity rather than speed, I think.)


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 3:35 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Oh, hang on! I just realized that Dominic's Pi Pico emulator for the M0 (https://github.com/dp111/PicoTube) is based on Ed's A6502, which was written for the M4 (https://github.com/BigEd/a6502).

So which one would you recommend as a starting point for porting to the M7? The M4 version, since it already uses the full Thumb instruction set (which is present in the M4 and M7) and is hence faster? Or the M0 version which, being more recent, might incorporate some tweaks?

Thanks for your advice!
Juergen


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 5:43 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Hmm, I'd forgotten that connection! I'd probably start with Dominic's, but would expect an M7 version should get more performance. The performance-critical aspects are, I think:
- dispatch
- dealing with emulation of interrupts
- detection of special vs ordinary accesses
- doing the work of the opcode
where that last one is probably already quite squeezed. Actually, for the Pi model, especially when reworking for Pi 4 performance, I think Dominic did a lot of work on
- reordering ARM accesses to get best ARM instruction rate - reducing hazards, hiding latency
which makes the Pi model a maze of subtle macros.

(I do hope I'm not speaking out of turn, Dominic!)


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 22, 2021 7:46 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Alright then... Step 1 is done, ordered a Teensy 4.1. I guess things will get slightly more complex from here on. 8)

Step 2 will be to determine which programming environment to use. I am inclined to avoid the Arduino/Teensyduino environment -- for its runtime overhead and, unless I am mistaken, the limitation of assembler code to inline __asm statements. Turns out that using NXP's "MCUExpresso" IDE for bare-metal programming is not quite as easy as one might think, but there is a tutorial here.

I'll download and install MCUExpresso next, hoping that is not a similar abomination as STM's STMCubeIDEWhateverIt'sCalled... Will report how that goes if and when I get around to compiling something!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 8:39 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 690
Location: North Tejas
65f02 wrote:
Yes, I have been thinking of an 8-bit latch or two to work around the RP2040's limited number of GPIO pins, and concluded that the address bits would lend themselves best to that.

But I was hoping to give the new PCB full flexiblity to also support processors outside of the 65xx family, using bi-directional drivers on all 40 pins. (Plus solder jumpers for GND and Vcc on selected pins, as required by, say, the 65xx, 68xx, Z80. Not unlike the venerable GODIL board, but with a more powerful core and adhering to the DIP-40 form factor.) In which case I no longer know where the address lines are going to be...

Also, the RP2040 won't be fast enough if >= 100 MHz emulation speed is the goal, I think. But it is such a tempting little chip... Maybe worth a separate project, e.g. as a pure replacement for some obsolete chip without much acceleration in mind?


There is likely to be a market for compatible "clones" of unobtanium processors like

65C802
6309

among others.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 10:19 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
BillG wrote:
65f02 wrote:
[...] I was hoping to give the new PCB full flexiblity to also support processors outside of the 65xx family, using bi-directional drivers on all 40 pins. (Plus solder jumpers for GND and Vcc on selected pins, as required by, say, the 65xx, 68xx, Z80.
There is likely to be a market for compatible "clones" of unobtanium processors like
65C802
6309
among others.

I have come to realize that the "full flexibility on the pinout" goal might conflict with the idea of using a microcontroller instead of an FPGA. With an FPGA pins can be mapped pretty freely, just incurring some minor routing delays. But with a microcontroller, driving the external bus would become much slower if I can't rely upon the address and data bus being grouped together on 16-bit or 8-bit ports. While this only affects the slow external bus cycles, it might still get in the way of a timely response to external clock edges or interrupts?

So I'm inclined to limit the flexibility to 65xx family members where at least the data and address bus remain in the same place: 6502, 65C02 (incl. the WDC version), [6510], Atari Sally, 6800, 65816 and 65802 should qualify, unless I overlooked something. But the 6809 and its siblings, as well as other processor families like Z80 and CDP1802, would remain out of scope.

EDIT: Hang on, the 6510 pinout differs much more from the 6502 than I recalled. Commodore did not leave the databus and address bus in place, but shuffled everything around. Why oh why?! Hmm, not sure what to conclude. While I personally don't have any history with the C64, so many people have, and the ability to support it would be nice.


Last edited by 65f02 on Sat Oct 23, 2021 10:42 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 10:36 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
BigEd wrote:
[...] The performance-critical aspects are, I think:
- dispatch
- dealing with emulation of interrupts
- detection of special vs ordinary accesses
- doing the work of the opcode
[...]

Thanks for pointing out the interrupt and external access hurdles, Ed.

To discriminate external (I/O incl. video RAM) access there is probably no way around comparing every address that shows, as part of the main emulation loop. It will depend significantly on the host computer how long that comparison takes: It could be simply checking a few high address bits, or maybe a 256 byte lookup table to check whether a RAM page is mapped externally. But for computers which can dynamically change the location of their video RAM, things get messy...

To deal with interrupts (and other "asynchronous" requests from the external computer, like /RESET and RDY), wouldn't it be best to use actual interrupts on the emulator core? That would avoid the overhead of polling these lines inside the main emulation loop. Latency should not be critical as long as it stays short compared to the external clock period, I think. Or am I overlooking something?

As a side note, I am unsure about terminology. I used to refer to the external computer which the 65F02 sits in as the "host" system, e.g. the Apple II. But it seems that in an emulation context, "host" is often used to designate the computer which runs the interpreter or JIT compiler, e.g. the Cortex M7 based MCU? What is the generally accepted terminology for the two computer systems involved, and for the emulated CPU?


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 23, 2021 3:24 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Good point about terminology - "host" is indeed ambiguous, and I don't think I have an answer for that. Perhaps "platform" for the microcontroller running the emulation??

On interrupts, yes, the microcontroller can take an interrupt and perhaps do something useful in the ISR, but to communicate that to the emulated 6502, the 6502 probably needs to take an interrupt to go into its ISR to do what it needs to do, and this has to happen on an emulated instruction boundary - and preferably at low cost.

Something I thought might be useful, although it turns out to be inadequate for BBC Micro Tube purposes, is for the 6502 to take interrupts only on branches or subroutine calls. Something which happens moderately often, but not impacting the emulation cost of most instructions. It should work, but the apparent latency in instructions will be greater. Long stretches of straightline code as can sometimes be found in demos might be a problem. At least the latency in microseconds won't be so long, because our emulator is running fast!

Edit: the point about 'dispatch' which I didn't elaborate, is that it's quite an overhead, and the best approach for fast dispatch varies according to the flavour of ARM you have. So it is something which can be optimised.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 46 posts ]  Go to page Previous  1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: