Meet the 65F02 - a 65C02 at 100 MHz

BigEd · Post by **BigEd** » Sat Jul 25, 2020 7:45 am

(Might be best to discuss those questions over in Jim's thread, or a new thread?)

65f02 · Post by **65f02** » Sat Jul 25, 2020 8:29 am

BigEd wrote:

(Might be best to discuss those questions over in Jim's thread, or a new thread?)

Good point, Ed. Jim, if you could reply here with a link to the most relevant thread to follow up on your design? (It might be the one I mentioned above, or maybe another one I have not found yet?)

Thanks!
Juergen

JimDrew · Post by **JimDrew** » Tue Aug 04, 2020 9:57 am

Yes, that link is valid (although I must admit that I forgot about it, and the design shown is quite dated). I have been kind of quiet about this and super busy since the world was up-ended by the pandemic.

Some answers to your questions: about PHIO - the 65xxT can either use PHI0 or become PHI0 (generating PHI1/2 on it's own without using PHI0 as the reference). I found with some things, like the 1541 disk drive you can reliably run most of the system at almost 3MHz (2MHz reliably) because there is nothing in that system that is tied to PHI0. There is literally a 1MHz clock into PHI0 and everything uses PHI2. So, the RAM works at 3MHz, the ROM works at 3MHz, but the VIAs don't work past about 2MHz (but changing them to the 2MHz 'A' version allows that to work). But I leave the timing exact for the I/O stuff as it requires the standard clocks. This is all pushing/pulling data from the data bus, not using any internal memory for caching. So, PHI0 was ignored and PHI2 was generated by the 65xxT at various speeds during testing.

Originally, I was just trying to emulate a cycle exact 6502 for debugging another project, so I was using interrupts for handling the IRQ (there is no NMI in the 1541, but as you know it's basically the same instruction emulation, much like BRK). When I went for all out speed I changed that to just check (poll) the pin change interrupt register (no interrupt is called but the microcontroller sets a hardware flag anyways when a change occurs). This guarantees that I won't miss a change (which turned out to not be necessary). The interrupt check is at the start of the main loop. The speed comes from shadowing everything that is not an I/O and not adhering to any instruction timing requirements. Almost all instructions take the same amount of time to be emulated, and there no adhering to any phase of PHI0. The code is rather large as I put the code for each instruction emulation on a 256 byte boundary which saves several of my CPU instructions by not having a lookup table for the instruction - the code just jumps to the CPU emulation code start + opcode*256. Also, at the end of every instruction emulation is code to fetch and handle the next instruction. I have 48K of RAM on the CPU. 16K of it is being used to copy the ROMs into RAM, 2K is the 1541's normal memory, 4K is the CPU control map, and the rest is really not used for the 1541. Code is fetched directly from RAM (ROMs were copied there) and only needs to look at the 4K mapping table when there is an instruction that changes the PC or EA, otherwise the PC stays pointed to the RAM. With the 1541, I know you can't execute code that is outside of the internal memory so PC boundary checks are ignored. In other systems the PC might get moved to a block of real memory that has to run cycle exact. Since the ROM gets copied to RAM, the 6502 doesn't have to fetch any instructions (ever) from the external bus. Likewise, the zero page, stack, and normal RAM are also fully internal. The only time I am actually pull/pushing data via the normal data bus is when the CPU mapping has an EA within the "slow" (I/O) space. PHI2 is generated at the normal speed to allow peripherals to do their normal clocking (otherwise timers and such would not work). The 1541 is really a perfect case scenario as it can run full speed until there is something that requires a real bus access at which point it grinds to a halt for that instruction cycle. Now that this all works, I am switching to another (much faster) 32 bit CPU and (re)writing the core in assembly code as well. I would like to get to your 100MHz speed when running internally, and I really need about 100K of RAM for doing some nifty caching of I/O (as an experiment). We can continue this discussion over in the other thread if you like.

brain · Post by **brain** » Wed Aug 05, 2020 4:53 am

As a use case, I'm hoping this design becomes open source or can be licensed. As someone who has tried in vain for a decade to resurrect the CMD SuperCPU, this core sounds like a nice base upon which to build such a device.

Jim

65f02 · Post by **65f02** » Wed Aug 05, 2020 8:39 am

Hi Jim (Drew), many thanks for the additional detail!

In a "clean" host design, which relies only on the Phi2 clock coming from the processor, simply driving that clock output from the emulated core (and ignoring Phi0 from the host) seems like the way to go. All the chess computers we have looked at so far fall into that category. For the 65F02, I decided to base the timing on the incoming Phi0 clock anyway, mainly with a view to its use in home computers. It seems that most of these do use Phi0, and/or faster clocks from which Phi0 is derived, to clock parts of their hardware. So the processor and its Phi2 clock need to be in phase with the incoming Phi0.

Probably a software emulation could sync its external bus cycles to Phi0 quite easily. As long as you operate from fast internal RAM, you don't care about Phi0/Phi2 anyway. When you encounter an address which requires an external bus cycle, you have all the time (and processing power) you need to wait for the next Phi0 clock edge and synchronize the bus cycle with that.

Cycle-correct execution is an aspect I had been wondering about. Arlet's FPGA core does use the correct number of instruction cycles, which could be considered a "luxury" when operating in fast, internal mode. (Although the byte-wide memory organization inside the FPGA makes it difficult to go much faster anyway.) But for executing timed code, it is of course important to get the cycle count right. The Apple II, which is close to my heart, is probably one of the worst offenders there, with its low-level disk routines based on cycle counting... So I am glad that I got cycle-exact execution "for free" from Arlet!

From a "bang for the buck" perspective, the software emulators easily beat an FPGA implementation these days, given the amazing cost/performance ratio of the ARM Cortex cores. And they are probably also ahead regarding the achievable emulated clock rate, if you use a sufficiently powerful core (at the expense of somewhat higher supply current). If one wants/needs to stay very close to the original bus and instruction timing, FPGA implementations probably have an advantage.

Regards,
Juergen

65f02 · Post by **65f02** » Wed Aug 05, 2020 8:57 am

brain wrote:

As a use case, I'm hoping this design becomes open source or can be licensed. As someone who has tried in vain for a decade to resurrect the CMD SuperCPU, this core sounds like a nice base upon which to build such a device.

Not sure whether you were thinking of Jim's software emulation or my FPGA core. (Both might be suitable for your purpose.) I definitely plan to make my VHDL code and schematics available, but need some more time for a cleanup...

This might be a good opportunity for a quick status update: After returning home from vacation, materials and PCBs for the final (?) revision of the circuit board had come in, and I first focused on building and testing these. Four out of four boards I built worked right away, so that counts as a success!

USB programming via Luke Valenty's TinyProg is now supported and works like a charm: Luke has developed a USB/SPI bridge which resides inside the FPGA, and which I could adapt to the Spartan-6 by just adding the Xilinx-specific reboot functionality. A simple Python-based command line tool running on the PC allows you to transfer configuration binaries to the SPI flash -- quick (7 seconds) and convenient (it knows which board it's talking to, and at which address the config should be stored, from metadata programmed into reseved pages in the SPI flash). Recommended for your next FPGA project!

The various functional enhancements discussed in this thread (automatically switching into "real time mode" for timed I/O operations, supporting bank-switched RAM, supporting mirrored video RAM) have been implemented but not debugged yet. I wrote them "blind", without access to test hardware, while on vacation, so there definitely will be bugs...

I just started to look into the automatic real-time mode; hope to be back here soon to report that I can boot Apple DOS in accelerated mode. Debugging the bank switching for the Apple language card, with the goal to support Apple Pascal, will be next.

JimDrew · Post by **JimDrew** » Wed Aug 05, 2020 9:34 am

65f02 wrote:

Probably a software emulation could sync its external bus cycles to Phi0 quite easily. As long as you operate from fast internal RAM, you don't care about Phi0/Phi2 anyway. When you encounter an address which requires an external bus cycle, you have all the time (and processing power) you need to wait for the next Phi0 clock edge and synchronize the bus cycle with that.

Yep, in fact there is enough time to put the CPU to sleep and wake on the edge interrupt. This reduces power consumption, but's only about 60mA max for the entire board when never sleeping. So, power consumption is not a concern.

65f02 wrote:

Cycle-correct execution is an aspect I had been wondering about. Arlet's FPGA core does use the correct number of instruction cycles, which could be considered a "luxury" when operating in fast, internal mode. (Although the byte-wide memory organization inside the FPGA makes it difficult to go much faster anyway.) But for executing timed code, it is of course important to get the cycle count right. The Apple II, which is close to my heart, is probably one of the worst offenders there, with its low-level disk routines based on cycle counting... So I am glad that I got cycle-exact execution "for free" from Arlet!

I am working on the Apple II right now. I am aware of how critical the cycle timing is for the disk drive. I worked at Central Point Software, writing the last version of Copy II 64/128, but I also worked on the ROM code for the Laser 128 there as well as the Option Board project. I got quite a history lesson from Mike Brown (owner of CPS and author of Copy ][+) on the disk drives.

Mapping $C000-$CFFF as "slow" (cycle exact) seems to work fine for stuff I have tried. I would like to have some flexibility here to have some write-through caching. That will require a CPU with more memory in order to do that. Hence the reluctant desire to change CPUs. As a simple drop-in CPU replacement with some acceleration capability, it works fine right now. You can adjust the memory map for the device you are using it with, and it is what it is. I just want it faster.

65f02 wrote:

From a "bang for the buck" perspective, the software emulators easily beat an FPGA implementation these days, given the amazing cost/performance ratio of the ARM Cortex cores. And they are probably also ahead regarding the achievable emulated clock rate, if you use a sufficiently powerful core (at the expense of somewhat higher supply current). If one wants/needs to stay very close to the original bus and instruction timing, FPGA implementations probably have an advantage.

I am not sure what the cost of ARM processors are (I should probably look). I use PIC micros. I am using a $4.56 (@100pcs) dual core 90/100MIPs 16 bit PIC micro currently. One core is not used, but I plan to use it for diagnostics (stand-up arcade machines). I want to switch to the 200MHz PIC32 which has 512K of RAM available. This would allow switching of ROMs on the fly, as well as having a large cache. I am not thrilled with MIPS assembly, and the CPU itself has some caching quirks you have to work around. For me, assembly is the only way to go as I have total control of everything. There is no mystery about delays and states.

drogon · Post by **drogon** » Wed Aug 05, 2020 10:24 am

JimDrew wrote:

I am not sure what the cost of ARM processors are (I should probably look).

$5 will get you an SBC with a 1GHz ARM, 512MB of RAM and enough IO to interface to a 2Mhz BBC Micro's Tube interface (via level shifters) that will emulate a 65C02 at just under 300Mhz. Look for the PiTubeDirect project. (Based on a Raspberry Pi Zero)

-Gordon

BitWise · Post by **BitWise** » Wed Aug 05, 2020 12:12 pm

JimDrew wrote:

I want to switch to the 200MHz PIC32 which has 512K of RAM available. This would allow switching of ROMs on the fly, as well as having a large cache. I am not thrilled with MIPS assembly, and the CPU itself has some caching quirks you have to work around. For me, assembly is the only way to go as I have total control of everything. There is no mystery about delays and states.

I've been writing a 65C816 emulator for PIC32 in MIPS assembler. Initially targeting a PIC32MX174 and the PIC32MZ1024.

65f02 · Post by **65f02** » Wed Aug 05, 2020 4:15 pm

JimDrew wrote:

I am working on the Apple II right now. I am aware of how critical the cycle timing is for the disk drive. [...] Mapping $C000-$CFFF as "slow" (cycle exact) seems to work fine for stuff I have tried.

I would think that you need to do more than that to get the disk drive to work. Acccessing the actual I/O addresses externally is necessary, of course, and executing the boot ROM slowly can't hurt. But as DOS boots up, the RTWS routine will be loaded into RAM and will also need to be executed with cycle-exact timing.

That's what I hope to achieve by switching into "slow mode" as soon as one of the disk I/O addresses is accessed, and staying in that mode until the code returns "upwards" via an RTS. But I have not proven yet that this works; one can imagine coding patterns where the approach leaves slow mode too early, or gets stuck in slow mode forever. (And at the moment my switching logic is still buggy anyway...)

Quote:

I would like to have some flexibility here to have some write-through caching.

Hmm, at what level would you do the caching? Not the low-level disk I/O addresses, I assume. And even the RWTS sector buffer is not likely to see a write-and-read-back-immediately pattern all that often, I'd think.

Are you planning to keep track of the track/sector numbers accessed by the RWTS routine, and buffer many sectors? That should work, but I would argue that it does not "belong" in a CPU accelerator, because it would need to have quite a bit of knowledge about the software (RWTS parameters and buffer use).

brain · Post by **brain** » Thu Aug 06, 2020 5:48 am

65f02 wrote:

brain wrote:

As a use case, I'm hoping this design becomes open source or can be licensed. As someone who has tried in vain for a decade to resurrect the CMD SuperCPU, this core sounds like a nice base upon which to build such a device.

Not sure whether you were thinking of Jim's software emulation or my FPGA core. (Both might be suitable for your purpose.) I definitely plan to make my VHDL code and schematics available, but need some more time for a cleanup...

Jim is not an Open Source kinda guy

, for reasons he can share if desired, so I am referring to your design. As well, even if Jim's design were open source and modifiable, I'm not currently as up to speed on PIC16 as I am on HDL, so an FPGA-based solution fits best for the design.

SuperCPU included a "write through" byte for mirroring. If a value in the mirror range was written, it was written in the SCPU RAM but also loaded into the write through cache. If the code went on and did not touch that address for a bit, the value was written to C64 internal RAM (the delay could be up to 56uS, I think, because sometimes the VIC-II owned both halves of the cycle in the 64 and the cache could not be committed until the VIC got done with a badline). If, on the other hand, the SCPU code referenced that address again (not sure if both read and write or just write, I think just a write, but could be wrong), the entire CPU stalled until the cache could be emptied.

Jim

65f02 · Post by **65f02** » Mon Aug 10, 2020 6:55 am

Quick update: The 65F02 can now boot Apple DOS 3.3 while running in accelerated mode!

After some tweaking, the approach outlined earlier in this thread works nicely: Switch to "real time mode" as soon as an I/O address known to be time-critical is accessed; keep track of the subroutine nesting depth by counting JSR/BRK/interrupts and RTS/RTI, respectively; leave real time mode when exiting "upwards" from the level where it was first triggered. I did have to add a timeout counter also, to make sure that an I/O access from the top level (during startup) does not set real time mode forever. But during normal operation, once DOS is booted, switching between speeds works instantaneously, based on the nesting level.

There's more testing to be done. I have not installed other I/O cards beyond the disk controller yet. Will do that -- but first I want to play with some math and graphics demos and take some benchmark data!

And then on to language card support... I already had to implement a little work-around there: DOS writes a $00 to address $E000 during startup, to make sure the language card (if installed) is considered "empty". Since the 65F02 does not distinguish between accelerated RAM vs. ROM so far, that nicely overwrote a byte in the Applesoft ROM, causing the Apple to crash. As mentioned earlier, I would like to make the language card work, mainly with a view to running UCSD Pascal. More to come...

BillG · Post by **BillG** » Mon Aug 10, 2020 8:11 am

65f02 wrote:

Quick update: The 65F02 can now boot Apple DOS 3.3 while running in accelerated mode!

I'll go first...how much faster does it feel?

65f02 · Post by **65f02** » Mon Aug 10, 2020 8:24 am

Quite satisfyingly so.

The first thing I tried was a simple BASIC program I wrote while in high school, to plot the light diffraction pattern from a slit. Lots of sin and tan calculations, and a bit of high-res plotting. At the original speed it takes 75 seconds; in accelerated mode about 0.8 seconds (looped it ten times to get a reasonable measure). It's a massive difference between watching the line creep across the screen vs. "there it is!".

As a side note, those 40-year-old floppy disks seem to work just fine!

JimDrew · Post by **JimDrew** » Tue Aug 11, 2020 10:51 am

BitWise wrote:

JimDrew wrote:

I want to switch to the 200MHz PIC32 which has 512K of RAM available. This would allow switching of ROMs on the fly, as well as having a large cache. I am not thrilled with MIPS assembly, and the CPU itself has some caching quirks you have to work around. For me, assembly is the only way to go as I have total control of everything. There is no mystery about delays and states.

I've been writing a 65C816 emulator for PIC32 in MIPS assembler. Initially targeting a PIC32MX174 and the PIC32MZ1024.

Yeah, I was looking only at the MZ myself. No since going with the slower one.

Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz