Meet the 65F02 - a 65C02 at 100 MHz

65f02 · Post by **65f02** » Wed Jul 01, 2020 7:15 pm

My first post here -- but I have been reading for quite a while, and the project I want to report on owes a lot to you guys.

Over the past three months, I have developed yet another 65C02 replica in an FPGA, but one with a nice twist (or so I think). The actual 65C02 core inside is not my own, but is the 6502 core by Arlet with the 65C02 extension by Ed and David -- a big tip o'the hat to you folks!

I packaged this in a Spartan-6 (with 64 kByte on-chip RAM), on a small PCB which is just the size of a 40-pin DIP package, with pins matching the 65C02 pinout. I added a small state machine inside the FPGA which can access the external 65(C)02 bus with the correct timing, based on whatever Phi0 clock is coming in from the host system. Inside the FPGA, the CPU core runs at 100 MHz. I dubbed this the "65F02", where the "F" might stand for FPGA or for "Fast".

The idea is to use this as a "universal" accelerator for 6502 and 65C02 based host computers -- just plug it into the CPU socket. The only thing the FPGA board needs to know about its host is the memory map: Where does the host have memory-mapped I/O? Up to 16 different memory maps can be stored in the FPGA, and selected via a mini DIP switch. Upon power-on, the 65F02 grabs the complete memory content from the host and copies it into the on-chip RAM, except for the I/O area. Then the CPU gets going, using the internal memory at 100 MHz for all bus accesses except for any I/O addresses -- for these, the internal CPU gets halted, and an external bus cycle is started at whatever the external clock speed is.

Kudos go to Roland Langfeld, who suggested this way of integrating an accelerator into almost any host, and has contributed a lot during the testing and debugging effort. Roland's original interest was in 6502-based chess computers (there were some really nice ones in the 1980s). But we have successfully tested the prototype 65F02s in an Apple II and a Commodore 8032, as well as various chess computers. Here it is at work in a "Mephisto Milano" chess computer:

My current PCB includes level converters to provide 5V tolerant inputs with TTL logic thresholds, and full-swing 5V outputs. So it is compatible with TTL as well as CMOS environments. For the bus timing, we seem to have established one "universal" timing (setup and hold times etc.) which works in all host systems mentioned above, which run at 1 MHz to 5 MHz external bus speed.

I am not quite sure where to take this next. Populating and soldering such a board is a bit beyond the typical hobbyist's range -- the FPGA comes in a BGA package, and the level converter packages are also rather finicky. So just open-sourcing it once I have convinced myself that the design is stable wouldn't get us very far. But setting up a cottage industry to make these is not my idea of fun, and selling electronics in Europe is also painfully regulated (EMC testing, anyone)?

Hence this post. Would people even be interested in getting one of these? Any ideas how I could have them produced and, more importantly, distributed?

Of course I'm happy to discuss the technical aspects of this design too! Feel free to ask any questions, shoot holes in it, suggest improvements.

Best regards from Hamburg, Germany --
Juergen

(typos edited later...)

BigEd · Post by **BigEd** » Wed Jul 01, 2020 8:01 pm

Welcome, Juergen! What a splendid first post.

I would say it is worth open-sourcing the design - others can learn from what you've done, and maybe someone will be willing to put in the money to build in small quantity. More usefully, perhaps, others can exercise the machine, in a variety of applications, and help fix anything which might not be as robust as it could be. It has usually turned out, in the past, that once you get a half-dozen parts into the field you start to learn things about margins and variations. It can be a mistake, I think, to skip a thorough beta-testing.

If you have used commodity parts, some PCB manufacturers will offer partial or complete assembly service, and this could be a good way to get difficult boards assembled. JLCPCB have been mentioned in this context - and it looks like Seeed Studio and PCBWay offer it too.

BillG · Post by **BillG** » Wed Jul 01, 2020 9:33 pm

65f02 wrote:

Of course I'm happy to discuss the technical aspects of this design too! Feel free to ask any questions, shoot holes in it, suggest improvements.

Very interesting...

Is there an easy way to specify the I/O region(s) short of reprogramming the chip? Maybe implement a special instruction to give it an I/O range?

What can be done for existing software timing loops other than rewriting them?

hoglet · Post by **hoglet** » Wed Jul 01, 2020 9:36 pm

This looks really great.

Did you hand assemble the board? I'd be very interested in hearing more about that.

A couple of further thoughts..

1. Please do consider publishing the HDL source under an open source license. I would love to play with this on some slightly different hardware.

2. Do you have any spare I/Os on your board that could be used as a serial port? If so, then we ought to be able to build a version of ICE-65(C)02 for that hardware. That would make a very neat fulll-feature 6502 In-Circuit-Emulator.

Dave

65f02 · Post by **65f02** » Wed Jul 01, 2020 9:54 pm

BillG wrote:

Is there an easy way to specify the I/O region(s) short of reprogramming the chip? Maybe implement a special instruction to give it an I/O range?

Nice idea, thank you! At the moment, it is indeed "reprogramming the chip". I do have spare room in the attached flash ROM though, which could be used to store the memory maps. I am working on an updated PCB which can write the flash via the "TinyProg" programmer (a USB-to-flash bridge in the FPGA plus Python software on the PC). Writing into a data area via TinyProg should be easy; writing to the flash "live" from the FPGA, triggered by a special instruction from the host, should be feasible but more tricky.

Quote:

What can be done for existing software timing loops other than rewriting them?

Ah, right, I had meant to mention that. Very easy in principle: Just declare the memory areas which hold timing loops as "external" in the memory map, like the I/O areas; then this code will be executed from external ROM at the original speed. We are doing that for all the chess computers, which typically have "beeps" timed in software, and for the CBM 8032, where the IEEE interface is driven with some software-defined timing. I have not bothered to identify all timing loops in Apple DOS yet...

It takes away a bit from the universality, since the memory maps also become dependent on the firmware version, but otherwise works nicely.

65f02 · Post by **65f02** » Wed Jul 01, 2020 10:13 pm

hoglet wrote:

Did you hand assemble the board? I'd be very interested in hearing more about that.

Yes, I did. I use solderpaste applied with stencils ordered from the PCB house (I used Aisler.net), tweezers and a magnifying glass for pick & place, and a manually controlled pizza oven to solder the top side.

The pizza oven is fun. I only got into that technique recently, and fretted a lot whether a PID/temperature profile controller would be needed. But it works like a charm just using a 50€ oven plus a multimeter + temperature sensor taped to a dummy PCB, so I know the current temperature near the PCB to be soldered:

* Turn the oven up to full heat;
* when it reaches 150°C, turn heat off for 30s (the temperature will creep up a bit more and then sit there, for soaking);
* turn heat on again until it reaches 250°C;
* turn off, wait 10 to 15 s, open the door a bit for slowish cooling.

The bottom side of the PCB, which is pretty busy with voltage regulators and passives, was soldered in a second step via a cheap hot air soldering station. I use lead-free solder paste on the top (since the BGA balls are lead free), and leaded on the bottom for now (so the board can stay well below the melting point for the top side).

Quote:

1. Please do consider publishing the HDL source under an open source license.

I will, as soon as I get it cleaned up a bit more so it's less embarrassing.

Quote:

2. Do you have any spare I/Os on your board that could be used as a serial port? If so, then we ought to be able to build a version of ICE-65(C)02 for that hardware. That would make a very neat fulll-feature 6502 In-Circuit-Emulator.

Another great idea! There are plenty of unused FPGA pins, but limited room to route traces to them or to place a connector...

I do currently have USB data lines routed to two DIP-40 pins which are unused by the 65C02, for programming the FPGA via a plug-in adapter, and have considered adding a dedicated micro USB jack to the board. Would ICE-65 support a USB CDC device instead of plain old serial? Then that USB connector could do double duty.

BigEd · Post by **BigEd** » Thu Jul 02, 2020 10:05 am

As the Apple II has memory-mapped graphics, I imagine you already have a mode which holds a fast copy of an area of memory, but has a write-through or write-back to the host's physical memory?

As it happens, there's a 10-year-old project for Acorn's BBC Micro which we've recently been re-investigating, for an in-socket accelerator (but a much slower and larger one) and we've seen a couple of things we have to deal with:
- the memory that's used for video can be configured to sit anywhere in the low 32k. So either we have to be very conservative with the write-through, or we have to track and model the configuration register. Or, what we're doing at present, we make the inaccurate assumption that the configuration doesn't change.
- the 16k area starting at 8000 is mappable, so we have to track the configuration register, and decide how many of the mappings to accelerate, and how many not to. If we accelerate more than one mapping, that's more than 64k total - but we have more. In your case, slurping up the newly mapped 16k might be overkill if the mapping is changed again fairly soon. Some kind of cache, with invalidation, might suit better than a full 16k copy, if the complexity is not too much.

I've a feeling Apple II cards might similarly have overlapping areas of memory? I might be wrong. Certainly the C64 has a non-trivial memory map. But for me the Acorn story is the important one!

65f02 · Post by **65f02** » Thu Jul 02, 2020 10:41 am

BigEd wrote:

As the Apple II has memory-mapped graphics, I imagine you already have a mode which holds a fast copy of an area of memory, but has a write-through or write-back to the host's physical memory?

In fact, that had not occurred to me yet -- thanks for the idea! At the moment, the graphics (and text) screen memory is simply declared as fully external, i.e. all read and write accesses go to the host system. That works, but is indeed slower than necessary: The writes could be captured in the internal RAM in parallel, and then reads could be fast, internal only. Makes sense even for scrolling the text screen, which the Apple does fully in software.

Quote:

- the 16k area starting at 8000 is mappable, so we have to track the configuration register, and decide how many of the mappings to accelerate, and how many not to. If we accelerate more than one mapping, that's more than 64k total - but we have more. In your case, slurping up the newly mapped 16k might be overkill if the mapping is changed again fairly soon. Some kind of cache, with invalidation, might suit better than a full 16k copy, if the complexity is not too much.

I've a feeling Apple II cards might similarly have overlapping areas of memory? I might be wrong. Certainly the C64 has a non-trivial memory map. But for me the Acorn story is the important one!

The Apple does have bank-switched memory if it is equipped with a "language card", which replaces the upper 16k ROM with extra RAM. A few of the late, high-end chess computers use bank-switched ROM as well, presumably to enable larger libraries of openings.

I have not implemented bank switching yet, and the right approach probably varies depending on the host. For the Apple, switching in the RAM is mostly a one-way street to my knowledge; it never goes back to the ROM after e.g. you boot up UCSD Pascal. So I could just ignore the bank switch, and allow the software to start overwriting the ROM contents that were initially slurped up. For the chess computers, bank switching would probably entail alternating between the slurped-up ROM contents (fast code for generating and evaluating moves) and the slow external ROM (opening library).

The Spartan-6 maxes out at 64 kByte RAM on chip (in the 225-pad BGA I am using, the largest I can fit onto the PCB). I have been tempted by its Spartan-7 cousin, which is available in the same package with up to 180 kByte of RAM. More expensive though, and annoyingly it needs yet another supply voltage. Having 5V, 3.3V, 1.8V and 1.0V supplies on such a little board seems a bit stupid...

Lots to do still; luckily I have vacation time coming up... I also want to support the battery-backed-up RAM in some of the chess computers. In that case, you don't want a write-through for every write cycle, since it would slow down the time-critical core routines. So the idea is to mark changed bytes in the internal RAM (luckily there's a 9th bit in the Spartan block RAM!); then scan the memory for changes in the background, and use vacant external bus cycles to write back any changes. That should be a nice little puzzle to implement...

BigEd · Post by **BigEd** » Thu Jul 02, 2020 10:57 am

Best use yet of the ninth bit! I like it.

Chromatix · Post by **Chromatix** » Thu Jul 02, 2020 11:49 am

Of course, the easiest and most compatible way to accelerate a BBC Micro is to implement a Second Processor. That would involve only a small, fixed I/O space in which the Tube interface lives.

BigEd · Post by **BigEd** » Thu Jul 02, 2020 12:33 pm

Hold that thought! I intend to start a new thread...

BillG · Post by **BillG** » Thu Jul 02, 2020 12:48 pm

65f02 wrote:

BigEd wrote:

I've a feeling Apple II cards might similarly have overlapping areas of memory? I might be wrong. Certainly the C64 has a non-trivial memory map. But for me the Acorn story is the important one!

The Apple does have bank-switched memory if it is equipped with a "language card", which replaces the upper 16k ROM with extra RAM. A few of the late, high-end chess computers use bank-switched ROM as well, presumably to enable larger libraries of openings.

I have not implemented bank switching yet, and the right approach probably varies depending on the host. For the Apple, switching in the RAM is mostly a one-way street to my knowledge; it never goes back to the ROM after e.g. you boot up UCSD Pascal. So I could just ignore the bank switch, and allow the software to start overwriting the ROM contents that were initially slurped up. For the chess computers, bank switching would probably entail alternating between the slurped-up ROM contents (fast code for generating and evaluating moves) and the slow external ROM (opening library).

Speaking of the Apple II, some computers have memory decoding such that reading a particular address gets ROM whereas writing tickles a "write-only" control port.

Chromatix · Post by **Chromatix** » Fri Jul 03, 2020 12:41 am

More generally, things will start to get sticky when memory mapping hardware comes into play, as it does with the BBC Master (and even some aspects of the BBC Micro). The BBC Master comes with 128KB RAM and 128KB ROM as standard; you should ask yourself how all that fits into the 6502's 64K address space! You might use the FPGA's on-board RAM as a write-through cache (which, since write accesses are comparatively rare, is still a big win over emulating every bus access), but certain I/O writes will force a need to flush (before) and invalidate (thereafter) that cache.

Worse, the BBC Master has some special cleverness which maps one of the 32KB banks of RAM in depending on where the current code is executing from. This is the "shadow RAM" logic which hides the framebuffer and the DFS workspace, making those areas in the "front" RAM available to user programs. So you will additionally need to detect changes of program flow, emulate enough SYNC bus cycles to trigger the shadow logic correctly, and flush/invalidate the cache accordingly to boot.

At some point, it possibly gets easier to emulate the entire machine, rather than interfacing with the original logic. Other machines may have similarly frustrating quirks (particularly the Apple II series and its disk controller), so this is just an illustration.

rpiguy2 · Post by **rpiguy2** » Fri Jul 03, 2020 2:46 am

How much room is left on the FPGA?

I imagine you could combine this with an FPGA PAL for the C64 to get the mapping right.

65f02 · Post by **65f02** » Fri Jul 03, 2020 5:32 am

There certainly are limitations to where this accelerator can be used in a meaningful way. There's a reason why I put "universal" in quotes in the original post...

@Chromatix: As mentioned above, up to 180k RAM would be feasible via a Spartan-7. But independent of the RAM constraint, I agree that accelerating a machine as complex as the BBC Master would best be done by emulating larger parts of the architecture in an FPGA. In my current concept, the CPU would end up duplicating too much of the host's functionality, just so it can keep track of what's going on in the host -- not elegant at all.

@rpiguy2: I would probably argue that accelerating the C64 makes no sense anyway, because who wants to play accelerated games?

I do intend to deal with the Apple II disk controller though. No special hardware features there, just heavy (and clever) use of exactly timed code, which needs to be executed at the original speed. I have my "Beneath Apple DOS" book ready to define the required address ranges for the 65F02 memory map. And next, I want to see that 100 MHz Apple boot into UCSD Pascal!

Meet the 65F02 - a 65C02 at 100 MHz

Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz

Re: Meet the 65F02 - a 65C02 at 100 MHz