My new verilog 65C02 core.

BigEd · Post by **BigEd** » Thu Oct 22, 2020 2:15 pm

I was idly wondering how well that 239x36 (?) bits of microcode would synthesise down... and it's still slightly interesting, but I also realise that ideally it wouldn't be binary but would also have don't cares, and of course that's a lot more work.

Arlet · Post by **Arlet** » Thu Oct 22, 2020 2:58 pm

Right now it's 241x31 = 7471 bits. The memory is declared as 36 bits, but I still have 5 free bits left.

If you'd just naively store it as binary in LUTs, you'd get 256 bits/slice, or 30 slices total. It would still beat my old core. Obviously, by exploiting the don't cares, you could reduce that. It's not an easy task to find the most densely packed solution, though, because there are many degrees of freedom in the design.

Using a ROM is easy and flexible. It only takes a few minutes to add a new instruction, or to fix a bug.

Arlet · Post by **Arlet** » Thu Oct 22, 2020 3:13 pm

Blinky LED works. I'm running the code from a block RAM, using negedge clock to simulate asynchronous memory.

BigEd · Post by **BigEd** » Thu Oct 22, 2020 5:06 pm

Milestone! Well done.

Arlet · Post by **Arlet** » Thu Oct 22, 2020 6:10 pm

Next milestone done as well... printing "Hello world" over the serial port, with stack & zeropage running in external SRAM.

Now back to the simulator for decimal mode.

Arlet · Post by **Arlet** » Fri Oct 23, 2020 7:20 am

I was thinking last night about the synchronous vs asynchronous memory bus. As I was working with real hardware, and trying to use the block RAMs, I had to use the 'negedge' clock to simulate asynchronous behavior. While this works just fine for a test, it has a severe disadvantage that it essentially cuts the clock period in half, resulting in lower Fmax. On the other hand, writing a core for synchronous memory takes more effort, and results in a bigger and slower design, which is a waste if you're dealing with a naturally asynchronous memory bus. Ideally, I want to use both at the same time, with a typical system setup of external asynchronous memory (or I/O devices), and internal block RAM for the boot code/operating system ROMs

Now, I do have an early AD signal available, which is a combinatorial version of the address bus. The core does basically this:

Code: Select all

always @(posedge clk)
    AB <= AD;

I had an idea: why not simply feed that AD into the block RAMs instead of the AB signal ? That way we get an early address set up before the positive clock edge, and the RAM can get the result exactly when the core needs it. We can still use asynchronous memory in the same system by registering that AD signal in the output pads.

It's an obvious idea, but it fails.

The problem is with writing data. With asynchronous memory, you can have a write in one cycle, and then switch to a read on the next cycle, typically to fetch the next opcode. With synchronous memories, when doing a write, followed by a read, the result from that read won't be available until one cycle later.

However, there is a solution.

The block RAMs inside the FPGA are dual ported, so we can use one port as a read port with the address bits connected to AD. The other port we will use for writing only, using the registered AB for the address. Of course, if you only use the block RAM as read-only (such as an OS/boot ROM), you don't need to use the write port.

If you don't have a dual ported RAM available (or you want to use the 2nd port for something else, like a video processor), you can still use a single port, but then add a wait state (using RDY=0) whenever you're trying to switch from write -> read. As long as you're not executing code from that particular memory, I don't think that should ever happen.

BigEd · Post by **BigEd** » Fri Oct 23, 2020 8:18 am

Quite an interesting solution! You will perhaps have seen the thread elsewhere, where the fmax of a system-in-FPGA was limited when using a full 64k of block rams, presumably because the routing has to reach all parts of the chip. So, just a thought, but with all block rams in use and double ported, there might well be an fmax impact because there's twice as much routing. Or, maybe not, depending on how FPGA routing is implemented.

(In this case, not quite a full 64k, but the same observation!)

hoglet · Post by **hoglet** » Fri Oct 23, 2020 8:34 am

Hi Arlet,

I'm watching this thread with great interest. I have a project that 65C02 version of your existing core in an Xilinx XC6SLX9 FPGA, currently running at 80MHz, and in practice stable at 100MHz. It would be great to see how the new core compares.

I have couple of thoughts on the points you raise:

Arlet wrote:

The problem is with writing data. With asynchronous memory, you can have a write in one cycle, and then switch to a read on the next cycle, typically to fetch the next opcode. With synchronous memories, when doing a write, followed by a read, the result from that read won't be available until one cycle later.

Can you think of a situtation with the 65C02 where this would actually occur in practice?

The only one I could think of was self-modifying code, where the current instruction updates the opcode of the very next instruction. Even this seems rather contrived.

I could certainly live with this limitation.

Arlet wrote:

The block RAMs inside the FPGA are dual ported, so we can use one port as a read port with the address bits connected to AD. The other port we will use for writing only, using the registered AB for the address.

Not all FPGA families have block RAM that is as flexible as the Xilinx block RAM, in terms of how independant the two ports are.

I found with the Lattice ICE40, one ports is a read port, and the other port is a write port, If you violate this, you end up with two copies of the RAM. And I think some parts include large 16Kx16 embedded single port RAM (SPRAM), which would be nice to use.

At the very least, consider a compile time switch to use just a single port.

Dave

Arlet · Post by **Arlet** » Fri Oct 23, 2020 8:52 am

Because the 'write' address is always a single cycle delayed version of the 'read' address, the routing pressure can be avoided by inserting a local register at each RAM. The tools can do this automatically.

Arlet · Post by **Arlet** » Fri Oct 23, 2020 9:15 am

hoglet wrote:

The only one I could think of was self-modifying code, where the current instruction updates the opcode of the very next instruction. Even this seems rather contrived.

This even works correctly, as long as you can configure the block RAM to return the new data instead of the old when both read/write ports have the same address. Currently, my new core expects that behavior anyway, because when doing a read-modify-write on memory, it sets the N/Z flags by looking at read data on the bus, not its own write data (although that can be fixed at small cost in extra logic). Xilinx block RAMs have a configuration feature "write-first" to accomplish that. However, if the block RAM doesn't support that, you can add a small wrapper to check the WE signal, and return the write data on the read port when it's asserted (no need to compare addresses)

Quote:

I found with the Lattice ICE40, one ports is a read port, and the other port is a write port, If you violate this, you end up with two copies of the RAM. And I think some parts include large 16Kx16 embedded single port RAM (SPRAM), which would be nice to use.

Dedicated read/write ports are not a problem. That's all I need.

Quote:

At the very least, consider a compile time switch to use just a single port.

Unfortunately, that's not possible, since the core design will never work correctly with single port synchronous memory. There are several ways to mitigate this limitation, such as using separate memories for code areas, or adding wait state when doing read after write, or using a phase shifted clock to emulate asynchronous memory. All these different options can be done outside the core itself.

Note that the extra wait state is most likely compensated by the overall reduced cycle count that comes from the simpler memory system. (Also the shorter paths in the new core, which hopefully should result in higher fmax)

Arlet · Post by **Arlet** » Fri Oct 23, 2020 9:27 am

Note that the code/data areas in memory don't have to be strict if you combine it with the wait states. If you have different memory areas, implemented by separate RAM blocks, you can write to one block, while reading another. The extra wait state would only be incurred when doing write followed by read on the same block.

However, this does mean that constructing large memories out of smaller blocks requires an extra MUX/OR to combine data outputs. You can't use the bit slice technique.

65f02 · Post by **65f02** » Fri Oct 23, 2020 3:06 pm

Wow! Didn't stop by for a few days, and presto, there's a new 65C02 core taking shape!

I am quite interested indeed to use this in my 65F02 plug-in CPU, where your original core is doing a great job. Any prediction regarding the fmax one can expect for the new core running in a Spartan-6, compared to your earlier design?

Arlet · Post by **Arlet** » Fri Oct 23, 2020 3:33 pm

I will do some fmax experiments later today or tomorrow. Right now, it says around 85 MHz for the async model without memories, but obviously that does not including going off chip to external SRAM which will have a huge impact.

I will hook up internal block RAM using method above instead, to get some results we can compare to old core.

Whatever the results are, I expect to be able to improve them with manual instantiation/placement, which is planned for 2nd phase.

Arlet · Post by **Arlet** » Fri Oct 23, 2020 4:59 pm

Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.

Arlet · Post by **Arlet** » Fri Oct 23, 2020 6:57 pm

Hooking up the core to a small ROM, using XC6SLX9-3, I can get it to pass 116 MHz.

My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.