My new verilog 65C02 core.
Re: My new verilog 65C02 core.
Hmm... I am using one write port to connect the external data bus at the moment, for filling the RAM during initialization only. Hopefully that can be arranged to co-exist with the fast core's read access on the same BRAM port, leaving the other port for the core's write access.
But I must be missing something regarding the reset option of the BRAMs. The reset only acts on the optional output registers, doesn't it? And switching in the output registers adds one additional latency cycle before data from the RAM reaches the output -- how can your core accommodate that?
But I must be missing something regarding the reset option of the BRAMs. The reset only acts on the optional output registers, doesn't it? And switching in the output registers adds one additional latency cycle before data from the RAM reaches the output -- how can your core accommodate that?
Re: My new verilog 65C02 core.
If you're not using the optional output register, the reset works directly on the latch value. For the write port, it's possible to add a mux and allow outside access when CPU is held in reset, or doing something else in other memory.
Re: My new verilog 65C02 core.
Thank you, Arlet -- I had indeed overlooked that. I had thought about the "OR gate instead of a mux" option a couple of months ago, and concluded that it would not be supported in non-registered BRAM mode... I'll give that a try; should give me slightly faster timing with the existing core too.
Re: My new verilog 65C02 core.
I've instantiated the register file and the microcode ROM with Spartan-6 primitives. The next step I want to try is to make a mini top-level project that only has the microcode ROM, the 6502 code ROM, the register file, and the lower address bus logic, all as instantiated blocks, connected together properly. There are no dependencies on the rest of the design, so I will be able to see how this part works in isolation, and experiment with manual placement, without having to worry about everything at the same time.
Re: My new verilog 65C02 core.
Well, that has been enlightening and rather sobering.
Even though you can get manual placement to look nicer, the actual gains are rather modest. Even with short connections, the routing delay still takes near 50% of the total path. A problem is that there are a lot of connections between modules, and you can't place everything right next to everything else.
One lesson learned is that using more parallel logic to get speed improvements (like the ABL mux from before) looks good on paper, but in practice you see that all these bigger blocks need to sit physically further away from the sites they are connecting to. It may still be faster in the end, but you have to realize the trade off. And even if it is faster on one run, that doesn't mean that it will be faster when the design is extended, and these bigger blocks force other logic away from their ideal placement.
Another lesson is some resources I've been using, such as the F7MUX, LUT memories, or carry chains are not present in all slices. The memory is the most restrictive, only present in 25% of the slices. The F7MUX and carry chains are only present in 50%. This means that if you connect a memory, a F7MUX and a carry chain together (like I have), you have to skip slices that don't have these resources, creating bigger distances. It may be preferable in some case to use simpler resources, even at penalty of slower logic, if that means you can get better density.
Even though you can get manual placement to look nicer, the actual gains are rather modest. Even with short connections, the routing delay still takes near 50% of the total path. A problem is that there are a lot of connections between modules, and you can't place everything right next to everything else.
One lesson learned is that using more parallel logic to get speed improvements (like the ABL mux from before) looks good on paper, but in practice you see that all these bigger blocks need to sit physically further away from the sites they are connecting to. It may still be faster in the end, but you have to realize the trade off. And even if it is faster on one run, that doesn't mean that it will be faster when the design is extended, and these bigger blocks force other logic away from their ideal placement.
Another lesson is some resources I've been using, such as the F7MUX, LUT memories, or carry chains are not present in all slices. The memory is the most restrictive, only present in 25% of the slices. The F7MUX and carry chains are only present in 50%. This means that if you connect a memory, a F7MUX and a carry chain together (like I have), you have to skip slices that don't have these resources, creating bigger distances. It may be preferable in some case to use simpler resources, even at penalty of slower logic, if that means you can get better density.
Re: My new verilog 65C02 core.
Here's placement and timing info for about the best I can get, just for the path that involves the address bus.
It's just under 6 ns, compared to about 6.6 ns for automatic placement. Since this is only part of the address bus, it can only get worse from here.
Another thing I learned, even though there are 8 LUT5s in a slice, and I had 8 of them in part of the design, I could not put all 8 in the same slice, due to restrictions on total number of inputs into a slice. So, yes, you can put 8 LUTs in a slice, but only if they share a sufficient number of inputs.
But if routing restrictions only let you put 4 in a slice, you might as well use the 6th input.
Big orange thing on the left is microcode ROM. Going to the right, there's the register file, the address mux and the address offset adder.
It's just under 6 ns, compared to about 6.6 ns for automatic placement. Since this is only part of the address bus, it can only get worse from here.
Another thing I learned, even though there are 8 LUT5s in a slice, and I had 8 of them in part of the design, I could not put all 8 in the same slice, due to restrictions on total number of inputs into a slice. So, yes, you can put 8 LUTs in a slice, but only if they share a sufficient number of inputs.
But if routing restrictions only let you put 4 in a slice, you might as well use the 6th input.
Big orange thing on the left is microcode ROM. Going to the right, there's the register file, the address mux and the address offset adder.
Re: My new verilog 65C02 core.
Some observations: the 8 bit carry chain only takes about 0.7 ns, whereas the microcode ROM has close to 3 ns of routing+logic delay.
Replacing microcode ROM with distributed logic may be advantageous.
Another interesting observation is that the XC6SLX4 version of the FPGA seems identical in layout to the XC6SLX9, except that half the slices are missing. The remaining slices are still in the same place. Maybe they use the same die, and just cripple it, or maybe they look for the SLX9's with defects in the right places that can still be sold as SLX4.
A consequence is that the SLX4 is likely slower than the SLX9, even with the same speed rating, simply because routing distances increase.
Replacing microcode ROM with distributed logic may be advantageous.
Another interesting observation is that the XC6SLX4 version of the FPGA seems identical in layout to the XC6SLX9, except that half the slices are missing. The remaining slices are still in the same place. Maybe they use the same die, and just cripple it, or maybe they look for the SLX9's with defects in the right places that can still be sold as SLX4.
A consequence is that the SLX4 is likely slower than the SLX9, even with the same speed rating, simply because routing distances increase.
Re: My new verilog 65C02 core.
Hi Arlet,
I just wanted to say I'm finding this thread absolutely fascinating.
Please do keep posting updates as you progress.
Dave
I just wanted to say I'm finding this thread absolutely fascinating.
Please do keep posting updates as you progress.
Dave
Re: My new verilog 65C02 core.
Ok, revisiting the ABL mux logic again, because I did not like the placement/routing consequences of the F7MUX. First of all, it requires 4 slices for 8 bits, which takes up a huge amount of space. Secondly, because the F7MUX is only present in the SLICEL/SLICEM types, it reduces placement options. To make this issue worse, the register file needs to be close as well, and it uses 2 SLICEMs. The ABL adders require carry logic, which means another 2 SLICEM/SLICEL (see table at the bottom showing the SLICE features)
Looking at the ABL mux again, we have 3 input signals that are absolutely required: DB (databus), AHL (address hold), and PCL (program counter). Ruling out 7-input logic, that means we have 3 inputs available for mux selection. I would like to use one of them for the 'cond' signal, which indicates whether a branch is taken or not. That leaves 2 inputs. The ABL mux needs 4 different outputs:
So, here's the plan for a mux with 4 choices:
Looking at the ABL mux again, we have 3 input signals that are absolutely required: DB (databus), AHL (address hold), and PCL (program counter). Ruling out 7-input logic, that means we have 3 inputs available for mux selection. I would like to use one of them for the 'cond' signal, which indicates whether a branch is taken or not. That leaves 2 inputs. The ABL mux needs 4 different outputs:
- constant 0
- DB
- AHL
- PCL
- constant 0
- DB
- constant 0 / DB, depending on 'cond'.
- DB / constant 0, depending on 'cond'.
- AHL
- PCL
So, here's the plan for a mux with 4 choices:
- constant 0.
- DB / constant 0, depending on 'cond'.
- AHL
- PCL
Re: My new verilog 65C02 core.
Code has been updated. Two bits go straight from microcode into ABL mux, joined by upgraded 'cond' signal.
Good news is that the new path meets the timing. Also, the control logic has been simplified a bit. The three different branch cases have been combined into a single one.
Bad news is that various ALU-related paths seem to have taken a hit.
Good news is that the new path meets the timing. Also, the control logic has been simplified a bit. The three different branch cases have been combined into a single one.
Bad news is that various ALU-related paths seem to have taken a hit.
Re: My new verilog 65C02 core.
Looks like the BCD adjustment logic is the culprit. If I disable BCD I get down to 6.39 ns (without placement hints for the address bus paths, so that's encouraging).
Re: My new verilog 65C02 core.
Arlet wrote:
Some observations: the 8 bit carry chain only takes about 0.7 ns, whereas the microcode ROM has close to 3 ns of routing+logic delay.
Replacing microcode ROM with distributed logic may be advantageous.
Replacing microcode ROM with distributed logic may be advantageous.
Re: My new verilog 65C02 core.
One of the problems is that the BCD correction uses the 'M' register. Normally, the M register holds the previous value from the Data Bus. This goes into the ALU, and then the ALU calculates the result between 'R' (from the register file) and 'M'. BCD correction is done in an additional cycle, where 'M' is loaded with adjustment term, and then A <= A + M (or A - M) is performed.
The problem is that the 'M' register performs double duty as 'IR' register, used both for checking branch condition, as well as setting flags. For instance, in case of CLD/SED, the D flag is loaded from M[5].
In addition, 'M' also holds the flags during PLP/RTI instructions. As the flags show up on DB, they are loaded into 'M', which then holds them for 1 or more cycles until the instruction is finished, and the flags are updated.
This means that there's a lot of logic that goes into some of the bits of 'M'. I am thinking that it may be wise to split up the different functions. The bits that go into the ALU don't need to worry about the PLP/RTI details, or the branch condition. By duplicating the flip-flops, and giving them a more dedicated purpose, this also may help to relieve some mapping/routing issues. The bits that go into the ALU could be placed near the ALU. Flops are probably the cheapest resource on the FPGA.
The problem is that the 'M' register performs double duty as 'IR' register, used both for checking branch condition, as well as setting flags. For instance, in case of CLD/SED, the D flag is loaded from M[5].
In addition, 'M' also holds the flags during PLP/RTI instructions. As the flags show up on DB, they are loaded into 'M', which then holds them for 1 or more cycles until the instruction is finished, and the flags are updated.
This means that there's a lot of logic that goes into some of the bits of 'M'. I am thinking that it may be wise to split up the different functions. The bits that go into the ALU don't need to worry about the PLP/RTI details, or the branch condition. By duplicating the flip-flops, and giving them a more dedicated purpose, this also may help to relieve some mapping/routing issues. The bits that go into the ALU could be placed near the ALU. Flops are probably the cheapest resource on the FPGA.
Re: My new verilog 65C02 core.
One thing that I found with manual placement is that the best location is not always obvious. A method that seems to work reasonably well is to leave all the placement up to the tools, preferably with smart explorer. See what the critical paths are, and then check the floorplan. Try to improve placement of one or two blocks, and then run the tools again. Try this a few times with slightly different positions to see what works best. You may be able to realize some decent gains by simply locking 2 or 3 blocks in a certain position.
Re: My new verilog 65C02 core.
Sounds like a good approach - and you end up with a light touch, which hopefully doesn't overconstrain things.
It feels likely to me that the placer would be optimised for packing in a fairly full design, and not doing too bad a job on the speed, rather than making the ultimately fastest placement of a small design in a large FPGA. (And when I say optimised, I mean that in quite a weak sense: I mean it's likely to have been the developer focus.)
It feels likely to me that the placer would be optimised for packing in a fairly full design, and not doing too bad a job on the speed, rather than making the ultimately fastest placement of a small design in a large FPGA. (And when I say optimised, I mean that in quite a weak sense: I mean it's likely to have been the developer focus.)