My new verilog 65C02 core.
Re: My new verilog 65C02 core.
Tracing a few of these names in the schematic viewer, it looks like there are a few unnecessary logic layers, because the tools aren't clever enough. I'm curious what some properly designed manual blocks will do instead. And it's not just the number of logic layers, it's also all the unnecessary routing between them, and the stretched out placement.
Re: My new verilog 65C02 core.
Support for BCD has been added, and passes Klaus Dormann's test.
Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.
Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.
Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.
Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.
Re: My new verilog 65C02 core.
In fact, just isolating the "adder" portion:
Already requires 11 slices. I will try that part first.
Code: Select all
module adder(
input CI, // carry in
input [7:0] R, // input from register file
input [7:0] M, // input from memory
input [2:0] op, // 5-bit operation select
output reg [8:0] adder, // data out
);
wire [7:0] N = ~M;
always @(*)
case( op[2:0] )
3'b000: adder = R | M + CI;
3'b001: adder = R & M + CI;
3'b010: adder = R ^ M + CI;
3'b011: adder = R + M + CI;
3'b100: adder = R + 8'h00 + CI;
3'b101: adder = R + 8'hff + CI;
3'b110: adder = R + N + CI;
3'b111: adder = ~R & M + CI;
endcase
endmodule
Re: My new verilog 65C02 core.
Okay, that was my mistake. I wasn't thinking about the operator precedence in verilog. The "+ CI" apparently binds tighter than the logic operators, making a mess of my plans. I never noticed this, because CI is always 0 when doing logic operations.
Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.
Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.
Re: My new verilog 65C02 core.
Arlet wrote:
Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.
Could the microcode also reside in distributed RAM? I would not mind the extra slices used, but an fmax penalty would be undesirable of course. The 116 MHz you got in your test are quite nice indeed!
Re: My new verilog 65C02 core.
Yes, you can use distributed RAM, but it will require quite a bit, especially because current design is rather wasteful with ROM usage.
I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.
I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.
Re: My new verilog 65C02 core.
"Speed increase" sounds even better!
I will definitely stay tuned.
I will definitely stay tuned.
Re: My new verilog 65C02 core.
Hi Arlet!
Thank you for this new core!
I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:
Obviously, this needs removing the first 5 bits from the microcode.hex file.
This is the usage report from synthesis:
This is about 40% less LUT4 than your original 6502 core, using 4 extra block rams.
This is the report for the original 6502 core:
After packing into an iCE40 up5k, the usages are:
About the speed, it is difficult to estimate without a full design, simply routing all 38 pins to I/O pins in the FPGA gives about 5% more speed in the new design, so both are similar.
Have Fun!
Thank you for this new core!
I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:
Code: Select all
@@ -62,7 +62,7 @@ assign flags = control[7:0];
assign alu_op = { ci, shift, adder };
assign dp_op = control[21:15];
-reg [35:0] microcode[511:0];
+reg [30:0] microcode[511:0];
reg [35:0] control;
/*
@@ -115,7 +115,7 @@ always @(posedge clk)
* load next control word from ROM
*/
always @(posedge clk)
- control <= microcode[pc];
+ control <= { 5'b0, microcode[pc] };
This is the usage report from synthesis:
Code: Select all
Number of wires: 328
Number of wire bits: 1434
Number of public wires: 328
Number of public wire bits: 1434
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 656
SB_CARRY 107
SB_DFF 18
SB_DFFE 103
SB_DFFESR 4
SB_LUT4 420
SB_RAM40_4K 4This is the report for the original 6502 core:
Code: Select all
Number of wires: 493
Number of wire bits: 1620
Number of public wires: 493
Number of public wire bits: 1620
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 901
SB_CARRY 27
SB_DFF 10
SB_DFFE 114
SB_DFFESR 9
SB_DFFESS 4
SB_DFFR 5
SB_DFFS 1
SB_LUT4 731Code: Select all
Arlet 65C02:
------------------------------------------
Info: ICESTORM_LC: 551/ 5280 10%
Info: ICESTORM_RAM: 4/ 30 13%
Info: SB_GB: 3/ 8 37%
Arlet 6502:
------------------------------------------
Info: ICESTORM_LC: 829/ 5280 15%
Info: SB_GB: 4/ 8 50%
Have Fun!
Re: My new verilog 65C02 core.
Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.
By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.
By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.
Re: My new verilog 65C02 core.
I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:
Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:
This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.
Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.
In my board, I just mapped the pins where it was convenient for board layout.
Code: Select all
always @(*)
case( op[1:0] )
2'b00: {CO, ADL} = base + 00 + CI;
2'b01: {CO, ADL} = 9'hx;
2'b10: {CO, ADL} = base + ABL + CI;
2'b11: {CO, ADL} = base + REG + CI;
endcase
Code: Select all
generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
.O6(P[i]),
.O5(G[i]),
.I0(base[i]),
.I1(REG[i]),
.I2(ABL[i]),
.I3(op[0]),
.I4(op[1]),
.I5(1'b1) );
end
endgenerate
Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.
Code: Select all
CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI), .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );
In my board, I just mapped the pins where it was convenient for board layout.
Re: My new verilog 65C02 core.
Hi!
Thank you!
Sadly, in the LUT4 architecture of iCE40 the new ALU is bigger!
Best result is simply to only pass CI on the operations that use it:
With this change, I got:
And:
Here the longest path goes from "C" to "Z", "N", "abl.base", "abh.CI", up to "ABH[7]". I don't really understand the first flags in the chain, but perhaps it is because the BCC/BCS opcodes.
Have Fun!
Arlet wrote:
Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.
Quote:
By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.
Best result is simply to only pass CI on the operations that use it:
Code: Select all
@@ -64,14 +64,14 @@ wire [7:0] NR = ~R;
always @(*)
case( op[2:0] )
- 3'b000: adder = (R | M) + CI;
- 3'b001: adder = (R & M) + CI;
- 3'b010: adder = (R ^ M) + CI;
+ 3'b000: adder = (R | M) /* + CI */ ;
+ 3'b001: adder = (R & M) /* + CI */ ;
+ 3'b010: adder = (R ^ M) /* + CI */ ;
3'b011: adder = (R + M) + CI;
3'b100: adder = (R + 8'h00) + CI;
3'b101: adder = (R + 8'hff) + CI;
3'b110: adder = (R + NM) + CI;
- 3'b111: adder = (M & NR) + CI;
+ 3'b111: adder = (M & NR) /* + CI */ ;
endcase
/*Code: Select all
Number of wires: 333
Number of wire bits: 1442
Number of public wires: 333
Number of public wire bits: 1442
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 655
SB_CARRY 99
SB_DFF 18
SB_DFFE 103
SB_DFFESR 4
SB_LUT4 427
SB_RAM40_4K 4Code: Select all
Info: ICESTORM_LC: 559/ 5280 10%
Info: ICESTORM_RAM: 4/ 30 13%Have Fun!
Re: My new verilog 65C02 core.
Yes, probably the branches. The branch target address is calculated in the ABL/ABH modules directly from current address, so there's a path from the flags to ABL[0] all the way up the carry chain to ABH[7].
I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.
I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.
Re: My new verilog 65C02 core.
Arlet wrote:
In my board, I just mapped the pins where it was convenient for board layout.
Of course, when you're running at >100 MHz, you can't address the external bus that fast anyway. If you need to run fast, you can use the internal block RAMs for code, and add 1 or 2 wait states for external access, and then add a pipeline stage to the AB outputs. The problem was that I had configured my core for internal block RAM, but still had the AB connected (but not the Data Bus).
When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.
Re: My new verilog 65C02 core.
spectacular!
Re: My new verilog 65C02 core.
Here's a screenshot of the floorplan. You can see the two block RAMs next to each other, one for the bootrom, and the other for the microcode.
The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.
The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.