My new verilog 65C02 core.

Arlet · Post by **Arlet** » Fri Oct 23, 2020 7:34 pm

Tracing a few of these names in the schematic viewer, it looks like there are a few unnecessary logic layers, because the tools aren't clever enough. I'm curious what some properly designed manual blocks will do instead. And it's not just the number of logic layers, it's also all the unnecessary routing between them, and the stretched out placement.

Arlet · Post by **Arlet** » Sat Oct 24, 2020 11:48 am

Support for BCD has been added, and passes Klaus Dormann's test.

Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.

Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.

Arlet · Post by **Arlet** » Sat Oct 24, 2020 11:56 am

In fact, just isolating the "adder" portion:

Code: Select all

module adder(
    input CI,               // carry in
    input [7:0] R,          // input from register file
    input [7:0] M,          // input from memory
    input [2:0] op,         // 5-bit operation select
    output reg [8:0] adder, // data out
);

wire [7:0] N = ~M;

always @(*)
    case( op[2:0] )
        3'b000: adder =  R |  M     + CI;
        3'b001: adder =  R &  M     + CI;
        3'b010: adder =  R ^  M     + CI;
        3'b011: adder =  R +  M     + CI;
        3'b100: adder =  R +  8'h00 + CI;
        3'b101: adder =  R +  8'hff + CI;
        3'b110: adder =  R +  N     + CI;
        3'b111: adder = ~R &  M     + CI;
    endcase
endmodule

Already requires 11 slices. I will try that part first.

Arlet · Post by **Arlet** » Sat Oct 24, 2020 1:12 pm

Okay, that was my mistake. I wasn't thinking about the operator precedence in verilog. The "+ CI" apparently binds tighter than the logic operators, making a mess of my plans. I never noticed this, because CI is always 0 when doing logic operations.

Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.

65f02 · Post by **65f02** » Sat Oct 24, 2020 1:58 pm

Arlet wrote:

Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.

Ah, right. That would be an unpleasant limitation in my case, since I use the on-chip block RAM exclusively in my 65F02, and for some host systems I do need very close to the 64 kByte available on the XC6SLX9.

Could the microcode also reside in distributed RAM? I would not mind the extra slices used, but an fmax penalty would be undesirable of course. The 116 MHz you got in your test are quite nice indeed!

Arlet · Post by **Arlet** » Sat Oct 24, 2020 2:04 pm

Yes, you can use distributed RAM, but it will require quite a bit, especially because current design is rather wasteful with ROM usage.

I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.

65f02 · Post by **65f02** » Sat Oct 24, 2020 2:31 pm

"Speed increase" sounds even better!

I will definitely stay tuned.

dmsc · Post by **dmsc** » Sat Oct 24, 2020 2:49 pm

Hi Arlet!

Thank you for this new core!

I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:

Code: Select all

@@ -62,7 +62,7 @@ assign flags = control[7:0];
 assign alu_op = { ci, shift, adder };
 assign dp_op  = control[21:15];
 
-reg [35:0] microcode[511:0];
+reg [30:0] microcode[511:0];
 reg [35:0] control;
 
 /* 
@@ -115,7 +115,7 @@ always @(posedge clk)
  * load next control word from ROM
  */
 always @(posedge clk)
-    control <= microcode[pc];
+    control <= { 5'b0, microcode[pc] };

Obviously, this needs removing the first 5 bits from the microcode.hex file.

This is the usage report from synthesis:

Code: Select all

   Number of wires:                328
   Number of wire bits:           1434
   Number of public wires:         328
   Number of public wire bits:    1434
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                656
     SB_CARRY                      107
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       420
     SB_RAM40_4K                     4

This is about 40% less LUT4 than your original 6502 core, using 4 extra block rams.

This is the report for the original 6502 core:

Code: Select all

   Number of wires:                493
   Number of wire bits:           1620
   Number of public wires:         493
   Number of public wire bits:    1620
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                901
     SB_CARRY                       27
     SB_DFF                         10
     SB_DFFE                       114
     SB_DFFESR                       9
     SB_DFFESS                       4
     SB_DFFR                         5
     SB_DFFS                         1
     SB_LUT4                       731

After packing into an iCE40 up5k, the usages are:

Code: Select all

Arlet 65C02:
------------------------------------------
Info: 	         ICESTORM_LC:   551/ 5280    10%
Info: 	        ICESTORM_RAM:     4/   30    13%
Info: 	               SB_GB:     3/    8    37%

Arlet 6502:
------------------------------------------
Info: 	         ICESTORM_LC:   829/ 5280    15%
Info: 	               SB_GB:     4/    8    50%

About the speed, it is difficult to estimate without a full design, simply routing all 38 pins to I/O pins in the FPGA gives about 5% more speed in the new design, so both are similar.

Have Fun!

Arlet · Post by **Arlet** » Sat Oct 24, 2020 3:04 pm

Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.

Arlet · Post by **Arlet** » Sat Oct 24, 2020 6:20 pm

I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:

Code: Select all

always @(*)
    case( op[1:0] )
        2'b00: {CO, ADL} = base + 00  + CI;
        2'b01: {CO, ADL} = 9'hx;
        2'b10: {CO, ADL} = base + ABL + CI;
        2'b11: {CO, ADL} = base + REG + CI;
    endcase

Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:

Code: Select all

generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
    .O6(P[i]),
    .O5(G[i]),
    .I0(base[i]),
    .I1(REG[i]),
    .I2(ABL[i]),
    .I3(op[0]),
    .I4(op[1]),
    .I5(1'b1) );
end
endgenerate

This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.

Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.

Code: Select all

CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI),     .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );

With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.

In my board, I just mapped the pins where it was convenient for board layout.

dmsc · Post by **dmsc** » Sat Oct 24, 2020 6:41 pm

Hi!

Arlet wrote:

Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

Thank you!

Quote:

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.

Sadly, in the LUT4 architecture of iCE40 the new ALU is bigger!

Best result is simply to only pass CI on the operations that use it:

Code: Select all

@@ -64,14 +64,14 @@ wire [7:0] NR = ~R;
 
 always @(*)
     case( op[2:0] )
-        3'b000: adder = (R | M)     + CI;
-        3'b001: adder = (R & M)     + CI;
-        3'b010: adder = (R ^ M)     + CI;
+        3'b000: adder = (R | M)    /* + CI */ ;
+        3'b001: adder = (R & M)    /* + CI */ ;
+        3'b010: adder = (R ^ M)    /* + CI */ ;
         3'b011: adder = (R + M)     + CI;
         3'b100: adder = (R + 8'h00) + CI;
         3'b101: adder = (R + 8'hff) + CI; 
         3'b110: adder = (R + NM)    + CI;
-        3'b111: adder = (M & NR)    + CI;
+        3'b111: adder = (M & NR)   /* + CI */ ;
     endcase
 
 /*

With this change, I got:

Code: Select all

   Number of wires:                333
   Number of wire bits:           1442
   Number of public wires:         333
   Number of public wire bits:    1442
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                655
     SB_CARRY                       99
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       427
     SB_RAM40_4K                     4

And:

Code: Select all

Info: 	         ICESTORM_LC:   559/ 5280    10%
Info: 	        ICESTORM_RAM:     4/   30    13%

Here the longest path goes from "C" to "Z", "N", "abl.base", "abh.CI", up to "ABH[7]". I don't really understand the first flags in the chain, but perhaps it is because the BCC/BCS opcodes.

Have Fun!

Arlet · Post by **Arlet** » Sat Oct 24, 2020 7:04 pm

Yes, probably the branches. The branch target address is calculated in the ABL/ABH modules directly from current address, so there's a path from the flags to ABL[0] all the way up the carry chain to ABH[7].

I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.

Arlet · Post by **Arlet** » Sat Oct 24, 2020 7:12 pm

Arlet wrote:

In my board, I just mapped the pins where it was convenient for board layout.

I can see the reason now. I've highlighted the AB* pins in my board. Some are near the top of the FPGA, and some are at the bottom. This means that there's no good place for the logic to sit close to both.

Of course, when you're running at >100 MHz, you can't address the external bus that fast anyway. If you need to run fast, you can use the internal block RAMs for code, and add 1 or 2 wait states for external access, and then add a pipeline stage to the AB outputs. The problem was that I had configured my core for internal block RAM, but still had the AB connected (but not the Data Bus).

When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.

BigEd · Post by **BigEd** » Sat Oct 24, 2020 7:56 pm

spectacular!

Arlet · Post by **Arlet** » Sat Oct 24, 2020 9:19 pm

Here's a screenshot of the floorplan. You can see the two block RAMs next to each other, one for the bootrom, and the other for the microcode.

The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.

My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.