My new verilog 65C02 core.

Topics relating to PALs, CPLDs, FPGAs, and other PLDs used for the support or creation of 65-family processors, both hardware and HDL.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Tracing a few of these names in the schematic viewer, it looks like there are a few unnecessary logic layers, because the tools aren't clever enough. I'm curious what some properly designed manual blocks will do instead. And it's not just the number of logic layers, it's also all the unnecessary routing between them, and the stretched out placement.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Support for BCD has been added, and passes Klaus Dormann's test.

Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.

Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

In fact, just isolating the "adder" portion:

Code: Select all

module adder(
    input CI,               // carry in
    input [7:0] R,          // input from register file
    input [7:0] M,          // input from memory
    input [2:0] op,         // 5-bit operation select
    output reg [8:0] adder, // data out
);

wire [7:0] N = ~M;

always @(*)
    case( op[2:0] )
        3'b000: adder =  R |  M     + CI;
        3'b001: adder =  R &  M     + CI;
        3'b010: adder =  R ^  M     + CI;
        3'b011: adder =  R +  M     + CI;
        3'b100: adder =  R +  8'h00 + CI;
        3'b101: adder =  R +  8'hff + CI;
        3'b110: adder =  R +  N     + CI;
        3'b111: adder = ~R &  M     + CI;
    endcase
endmodule
Already requires 11 slices. I will try that part first.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Okay, that was my mistake. I wasn't thinking about the operator precedence in verilog. The "+ CI" apparently binds tighter than the logic operators, making a mess of my plans. I never noticed this, because CI is always 0 when doing logic operations.

Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.
User avatar
65f02
Posts: 79
Joined: 01 Jul 2020
Location: Germany

Re: My new verilog 65C02 core.

Post by 65f02 »

Arlet wrote:
Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.
Ah, right. That would be an unpleasant limitation in my case, since I use the on-chip block RAM exclusively in my 65F02, and for some host systems I do need very close to the 64 kByte available on the XC6SLX9.

Could the microcode also reside in distributed RAM? I would not mind the extra slices used, but an fmax penalty would be undesirable of course. The 116 MHz you got in your test are quite nice indeed!
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Yes, you can use distributed RAM, but it will require quite a bit, especially because current design is rather wasteful with ROM usage.

I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.
User avatar
65f02
Posts: 79
Joined: 01 Jul 2020
Location: Germany

Re: My new verilog 65C02 core.

Post by 65f02 »

"Speed increase" sounds even better! :D
I will definitely stay tuned.
dmsc
Posts: 153
Joined: 17 Sep 2018

Re: My new verilog 65C02 core.

Post by dmsc »

Hi Arlet!

Thank you for this new core!

I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:

Code: Select all

@@ -62,7 +62,7 @@ assign flags = control[7:0];
 assign alu_op = { ci, shift, adder };
 assign dp_op  = control[21:15];
 
-reg [35:0] microcode[511:0];
+reg [30:0] microcode[511:0];
 reg [35:0] control;
 
 /* 
@@ -115,7 +115,7 @@ always @(posedge clk)
  * load next control word from ROM
  */
 always @(posedge clk)
-    control <= microcode[pc];
+    control <= { 5'b0, microcode[pc] };
Obviously, this needs removing the first 5 bits from the microcode.hex file.

This is the usage report from synthesis:

Code: Select all

   Number of wires:                328
   Number of wire bits:           1434
   Number of public wires:         328
   Number of public wire bits:    1434
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                656
     SB_CARRY                      107
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       420
     SB_RAM40_4K                     4
This is about 40% less LUT4 than your original 6502 core, using 4 extra block rams.

This is the report for the original 6502 core:

Code: Select all

   Number of wires:                493
   Number of wire bits:           1620
   Number of public wires:         493
   Number of public wire bits:    1620
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                901
     SB_CARRY                       27
     SB_DFF                         10
     SB_DFFE                       114
     SB_DFFESR                       9
     SB_DFFESS                       4
     SB_DFFR                         5
     SB_DFFS                         1
     SB_LUT4                       731
After packing into an iCE40 up5k, the usages are:

Code: Select all

Arlet 65C02:
------------------------------------------
Info: 	         ICESTORM_LC:   551/ 5280    10%
Info: 	        ICESTORM_RAM:     4/   30    13%
Info: 	               SB_GB:     3/    8    37%

Arlet 6502:
------------------------------------------
Info: 	         ICESTORM_LC:   829/ 5280    15%
Info: 	               SB_GB:     4/    8    50%
About the speed, it is difficult to estimate without a full design, simply routing all 38 pins to I/O pins in the FPGA gives about 5% more speed in the new design, so both are similar.

Have Fun!
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:

Code: Select all

always @(*)
    case( op[1:0] )
        2'b00: {CO, ADL} = base + 00  + CI;
        2'b01: {CO, ADL} = 9'hx;
        2'b10: {CO, ADL} = base + ABL + CI;
        2'b11: {CO, ADL} = base + REG + CI;
    endcase
Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:

Code: Select all

generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
    .O6(P[i]),
    .O5(G[i]),
    .I0(base[i]),
    .I1(REG[i]),
    .I2(ABL[i]),
    .I3(op[0]),
    .I4(op[1]),
    .I5(1'b1) );
end
endgenerate
This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.

Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.

Code: Select all

CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI),     .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );
With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.

In my board, I just mapped the pins where it was convenient for board layout.
Attachments
path.png
dmsc
Posts: 153
Joined: 17 Sep 2018

Re: My new verilog 65C02 core.

Post by dmsc »

Hi!
Arlet wrote:
Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.
Thank you!
Quote:
By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.
Sadly, in the LUT4 architecture of iCE40 the new ALU is bigger!

Best result is simply to only pass CI on the operations that use it:

Code: Select all

@@ -64,14 +64,14 @@ wire [7:0] NR = ~R;
 
 always @(*)
     case( op[2:0] )
-        3'b000: adder = (R | M)     + CI;
-        3'b001: adder = (R & M)     + CI;
-        3'b010: adder = (R ^ M)     + CI;
+        3'b000: adder = (R | M)    /* + CI */ ;
+        3'b001: adder = (R & M)    /* + CI */ ;
+        3'b010: adder = (R ^ M)    /* + CI */ ;
         3'b011: adder = (R + M)     + CI;
         3'b100: adder = (R + 8'h00) + CI;
         3'b101: adder = (R + 8'hff) + CI; 
         3'b110: adder = (R + NM)    + CI;
-        3'b111: adder = (M & NR)    + CI;
+        3'b111: adder = (M & NR)   /* + CI */ ;
     endcase
 
 /*
With this change, I got:

Code: Select all

   Number of wires:                333
   Number of wire bits:           1442
   Number of public wires:         333
   Number of public wire bits:    1442
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                655
     SB_CARRY                       99
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       427
     SB_RAM40_4K                     4
And:

Code: Select all

Info: 	         ICESTORM_LC:   559/ 5280    10%
Info: 	        ICESTORM_RAM:     4/   30    13%
Here the longest path goes from "C" to "Z", "N", "abl.base", "abh.CI", up to "ABH[7]". I don't really understand the first flags in the chain, but perhaps it is because the BCC/BCS opcodes.

Have Fun!
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Yes, probably the branches. The branch target address is calculated in the ABL/ABH modules directly from current address, so there's a path from the flags to ABL[0] all the way up the carry chain to ABH[7].

I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Arlet wrote:
In my board, I just mapped the pins where it was convenient for board layout.
I can see the reason now. I've highlighted the AB* pins in my board. Some are near the top of the FPGA, and some are at the bottom. This means that there's no good place for the logic to sit close to both.

Of course, when you're running at >100 MHz, you can't address the external bus that fast anyway. If you need to run fast, you can use the internal block RAMs for code, and add 1 or 2 wait states for external access, and then add a pipeline stage to the AB outputs. The problem was that I had configured my core for internal block RAM, but still had the AB connected (but not the Data Bus).

When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.
Attachments
AB.png
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: My new verilog 65C02 core.

Post by BigEd »

spectacular!
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Here's a screenshot of the floorplan. You can see the two block RAMs next to each other, one for the bootrom, and the other for the microcode.

The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.
Attachments
scattered.png
floorplan.png
floorplan.png (8.76 KiB) Viewed 1193 times
Post Reply