6502.org • View topic - My new verilog 65C02 core.

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

My new verilog 65C02 core.

Page 3 of 16

[ 232 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 ... 16 Next

Print view

Previous topic | Next topic

Author

Message

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Fri Oct 23, 2020 7:34 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Tracing a few of these names in the schematic viewer, it looks like there are a few unnecessary logic layers, because the tools aren't clever enough. I'm curious what some properly designed manual blocks will do instead. And it's not just the number of logic layers, it's also all the unnecessary routing between them, and the stretched out placement.

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 11:48 am

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Support for BCD has been added, and passes Klaus Dormann's test.

Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.

Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 11:56 am

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

In fact, just isolating the "adder" portion:

Code:

module adder(
    input CI,               // carry in
    input [7:0] R,          // input from register file
    input [7:0] M,          // input from memory
    input [2:0] op,         // 5-bit operation select
    output reg [8:0] adder, // data out
);

wire [7:0] N = ~M;

always @(*)
    case( op[2:0] )
        3'b000: adder =  R |  M     + CI;
        3'b001: adder =  R &  M     + CI;
        3'b010: adder =  R ^  M     + CI;
        3'b011: adder =  R +  M     + CI;
        3'b100: adder =  R +  8'h00 + CI;
        3'b101: adder =  R +  8'hff + CI;
        3'b110: adder =  R +  N     + CI;
        3'b111: adder = ~R &  M     + CI;
    endcase
endmodule

Already requires 11 slices. I will try that part first.

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 1:12 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Okay, that was my mistake. I wasn't thinking about the operator precedence in verilog. The "+ CI" apparently binds tighter than the logic operators, making a mess of my plans. I never noticed this, because CI is always 0 when doing logic operations.

Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.

Top

65f02

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 1:58 pm

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany

Arlet wrote:

Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.

Ah, right. That would be an unpleasant limitation in my case, since I use the on-chip block RAM exclusively in my 65F02, and for some host systems I do need very close to the 64 kByte available on the XC6SLX9.

Could the microcode also reside in distributed RAM? I would not mind the extra slices used, but an fmax penalty would be undesirable of course. The 116 MHz you got in your test are quite nice indeed!

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 2:04 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Yes, you can use distributed RAM, but it will require quite a bit, especially because current design is rather wasteful with ROM usage.

I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.

Top

65f02

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 2:31 pm

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany

"Speed increase" sounds even better!

I will definitely stay tuned.

Top

dmsc

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 2:49 pm

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132

Hi Arlet!

Thank you for this new core!

I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:

Code:

@@ -62,7 +62,7 @@ assign flags = control[7:0];
 assign alu_op = { ci, shift, adder };
 assign dp_op  = control[21:15];
 
-reg [35:0] microcode[511:0];
+reg [30:0] microcode[511:0];
 reg [35:0] control;
 
 /* 
@@ -115,7 +115,7 @@ always @(posedge clk)
  * load next control word from ROM
  */
 always @(posedge clk)
-    control <= microcode[pc];
+    control <= { 5'b0, microcode[pc] };

Obviously, this needs removing the first 5 bits from the microcode.hex file.

This is the usage report from synthesis:

Code:

   Number of wires:                328
   Number of wire bits:           1434
   Number of public wires:         328
   Number of public wire bits:    1434
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                656
     SB_CARRY                      107
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       420
     SB_RAM40_4K                     4

This is about 40% less LUT4 than your original 6502 core, using 4 extra block rams.

This is the report for the original 6502 core:

Code:

   Number of wires:                493
   Number of wire bits:           1620
   Number of public wires:         493
   Number of public wire bits:    1620
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                901
     SB_CARRY                       27
     SB_DFF                         10
     SB_DFFE                       114
     SB_DFFESR                       9
     SB_DFFESS                       4
     SB_DFFR                         5
     SB_DFFS                         1
     SB_LUT4                       731

After packing into an iCE40 up5k, the usages are:

Code:

Arlet 65C02:
------------------------------------------
Info:             ICESTORM_LC:   551/ 5280    10%
Info:            ICESTORM_RAM:     4/   30    13%
Info:                   SB_GB:     3/    8    37%

Arlet 6502:
------------------------------------------
Info:             ICESTORM_LC:   829/ 5280    15%
Info:                   SB_GB:     4/    8    50%

About the speed, it is difficult to estimate without a full design, simply routing all 38 pins to I/O pins in the FPGA gives about 5% more speed in the new design, so both are similar.

Have Fun!

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 3:04 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 6:20 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:

Code:

always @(*)
    case( op[1:0] )
        2'b00: {CO, ADL} = base + 00  + CI;
        2'b01: {CO, ADL} = 9'hx;
        2'b10: {CO, ADL} = base + ABL + CI;
        2'b11: {CO, ADL} = base + REG + CI;
    endcase

Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:

Code:

generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
    .O6(P[i]),
    .O5(G[i]),
    .I0(base[i]),
    .I1(REG[i]),
    .I2(ABL[i]),
    .I3(op[0]),
    .I4(op[1]),
    .I5(1'b1) );
end
endgenerate

This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.

Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.

Code:

CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI),     .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );

With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.

In my board, I just mapped the pins where it was convenient for board layout.

Attachments:

path.png [ 53.03 KiB | Viewed 703 times ]

Top

dmsc

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 6:41 pm

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132

Hi!

Arlet wrote:

Thank you!

Quote:

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.

Sadly, in the LUT4 architecture of iCE40 the new ALU is bigger!

Best result is simply to only pass CI on the operations that use it:

Code:

@@ -64,14 +64,14 @@ wire [7:0] NR = ~R;
 
 always @(*)
     case( op[2:0] )
-        3'b000: adder = (R | M)     + CI;
-        3'b001: adder = (R & M)     + CI;
-        3'b010: adder = (R ^ M)     + CI;
+        3'b000: adder = (R | M)    /* + CI */ ;
+        3'b001: adder = (R & M)    /* + CI */ ;
+        3'b010: adder = (R ^ M)    /* + CI */ ;
         3'b011: adder = (R + M)     + CI;
         3'b100: adder = (R + 8'h00) + CI;
         3'b101: adder = (R + 8'hff) + CI; 
         3'b110: adder = (R + NM)    + CI;
-        3'b111: adder = (M & NR)    + CI;
+        3'b111: adder = (M & NR)   /* + CI */ ;
     endcase
 
 /*

With this change, I got:

Code:

   Number of wires:                333
   Number of wire bits:           1442
   Number of public wires:         333
   Number of public wire bits:    1442
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                655
     SB_CARRY                       99
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       427
     SB_RAM40_4K                     4

And:

Code:

Info:             ICESTORM_LC:   559/ 5280    10%
Info:            ICESTORM_RAM:     4/   30    13%

Here the longest path goes from "C" to "Z", "N", "abl.base", "abh.CI", up to "ABH[7]". I don't really understand the first flags in the chain, but perhaps it is because the BCC/BCS opcodes.

Have Fun!

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 7:04 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Yes, probably the branches. The branch target address is calculated in the ABL/ABH modules directly from current address, so there's a path from the flags to ABL[0] all the way up the carry chain to ABH[7].

I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 7:12 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Arlet wrote:

In my board, I just mapped the pins where it was convenient for board layout.

I can see the reason now. I've highlighted the AB* pins in my board. Some are near the top of the FPGA, and some are at the bottom. This means that there's no good place for the logic to sit close to both.

Of course, when you're running at >100 MHz, you can't address the external bus that fast anyway. If you need to run fast, you can use the internal block RAMs for code, and add 1 or 2 wait states for external access, and then add a pipeline stage to the AB outputs. The problem was that I had configured my core for internal block RAM, but still had the AB connected (but not the Data Bus).

When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.

Attachments:

AB.png [ 40.43 KiB | Viewed 696 times ]

Top

BigEd

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 7:56 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

spectacular!

Top

Arlet

Post subject: Re: My new verilog 65C02 core.

Posted: Sat Oct 24, 2020 9:19 pm

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands

Here's a screenshot of the floorplan. You can see the two block RAMs next to each other, one for the bootrom, and the other for the microcode.

The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.

Attachments:

scattered.png [ 19.46 KiB | Viewed 679 times ]

floorplan.png [ 8.76 KiB | Viewed 679 times ]

Top

Page 3 of 16

[ 232 posts ]

Go to page Previous 1, 2, 3, 4, 5, 6 ... 16 Next

Board index » 6502.org Users Forum » Programmable Logic

All times are UTC

Who is online

Users browsing this forum: No registered users and 11 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum