6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Apr 28, 2024 9:13 pm

All times are UTC




Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 16  Next
Author Message
PostPosted: Fri Oct 23, 2020 7:34 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Tracing a few of these names in the schematic viewer, it looks like there are a few unnecessary logic layers, because the tools aren't clever enough. I'm curious what some properly designed manual blocks will do instead. And it's not just the number of logic layers, it's also all the unnecessary routing between them, and the stretched out placement.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 11:48 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Support for BCD has been added, and passes Klaus Dormann's test.

Next I'm going to try to make manual instantiation of the ALU, and see how much the tools can be improved, and if there's a way to change the verilog source to direct them better.

Taking just the ALU, ISE 14.7 synthesis reports 12 slices with effort on "high" and optimized for area.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 11:56 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
In fact, just isolating the "adder" portion:
Code:
module adder(
    input CI,               // carry in
    input [7:0] R,          // input from register file
    input [7:0] M,          // input from memory
    input [2:0] op,         // 5-bit operation select
    output reg [8:0] adder, // data out
);

wire [7:0] N = ~M;

always @(*)
    case( op[2:0] )
        3'b000: adder =  R |  M     + CI;
        3'b001: adder =  R &  M     + CI;
        3'b010: adder =  R ^  M     + CI;
        3'b011: adder =  R +  M     + CI;
        3'b100: adder =  R +  8'h00 + CI;
        3'b101: adder =  R +  8'hff + CI;
        3'b110: adder =  R +  N     + CI;
        3'b111: adder = ~R &  M     + CI;
    endcase
endmodule

Already requires 11 slices. I will try that part first.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 1:12 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Okay, that was my mistake. I wasn't thinking about the operator precedence in verilog. The "+ CI" apparently binds tighter than the logic operators, making a mess of my plans. I never noticed this, because CI is always 0 when doing logic operations.

Rewriting it with parentheses, like (R|M) + CI reduces area to 3 slices (probably really 2 slices + 1 LUT for the carry out), which is what I expected to get.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 1:58 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Arlet wrote:
Keep in mind that you'll lose 2kB of memory with this core, due to microcode claiming one of the block RAMs.

Ah, right. That would be an unpleasant limitation in my case, since I use the on-chip block RAM exclusively in my 65F02, and for some host systems I do need very close to the 64 kByte available on the XC6SLX9.

Could the microcode also reside in distributed RAM? I would not mind the extra slices used, but an fmax penalty would be undesirable of course. The 116 MHz you got in your test are quite nice indeed!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 2:04 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yes, you can use distributed RAM, but it will require quite a bit, especially because current design is rather wasteful with ROM usage.

I'm considering a follow up project without the ROM. I was thinking there may be some potential for speed increases, because the LUTs can be placed near the place where they are needed. I noticed a fairly large routing delay from the ROM.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 2:31 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
"Speed increase" sounds even better! :D
I will definitely stay tuned.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 2:49 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132
Hi Arlet!

Thank you for this new core!

I did some experiments with your new core, using iCE40 synthesis of the the open-source Yosys. I had to change the microcode ROM, as the iCE40 architecture uses 4kbit block-ram blocks, so the optimal is 32 bit microcode word (using 4 block rams). Also, I discovered that Yosys can't infer block rams for bits that are not used, so I changed the microcode definition to 30 bits, this is my patch:

Code:
@@ -62,7 +62,7 @@ assign flags = control[7:0];
 assign alu_op = { ci, shift, adder };
 assign dp_op  = control[21:15];
 
-reg [35:0] microcode[511:0];
+reg [30:0] microcode[511:0];
 reg [35:0] control;
 
 /*
@@ -115,7 +115,7 @@ always @(posedge clk)
  * load next control word from ROM
  */
 always @(posedge clk)
-    control <= microcode[pc];
+    control <= { 5'b0, microcode[pc] };

Obviously, this needs removing the first 5 bits from the microcode.hex file.

This is the usage report from synthesis:
Code:
   Number of wires:                328
   Number of wire bits:           1434
   Number of public wires:         328
   Number of public wire bits:    1434
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                656
     SB_CARRY                      107
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       420
     SB_RAM40_4K                     4

This is about 40% less LUT4 than your original 6502 core, using 4 extra block rams.

This is the report for the original 6502 core:
Code:
   Number of wires:                493
   Number of wire bits:           1620
   Number of public wires:         493
   Number of public wire bits:    1620
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                901
     SB_CARRY                       27
     SB_DFF                         10
     SB_DFFE                       114
     SB_DFFESR                       9
     SB_DFFESS                       4
     SB_DFFR                         5
     SB_DFFS                         1
     SB_LUT4                       731

After packing into an iCE40 up5k, the usages are:
Code:
Arlet 65C02:
------------------------------------------
Info:             ICESTORM_LC:   551/ 5280    10%
Info:            ICESTORM_RAM:     4/   30    13%
Info:                   SB_GB:     3/    8    37%

Arlet 6502:
------------------------------------------
Info:             ICESTORM_LC:   829/ 5280    15%
Info:                   SB_GB:     4/    8    50%

About the speed, it is difficult to estimate without a full design, simply routing all 38 pins to I/O pins in the FPGA gives about 5% more speed in the new design, so both are similar.

Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 3:04 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 6:20 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:
Code:
always @(*)
    case( op[1:0] )
        2'b00: {CO, ADL} = base + 00  + CI;
        2'b01: {CO, ADL} = 9'hx;
        2'b10: {CO, ADL} = base + ABL + CI;
        2'b11: {CO, ADL} = base + REG + CI;
    endcase

Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:
Code:
generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
    .O6(P[i]),
    .O5(G[i]),
    .I0(base[i]),
    .I1(REG[i]),
    .I2(ABL[i]),
    .I3(op[0]),
    .I4(op[1]),
    .I5(1'b1) );
end
endgenerate

This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.

Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.
Code:
CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI),     .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );
With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.

In my board, I just mapped the pins where it was convenient for board layout.


Attachments:
path.png
path.png [ 53.03 KiB | Viewed 703 times ]
Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 6:41 pm 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 132
Hi!

Arlet wrote:
Ok, nice to know. I will reduce the width of the ROM. I added all bits at the start so that I could just whatever I needed, but now that BCD has been added, I don't expect to need any additional bits.

Thank you!

Quote:
By the way, with the fixed ALU parentheses, the ALU Z flag is no longer the longest path, and I was able to push fmax to 120 MHz.

Sadly, in the LUT4 architecture of iCE40 the new ALU is bigger!

Best result is simply to only pass CI on the operations that use it:
Code:
@@ -64,14 +64,14 @@ wire [7:0] NR = ~R;
 
 always @(*)
     case( op[2:0] )
-        3'b000: adder = (R | M)     + CI;
-        3'b001: adder = (R & M)     + CI;
-        3'b010: adder = (R ^ M)     + CI;
+        3'b000: adder = (R | M)    /* + CI */ ;
+        3'b001: adder = (R & M)    /* + CI */ ;
+        3'b010: adder = (R ^ M)    /* + CI */ ;
         3'b011: adder = (R + M)     + CI;
         3'b100: adder = (R + 8'h00) + CI;
         3'b101: adder = (R + 8'hff) + CI;
         3'b110: adder = (R + NM)    + CI;
-        3'b111: adder = (M & NR)    + CI;
+        3'b111: adder = (M & NR)   /* + CI */ ;
     endcase
 
 /*

With this change, I got:
Code:
   Number of wires:                333
   Number of wire bits:           1442
   Number of public wires:         333
   Number of public wire bits:    1442
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                655
     SB_CARRY                       99
     SB_DFF                         18
     SB_DFFE                       103
     SB_DFFESR                       4
     SB_LUT4                       427
     SB_RAM40_4K                     4

And:
Code:
Info:             ICESTORM_LC:   559/ 5280    10%
Info:            ICESTORM_RAM:     4/   30    13%


Here the longest path goes from "C" to "Z", "N", "abl.base", "abh.CI", up to "ABH[7]". I don't really understand the first flags in the chain, but perhaps it is because the BCC/BCS opcodes.

Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 7:04 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yes, probably the branches. The branch target address is calculated in the ABL/ABH modules directly from current address, so there's a path from the flags to ABL[0] all the way up the carry chain to ABH[7].

I can see where the +CI is not optimal for the LUT4 architectures, because the carry chain inputs are not so flexible.


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 7:12 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Arlet wrote:
In my board, I just mapped the pins where it was convenient for board layout.

I can see the reason now. I've highlighted the AB* pins in my board. Some are near the top of the FPGA, and some are at the bottom. This means that there's no good place for the logic to sit close to both.

Of course, when you're running at >100 MHz, you can't address the external bus that fast anyway. If you need to run fast, you can use the internal block RAMs for code, and add 1 or 2 wait states for external access, and then add a pipeline stage to the AB outputs. The problem was that I had configured my core for internal block RAM, but still had the AB connected (but not the Data Bus).

When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.


Attachments:
AB.png
AB.png [ 40.43 KiB | Viewed 696 times ]
Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 7:56 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
spectacular!


Top
 Profile  
Reply with quote  
PostPosted: Sat Oct 24, 2020 9:19 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Here's a screenshot of the floorplan. You can see the two block RAMs next to each other, one for the bootrom, and the other for the microcode.

The light blue elements are used for the design, but they appear rather scattered, so I would think there's plenty of room for improvements. For some bizarre reason, there are a handful of flip-flops placed even further away, with no logic nearby.


Attachments:
scattered.png
scattered.png [ 19.46 KiB | Viewed 679 times ]
floorplan.png
floorplan.png [ 8.76 KiB | Viewed 679 times ]
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6 ... 16  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 11 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: