I've converted part of the ABL module into instantiated LUTs, because that was the longest path, and I saw an opportunity for improvement, especially around this code:
Code:
always @(*)
case( op[1:0] )
2'b00: {CO, ADL} = base + 00 + CI;
2'b01: {CO, ADL} = 9'hx;
2'b10: {CO, ADL} = base + ABL + CI;
2'b11: {CO, ADL} = base + REG + CI;
endcase
Synthesis produced 16 LUTs, even though it can be done in 8. This is what manual instantiation looks like:
Code:
generate for (i = 0; i < 8; i = i + 1 )
begin : adl_loop
LUT6_2 #(.INIT(64'h665aaaaa88a00000)) adl_lut(
.O6(P[i]),
.O5(G[i]),
.I0(base[i]),
.I1(REG[i]),
.I2(ABL[i]),
.I3(op[0]),
.I4(op[1]),
.I5(1'b1) );
end
endgenerate
This creates 8 LUTs, with two outputs: carry propagate (P), and carry generate (G). The LUTs have 6 inputs, but since you need 2 outputs for the carry chain, the LUT is divided into a pair of LUT5. The function of the LUT is defined by the hex INIT string, which is a 64 bit string for the truth table. These signals then go into CARRY4 instances that represent the carry chain.
Making the truth table can be done by isolating a module, putting it in a test wrapper that cycles through all possible signals, and prints out the hex string. Then you can just copy & paste that in your own code.
Code:
CARRY4 carry_l ( .CO(COL), .O(ADL[3:0]), .CI(CI), .CYINIT(1'b0), .DI(G[3:0]), .S(P[3:0]) );
CARRY4 carry_h ( .CO(COH), .O(ADL[7:4]), .CI(COL[3]), .CYINIT(1'b0), .DI(G[7:4]), .S(P[7:4]) );
With these improvements, the total slice count is down to 49. I'm sure there's more that can be saved. The longest path is only a few logic levels, but I notice there's a lot of routing delay (more than half). The biggest delay is to the AB output pad. This means that if you want the fastest possible design, you also have to keep in mind which I/O pins you use, and make sure that related pins are close together, so that the logic can be mapped next to the pads.
In my board, I just mapped the pins where it was convenient for board layout.