My new verilog 65C02 core.

Arlet · Post by **Arlet** » Sun Nov 15, 2020 7:03 pm

Yes, I'll do RDY too, but that's easy.

GARTHWILSON · Post by **GARTHWILSON** » Sun Nov 15, 2020 7:14 pm

I've kept quiet until now because I don't have anything to contribute. This is sounding really good though, and I look forward to what appears to be close to being a finished, available product.

Arlet · Post by **Arlet** » Sun Nov 15, 2020 8:07 pm

A quick run through ISE of just the control logic shows 140 LUTs/40 slices. That's a fairly big chunk. It meets 350 MHz timing, though, so that's nice.

I'll have to take a look at the design that was produced, see if there's anything strange going on. I haven't really tried to optimize any of this, yet. There could be some opportunities in using different bit patterns. Except for the ABL bits, most are randomly chosen.

Arlet · Post by **Arlet** » Mon Nov 16, 2020 6:44 am

The BCD logic is giving me a bit of trouble. I may have to make some modifications to the ALU module in order to make it easier to generate the proper control signals. It's mostly the carry handling, both going in and coming out.

Arlet · Post by **Arlet** » Mon Nov 16, 2020 10:11 am

Hi Dave,

I have made a small change in the ALU carry logic (generic only), so that it knows whether to do a regular carry or BCD carry. It passes Dormann's tests, but if you get a chance, I'd appreciate if you can test it on your hardware to make sure I didn't break anything.

The next step will be to simplify the control logic for the carry signal, collapsing 3 distinct cases into 1.

hoglet · Post by **hoglet** » Mon Nov 16, 2020 12:17 pm

Arlet wrote:

I have made a small change in the ALU carry logic (generic only), so that it knows whether to do a regular carry or BCD carry. It passes Dormann's tests, but if you get a chance, I'd appreciate if you can test it on your hardware to make sure I didn't break anything.

It certainly works, but there's a slight impact to the overall speed:
- the old version makes timing at 80MHz (12.404ns)
- the new version doesn't (13.453ns):

Here's the critical path in the old version:

Code: Select all

--------------------------------------------------------------------------------
Slack (setup path):     0.096ns (requirement - (data path - clock path skew + uncertainty))
  Source:               Mram_ram19 (RAM)
  Destination:          Mram_ram27 (RAM)
  Requirement:          12.500ns
  Data Path Delay:      12.071ns (Levels of Logic = 6)
  Clock Path Skew:      -0.073ns (0.750 - 0.823)
  Source Clock:         cpu_clk_BUFG rising at 0.000ns
  Destination Clock:    cpu_clk_BUFG rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: Mram_ram19 to Mram_ram27
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X0Y0.DOB0     Trcko_DOB             2.100   Mram_ram19
                                                       Mram_ram19
    SLICE_X7Y29.A6       net (fanout=1)        2.495   N43
    SLICE_X7Y29.A        Tilo                  0.259   cpu/abl/Madd_n0055_lut<4>
                                                       Mmux_cpu_DI5_SW0
    SLICE_X7Y29.D3       net (fanout=7)        0.420   N8
    SLICE_X7Y29.D        Tilo                  0.259   cpu/abl/Madd_n0055_lut<4>
                                                       cpu/abl/Mmux_base51
    SLICE_X8Y31.A6       net (fanout=2)        0.818   cpu/abl/Madd_n0055_lut<4>
    SLICE_X8Y31.A        Tilo                  0.254   cpu/abh/Madd_base[7]_PWR_7_o_add_3_OUT_cy<3>
                                                       cpu/abl/Madd_n0055_cy<4>11
    SLICE_X8Y31.B6       net (fanout=6)        0.157   cpu/abl/Madd_n0055_cy<4>
    SLICE_X8Y31.B        Tilo                  0.254   cpu/abh/Madd_base[7]_PWR_7_o_add_3_OUT_cy<3>
                                                       cpu/abl/Mmux_CO12
    SLICE_X9Y29.B5       net (fanout=7)        0.464   cpu/abl/Mmux_CO11
    SLICE_X9Y29.B        Tilo                  0.259   cpu_AB<13>
                                                       cpu/abl/Mmux_CO13_SW8
    SLICE_X8Y32.A5       net (fanout=2)        0.689   N168
    SLICE_X8Y32.A        Tilo                  0.254   cpu/abh/Mmux_ADH31
                                                       cpu/abh/Mmux_ADH41_1
    RAMB16_X1Y30.ADDRB11 net (fanout=16)       2.989   cpu/abh/Mmux_ADH41
    RAMB16_X1Y30.CLKB    Trcck_ADDRB           0.400   Mram_ram27
                                                       Mram_ram27
    -------------------------------------------------  ---------------------------
    Total                                     12.071ns (4.039ns logic, 8.032ns route)
                                                       (33.5% logic, 66.5% route)

Here's the critical path in the new version:

Code: Select all

Paths for end point Mram_ram26 (RAMB16_X0Y0.ADDRB9), 2727 paths
--------------------------------------------------------------------------------
Slack (setup path):     -0.953ns (requirement - (data path - clock path skew + uncertainty))
  Source:               Mram_ram2 (RAM)
  Destination:          Mram_ram26 (RAM)
  Requirement:          12.500ns
  Data Path Delay:      13.211ns (Levels of Logic = 6)
  Clock Path Skew:      0.018ns (0.706 - 0.688)
  Source Clock:         cpu_clk_BUFG rising at 0.000ns
  Destination Clock:    cpu_clk_BUFG rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: Mram_ram2 to Mram_ram26
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X1Y0.DOB0     Trcko_DOB             2.100   Mram_ram2
                                                       Mram_ram2
    SLICE_X13Y33.D6      net (fanout=1)        2.937   N9
    SLICE_X13Y33.D       Tilo                  0.259   cpu_AB<15>
                                                       Mmux_cpu_DI1_SW0
    SLICE_X12Y34.B6      net (fanout=8)        0.460   N16
    SLICE_X12Y34.B       Tilo                  0.254   cpu_AB<14>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
    SLICE_X10Y35.A4      net (fanout=1)        0.892   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
    SLICE_X10Y35.COUT    Topcya                0.472   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>_rt
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
    SLICE_X10Y36.CIN     net (fanout=1)        0.003   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
    SLICE_X10Y36.DMUX    Tcind                 0.267   cpu/abl/BUS_0005_GND_5_o_add_9_OUT<7>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<7>
    SLICE_X12Y34.D4      net (fanout=10)       1.015   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<7>
    SLICE_X12Y34.D       Tilo                  0.254   cpu_AB<14>
                                                       cpu/abh/Mmux_ADH21_SW2
    SLICE_X13Y32.B3      net (fanout=1)        0.585   N206
    SLICE_X13Y32.B       Tilo                  0.259   cpu_AB_next<9>
                                                       cpu/abh/Mmux_ADH21
    RAMB16_X0Y0.ADDRB9   net (fanout=34)       3.054   cpu_AB_next<9>
    RAMB16_X0Y0.CLKB     Trcck_ADDRB           0.400   Mram_ram26
                                                       Mram_ram26
    -------------------------------------------------  ---------------------------
    Total                                     13.211ns (4.265ns logic, 8.946ns route)
                                                       (32.3% logic, 67.7% route)

Let me give Start Explorer a try in both cases...

Dave

Arlet · Post by **Arlet** » Mon Nov 16, 2020 12:43 pm

Interestingly, this long path has nothing to do with the changes I did.

Anyway, I will continue with the modifications, and when it's all done, then see if the performance can be improved. One area of improvement will be registering the control signals in the new state machine code.

hoglet · Post by **hoglet** » Mon Nov 16, 2020 1:17 pm

FYI, here are the two Smart Explorer runs:

Old:

New:

It does seem to be consistently worse.

Dave

Arlet · Post by **Arlet** » Mon Nov 16, 2020 1:57 pm

I've made some more changes to generic ALU/microcode for BCD carry flag. No idea if this makes things worse or better.

hoglet · Post by **hoglet** » Mon Nov 16, 2020 3:21 pm

Arlet wrote:

I've made some more changes to generic ALU/microcode for BCD carry flag. No idea if this makes things worse or better.

Somewhere in between; there was one run that meets timing at 80MHz:

Dave

Arlet · Post by **Arlet** » Tue Nov 17, 2020 6:52 am

I've added the BCD support to the FSM controller, but that made it grow a lot bigger. Before doing interrupts/reset, I want to see if I can understand why it is so big, and if there's anything that can be done to improve it. Also, I don't like that the outputs are not registered.

If I remove all control signal outputs, and just keep the state machine, I get 29 flops, which makes sense because the state machine is one-hot encoded, and there's a flop for each state. There are 26 LUTs used, which seems reasonable given that some states don't need one, while other states need one or a few for address mode decoding, or for incorporating some choices based on the independent flops.

If you right click on 'Synthesize', you can select 'Process Properties ...' which shows the dialog below. On there, you can choose between various automatic state machine implementations, or 'None', to stick to whatever is written in the source code. When I do 'None', I get 5 flops, which is much smaller, but I get 35 LUTs, which is bigger than for one-hot.

Since each FPGA slice has a total of 8 flops, each driven by a LUT5, it is a good idea to keep the number of flops and LUTs balanced. There's no point in using the compact encoding to save flops, because they will be wasted anyway in the slices that are used for logic only. Also, while it is possible to use flops in a slice without the corresponding LUT, and use the LUT for something else, I think that should be avoided, since it creates more routing pressure. Ideally, you would have a balance, where each state flop uses exactly one LUT to drive it.

Of course, the state machine is only part of the total control module. Since the states drive the control signals, the exact state encoding impacts how much logic is needed for the other parts. The disadvantage for one-hot is that it produces a very wide "bus" of state signals that make it hard to select other muxes.

Arlet · Post by **Arlet** » Tue Nov 17, 2020 7:52 am

Looking at the state machine, and the schematic that was generated, one property sticks out: during the SYNC state, we have to look at the DB inputs to go to the right state for the addressing mode. Outside the SYNC state, we never look at DB.

For instance, the IMM0 state is entered when handling an immediate operand. The only way to get to that state is from SYNC, while the DB has one of these patterns:

Code: Select all

                8'b1010_00?0:  next = IMM0;      // LDY#/LDX# 
                8'b11?0_00?0:  next = IMM0;      // CPX#/CPY# 
                8'b???0_1001:  next = IMM0;      // col 9 even

It is clear why the one-hot encoding works so well, because the logic needed to match this IMM0 pattern is only needed in front of the dedicated IMM0 flop. On the other hand, when you choose a compact encoding, you need this logic in front of all/most flops.

Looking at the state machine, there are 12 states leading out of the SYNC state (including looping back to SYNC). So, maybe, if we use a mixed encoding, with each of these 12 states encoded by a single flop, and then a few additional flops to hold all the other, simpler, states, we get a better result.

BigEd · Post by **BigEd** » Tue Nov 17, 2020 9:45 am

Sounds good to me: a log(n) encoding and a one-hot encoding both fail to take any structure into account. A mixed encoding which brings in some knowledge of the problem should be a win. (I have in the past used two-hot and I think even three-hot, but again it's a form of brute force.)

Arlet · Post by **Arlet** » Tue Nov 17, 2020 9:55 am

I'm doing some experiments with various encoding schemes, running them through synthesis, and see what the tools generate. It is interesting, because some of the things you intuitively think are useful, don't actually matter. For instance, the IMM0 patterns above show that DB[2] and DB[4] are always zero. The tools have exploited this by creating a 4-input LUT and 6-input LUT, with following logic expression:

Code: Select all

IMM0 = SYNC & !DB[2] & !DB[4] & LUT6(DB[0], DB[1], DB[3], DB[5], DB[6], DB[7]);

This means that there's no need for any clever don't cares. The LUT6 can pick out any arbitrary set of opcodes from the remaining 64.

Arlet · Post by **Arlet** » Tue Nov 17, 2020 10:02 am

By the way, I use a simple spreadsheet with a 16x16 opcode grid, and then use background coloring to mark the patterns.

My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.