My new verilog 65C02 core.
Re: My new verilog 65C02 core.
Yes, I'll do RDY too, but that's easy.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: My new verilog 65C02 core.
I've kept quiet until now because I don't have anything to contribute. This is sounding really good though, and I look forward to what appears to be close to being a finished, available product.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: My new verilog 65C02 core.
A quick run through ISE of just the control logic shows 140 LUTs/40 slices. That's a fairly big chunk. It meets 350 MHz timing, though, so that's nice.
I'll have to take a look at the design that was produced, see if there's anything strange going on. I haven't really tried to optimize any of this, yet. There could be some opportunities in using different bit patterns. Except for the ABL bits, most are randomly chosen.
I'll have to take a look at the design that was produced, see if there's anything strange going on. I haven't really tried to optimize any of this, yet. There could be some opportunities in using different bit patterns. Except for the ABL bits, most are randomly chosen.
Re: My new verilog 65C02 core.
The BCD logic is giving me a bit of trouble. I may have to make some modifications to the ALU module in order to make it easier to generate the proper control signals. It's mostly the carry handling, both going in and coming out.
Re: My new verilog 65C02 core.
Hi Dave,
I have made a small change in the ALU carry logic (generic only), so that it knows whether to do a regular carry or BCD carry. It passes Dormann's tests, but if you get a chance, I'd appreciate if you can test it on your hardware to make sure I didn't break anything.
The next step will be to simplify the control logic for the carry signal, collapsing 3 distinct cases into 1.
I have made a small change in the ALU carry logic (generic only), so that it knows whether to do a regular carry or BCD carry. It passes Dormann's tests, but if you get a chance, I'd appreciate if you can test it on your hardware to make sure I didn't break anything.
The next step will be to simplify the control logic for the carry signal, collapsing 3 distinct cases into 1.
Re: My new verilog 65C02 core.
Arlet wrote:
I have made a small change in the ALU carry logic (generic only), so that it knows whether to do a regular carry or BCD carry. It passes Dormann's tests, but if you get a chance, I'd appreciate if you can test it on your hardware to make sure I didn't break anything.
- the old version makes timing at 80MHz (12.404ns)
- the new version doesn't (13.453ns):
Here's the critical path in the old version:
Code: Select all
--------------------------------------------------------------------------------
Slack (setup path): 0.096ns (requirement - (data path - clock path skew + uncertainty))
Source: Mram_ram19 (RAM)
Destination: Mram_ram27 (RAM)
Requirement: 12.500ns
Data Path Delay: 12.071ns (Levels of Logic = 6)
Clock Path Skew: -0.073ns (0.750 - 0.823)
Source Clock: cpu_clk_BUFG rising at 0.000ns
Destination Clock: cpu_clk_BUFG rising at 12.500ns
Clock Uncertainty: 0.260ns
Clock Uncertainty: 0.260ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.450ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: Mram_ram19 to Mram_ram27
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
RAMB16_X0Y0.DOB0 Trcko_DOB 2.100 Mram_ram19
Mram_ram19
SLICE_X7Y29.A6 net (fanout=1) 2.495 N43
SLICE_X7Y29.A Tilo 0.259 cpu/abl/Madd_n0055_lut<4>
Mmux_cpu_DI5_SW0
SLICE_X7Y29.D3 net (fanout=7) 0.420 N8
SLICE_X7Y29.D Tilo 0.259 cpu/abl/Madd_n0055_lut<4>
cpu/abl/Mmux_base51
SLICE_X8Y31.A6 net (fanout=2) 0.818 cpu/abl/Madd_n0055_lut<4>
SLICE_X8Y31.A Tilo 0.254 cpu/abh/Madd_base[7]_PWR_7_o_add_3_OUT_cy<3>
cpu/abl/Madd_n0055_cy<4>11
SLICE_X8Y31.B6 net (fanout=6) 0.157 cpu/abl/Madd_n0055_cy<4>
SLICE_X8Y31.B Tilo 0.254 cpu/abh/Madd_base[7]_PWR_7_o_add_3_OUT_cy<3>
cpu/abl/Mmux_CO12
SLICE_X9Y29.B5 net (fanout=7) 0.464 cpu/abl/Mmux_CO11
SLICE_X9Y29.B Tilo 0.259 cpu_AB<13>
cpu/abl/Mmux_CO13_SW8
SLICE_X8Y32.A5 net (fanout=2) 0.689 N168
SLICE_X8Y32.A Tilo 0.254 cpu/abh/Mmux_ADH31
cpu/abh/Mmux_ADH41_1
RAMB16_X1Y30.ADDRB11 net (fanout=16) 2.989 cpu/abh/Mmux_ADH41
RAMB16_X1Y30.CLKB Trcck_ADDRB 0.400 Mram_ram27
Mram_ram27
------------------------------------------------- ---------------------------
Total 12.071ns (4.039ns logic, 8.032ns route)
(33.5% logic, 66.5% route)
Code: Select all
Paths for end point Mram_ram26 (RAMB16_X0Y0.ADDRB9), 2727 paths
--------------------------------------------------------------------------------
Slack (setup path): -0.953ns (requirement - (data path - clock path skew + uncertainty))
Source: Mram_ram2 (RAM)
Destination: Mram_ram26 (RAM)
Requirement: 12.500ns
Data Path Delay: 13.211ns (Levels of Logic = 6)
Clock Path Skew: 0.018ns (0.706 - 0.688)
Source Clock: cpu_clk_BUFG rising at 0.000ns
Destination Clock: cpu_clk_BUFG rising at 12.500ns
Clock Uncertainty: 0.260ns
Clock Uncertainty: 0.260ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.450ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: Mram_ram2 to Mram_ram26
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
RAMB16_X1Y0.DOB0 Trcko_DOB 2.100 Mram_ram2
Mram_ram2
SLICE_X13Y33.D6 net (fanout=1) 2.937 N9
SLICE_X13Y33.D Tilo 0.259 cpu_AB<15>
Mmux_cpu_DI1_SW0
SLICE_X12Y34.B6 net (fanout=8) 0.460 N16
SLICE_X12Y34.B Tilo 0.254 cpu_AB<14>
cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
SLICE_X10Y35.A4 net (fanout=1) 0.892 cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
SLICE_X10Y35.COUT Topcya 0.472 cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>_rt
cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
SLICE_X10Y36.CIN net (fanout=1) 0.003 cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
SLICE_X10Y36.DMUX Tcind 0.267 cpu/abl/BUS_0005_GND_5_o_add_9_OUT<7>
cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<7>
SLICE_X12Y34.D4 net (fanout=10) 1.015 cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<7>
SLICE_X12Y34.D Tilo 0.254 cpu_AB<14>
cpu/abh/Mmux_ADH21_SW2
SLICE_X13Y32.B3 net (fanout=1) 0.585 N206
SLICE_X13Y32.B Tilo 0.259 cpu_AB_next<9>
cpu/abh/Mmux_ADH21
RAMB16_X0Y0.ADDRB9 net (fanout=34) 3.054 cpu_AB_next<9>
RAMB16_X0Y0.CLKB Trcck_ADDRB 0.400 Mram_ram26
Mram_ram26
------------------------------------------------- ---------------------------
Total 13.211ns (4.265ns logic, 8.946ns route)
(32.3% logic, 67.7% route)
Dave
Re: My new verilog 65C02 core.
Interestingly, this long path has nothing to do with the changes I did.
Anyway, I will continue with the modifications, and when it's all done, then see if the performance can be improved. One area of improvement will be registering the control signals in the new state machine code.
Anyway, I will continue with the modifications, and when it's all done, then see if the performance can be improved. One area of improvement will be registering the control signals in the new state machine code.
Re: My new verilog 65C02 core.
FYI, here are the two Smart Explorer runs:
Old: New: It does seem to be consistently worse.
Dave
Old: New: It does seem to be consistently worse.
Dave
Re: My new verilog 65C02 core.
I've made some more changes to generic ALU/microcode for BCD carry flag. No idea if this makes things worse or better.
Re: My new verilog 65C02 core.
Arlet wrote:
I've made some more changes to generic ALU/microcode for BCD carry flag. No idea if this makes things worse or better.
Re: My new verilog 65C02 core.
I've added the BCD support to the FSM controller, but that made it grow a lot bigger. Before doing interrupts/reset, I want to see if I can understand why it is so big, and if there's anything that can be done to improve it. Also, I don't like that the outputs are not registered.
If I remove all control signal outputs, and just keep the state machine, I get 29 flops, which makes sense because the state machine is one-hot encoded, and there's a flop for each state. There are 26 LUTs used, which seems reasonable given that some states don't need one, while other states need one or a few for address mode decoding, or for incorporating some choices based on the independent flops.
If you right click on 'Synthesize', you can select 'Process Properties ...' which shows the dialog below. On there, you can choose between various automatic state machine implementations, or 'None', to stick to whatever is written in the source code. When I do 'None', I get 5 flops, which is much smaller, but I get 35 LUTs, which is bigger than for one-hot.
Since each FPGA slice has a total of 8 flops, each driven by a LUT5, it is a good idea to keep the number of flops and LUTs balanced. There's no point in using the compact encoding to save flops, because they will be wasted anyway in the slices that are used for logic only. Also, while it is possible to use flops in a slice without the corresponding LUT, and use the LUT for something else, I think that should be avoided, since it creates more routing pressure. Ideally, you would have a balance, where each state flop uses exactly one LUT to drive it.
Of course, the state machine is only part of the total control module. Since the states drive the control signals, the exact state encoding impacts how much logic is needed for the other parts. The disadvantage for one-hot is that it produces a very wide "bus" of state signals that make it hard to select other muxes.
If I remove all control signal outputs, and just keep the state machine, I get 29 flops, which makes sense because the state machine is one-hot encoded, and there's a flop for each state. There are 26 LUTs used, which seems reasonable given that some states don't need one, while other states need one or a few for address mode decoding, or for incorporating some choices based on the independent flops.
If you right click on 'Synthesize', you can select 'Process Properties ...' which shows the dialog below. On there, you can choose between various automatic state machine implementations, or 'None', to stick to whatever is written in the source code. When I do 'None', I get 5 flops, which is much smaller, but I get 35 LUTs, which is bigger than for one-hot.
Since each FPGA slice has a total of 8 flops, each driven by a LUT5, it is a good idea to keep the number of flops and LUTs balanced. There's no point in using the compact encoding to save flops, because they will be wasted anyway in the slices that are used for logic only. Also, while it is possible to use flops in a slice without the corresponding LUT, and use the LUT for something else, I think that should be avoided, since it creates more routing pressure. Ideally, you would have a balance, where each state flop uses exactly one LUT to drive it.
Of course, the state machine is only part of the total control module. Since the states drive the control signals, the exact state encoding impacts how much logic is needed for the other parts. The disadvantage for one-hot is that it produces a very wide "bus" of state signals that make it hard to select other muxes.
Last edited by Arlet on Tue Nov 17, 2020 7:58 am, edited 1 time in total.
Re: My new verilog 65C02 core.
Looking at the state machine, and the schematic that was generated, one property sticks out: during the SYNC state, we have to look at the DB inputs to go to the right state for the addressing mode. Outside the SYNC state, we never look at DB.
For instance, the IMM0 state is entered when handling an immediate operand. The only way to get to that state is from SYNC, while the DB has one of these patterns:
It is clear why the one-hot encoding works so well, because the logic needed to match this IMM0 pattern is only needed in front of the dedicated IMM0 flop. On the other hand, when you choose a compact encoding, you need this logic in front of all/most flops.
Looking at the state machine, there are 12 states leading out of the SYNC state (including looping back to SYNC). So, maybe, if we use a mixed encoding, with each of these 12 states encoded by a single flop, and then a few additional flops to hold all the other, simpler, states, we get a better result.
For instance, the IMM0 state is entered when handling an immediate operand. The only way to get to that state is from SYNC, while the DB has one of these patterns:
Code: Select all
8'b1010_00?0: next = IMM0; // LDY#/LDX#
8'b11?0_00?0: next = IMM0; // CPX#/CPY#
8'b???0_1001: next = IMM0; // col 9 even
Looking at the state machine, there are 12 states leading out of the SYNC state (including looping back to SYNC). So, maybe, if we use a mixed encoding, with each of these 12 states encoded by a single flop, and then a few additional flops to hold all the other, simpler, states, we get a better result.
Re: My new verilog 65C02 core.
Sounds good to me: a log(n) encoding and a one-hot encoding both fail to take any structure into account. A mixed encoding which brings in some knowledge of the problem should be a win. (I have in the past used two-hot and I think even three-hot, but again it's a form of brute force.)
Re: My new verilog 65C02 core.
I'm doing some experiments with various encoding schemes, running them through synthesis, and see what the tools generate. It is interesting, because some of the things you intuitively think are useful, don't actually matter. For instance, the IMM0 patterns above show that DB[2] and DB[4] are always zero. The tools have exploited this by creating a 4-input LUT and 6-input LUT, with following logic expression:
This means that there's no need for any clever don't cares. The LUT6 can pick out any arbitrary set of opcodes from the remaining 64.
Code: Select all
IMM0 = SYNC & !DB[2] & !DB[4] & LUT6(DB[0], DB[1], DB[3], DB[5], DB[6], DB[7]);
Re: My new verilog 65C02 core.
By the way, I use a simple spreadsheet with a 16x16 opcode grid, and then use background coloring to mark the patterns.