My new verilog 65C02 core.

Topics relating to PALs, CPLDs, FPGAs, and other PLDs used for the support or creation of 65-family processors, both hardware and HDL.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

Hi Arlet,

I've hit possibly the most subtle issue yet; it depends on the combination of:
- RDY being used (this is the most important pre-condition)
- Read-Modify-Write to ZP (e.g. ASL, ROL)
- Decimal Mode (for ASL anyway)

The bug results in the N flag being inconsistent with the result.

In following tests, location &80 is the flags after the ASL, and bit 7 (the N flag) should always match bit 7 of the result (in &81).

First, with decimal mode off:
capture11.png
capture11.png (2.29 KiB) Viewed 1280 times
capture12.png
capture12.png (2.45 KiB) Viewed 1280 times
Then, with decimal mode on:
capture9.png
capture9.png (2.3 KiB) Viewed 1280 times
capture10.png
capture10.png (2.46 KiB) Viewed 1280 times
I think I'm correctly following your suggested pattern for using RDY with syncronous memory. I also checked your implementation of RDY, and this looks solid to me. Every use of posedge clk is qualified by RDY, and this is the only use of RDY.

This is with the generic version of your core (I haven't tried the spartan6 version for this failure)

Looking at the microcode, there does seem to be an unexpected (to me) difference between &14C and &1CC

Code: Select all

@14C  000_00001_01_000_0111_10011_00_00101100     // ASL, NZC   -> @12C
@1CC  100_00001_01_000_0111_10011_00_00011111     // ASL, NZC   -> @11F
If I change the test code to use ROL, then I see failures in both cases (i.e. N=0 in both cases, even though the result is &82).

The microcode for ROL is:

Code: Select all

@152  100_00001_01_000_0111_10011_10_00011111     // ROL, NZC   -> @11F
@1D2  100_00001_01_000_0111_10011_10_00011111     // ROL, NZC   -> @11F
Any ideas?

Should all of these routines finish at &12C/&1AC?

Dave
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Do you know how the RDY input is behaving ?
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

I haven't been able to reproduce it yet, but I suspect I know what it is, and it is related to the microcode differences you're seeing.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

Arlet wrote:
Do you know how the RDY input is behaving ?
That's a bit hard to determine exactly.

This is in a project called BeebAccelerator, which replaces the 6502 in the Beeb with a Xilinx XC6SLX9 FPGA (currently the same hardware as my ICE. For testing with your new core, the FPGA maps 0x0000-0x2FFF, 0x8000-0xFBFF and 0xFF00-0xFFFF to "internal" block RAM.

The core is currently running at 50MHz. If cycle is deemed "internal" then RDY is kept high and the access is served from block RAM. If a cycle is deemed external, then RDY is taken low immediately, until the Beeb's 6502 clock is at an appropriate phase, when RDY is taken high again to allow the external cycle to complete.

With the test case, zero page and stack accesses are internal, but all code fetches are external (&4xxx is in external memory). So RDY will be low most of the time, which just occasional blips high for 1-2 cycles.

However, I still see the bug if I run the code at &2xxx, which is internal memory. In this case, I would expect RDY to be high most of the time. But that depends on what address your core puts out in dead cycles.

Dave
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

The only dead cycle is a double opcode fetch (from the correct address) in the PHA/PLA instructions.

I'm having trouble reproducing it, probably because of differences in the memory interface, so I will just apply the fix that I have in mind, and then you can let me know if it works.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Hopefully fix in github.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

Arlet wrote:
Hopefully fix in github.
With those changes, the system doesn't boot at all,

Let me switch back to the ICE and see if I can get a bit more info on where it's now failing.

Dave
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

It looks like it's crashed here:

Code: Select all

DA20 : 26 FC    : ROL $FC
DA22 : 20 EB EE : JSR $EEEB
Bus cycles are:

Code: Select all

0 26 1 1 1
1 fc 0 1 1
2 18 0 1 1
3 30 0 0 1
DA20 : 26 FC    : ROL FC         : 4
0 20 1 1 1
1 20 0 1 1
2 da 0 0 1
3 23 0 0 1
4 eb 0 1 1
DA22 : 20 20 EB : JSR EB20       : 5
The bus cycles for the JSR look all wrong, it looks like the PC has failed to increment after the opcode, so the wrong address is called, and the wrong return address (off by one) is pushed.

Dave
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Ok, I forgot to update address mode bits when copy/pasting lines in microcode. Fixed.

I guess Klaus doesn't do JSR after ROL :)
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

Thanks Arlet, that now boots again and seems to have sorted the RDY original problem. :D

Here's it running with an 80MHz clock:
capture14.png
capture14.png (1.83 KiB) Viewed 1270 times
The generic version doesn't quite meet timing at 80MHz:

Code: Select all

Paths for end point Mram_ram1 (RAMB16_X0Y8.ADDRB7), 2365 paths
--------------------------------------------------------------------------------
Slack (setup path):     -0.043ns (requirement - (data path - clock path skew + uncertainty))
  Source:               Mram_ram1 (RAM)
  Destination:          Mram_ram1 (RAM)
  Requirement:          12.500ns
  Data Path Delay:      12.283ns (Levels of Logic = 5)
  Clock Path Skew:      0.000ns
  Source Clock:         cpu_clk_BUFG rising at 0.000ns
  Destination Clock:    cpu_clk_BUFG rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: Mram_ram1 to Mram_ram1
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X0Y8.DOB0     Trcko_DOB             2.100   Mram_ram1
                                                       Mram_ram1
    SLICE_X12Y38.D6      net (fanout=8)        2.733   ram_dout<0>
    SLICE_X12Y38.CMUX    Topdc                 0.456   cpu/abl/AHL<7>
                                                       Mmux_cpu_DI11_SW2_F
                                                       Mmux_cpu_DI11_SW2
    SLICE_X12Y38.A2      net (fanout=1)        0.741   N144
    SLICE_X12Y38.A       Tilo                  0.254   cpu/abl/AHL<7>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
    SLICE_X16Y41.A4      net (fanout=1)        0.762   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>
    SLICE_X16Y41.COUT    Topcya                0.474   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_lut<0>_rt
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
    SLICE_X16Y42.CIN     net (fanout=1)        0.003   cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<3>
    SLICE_X16Y42.DQ      Tito_logic            0.798   cpu/abl/BUS_0005_GND_5_o_add_9_OUT<7>
                                                       cpu/abl/Madd_BUS_0005_GND_5_o_add_9_OUT_Madd_cy<7>
                                                       cpu/abl/BUS_0005_GND_5_o_add_9_OUT<7>_rt
    SLICE_X14Y43.C5      net (fanout=1)        0.465   cpu/abl/BUS_0005_GND_5_o_add_9_OUT<7>
    SLICE_X14Y43.CMUX    Tilo                  0.403   cpu_AB<7>
                                                       cpu/abl/Mmux_ADL83_G
                                                       cpu/abl/Mmux_ADL83
    RAMB16_X0Y8.ADDRB7   net (fanout=24)       2.694   cpu_AB_next<7>
    RAMB16_X0Y8.CLKB     Trcck_ADDRB           0.400   Mram_ram1
                                                       Mram_ram1
    -------------------------------------------------  ---------------------------
    Total                                     12.283ns (4.885ns logic, 7.398ns route)
                                                       (39.8% logic, 60.2% route)
For reference, this is your old core, which I had previously been using, that does just make timing:

Code: Select all

Paths for end point Mram_ram8 (RAMB16_X1Y4.ADDRB10), 66 paths
--------------------------------------------------------------------------------
Slack (setup path):     0.574ns (requirement - (data path - clock path skew + uncertainty))
  Source:               cpu/state_FSM_FFd3 (FF)
  Destination:          Mram_ram8 (RAM)
  Requirement:          12.500ns
  Data Path Delay:      11.701ns (Levels of Logic = 3)
  Clock Path Skew:      0.035ns (0.741 - 0.706)
  Source Clock:         cpu_clk_BUFG rising at 0.000ns
  Destination Clock:    cpu_clk_BUFG rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: cpu/state_FSM_FFd3 to Mram_ram8
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    SLICE_X15Y40.CQ      Tcko                  0.430   cpu/state_FSM_FFd29
                                                       cpu/state_FSM_FFd3
    SLICE_X23Y44.A1      net (fanout=4)        2.053   cpu/state_FSM_FFd3
    SLICE_X23Y44.A       Tilo                  0.259   N235
                                                       cpu/state__n16181
    SLICE_X15Y44.B3      net (fanout=2)        1.437   cpu/_n1618
    SLICE_X15Y44.B       Tilo                  0.259   cpu/Mmux_AB1012
                                                       cpu/Mmux_AB101311
    SLICE_X21Y47.A6      net (fanout=16)       1.308   cpu/Mmux_AB10131
    SLICE_X21Y47.A       Tilo                  0.259   cpu_AB<12>
                                                       cpu/Mmux_AB612
    RAMB16_X1Y4.ADDRB10  net (fanout=25)       5.296   cpu_AB_next<10>
    RAMB16_X1Y4.CLKB     Trcck_ADDRB           0.400   Mram_ram8
                                                       Mram_ram8
    -------------------------------------------------  ---------------------------
    Total                                     11.701ns (1.607ns logic, 10.094ns route)
                                                       (13.7% logic, 86.3% route)
The spartan6 version is a tad faster and meets timing at 80MHz, but not by much:

Code: Select all

Paths for end point Mram_ram7 (RAMB16_X1Y4.ADDRB12), 23337 paths
--------------------------------------------------------------------------------
Slack (setup path):     0.072ns (requirement - (data path - clock path skew + uncertainty))
  Source:               Mram_rom16 (RAM)
  Destination:          Mram_ram7 (RAM)
  Requirement:          12.500ns
  Data Path Delay:      12.149ns (Levels of Logic = 6)
  Clock Path Skew:      -0.019ns (0.185 - 0.204)
  Source Clock:         cpu_clk_BUFG rising at 0.000ns
  Destination Clock:    cpu_clk_BUFG rising at 12.500ns
  Clock Uncertainty:    0.260ns

  Clock Uncertainty:          0.260ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.450ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: Mram_rom16 to Mram_ram7
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    RAMB16_X1Y6.DOA0     Trcko_DOA             2.100   Mram_rom16
                                                       Mram_rom16
    SLICE_X15Y45.D5      net (fanout=1)        3.001   N341
    SLICE_X15Y45.D       Tilo                  0.259   cpu/abh/ABH<3>
                                                       inst_LPM_MUX711
    SLICE_X15Y45.A3      net (fanout=1)        0.359   rom_dout<7>
    SLICE_X15Y45.A       Tilo                  0.259   cpu/abh/ABH<3>
                                                       Mmux_cpu_DI81
    SLICE_X13Y44.A6      net (fanout=7)        0.405   cpu_DI<7>
    SLICE_X13Y44.A       Tilo                  0.259   cpu/abl/base<4>
                                                       cpu/abl/adl_base_mux/mux8_3[7].mux8_3
    SLICE_X12Y43.D5      net (fanout=1)        0.423   cpu/abl/base<7>
    SLICE_X12Y43.DMUX    Topdd                 0.555   cpu_AB_next<7>
                                                       cpu/abl/abl_add/add[7].add/LUT5
                                                       cpu/abl/abl_add/carry_h
    SLICE_X14Y43.A4      net (fanout=8)        0.543   cpu/abl_co
    SLICE_X14Y43.COUT    Topcya                0.495   cpu_AB<10>
                                                       cpu/abh/adh_add/add[0].add/LUT5
                                                       cpu/abh/adh_add/carry_l
    SLICE_X14Y44.CIN     net (fanout=1)        0.003   cpu/abh/adh_add/COL<3>
    SLICE_X14Y44.AMUX    Tcina                 0.210   cpu_AB<15>
                                                       cpu/abh/adh_add/carry_h
    RAMB16_X1Y4.ADDRB12  net (fanout=25)       2.878   cpu_AB_next<12>
    RAMB16_X1Y4.CLKB     Trcck_ADDRB           0.400   Mram_ram7
                                                       Mram_ram7
    -------------------------------------------------  ---------------------------
    Total                                     12.149ns (4.537ns logic, 7.612ns route)
                                                       (37.3% logic, 62.7% route)
If you update the microcode for the spartan6 version with these recent fixes I'll give that a try in the Beeb as well.

Dave
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

spartan 6 fix in github.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

Timing looks to be dominated by long paths to/from block RAM. Did you run smartexplorer ?

The path you quoted for the old core is not representative, because it's an unusually small bit of logic that happened to get really long routing delays.

An issue that the new core suffers from is that it needs to be tied to a ROM for the microcode, which limits the routing options. Hopefully it will improve with the distributed version.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: My new verilog 65C02 core.

Post by Arlet »

The ASL+RDY+SED problem was because of an earlier method that I came up with for calculating the NZ flags after a read-modify-write. The problem is that the ALU result happens a cycle earlier than other instructions, because the result still needs to be written back to memory, and then the next opcode needs to be fetched.

My solution was to snoop on the data bus during the write cycle, and grab that data to determine the NZ flags (the C flag already needs to be stored in a flop for BCD carry)

Later, I came up with a better method, where I just let the ALU perform the same operation for 2 cycles in a row, and then use the 1st one to write the data to memory, and use the 2nd cycle for the flags. I tested that method with binary ASL, then did INC/DEC, but didn't do the rest until later. But then with all the porting back and forth between generic/spartan, the changes must have gotten lost.

And apparently, snooping the write cycle doesn't work properly if you use RDY.
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

The Spartan6 microcode changes seem to work.
Arlet wrote:
Timing looks to be dominated by long paths to/from block RAM. Did you run smartexplorer ?
No I haven't - it's not a tool I'm particularly familiar with.

Am I likely to see much of a difference?
Arlet wrote:
The path you quoted for the old core is not representative, because it's an unusually small bit of logic that happened to get really long routing delays.
Understood.

When you use large block RAMs, the system speed does seem to be dominated by routing delays associated with high fanout nets used to construct the large RAM.
Arlet wrote:
An issue that the new core suffers from is that it needs to be tied to a ROM for the microcode, which limits the routing options. Hopefully it will improve with the distributed version.
Yes, this is quite painful in Beeb Accelerator, as I would like to use the full 64K. The Xililx tools don't work well with inferred RAMs that are not a power of 2. So if you try to inferr a 62K RAM (leaving 2K for the microcode ROM), it ends up overallocating and then fails. So to try out your core I've had to constuct the memory myself (currently using a 16K and 32K chunks and an additional mux). This may also be impacting performance.

So I'm very much looking forward to the option of the microcode being implemented as distributed RAM.

Anyway, keep up the good work, it's been fun debugging these (minor) issues.

Dave
hoglet
Posts: 367
Joined: 29 Jun 2014

Re: My new verilog 65C02 core.

Post by hoglet »

Oh, and one last thing.

Even though it doesn't come close to meeting timing at 100MHz, it does in fact run, and appears very stable:
capture15.png
capture15.png (1.84 KiB) Viewed 1259 times
The 15% CPI increase is nice to have.
IMG_2129.JPG
And these are the slower -2 parts....

What speed parts have you been using?

Dave
Last edited by hoglet on Sat Nov 14, 2020 4:47 pm, edited 1 time in total.
Post Reply