My new verilog 65C02 core.

Arlet · Post by **Arlet** » Tue Nov 17, 2020 10:15 am

The RTS0, JSR0, BRK0, RTI0 states are all coming from column 0, so those are implemented like this:

LUT6( DB[4], DB[5], DB[6], DB[7], SYNC, LUT4(DB[3:0]) )

Instead of one-hot encoding these 4 states, we could try to exploit the similarity by putting them in a more compact group, perhaps also incorporating the follow-up sequence states in some way. These 4 instructions alone take care of half the total states, and they all use the stack in some way, so I'm thinking there could be some benefit in grouping them.

Arlet · Post by **Arlet** » Tue Nov 17, 2020 10:45 am

Something else I realized is that some attempts to minimize states could actually hinder optimization. I was looking at all the 12 initial decode states, and most had very simple 2 layer inputs, like the IMM0 ones, with 1xLUT4 and 1xLUT6.

But looking at the ABS0 state, it was 2xLUT6, with a bunch of extra state inputs. The cause of this was some early optimization I did. For instance, when doing a JSR instruction, we go: SYNC -> JSR0 -> JSR1 -> JSR2 -> ABS0 -> ABS1 -> DATA -> SYNC.

For a one-hot encoding that's probably a good idea, because each extra state means an extra flop, so you want to minimize states. For a mixed encoding, it may be better to add JSR3, JSR4, etc... reusing the same state bits. Perhaps a one/two-hot group encoding, plus a T0-T6 sequence counter

BigEd · Post by **BigEd** » Tue Nov 17, 2020 11:10 am

Just a thought: if you permute the axes of your 16x16 table according to a Gray code, to make a Karnaugh map, does that help anything jump out at you?

Arlet · Post by **Arlet** » Tue Nov 17, 2020 11:24 am

Not really. Gray code brings some things get closer together, but other things move further apart.

I isolated the part of the design where the DB input is used to determine the first state transition from SYNC to any of the 12 initial states, based on addressing mode and special instructions. Each of those 12 states was one-hot encoded.

Total design requires 1 LUT for each output (obviously), plus 5 additional LUTs for 1 layer of "pre-decoding". The don't cares for undocumented opcodes do end up helping, especially to find the default SYNC -> SYNC transition. Because some of the LUTs can be shared in the same LUT6, a total of 12 LUT6s are needed. That's pretty good.

Code: Select all

always @(*)
    if( sync )
        casez( DB )
            8'b0000_0000:  next = BRK0;      // BRK
            8'b0010_0000:  next = JSR0;      // JSR
            8'b0100_0000:  next = RTI0;      // RTI
            8'b0110_0000:  next = RTS0;      // RTS
            8'b1000_0000:  next = COND;      // BRA
            8'b???1_0000:  next = COND;      // other branches
            8'b0?10_1000:  next = PULL;      // PLA/PLP
            8'b0?00_1000:  next = PUSH;      // PHA/PHP
            8'b?111_1010:  next = PULL;      // PLX/PLY
            8'b?101_1010:  next = PUSH;      // PHY/PHX
            8'b????_0001:  next = IND0;      // col 1 (ZP,X)/(ZP),Y
            8'b???1_0010:  next = IND0;      // col 2 odd (ZP)
            8'b????_0000:  next = IMM0;      // anything in col 0 not already done
            8'b???0_0010:  next = IMM0;      // col 2 even (LDX#IMM)
            8'b???0_1001:  next = IMM0;      // col 9 even 
            8'b???1_1001:  next = ABS0;      // col 9 odd 
            8'b????_11??:  next = ABS0;      // col c,d,e,f
            8'b????_01??:  next = ZPG0;      // col 4,5,6,7
            default:       next = SYNC;      // implied & undocumented
        endcase

Arlet · Post by **Arlet** » Tue Nov 17, 2020 4:07 pm

I figured out that if you want to define your own state machine encoding, it's important not to set the "FSM Encoding Algorithm" to "None", like I showed in the dialog before. You need to set it to "User". Also, in the 'case' statement for the state machine, make sure to provide all the valid states as options, and do not use a "default" clause. Edit: you can use "default" but it must lead back to a valid state. I had a transition to "xxx" and that blew up. Also, you need to specify a valid initial state, and not export the state registers directly.

When you do it this way, the synthesis program knows that only the defined states can be reached. Otherwise, it will add extra logic to detect invalid states, which can have a huge impact if you have a sparse encoding.

hoglet · Post by **hoglet** » Wed Nov 18, 2020 7:58 am

Arlet wrote:

I figured out that if you want to define your own state machine encoding, it's important not to set the "FSM Encoding Algorithm" to "None", like I showed in the dialog before. You need to set it to "User".

Interesting. What does setting it to None actually mean then?

Arlet · Post by **Arlet** » Wed Nov 18, 2020 8:06 am

"None" means that it won't try to extract a FSM at all.

Suppose you have 4 states, with one-hot encoding, in "User" mode, and you write something like "if( state == INIT )", then the tools understand that they only need to check the particular hot bit for the INIT state. If you choose "None", then the generated logic will check all 4 bits.

Arlet · Post by **Arlet** » Wed Nov 18, 2020 8:12 am

If you're trying to make a state machine, I recommend checking the Synthesis Report. If it's done correctly, it will have a section on how the FSM is implemented, including a full list of all state encodings. If that section is missing, then something is wrong in the code.

Arlet · Post by **Arlet** » Thu Nov 19, 2020 7:21 am

I removed a couple of the 'clever' tricks in my state machine that I had introduced to reduce number of states. Instead of having the JMP piggyback on the ABS path, and then diverging, the JMP instruction now has its own states: JMP0->JMP1->SYNC. This same path is used by JSR and BRK. The indirect JMP(IND) also got its own states: IND0->IND1->JMP0->JMP1->SYNC.

This increases the number of states, but it reduces the number of extra inputs we have to deal with. The only 2 extra inputs left are for Read-Modify-Write, and for the BCD adjustment step. Removing those two would require duplicating a bunch of states.

By getting rid of the INIT state (which was only necessary for testing without proper reset support), the total number of states is now exactly 32. My idea is to implement these 32 states in 5 bits, using standard binary encoding.

Each bit gets 1 flop to hold the current state, and a bit of logic, according to schematic. Each state flop is controlled by a 6-input LUT, of which 5 inputs are used to track the current state, and the 6th input has multiple functions, depending on current state. Since the FSM LUT knows the full current state, it knows exactly how to use the 6th input. This 6th input has 4 different sources:

The opcode decoder, which produces the initial state number for the particular addressing mode/special instruction (probably takes 2 LUTs per bit, using different DB bits for each state bit).
The RMW input, to add an extra write back state.
The BCD input, to add an extra BCD adjustment state.
The IRQ input, to go to the BRK0 state (also used for RST/NMI).

These 4 sources are selected based on 2 of the state bits, using appropriate encoding. The FSM LUT knows to ignore the 6th input when it's not in the proper state.

Arlet · Post by **Arlet** » Thu Nov 19, 2020 9:00 am

A naive Verilog implementation of this design results in 5 flops and 20 LUTs. The implementation is different than my schematic, because it's performed global optimizations on the whole thing. It looks like it has extracted the essential ideas, and I doubt that the 20 LUT count can be improved.

I've also renamed the BCD Adjust state to "Adjust BCD", so it can be abbreviated to ABCD.

Edit: in the code, as it is now, the RST shares the IRQ pin, so can only be taken in the SYNC state. This means that a reset will be ignored until start of next instruction. I'll have to think about that one.

BigEd · Post by **BigEd** » Thu Nov 19, 2020 9:42 am

I think Reset is always specced to last for several cycles, so you're probably fine waiting for sync.

Arlet · Post by **Arlet** » Thu Nov 19, 2020 10:13 am

If you want to listen to short reset pulses, it's always possible to add some logic to extend the signal.

Another approach could be to make RST + IRQ act at any cycle inside the state machine, but provide some outside logic to mask the IRQ based on SYNC state.

Or, we could use the IRQ only for interrupts, and then attach RST input to the synchronous reset input of the state flops. This is the simplest solution, but it has disadvantage that the combinatorial output of the state machine is messed up for the first cycle of RST. That should probably still be okay, because we're not doing anything useful in the first 3 cycles anyway.

BigEd · Post by **BigEd** » Thu Nov 19, 2020 10:19 am

I think your present story is very like the 6502's - the reset input is like BRK and like IRQ and NMI, in that it replaces the IR with 00. So you always have to wait.

Arlet · Post by **Arlet** » Thu Nov 19, 2020 10:55 am

Adding the control (mode) bits for the address data path only adds 2 more LUTs, which makes sense because this is a 4 bit signal, and each bit can be produced by a single LUT5 based on the state. Because they share all their inputs, these 4 LUT5s can be packed in 2 LUT6s. Total 5 flops and 22 LUTs.

Trying the same design with automatic FSM extraction results in total 32 flops + 35 LUTs.

Code: Select all

always @(*)
    case( state )
        ABCD:   mode = 0;
        ABS0:   mode = 4;
        ABS1:   mode = 2;
        BRK0:   mode = 9;
        BRK1:   mode = 8;
        BRK2:   mode = 8;
        BRK3:   mode = 15;
        COND:   mode = 7;
        DATA:   mode = 1;
        IDX0:   mode = 3;
        IDX1:   mode = 12;
        IDX2:   mode = 10;
        IMM0:   mode = 4;
        IND0:   mode = 4;
        IND1:   mode = 2;
        JMP0:   mode = 4;
        JMP1:   mode = 2;
        JSR0:   mode = 9;
        JSR1:   mode = 8;
        JSR2:   mode = 1;
        PULL:   mode = 5;
        PUSH:   mode = 11;
        RDWR:   mode = 0;
        RTI0:   mode = 5;
        RTI1:   mode = 5;
        RTI2:   mode = 5;
        RTI3:   mode = 2;
        RTS0:   mode = 5;
        RTS1:   mode = 5;
        RTS2:   mode = 14;
        SYNC:   mode = 4;
        ZERO:   mode = 3;
    endcase

Below is the schematic of the mode[0] bit. On the top right you can see a LUT6 for the output. Connected to that are 2 more LUTs, looking at a combined total of 15 one-hot state flops. Clearly, one-hot can be a good encoding for the state machine, but not very efficient for downstream logic.

Arlet · Post by **Arlet** » Thu Nov 19, 2020 10:59 am

In contrast, with "User" encoding as 5 bit binary, you get a single LUT5 for the same signal, as you would expect:

My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.