6502.org

Posted: **Fri Oct 30, 2020 8:40 am**

Right, the more constraints, the more work you may have to revisit every time you make a change.

I also feel it's a good idea to focus more on area than on speed in the beginning of the design. Try to make it small, then try to assist in placement, then see if you could gain something by trading area for speed.

Posted: **Sat Oct 31, 2020 6:14 am**

I have a simple incrementer in the code that takes the AB register, optionally adds 1, and stores it in the PC. When I first started, I figured I would use the carry chain to implement the +1, but seeing how this restricts placement to SLICEL/SLICEM, I decided to look at regular logic, which turns out to be fairly simple.

Adding 1 to a nibble can be done with four LUT5s, each connected to the 4 data bits, and the increment bit (CI).

Code: Select all

LUT5 #(.INIT(32'h5555aaaa)) inc_0( .O(O[0]), .I0(I[0]), .I1(I[1]), .I2(I[2]), .I3(I[3]), .I4(CI) );
LUT5 #(.INIT(32'h6666cccc)) inc_1( .O(O[1]), .I0(I[0]), .I1(I[1]), .I2(I[2]), .I3(I[3]), .I4(CI) );
LUT5 #(.INIT(32'h7878f0f0)) inc_2( .O(O[2]), .I0(I[0]), .I1(I[1]), .I2(I[2]), .I3(I[3]), .I4(CI) );
LUT5 #(.INIT(32'h7f80ff00)) inc_3( .O(O[3]), .I0(I[0]), .I1(I[1]), .I2(I[2]), .I3(I[3]), .I4(CI) );

The 5th bit would require 6 inputs, but instead of using a LUT6, you can simply generate a lookahead carry using another LUT5 and then repeat the same logic for the upper nibble. The advantage of using this method is that you only use LUT5s, and have them share the same inputs. That means you can pack 8 of them (and 8 flops if you need them), in a single slice. For each nibble, you need one additional LUT5 that needs to be placed in a different slice.

If you just write '+ inc' in verilog, the ISE tools will generate a similar structure, but use a carry lookahead every 5 bits. Over long chains this is slightly faster, but for 8 or 16 bits, it doesn't make a difference. Doing it per 4 bits has the advantage of not needing LUT6, allowing better packing.

Posted: **Sat Oct 31, 2020 6:19 am**

By the way, when you look at the core, you might be surprised that the 'PC' often has the wrong value, and does not consistently point to the instruction it is executing. Most of the time, the AB register effectively works as the program counter. The PC is mostly used as a holding register for when the AB needs to point at data, or when PC needs to be pushed on the stack.

Posted: **Sun Nov 01, 2020 7:23 am**

I'm trying to come up with a better way to do the ABH path. It is supposed to be simpler than the ABL path, because it basically needs to incorporate a mux, plus a carry from the ABL. However, there are a couple of annoying properties that make it hard to come up with a very simple solution.

First of all, there are 3 fixed values it must produce: 00 (zeropage), 01 (stack) and FF (vectors), in addition to 3 choices coming from registers: previous ABH, PCH, or DB. In some cases, we need to add a carry from ABL, but in other cases we must not, even though the ABL could produce a carry. For example, when doing (ZP,X) instruction where ZP + X generates a carry, we don't want to use that carry to propagate into upper byte of address. Instead, we want to wrap around zero page. Another example is any stack pull instruction when S=FF, where we want to wrap around inside the stack page. And then finally, we have the case of a backwards branch where we have to calculate ABH - 1.

Combining all these options in a nice design is quite challenging. The operations seem simple enough that you'd think that maybe it could all be combined in a single LUT, which was indeed my first try. It almost works, except for having to introduce some logic between the ABL carry out and the ABH carry in, and not being able to produce FF as an output. The FF can be hacked in a number of ways: when the output goes into a flop, we could use the extra 'set' input, or we could use the slice output latch to create an OR gate. Neither is very elegant, though, but the bigger problem was the extra logic layer in the carry path, since the carry out from the ABL is already slow.

The second plan was to use carry-select, where I would generate both the ABH for when C=0, and C=1, and then pick the winner. The problem is the backwards branch, which means these blocks need to be able to perform +1 and -1 operations, which then means access to carry chain logic, forcing these blocks into SLICEM/SLICEL slices.

My latest plan is to have 2 layers: first layer is a simple mux, selecting between 00, ABH, PCH, DB. Second layer then does +0, +1, -1, or +C, which can be done in 2 slices with carry chain logic (instead of 4). Also, by not using the carry input of the slice, but instead feeding the carry into a regular LUT input, we can combine the adder with the select logic (and maybe some microcode decoding) in the same LUT. It's a fairly clean structure, requiring 8 identical paths, and no extra logic. There's no extra logic in the carry path, and the carry isn't needed until the 2nd layer.

In the slice diagram you can see the "CIN" input at the bottom, where you would normally feed the carry input. Instead, I want to feed the carry in the LUT on the left, and then use the LUT logic to suppress it, based on the operation select. In the same diagram on the right, you can see that the ff/latch has an option to turn into a AND/OR gate instead of a latch or flip-flop.

Posted: **Sun Nov 01, 2020 9:50 am**

Just a thought: you have a carry-in, and you have three constants, FF, 00, 01, which differ by 1. So collapsing two of them might be possible, and using carry-in to adjust this case.

Edit: I reckon you already spotted this! Unsurpisingly.

Also edit:
> And then finally, we have the case of a backwards branch where we have to calculate ABH - 1.

yikes, indeed so.

Posted: **Sun Nov 01, 2020 10:46 am**

Indeed, I now make the 3 constants by having the mux select 00, and then perform -1, +0, +1 in the adder. Similarly, when the mux selects the ABH input, we can create ABH-1, ABH, and ABH+1 for backward branches, same page, and next page overflow.

Because the carry chain requires 2 inputs (as you can see in the diagram above), the input LUT is configured as 2 LUT5s. One input goes to the mux output, another input to the carry in (only for bit 0). That leaves 3 bits for encoding 3 different operations, which may create some opportunities to decompress microcode at the same time. I'll still have to look into that. Similarly, the mux has also 3 bits for encoding 3 selections.

There is also the option of replacing the carry chain adder with lookahead adder, but since we have to do both +1 and -1, this requires more resources than the simple incrementer I'm using for the PC.

Posted: **Sun Nov 01, 2020 4:05 pm**

New ABH design passes the test suite, and has been committed to Github repo (spartan6 only)

I'm running ISE with a minimal test design for address logic only. It's got a code ROM, microcode ROM, and the ABL/ADH address logic. The register file is included for address calculations, but there's no ALU.

I think that in the end the ALU path, including the flag updates is going to be the longest path, so for the address path I'm optimizing for minimal area. With a few little constraints, I'm getting under 6 ns. If the address logic is at smallest possible area, but still faster than ALU path, then it's good enough. For the ALU path, I don't see any big area/speed trade offs, except maybe in handling of individual flags.

My primary goal was area, so it's nice if the smallest version is also the fastest.

Posted: **Sun Nov 01, 2020 5:08 pm**

This is slightly annoying. In order to use the core with internal memory, I need access to the combinatorial output of the address calculation, but at the same time, I need a registered version to feed back into the logic.

The slice allows both, so I can place the flop inside the slice, and use the "DQ" output for the registered signal, and use the "DMUX" output for the combinatorial one.

However, I also need access to the "COUT" output of the carry chain. The signal that comes out of the top only goes to the next slice and attaches only to the "CIN" signal so that the carry chains can be extended. When you need access to the carry out, it needs to be routed out the DMUX/DQ output, but then it can't be combined with the other 2 outputs I need.

Edit: in addition, the only way to get both the data and the carry out, is to route one of the signals through the latch to the "DQ" output, causing an extra delay.

Edit2: even worse. As soon as one latch is used as pass-through, all the other latches have to be placed in the same configuration, which means that none of the flops can be used.

Posted: **Mon Nov 02, 2020 7:47 am**

I spent some time trying to come up with a clever encoding scheme for microcode control word so that I could remove the decoder stage from ABH mux, but I couldn't find anything good.

However, on the Spartan, the block RAMs have 4 parity bits that are not used, so I was thinking I would just use 2 of those. In the generic branch I'll just leave in the decoder, so that the design remains compatible with devices that don't have parity bits. Allocating extra bits also comes with advantage of reducing fan-out.

Posted: **Tue Nov 03, 2020 7:14 am**

I've been working mostly on the Spartan-6 target code. In the top-level cpu.v, all logic has been removed, except for driving the output databus. All the flag handling has been incorporated into the ALU. The main reason is to make it easier to optimize the logic across hierarchical boundaries.

For instance, detecting the zero flag condition inside the ALU requires 8 inputs, but we also needed logic in the flag handling to decide when to update the Z register and whether to get the new value from the ALU or from PLP. This means that we may need to look at total of 10-12 inputs to decide what happens to the Z register. By putting all those signals in one place, it makes it easier to consider the optimal way to combine those in minimal amount of area/delay.

Posted: **Tue Nov 03, 2020 8:42 am**

I'm looking into the ALU logic for BCD adjust and flags right now, and I realized something useful. In order to calculate the overflow flag and the BCD adjustments, it is necessary to examine the intermediate carry outputs.

However, as I showed before, when you need both the regular output of the sum as well as the carry, then one of the signals is forced to go through the latch, which restricts all latches in that slice, but probably also adds a little bit of extra delay.

However, looking at the slice diagram again, there is also a "D" output that comes straight out of the O6 port of the LUT without any muxes or latch in the path. When you're doing an addition, the O6 port produces the XOR of the two inputs (carry propagate), while the O5 produces the AND (carry generate). The O6 signal then goes into the carry chain block, and is XOR'ed again with the carry from below, producing the sum output bit.

That means if we take the "D" output and XOR it with the "DMUX" output, we can reconstruct that internal carry signal. Obviously, this is only useful when you can combine that XOR with other logic in the same LUT so it is free. Otherwise, it would be a better idea to feed the internal carry through the latch.

Posted: **Tue Nov 03, 2020 2:23 pm**

Fun trick: when doing PLP, the C flag needs to be updated with bit 0 of the byte read from memory. However, doing that requires an extra mux input, plus a 3rd mux select bit, which is annoying.

To reduce the logic, we can take the memory byte, tell the ALU to add a zero from the register file, and then shift the output 1 bit to the right, so that M[0] ends up in the ALU carry out bit.

Similarly, when doing CLC or SEC, we can do some ALU operations that always clear/set the carry out bit.

With these tricks combined, the C flag can be updated with a single LUT5.

Posted: **Tue Nov 03, 2020 4:06 pm**

Arlet wrote:

the C flag can be updated with a single LUT5.

I noticed something else which I had read in the data sheet, but not fully realized. If you have something like this:

Code: Select all

always @(posedge clk)
    if( en )
        C <= D;

always @(posedge clk)
    A <= B;

This synthesizes into 2 flops. The first one could use the 'enable' input of the flop, but if that happens (or if you make it happen), then the second one can no longer share the same slice, because all the 'enable' inputs are tied together inside a slice.

Now, instead, if you create something like this:

Code: Select all

always @(posedge clk)
    C <= en ? D : C;

Where the old state is fed back into the logic, without using the flop enable input, then you can share it. Of course, this requires that the LUT has sufficient inputs for feedback.

In my case, I tried using a LUT5 for the carry, while using another signal as enable, and then I noticed I could not get another flop in that same slice. By changing the logic to a LUT6 and feeding the enable signal in (the data feedback was already present), this could be fixed. This was particularly nice because the other flop was also used as an input in the same logic, and by putting both in the same slice, the signals can stay inside the slice, which is faster and avoids using any global routing resources.

Posted: **Tue Nov 03, 2020 7:59 pm**

Until now, I had never bothered to actually implement the TRB/TSB instructions. I figured that I had everything I needed, and that it was just a matter of the right microcode. Except that it turned out that I was wrong, and the combination of not writing N flag, the "BIT behavior" for the Z flag, and the read-modify-write didn't really fit in my plans. My original plan was to implement the Z flag behavior outside the regular ALU path. So the ALU would worry about the setting/clearing of the bits, and there would be a separate A & M operation just to get the Z flag. That plan didn't work because flags are set during opcode fetch of next instruction, and the previous write cycle had already destroyed the M register. I then thought maybe I can do the A & M, store the zero test, and then update the Z flag. That wasn't very appealing, because the Z flag is already very slow, and I did not want an even wider mux.

So I figured I needed to inhibit the update of the M register during TRB/TSB. I had already a bit for that in the control word, but that didn't work, because during write cycles, that same bit already has another purpose (controlling the data output mux). Fortunately, there was still one spare bit available in the control word. I had only used 31, and I've been saving that last bit for a long time.

After allocating the 32nd bit for M load enable, I suddenly realized that because the read-modify-write of the TRB/TSB instructions and the Z flag calculation are not overlapping at all, the ALU is free to perform the TRB/TSB operation first, and then use the next instruction fetch cycle to perform A & M for the Z flag.

It's very satisfying when a disappointment turns in a better solution.

Posted: **Wed Nov 04, 2020 6:29 am**

Nice! I'll keep fingers crossed that you don't run into the need for a 33rd bit anytime soon.

Are there other instructions which still need to be implemented and might bring surprises? (I thought you had mentioned passing the complete Dormann tests a while back, but I must have misread that?)

6502.org

My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.

Re: My new verilog 65C02 core.