6502-Core Comparisons: Fitting a Xilinx Spartan 2 XC2S200

Topics relating to PALs, CPLDs, FPGAs, and other PLDs used for the support or creation of 65-family processors, both hardware and HDL.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

As I have a Xilinx session running and retromaster's core close at hand, I thought I'd report on some of the synthesis details of that.

First thing to note: this core supports BCD. In the VHDL there's a lookup table, which synthesis converts to a ROM. Not sure how the ROM is mapped.

The VHDL contains an 'fsm' component, but synthesis hasn't recognised it.

The headline size is even more impressive if I nobble the bcd function: it loses 65 slices (120 LUTs), which is 15%. It also goes from 34MHz to 41MHz (not that this is a carefully constrained synthesis!)

The ALU is written as a pair of nibbles, which is authentic (and may help with BCD):

Code: Select all

Macro Statistics
# ROMs                                                 : 2
 20x5-bit ROM                                          : 2
# Adders/Subtractors                                   : 4
 5-bit adder                                           : 4
# Counters                                             : 1
 16-bit up counter                                     : 1
# Registers                                            : 47
 1-bit register                                        : 39
 8-bit register                                        : 8
When we get to lower level, the adders have been consolidated:

Code: Select all

# Adders/Subtractors                                   : 2
 5-bit adder carry in                                  : 2
Note that the PC is in this case interpreted as a counter.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

Since my model doesn't support BCD, I was just thinking about what it would take to support it.

Looking at the 6502 block diagram, they have a separate Decimal Adjust Adder, that adjusts the byte that's loaded into the Accumulator.

The DAA consists of two 4-bit wide adders, performing an add between the ALU result, and a constant, which is equal to:

6 when the (half)carry bit is set after performing an add
-6 (10) when the (half) carry bit is cleared after performing subtract.

I assume the (half) carry logic is modified in the ALU to produce a (half) carry, when the nibble result is > 9.

This seems like a clever way to handle this. First of all, it isn't done in the ALU, but in the next cycle, which is good, because the ALU paths are quite long, and the ALU result -> Accumulator store path is fairly short. This means that there may not be much (if any) impact on the speed.

Secondly, it doesn't take much resources. Since bits 0 and 4 aren't modified, we only need 6 LUTs for the additional adders. With a bit of luck, the constant selection may be incorporated in the same LUTs, but otherwise I would only take a few more. The (half)carry logic needs to be modified too, which would take a couple of more LUTs.

I haven't played with this code in a while, so I don't have my environment ready to go. It'll take a bit of work before I can go and try this out.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

By coincidence, Ijor brought this post of his to my attention earlier today - it's of interest for BCD implementation.

See also the patent and Bruce's document (with test code)

(I'm not sure about having an extra cycle.)

I think you're right though - the overhead should not need to be so much as in retromaster's core.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

I've just added BCD support to my model. I've run it in the simulator, but I haven't checked the ISE output for LUT increase and performance penalty.

http://ladybug.xs4all.nl/arlet/fpga/6502/source/cpu.v
http://ladybug.xs4all.nl/arlet/fpga/6502/source/ALU.v

I didn't check the flags to see if they work exactly the same.

Actually, the ALU path is one of the longest, and it can be shortened a bit by moving the flag calculation out of that stage, and into the next one. Instead of looking at the internal result of the ALU, and then setting a register, it could look at the registered output of the ALU, and be determined combinatorially.

This would also allow a simple fix to change the behavior of the Z flag so that it would be applied after the BCD adjustments.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

Arlet wrote:
I've just added BCD support to my model. I've run it in the simulator, but I haven't checked the ISE output for LUT increase and performance penalty.
Nice, let's see:

Code: Select all

        flops  slices   LUTs RAM16 HDL      Notes
cpu.v     152     256    486    8  verilog  by Arlet Ottens (no bcd)
cpu.v     155     259    493    8  verilog  by Arlet Ottens (plus bcd mode)
with the (indicative) speed down from 63MHz to 54MHz, with the logic depth going up from 11 to 15.

That's a minimal increase in area cost. The advanced synth report changes from

Code: Select all

# Adders/Subtractors                                   : 2
 16-bit adder                                          : 1
 9-bit adder carry in                                  : 1
to

Code: Select all

# Adders/Subtractors                                   : 5
 16-bit adder                                          : 1
 4-bit adder                                           : 2
 4-bit adder carry in/out                              : 1
 5-bit adder carry in                                  : 1
# Comparators                                          : 2
 3-bit comparator greatequal                           : 2
Cheers
Ed
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

Code: Select all

        flops  slices   LUTs RAM16 HDL      Notes
cpu.v     152     256    486    8  verilog  by Arlet Ottens (no bcd)
cpu.v     155     259    493    8  verilog  by Arlet Ottens (plus bcd mode)
Good job by the synthesis tools. I was expecting a bigger impact. I tried to minimize the impact on the speed, but I didn't see any elegant way to avoid the new decimal half-carry calculation in the middle of the long ALU path.

I'm now working on your suggestions to remove PCLHOLD register. I still have the RTS left to do, but the JMP/JMPI/JSR instructions no longer need it.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

The PCLHOLD register has been removed, and everything still passes my simulation tests. I also found that some BCD state flops never got properly initialized on reset, so I fixed that too.

To get rid of the PCLHOLD register I used a trick that Ed mentioned the real 6502 does too. When doing a JSR, there's is a problem that the LSB of the new program counter (PCL) appears on the data bus, but it's still too early to actually put in the program counter, because the old program counter has to be stored on the stack first. I used a separate 'PCLHOLD' register to temporarily store it.

The real 6502 apparently stores the PCL in the stack pointer register, so I modified my model to do the same thing.

The reason this works is because we only need to store it for 2 cycles, and during that time, the actual stack pointer is kept in the ALU where it is being decremented.

Pretty clever hack :)
ijor
Posts: 16
Joined: 16 Nov 2010

Post by ijor »

Arlet wrote:
Looking at the 6502 block diagram, they have a separate Decimal Adjust Adder, that adjusts the byte that's loaded into the Accumulator.

The DAA consists of two 4-bit wide adders, performing an add between the ALU result, and a constant, which is equal to:

6 when the (half)carry bit is set after performing an add
-6 (10) when the (half) carry bit is cleared after performing subtract.

I assume the (half) carry logic is modified in the ALU to produce a (half) carry, when the nibble result is > 9.

This seems like a clever way to handle this. First of all, it isn't done in the ALU, but in the next cycle...
Actually that's not how the 6502 works. It doesn't have an extra cycle (as the CMOS part).

The 6502 computes a decimal half carry in parallel with the ALU. There is then a mux at the output of the nibble adder that select between the binary carry (computed by the ALU), and the decimal carry.

The decimal adjust is performed later in the path. But because the half carry was already taken care, the combinatorial depth is much smaller.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

ijor wrote:
The 6502 computes a decimal half carry in parallel with the ALU. There is then a mux at the output of the nibble adder that select between the binary carry (computed by the ALU), and the decimal carry.

The decimal adjust is performed later in the path. But because the half carry was already taken care, the combinatorial depth is much smaller.
Yes, the half carry flag is produced in the ALU, but I was talking about the decimal adjust logic block between the SB bus and the accumulator.

I'm looking at the block diagram, and it doesn't have all the clock information, so I could be wrong about the exact cycle when it happens. I assumed that the Adder Hold Register was storing the result of the ALU, and wait for the next cycle to move it over the SB bus, through the DAA block, and load it into the accumulator.

Anyway, that's how I implemented it (except I don't have an exact SB bus, and the AC is in a register file)
ijor
Posts: 16
Joined: 16 Nov 2010

Post by ijor »

Arlet wrote:
Yes, the half carry flag is produced in the ALU, but I was talking about the decimal adjust logic block between the SB bus and the accumulator.
Oh, I misunderstood you, sorry.

Yes, that is performed one cycle later. I thought you meant that an extra cycle is taken specifically for decimal mode. That's what the CMOS cpu does, so to be able to compute the flags in decimal mode correctly.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

Anyone can explore the behaviour of the NMOS 6502 using the visual6502 model, which now allows you to run a program of your choice and tabulate the bus and signal activity per clock phase.

For example
http://visual6502.org/JSSim/expert.html ... 18a9446956
runs this program

Code: Select all

sed
clc
lda #$44
adc #$56
(which I assembled at 6502asm.com)
and produces this tabulation:

Code: Select all

cycle ab    db  rw sync    pc    a    x    y    s     p    alucin alua alub alu cout dasb  sb
 6	0004	69	1	1	0004	44	00	00	fd	nv‑BDIzc	0	44	44	fc	1	44	44
 6	0004	69	1	1	0004	44	00	00	fd	nv‑BDIzc	0	44	44	88	0	88	88
 7	0005	56	1	0	0005	44	00	00	fd	nv‑BDIzc	0	88	88	88	0	88	88
 7	0005	56	1	0	0005	44	00	00	fd	nv‑BDIzc	0	88	88	10	1	ff	ff
 8	0006	02	1	1	0006	44	00	00	fd	nv‑BDIzc	0	44	56	10	1	44	44
 8	0006	02	1	1	0006	44	00	00	fd	nv‑BDIzc	0	44	56	aa	1	00	aa
 9	0007	00	1	0	0007	00	00	00	fd	NV‑BDIzC	0	aa	aa	aa	1	00	aa
 9	0007	00	1	0	0007	00	00	00	fd	NV‑BDIzC	0	aa	aa	54	1	ff	ff
10	ffff	00	1	0	0008	00	00	00	fd	NV‑BDIzC	0	ff	ff	54	1	ff	ff
10	ffff	00	1	0	0008	00	00	00	fd	NV‑BDIzC	0	ff	ff	fe	1	ff	ff
(let me know if I mention this too often, but I think it's a very handy tool)
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

Ed, is there a description of the signal names used visual6502 program ? I'd never have guessed to add 'dasb' to see the decimal adjustment, for instance.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

Hi Arlet, the best I can offer is the comments in the source code, which in this case is http://visual6502.org/JSSim/nodenames.js

Sorry, they are not obvious!

The other thing you can do, in principle, is click around the datapath to find the signal names. But this assumes that you've gained some familiarity and can 'read' the layout - it's certainly possible.

Ideally we'd have a graphical linkage with Hanson's block diagram. Maybe one day someone can code that up.

Edit: and, I remember, there is a standing idea for me to put some help text on the site, at least to document the URL interface and maybe list the most useful signal names. The pre-release version has some more pseudonames too: plaOutputs and DPControl:
http://visual6502.org/stage/JSSim/exper ... 18a9446956
ijor
Posts: 16
Joined: 16 Nov 2010

Post by ijor »

BigEd wrote:
Anyone can explore the behaviour of the NMOS 6502 using the visual6502 model, which now allows you to run a program of your choice and tabulate the bus and signal activity per clock phase.
Thanks, Ed.

This shows, btw, that it is not exactly one cycle later, but half cycle later instead. Probably doesn't matter in this case, because I don't think the accumulator output goes anywhere on the next half cycle.
fachat
Posts: 1123
Joined: 05 Jul 2005
Location: near Heidelberg, Germany
Contact:

Post by fachat »

These pages on ALU design might be of interest:
http://www.6502.org/users/dieter/index.htm
On the ALU design part2 Dieter explains a bit about the 6502 ALU (from a reverse engineered schematics - pre-visual6502 :-)
He also has a discussion on BCD operation.

André
Post Reply