6502-Core Comparisons: Fitting a Xilinx Spartan 2 XC2S200

Arlet · Post by **Arlet** » Tue Nov 16, 2010 8:22 am

Hello,

I'm Arlet Ottens. I'm new to this board, and I noticed you were discussing 6502 FPGA cores, including the one I wrote.

Just to clarify the licensing terms. My core is free to use for whatever purpose, as long as you don't blame me when it doesn't work.

Let me know if you have any other questions.

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Nov 16, 2010 9:11 pm

As author of this thread, let me be the first to welcome you to 6502.org! The Programmable Logic section was just added within the last month.

Thanks for letting us post your work. If you see any errors do let us know.

I personally am not at the stage yet where I can actually implement one of these cores in my designs. It was more of a test to see if I could successfully implement one at all.

One day I would like to see which one is the fastest!

Arlet · Post by **Arlet** » Tue Nov 16, 2010 9:32 pm

ElEctric_EyE wrote:

One day I would like to see which one is the fastest!

I don't know about the others, but mine ran at about 60 MHz on a Spartan-3.

A lot faster than my original 6502 at 1 MHz.

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Nov 16, 2010 10:13 pm

60 Mhz is awesome.... From reading your website, I understand your core was meant to be used with synchronous RAM internal to the Spartan3. I'm sure this has something to do with acquiring higher speeds, but...

Big Ed noticed a discrepancy in your design concerning the # of LUT's versus the # of FF's, when compared to the other cores. His post is earlier in this thread...

Can you comment why this is? Or did I do something wrong implementing your core?

Arlet · Post by **Arlet** » Tue Nov 16, 2010 10:42 pm

ElEctric_EyE wrote:

60 Mhz is awesome.... From reading your website, I understand your core was meant to be used with synchronous RAM internal to the Spartan3.

Well, at least the stack must be internal, because some instructions do two write cycles back to back, and this is impossible to do on asynchronous RAMs. The rest of the memory can be external. It would be interesting to see what it does with the speed. The internal memories are fast, but they're also registered, which means that it takes a full clock cycle to get the result. External memories are asynchronous, so you save a clock cycle, but have more delay. As long as the total delay is less than one clock, it may be just as fast.

On another project, I made a VGA controller that could read external SRAMs on the Spartan-3 starter kit at 100 MHz, so it may be possible to reach 60 MHz in combination with this core.

Quote:

Big Ed noticed a discrepancy in your design concerning the # of LUT's versus the # of FF's, when compared to the other cores. His post is earlier in this thread...

Can you comment why this is? Or did I do something wrong implementing your core?

The LUTs can be used for FF or for combinatorial logic, so I guess it just depends on the design. I've designed this core with the motto in mind that "on an FPGA flip flops are cheap". So, for each little bit of combinatorics, I tried to put the result in a FF wherever I could. If the other designs didn't do that, it would explain the difference in the FF/LUT relationship.

ElEctric_EyE · Post by **ElEctric_EyE** » Fri Nov 19, 2010 12:26 am

Arlet wrote:

...stack must be internal, because some instructions do two write cycles back to back, and this is impossible to do on asynchronous RAMs. .

Trying to learn here, forgive my ignorance please.

Are you referring to Read Modify Write instructions like INC & DEC for example?

Arlet · Post by **Arlet** » Fri Nov 19, 2010 6:41 am

ElEctric_EyE wrote:

Are you referring to Read Modify Write instructions like INC & DEC for example?

No, I'm talking about JSR and BRK (and IRQs, but they are handled like BRK).

The BRK instruction, for instance, has 7 cycles. In my code, these cycles are:

DECODE - decode the BRK instruction
BRK0 - write PCH to stack
BRK1 - write PCL to stack
BRK2 - write P to stack
BRK3 - read from FFFE
JMP0 - read from FFFF
JMP1 - fetch opcode from new PC

The critical states here are BRK0, BRK1, and BRK2, in which 3 write cycles are performed in a row. The internal RAM blocks can do this with no problem, but external SRAM needs to have the WE asserted/deasserted for each write cycle. The original 6502 didn't have a problem, because it had the two-phase clocks, so it can assert WE on the first phase, and deassert it on the second phase, all within the same cycle.

There are several approaches you can take. I decided I'd just put the stack in the block RAM, which is the easiest.

You can also forget about cycle accuracy, and make the BRK take 9 cycles, by inserting extra wait states between BRK0/BRK1 and BRK1/BRK2, where the WE is deasserted. Since I managed to get everything else cycle accurate, I didn't want to break it for 2 instructions.

You can also redefine cycle accuracy by mimicking the 2 phase clock with a double clock frequency, so the BRK would be 14 cycles, and every other instruction would be exactly double too. This would be nice and simple, but also slower overall.

Lastly, you could try to reorder the state machine, so you'd get the following bus cycles:

DECODE - decode the BRK instruction
BRK0 - write PCH to stack
BRK1 - read from FFFE
BRK2 - write PCL to stack
BRK3 - read from FFFF
BRK4 - write P to stack
BRK5 - fetch opcode from new PC

By interleave the write cycles with read cycles, the WE is deasserted at the same time. This is a bit more complex, and probably requires some more resources. It is also not how the 6502 did it.

By the way, the INC ZP instruction has the following states in my core:

DECODE - decode instruction, read zeropage address
ZP0 - read zeropage location
READ - send data to ALU for +1 operation
WRITE - send ALU output to zeropage location
FETCH - get next instruction

In this case, there is only one write cycle on the bus (WRITE) which will assert WE, and it will be deasserted on the FETCH, so there's no problem. None of the other instructions have a problem either. It's only the BRK/JSR because they write multiple bytes to the stack.

GARTHWILSON · Post by **GARTHWILSON** » Fri Nov 19, 2010 7:41 am

So what will you do when you get to the 65816 which always writes two clocks in a row if you're doing 16-bit stores.

kc5tja · Post by **kc5tja** » Fri Nov 19, 2010 7:48 am

I personally don't see the problem, as the Wishbone bus supports back-to-back write cycles, and is regularly used on FPGAs. Perhaps a solution, then, is to use Wishbone internally, and interface to external memory through a bus bridge of some kind.

Arlet · Post by **Arlet** » Fri Nov 19, 2010 8:03 am

GARTHWILSON wrote:

So what will you do when you get to the 65816 which always writes two clocks in a row if you're doing 16-bit stores.

Well, the 65816 uses the PHI2 on both edges. I suppose you could configure the outputs as DDR FF (double data rate)

That would be a good idea too for the 6502. If you'd configure the WE output as a DDR, it would be possible to only assert it for half a cycle. I hadn't thought of that option, but it seems like it would be perfect if you wanted all external SRAM, with a minimum of changes.

Arlet · Post by **Arlet** » Fri Nov 19, 2010 8:30 am

kc5tja wrote:

I personally don't see the problem, as the Wishbone bus supports back-to-back write cycles, and is regularly used on FPGAs. Perhaps a solution, then, is to use Wishbone internally, and interface to external memory through a bus bridge of some kind.

Wishbone is synchronous, so it's easy to interface, similar to internal synchronous memory in the FPGA.

The problem is when you try to interface external asychronous memories, like old fashioned SRAMs. In that case you'd need a memory interface that either uses a higher clock rate, double edge flip-flops, or extra wait states to handle the back-to-back writes.

When I wrote my 6502 core, I never considered external memories. It was just a hobby project to run my old Acorn Atom inside a Spartan-3 FPGA. Since the Atom never had much memory, the internal block RAM was sufficient for that purpose.

Thinking about it some more, it seems that using a DDR FF for the WE signal would be an elegant solution to the problem.

BigEd · Post by **BigEd** » Fri Nov 19, 2010 11:06 am

Hi Arlet
good to see you here, and thanks for sharing your verilog model!

You're probably already up to speed on this subject, but anyone tackling synchronous memory interfacing might be interested to read Rob Finch's page on the topic - the site seems down, so here's an archive copy.

Cheers
Ed

BigEd · Post by **BigEd** » Fri Nov 19, 2010 1:49 pm

BigEd wrote:

... I see that Arlet's is the only design which uses LUTs as 16-bit RAMs - this could be a reason why the slice count is by far the smallest.

[...]

Another point or two on density:

How you encode control information will affect the complexity of the decode. One-hot, two-hot, binary, Gray code, etc.

Whether you decode all bits of an encoding, or just the bits which distinguish interesting cases, will make a difference[...]

I decided to push Arlet's core through the 11.1 xilinx tools - I felt a bit guilty at never having looked carefully at it. I recommend reading his sources: breaking down into small self-contained processes and using case statements keeps it clear and probably helps with the efficient implementation.

A few notable points from the synthesis report:

Code: Select all

Macro Statistics
# FSMs                                                 : 1
# RAMs                                                 : 1
 4x8-bit single-port distributed RAM                   : 1
# Adders/Subtractors                                   : 2  
 16-bit adder                                          : 1  
 9-bit adder carry in                                  : 1
# Registers                                            : 103
 Flip-Flops                                            : 103
# Multiplexers                                         : 2
 1-bit 8-to-1 multiplexer                              : 1
 8-bit 4-to-1 multiplexer                              : 1

The main state machine has been coded in such a way that the tool sees it as a state machine - and has one-hot encoded it. So the code is clear, and the implementation is fast and small. (More flops, less logic.) Also, the decoding of the state bits by the various case statements is going to resemble the PLA approach in the 6502.

The A, X, Y and S registers were coded as a register file, and that's probably helped get them implemented as a small RAM: 8 LUTs instead of 32, maybe. Note also that there are just 2 adders - somehow the PC hasn't been recognised as only needing an incrementer, but Arlet has arranged to use his ALU to do all address arithmetic and inc/dec operations, including on the stack pointer. As the real 6502 does the same, we knew that it was possible, but I don't know if other cores do it.

I'm surprised at the small number of muxes. In part, this might be because AXYS are in a RAM, which kind of contains the output and input mux.

From the lower-level report:

Code: Select all

Macro Statistics
# Registers                                            : 46
 1-bit register                                        : 36
 16-bit register                                       : 1
 2-bit register                                        : 2
 3-bit register                                        : 1
 4-bit register                                        : 1
 8-bit register                                        : 5

we see the evidence for the hidden state: pipeline controls and holding registers for the datapath. The NMOS 6502 also has these - often just as a latch, which can be as simple as just a single transistor. (Which is why it can't be clocked as slow as you like: the stored charge will leak away.)

A splendid job by Arlet!

(Although: on a Reset, will this core write to the stack?)

Cheers
Ed

Arlet · Post by **Arlet** » Fri Nov 19, 2010 2:21 pm

Quote:

Note also that there are just 2 adders - somehow the PC hasn't been recognised as only needing an incrementer

At the LUT level, the incrementer is implemented as a full adder. Both need the carry chain logic.

Quote:

I'm surprised at the small number of muxes. In part, this might be because AXYS are in a RAM, which kind of contains the output and input mux

I'm surprised too. There are many more muxes. I have wide muxes for the PC, AB (address bus), DO (data out). That alone should be 40 muxes (one for each bit). Maybe the tools don't recognize them as such ? If that's true, there might be room for some LUT reduction. If the mux gets recognized by the tool, it will try to use the F5MUX, F6MUX muxes in the fabric, instead of the general purpose LUTs. Does the report mention how many F5/F6MUXes were used ?

Quote:

(Although: on a Reset, will this core write to the stack?)

I'm afraid it does. Should be easy to fix in the same way the 6502 did it, though.

BigEd · Post by **BigEd** » Fri Nov 19, 2010 2:35 pm

I think (at other times) I've seen the tools notice an incrementer, which isn't to say that it's an advantage. Possibly if you test your input bit and conditionally added a constant one, instead of adding the bit, it would take notice.

Yes, there are MUX* cells reported:

Code: Select all

#      MUXCY                       : 28
#      MUXF5                       : 25
#      MUXF6                       : 1

Cheers
Ed