6502-Core Comparisons: Fitting a Xilinx Spartan 2 XC2S200
Hello,
I'm Arlet Ottens. I'm new to this board, and I noticed you were discussing 6502 FPGA cores, including the one I wrote.
Just to clarify the licensing terms. My core is free to use for whatever purpose, as long as you don't blame me when it doesn't work.
Let me know if you have any other questions.
I'm Arlet Ottens. I'm new to this board, and I noticed you were discussing 6502 FPGA cores, including the one I wrote.
Just to clarify the licensing terms. My core is free to use for whatever purpose, as long as you don't blame me when it doesn't work.
Let me know if you have any other questions.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
As author of this thread, let me be the first to welcome you to 6502.org! The Programmable Logic section was just added within the last month.
Thanks for letting us post your work. If you see any errors do let us know.
I personally am not at the stage yet where I can actually implement one of these cores in my designs. It was more of a test to see if I could successfully implement one at all.
One day I would like to see which one is the fastest!
Thanks for letting us post your work. If you see any errors do let us know.
I personally am not at the stage yet where I can actually implement one of these cores in my designs. It was more of a test to see if I could successfully implement one at all.
One day I would like to see which one is the fastest!
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
60 Mhz is awesome.... From reading your website, I understand your core was meant to be used with synchronous RAM internal to the Spartan3. I'm sure this has something to do with acquiring higher speeds, but...
Big Ed noticed a discrepancy in your design concerning the # of LUT's versus the # of FF's, when compared to the other cores. His post is earlier in this thread...
Can you comment why this is? Or did I do something wrong implementing your core?
Big Ed noticed a discrepancy in your design concerning the # of LUT's versus the # of FF's, when compared to the other cores. His post is earlier in this thread...
Can you comment why this is? Or did I do something wrong implementing your core?
ElEctric_EyE wrote:
60 Mhz is awesome.... From reading your website, I understand your core was meant to be used with synchronous RAM internal to the Spartan3.
On another project, I made a VGA controller that could read external SRAMs on the Spartan-3 starter kit at 100 MHz, so it may be possible to reach 60 MHz in combination with this core.
Quote:
Big Ed noticed a discrepancy in your design concerning the # of LUT's versus the # of FF's, when compared to the other cores. His post is earlier in this thread...
Can you comment why this is? Or did I do something wrong implementing your core?
Can you comment why this is? Or did I do something wrong implementing your core?
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
ElEctric_EyE wrote:
Are you referring to Read Modify Write instructions like INC & DEC for example?
The BRK instruction, for instance, has 7 cycles. In my code, these cycles are:
DECODE - decode the BRK instruction
BRK0 - write PCH to stack
BRK1 - write PCL to stack
BRK2 - write P to stack
BRK3 - read from FFFE
JMP0 - read from FFFF
JMP1 - fetch opcode from new PC
The critical states here are BRK0, BRK1, and BRK2, in which 3 write cycles are performed in a row. The internal RAM blocks can do this with no problem, but external SRAM needs to have the WE asserted/deasserted for each write cycle. The original 6502 didn't have a problem, because it had the two-phase clocks, so it can assert WE on the first phase, and deassert it on the second phase, all within the same cycle.
There are several approaches you can take. I decided I'd just put the stack in the block RAM, which is the easiest.
You can also forget about cycle accuracy, and make the BRK take 9 cycles, by inserting extra wait states between BRK0/BRK1 and BRK1/BRK2, where the WE is deasserted. Since I managed to get everything else cycle accurate, I didn't want to break it for 2 instructions.
You can also redefine cycle accuracy by mimicking the 2 phase clock with a double clock frequency, so the BRK would be 14 cycles, and every other instruction would be exactly double too. This would be nice and simple, but also slower overall.
Lastly, you could try to reorder the state machine, so you'd get the following bus cycles:
DECODE - decode the BRK instruction
BRK0 - write PCH to stack
BRK1 - read from FFFE
BRK2 - write PCL to stack
BRK3 - read from FFFF
BRK4 - write P to stack
BRK5 - fetch opcode from new PC
By interleave the write cycles with read cycles, the WE is deasserted at the same time. This is a bit more complex, and probably requires some more resources. It is also not how the 6502 did it.
By the way, the INC ZP instruction has the following states in my core:
DECODE - decode instruction, read zeropage address
ZP0 - read zeropage location
READ - send data to ALU for +1 operation
WRITE - send ALU output to zeropage location
FETCH - get next instruction
In this case, there is only one write cycle on the bus (WRITE) which will assert WE, and it will be deasserted on the FETCH, so there's no problem. None of the other instructions have a problem either. It's only the BRK/JSR because they write multiple bytes to the stack.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
GARTHWILSON wrote:
So what will you do when you get to the 65816 which always writes two clocks in a row if you're doing 16-bit stores.
That would be a good idea too for the 6502. If you'd configure the WE output as a DDR, it would be possible to only assert it for half a cycle. I hadn't thought of that option, but it seems like it would be perfect if you wanted all external SRAM, with a minimum of changes.
kc5tja wrote:
I personally don't see the problem, as the Wishbone bus supports back-to-back write cycles, and is regularly used on FPGAs. Perhaps a solution, then, is to use Wishbone internally, and interface to external memory through a bus bridge of some kind.
The problem is when you try to interface external asychronous memories, like old fashioned SRAMs. In that case you'd need a memory interface that either uses a higher clock rate, double edge flip-flops, or extra wait states to handle the back-to-back writes.
When I wrote my 6502 core, I never considered external memories. It was just a hobby project to run my old Acorn Atom inside a Spartan-3 FPGA. Since the Atom never had much memory, the internal block RAM was sufficient for that purpose.
Thinking about it some more, it seems that using a DDR FF for the WE signal would be an elegant solution to the problem.
Hi Arlet
good to see you here, and thanks for sharing your verilog model!
You're probably already up to speed on this subject, but anyone tackling synchronous memory interfacing might be interested to read Rob Finch's page on the topic - the site seems down, so here's an archive copy.
Cheers
Ed
good to see you here, and thanks for sharing your verilog model!
You're probably already up to speed on this subject, but anyone tackling synchronous memory interfacing might be interested to read Rob Finch's page on the topic - the site seems down, so here's an archive copy.
Cheers
Ed
BigEd wrote:
... I see that Arlet's is the only design which uses LUTs as 16-bit RAMs - this could be a reason why the slice count is by far the smallest.
[...]
Another point or two on density:
[...]
Another point or two on density:
- How you encode control information will affect the complexity of the decode. One-hot, two-hot, binary, Gray code, etc.
Whether you decode all bits of an encoding, or just the bits which distinguish interesting cases, will make a difference[...]
A few notable points from the synthesis report:
Code: Select all
Macro Statistics
# FSMs : 1
# RAMs : 1
4x8-bit single-port distributed RAM : 1
# Adders/Subtractors : 2
16-bit adder : 1
9-bit adder carry in : 1
# Registers : 103
Flip-Flops : 103
# Multiplexers : 2
1-bit 8-to-1 multiplexer : 1
8-bit 4-to-1 multiplexer : 1
The A, X, Y and S registers were coded as a register file, and that's probably helped get them implemented as a small RAM: 8 LUTs instead of 32, maybe. Note also that there are just 2 adders - somehow the PC hasn't been recognised as only needing an incrementer, but Arlet has arranged to use his ALU to do all address arithmetic and inc/dec operations, including on the stack pointer. As the real 6502 does the same, we knew that it was possible, but I don't know if other cores do it.
I'm surprised at the small number of muxes. In part, this might be because AXYS are in a RAM, which kind of contains the output and input mux.
From the lower-level report:
Code: Select all
Macro Statistics
# Registers : 46
1-bit register : 36
16-bit register : 1
2-bit register : 2
3-bit register : 1
4-bit register : 1
8-bit register : 5
A splendid job by Arlet!
(Although: on a Reset, will this core write to the stack?)
Cheers
Ed
Quote:
Note also that there are just 2 adders - somehow the PC hasn't been recognised as only needing an incrementer
Quote:
I'm surprised at the small number of muxes. In part, this might be because AXYS are in a RAM, which kind of contains the output and input mux
Quote:
(Although: on a Reset, will this core write to the stack?)
I think (at other times) I've seen the tools notice an incrementer, which isn't to say that it's an advantage. Possibly if you test your input bit and conditionally added a constant one, instead of adding the bit, it would take notice.
Yes, there are MUX* cells reported:
Cheers
Ed
Yes, there are MUX* cells reported:
Code: Select all
# MUXCY : 28
# MUXF5 : 25
# MUXF6 : 1
Ed