Proper 65C02 core

MichaelM · Post by **MichaelM** » Wed Nov 14, 2012 12:50 am

Support for dynamic cycle length configuration is not built into the core just released in the manner that I would want in a final product. Currently, automatic cycle length extension is only implemented for BCD arithmetic operations; it changes the cycle length from 1 cycle under normal operation, to 2 cycles when a BCD add or subtraction is to be performed.

The intent of the microcycle length controller is to allow the memory interface, not the instructions, to determine the cycle length. With the microcycle controller in the Microprogram Controller, I expected to implement a processor framework around the core that includes 0.5kB of single cycle LUT-based RAM (pages 0 and 1), 15.5kB of 2 cycle block RAMs, and 48kB of 4 cycle external memory. When I attempted to put that address-dependent cycle length control into the just released core, the path delays were too long. So the present release is limited to accounting for the BCD arithmetic operation. When I integrate this core into a processor framework which includes an external memory interface, internal LUT and block RAM, and an interrupt handler/controller, I expect that I will implement microcycle length control based on the memory map described above. However, I will have determine how to add the microcycle length control logic such that it is not in the critical memory address path. That may take some noodling and testing to implement.

In another 65C02 core project which I started some time ago, I undertook a significant amount of rework of the ALU, the address generator, and the microprogram organization. In that project, I was trying to address Arlet's comments regarding an expected decrease in performance if the M65C02 core was connected to block RAM instead of single cycle LUT RAM. Most of the objectives were met, and most instructions and addressing modes were working correctly.

In addition, its resource utilization is considerably less that of the just released core, but I ran into a number of issues when I attempted to deal with the memory read delay of block RAMs. I was able to deal with the read delay issue for sequential instruction and data fetches. However, I did run into several issues in the microcode when the various sequential fetch/read operations had to be flushed and restarted at a non-sequential address. The read delay of the block RAMs was forcing me to add additional states to the microcode.

It took a while, but I finally realized that the additional microcode states to deal with the memory read was not the correct approach. The additional microprogram complexity is contrary to my objectives for microprogrammed 65C02 cores. Furthermore, the additional states were very specific to particular sequences of instructions and addressing modes. This makes for an implementation which is not very robust, and difficult to modify/maintain. Therefore, I stopped working on the project until I find a way around the memory read delay issue.

I am now thinking of taking another look at that core design, but this time with an MPC modified to support dynamic microcycle lengths. The problem represented by the memory pipeline delay of the block RAMs is exactly the application for which the AMD Am2925 microcycle length controller was intended. I don't know why I didn't come to this realization sooner; I've been aware of the Am2925 for a long time.

BigEd · Post by **BigEd** » Wed Nov 14, 2012 5:52 am

Hi Michael
Do I take it that the 2-tick cycle length for the block RAMs is to allow for the synchronous nature? That during a single microcode state, it allows for one tick to present the address and a second tick to collect the data? Whereas the LUT RAM (distributed RAM) can be combinatorial?

And so, is your 4-tick cycle for external RAM set up for particular timing of the RAM interface, taking specific actions on each tick? Or is it just a simple way to slow down to an appropriate access time and cycle time?

I'm following this with great interest!

Cheers
Ed

Arlet · Post by **Arlet** » Wed Nov 14, 2012 6:09 am

I'm also curious about the 4 cycle external memory. Can't you just make it 2 cycle, like the block RAMs, and use the RDY signal to extend it if necessary ?

MichaelM · Post by **MichaelM** » Wed Nov 14, 2012 1:45 pm

Quote:

Do I take it that the 2-tick cycle length for the block RAMs is to allow for the synchronous nature? That during a single microcode state, it allows for one tick to present the address and a second tick to collect the data? Whereas the LUT RAM (distributed RAM) can be combinatorial?

Your analysis is correct. The purpose of the microcycle length controller is to match the microcycle to the device/memory providing the data required. In the case of the recently released update, rather than having the ALU Valid signal initiate the delay required by BCD addition/subtraction, I dynamically lengthen the cycle when an ADC/SBC instruction is executed when the D flag in P is set.

Quote:

I'm also curious about the 4 cycle external memory. Can't you just make it 2 cycle, like the block RAMs, and use the RDY signal to extend it if necessary?

If different memory types are used to enhance the speed of the microprocessor (core plus interrupt handler, memory controller, etc.), then the microcycle length controller can dynamically adjust the length of the microcycle when different memories are accessed. The expectation is that external memory will require no less than two cycles (output address register delay plus input data register delay). If synchronous, fall through memory is used, then an additional cycle is also required. If synchronous, registered memory is used, then there will be a two cycle delay. The fall-through memory is like an FPGA block RAM in that the output is ready on the rising edge after the address is registered. The issue with the fall-through design is that like an asynchronous memory, there is an access delay before the data would arrive back at the FPGA IOB. That delay plus the internal path delays of the FPGA IO path likely means that the data is not valid at the output of the IOB register until the second rising edge. The other synchronous SRAM style imposes an additional clock delay, but it eliminates the access delay and converts it to a register output delay which is generally shorter. This probably makes the interfacing of the FPGA to the registered SyncSRAM a much better proposition. In either of these two cases, I count 4 cycles as being required.

Although the present implementation only supports 1, 2, and 4 cycle microcycle lengths, it is possible to include a 3 clock microcycle. As currently implemented, the microcycle length controller accepts a Wait signal in the second and/or third states. This allows the address to be registered to the external memory, and for the external memory to stretch the access cycle as required. It also possible for all of the timing to be controlled from the memory access controller, but I had anticipated that the memory controller would also support external memory-mapped I/O devices. Those devices may require additional cycles that the design could be designed to support implicitly, but it would be more flexible if I/O devices inserted their own cycle extensions.

On the whole, the microcycle length controller solves some problems, but it tends to leave some cycles on the floor. When a sequence of reads is required from block RAM, then only the first is required to wait one cycle. If the address of the second and third accessess immediately follow, then after four cycles all three data bytes have been read. The problem I was having in my second core stems from attempting to read three bytes in four cycles instead of 6 cycles. The interplay among the various instruction sequences and lengths, and the read modify write sequences played havoc with the microcode.

legacy · Post by **legacy** » Sat Dec 08, 2012 9:01 pm

hi guys
i'd like to realize a tiny System on Chip with a 65C02 softcore. I have 4 questions

1) do you think spartan3e 100K gates (2K LE) is enough for the cpu core + 8Kbyte of ram + uart ?
2) how could i translate LUT into gates or LE ? ISE 10.1 report is talking about "LUT" and this is confusing me because i have 100K gates in mind about my spartan 3e 100 device
3) which 65c02 soft core do you suggest ? ( i prefer verilog)
4) which compiler do you suggest ? SDCC + his assembler ? TASM ?

let me know =)

Arlet · Post by **Arlet** » Sat Dec 08, 2012 9:07 pm

@legacy: the "gates" measure is not very reliable. The most accurate approach is just to load the design in ISE, specify which device you're using, implement the design, and see if it fits.

legacy · Post by **legacy** » Sat Dec 08, 2012 10:37 pm

ok, but how could i consider this information from an OpenCores Project which i am considering

Quote:

Performance (standalone CPU, synthesis only):

Xilinx XST on Spartan 3 (-5 grade):

204 LUTs plus 1 BRAM @ 80 MHz (optimized for area)
228 LUTs plus 1 BRAM @ 100 MHz (optimized for speed)
618 LUTs @ 53 MHz (optimized for area, no block ram)

Altera Quartus on Cyclone 2:

369 LEs plus 4 M4Ks @ 67 MHz (balanced optimization)

The author is talking about LUTs and BRAM for Spartan3 and LE for Cyclon2: is it a not very reliable information ?
This is confusing me a bit because in this URL is talking about soft cores comparing them for LEs http://www.1-core.com/library/digital/soft-cpu-cores

Arlet · Post by **Arlet** » Sat Dec 08, 2012 11:55 pm

In footnote 5, a LE is defined as a 4-input LUT, plus a flip-flop. This is only a rough estimate "for reference purposes only" as it says. If one LUT/LE count is 10x higher than another, you can be sure it will take more area. But if it's only 20% higher, it could be that it's actually lower on another brand/type of FPGA.

MichaelM · Post by **MichaelM** » Sun Dec 09, 2012 2:22 am

Another issue is that the Altera LE and Xilinx LUT+FF are not equal in capability even if the logic function is a 4-input look-up table in both FPGAs. Altera and Xilinx implement these functions in a different manner in their FPGAs. An effort is made by Xilinx in their Spartan-3E data sheet to relate their LUT/FF design elements to those of Altera.

They give a measure named the Equivalent Logic Cells that tries to give an effectiveness measure for their implement implementation versus that of another vendor, i.e. Altera. In their data sheet they give that measure as 1.125:1. I find this to be kind of hokey. For one thing, Xilinx uses Configurable Logic Blocks (CLBs) consisting of two slices of two LUT/FFs per slice. One slice has the ability to implement variable length shift registers, and the other does not. Some additional restrictions regarding the clocking and sharing of resources apply to the slices.

My current understanding of the Altera architecture is that LEs are all of the same type. Perhaps they don't support some of the capabilities that the Xilinx slices do, but Altera's elements are symmetric which should make place and routing somwhat easier. In addition, on Altera's side of the ledger is the fact that the Altera Block RAMs acn be configured for shift registers while Xilinx's Block RAMs cannot.

For your purposes, I would simply compare the Xilinx LUT/FF directly to the Altera LE. The performance of the Block RAMs in these parts are roughly equivalent. Although, the Xilinx BRAMs in the Spartan-3E family are 18kb components, which are roughly 4x bigger than the Altera M4K Block RAMs you mention in you post. If the total amount of RAM between the two parts are roughly equal, then under certain conditions the smaller block RAM in the Altera may have some benefits over the larger Xilinx block RAMs.

The parts from both of these vendors are more than adequate for your purposes. I would choose the vendor whose tools you are most familiar with. I think that most of the VHDL or Verilog cores you'll find will support either FPGA with only minor modifications necessary to adapt them to one architecture or the other.

From your post, the XC3S100E-5 should be able to provide you a core from which you should be able to extract between 67 MHz and 100 MHz performance. Similarly, it appears to me that the EP2C5 part is equivalent to the Xilinx part, and it too should be adequate for your purposes. In their TQFP packages, either of these FPGAs should provide an adequate number of user I/O that you should be able to integrate several peripherals directly on-chip and still provide an external expansion bus.

MichaelM · Post by **MichaelM** » Tue Dec 11, 2012 3:18 am

I've just completed adding the capability to read the NMI, RST, and IRQ/BRK vectors in the normal manner to my M65C02 core. I decided to use the normal interrupt processing sequence to handle the RST exception. As a consequence, the latest M65C02 core (not yet uploaded to GitHUB) pushes PC and P to the stack.

BigEd posted earlier today on the thread about to 6502 cores evaluated, and specifically about Arlet's core. As I was scanning that thread, a post was made regarding the writing of the stack during reset. That post referred to an external site where it states that the PC and P are not pushed onto the stack following reset.

Question: is the behavior described at the external site correct? If so, I will correct the behavior of the M65C02 core to match.

Arlet · Post by **Arlet** » Tue Dec 11, 2012 5:47 am

That reminds me... I still have to fix my core for the reset handling

MichaelM · Post by **MichaelM** » Tue Dec 11, 2012 6:01 am

Thanks, I guess your post confirms that on RST only the JMP (FFFC) is performed.

BTW cool results on the VGA line draws you and EEyE posted earlier today.

Arlet · Post by **Arlet** » Tue Dec 11, 2012 9:31 am

The external site used the visual6502 model to confirm the results, so I trust that these are correct, and that nothing gets written to the stack during RST. Of course, this applies to the NMOS version only. I don't have any information about the 65C02 behavior, but I think it's a safe assumption that it wouldn't write to the stack either.

MichaelM · Post by **MichaelM** » Tue Dec 11, 2012 8:28 pm

Updated the M65C02 core on GitHUB. Core now incorporates the WAI and STP instructions (not explicitly tested at this time) and jmp (vector) for NMI, RST, and IRQ/BRK. Adjusted the definitions of the Mode output port to allow STP, BRK, and WAI to be detected by external logic. Operation of RST exception sequence matches that specified by the external website to which BigEd referred to in an earlier post: PC and P are not pushed to the stack.

Will begin attacking the Rockwell instructions. The core's output status signals should now allow easy generation of all 65C02 bus signals, except for the VP, Vector Pull, output pin. With the new microprogram controller with embedded microcycle length controller, it should be easy to generate a 65C02-compatible bus interface. The pipelined execution model used within the microcode will save cycles on many intstructions. Thus, the M65C02 core is not a cycle accurate implementation.

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Dec 11, 2012 8:34 pm

Very nice!

MichaelM wrote:

...Will begin attacking the Rockwell instructions...

Are you trying to incorporate all variations of the 6502?
What I mean to ask is if you plan on doing the 65CE02 as well any time in the future?

Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core