Proper 65C02 core

Arlet · Post by **Arlet** » Mon Apr 23, 2012 6:26 pm

BigEd wrote:

A two-cycle branch-taken would help... and indeed, we see Michael has given us both a test program listing m65c02_tst2.lst and a trace of activity M65C02_SV_Output.txt which shows a BRA taking only two cycles (line 36)

Yes, but with internal block RAM or external RAM, a branch taken would take at least 3 cycles.
- 1 cycle for opcode fetch
- 1 cycle for operand fetch
- 1 cycle for changing PC/AB
(opcode fetch in next cycle)
I noticed that the RAM in M65C02_RAM.v is asynchronous, so that could make a two cycle branch possible, because you can combine next opcode fetch with changing PC/AB. However, that would mean the program needs to be stored in LUT RAM.

Windfall · Post by **Windfall** » Mon Apr 23, 2012 6:40 pm

GARTHWILSON wrote:

You did say "drop-in replacement" though, which means it goes into existing boards.

If you interpret it literally, yes

. I meant metaphorically, as in 'without needing changes to the interface'.

I'm all for some nitfy, new flexibility under the hood, but the top level of the design should be a 'drop-in replacement' (see, I used quotes this time

).

MichaelM · Post by **MichaelM** » Mon Apr 23, 2012 11:34 pm

Arlet:

Sorry to have missed your earlier post.

My objective in the core implementation was for the basic synchronous logic to operate at 100 MHz. The only way to achieve this is to implement extra adders and decrementers for the next address, i.e. program counter, and the stack pointer, and to add some pipelining. Thus, the core uses several of these additional functional units to perform several operations in parallel. As a consequence, the LUT utilization is not nearly as resource efficient as the core that you developed.

In the main core file at line 670 and 695, the left and right operands, respectively, for the next address are produced by 16:1 multiplexers. I did design/assign the operand select codes for the next address generator in a fashion that allows some logic reduction. When I first formulated these multiplexers, I created a combinatorial logic loop through the RAM. I broke that loop by requiring the operands for the next address to be sourced from a register: PC, OP2, OP1, or S (StkPtr). In parallel with the NA adder at line 720, the next PC value is computed in the always block at line 740. A multiplexer in that always block performs the operand selection and next PC computation. In contrast to NA, the next PC value may include the current input from the RAM, DI, as an operand. (DI is the memory input bus, and is used to drive the IR, OP1, and OP1 registers, and the address bus of the two microcode block RAMs during the cycle when the instruction opcode is present on the bus.)

Instruction execution is completed in a single cycle (except for BCD mode ADC/SBC instructions) during the instruction fetch cycle of the following instruction. Thus, the condition code status flag which affects the branch instruction is present on the cycle during which the branch address offset is being fetched. At line 750, the PC value is computed using the state of the selected CC flag, and is either the next sequential address (branch not taken) or the current PC + offset + 1. For RTI/RTS instructions, the next instruction's address is computed at line 749.

These additional two adder functions allow the core to trim 1 clock cycle from branches and interrupt/subroutine returns. With respect to stack operations, the stack pointer is implemented with an add/sub at the S register input, and an adder at the S register output. The adder on the output is used during stack pops to eliminate a dead cycle since the 65C02 points to the next free address location. This characteristic of the 65C02 is great for push operations, but causes pops to incur a one clock cycle penalty as the S register is pre-incremented before the stack read can be performed.

The MAR register is used to capture the NA adder output, and used as the basis for the NA during absolute, indexed absolute, and post-indexed indirect addressing. I ran into a problem with interrupt servicing.

In that case, since the core performs the NA and next PC calculations in parallel, the PC value is pointing to the next instruction instead of the last byte of the instruction being interrupted. To correct this issue without changing the PC always block, another register was added to capture the previous value of the PC whenever the PC is modified (line 762), and two in-line multiplexers were added in lines 838 and 839. These multiplexers select the PC's past value instead of the PC's current value whenever an interrupt push operation is performed as a consequence of an instruction interrupt instead of a BRK instruction.

Hope this explanation answers your question.

MichaelM · Post by **MichaelM** » Tue Apr 24, 2012 12:07 am

Windfall:

Just to follow up about the 'proper' core's external interface. Why would you want to burden an FPGA/CPLD programmable core with the ancient interface specification of the 65C02?

Proper design techniques within FPGAs/CPLDs dictate a synchronous interface and would be difficult to achieve using a two-cycle interface based on Phi2 (or E) as the data enable. Without support for a variable memory cycle, the core and the memory would have to run at the same speed in such a system.

My thought process when developing the M65C02 core was to provide the basic interface signals so that an external memory interface and the core logic could be wrapped into a component suitable for whatever internal or external bus a user desired to implement. In the core that I posted, that may be a bit difficult because of the number of logic levels in the NA (AO) data path and the requirement that the Ack be returned in the same cycle. If the clock is slowed down sufficiently to allow additional logic delays in order to generate a wait state request, then the Ack signal can be used by memories or I/O components external to the core to stretch its memory cycle.

MichaelM · Post by **MichaelM** » Tue Apr 24, 2012 12:14 am

Ed:

I would be happy to work with you on-line or via PM to get the uploaded hosted on Github.

Windfall · Post by **Windfall** » Tue Apr 24, 2012 12:46 am

MichaelM wrote:

Just to follow up about the 'proper' core's external interface. Why would you want to burden an FPGA/CPLD programmable core with the ancient interface specification of the 65C02?

I don't suggest replacing the more flexible interface. I suggest providing an 'ancient' one on top of it, so those who need it are not burdened with re-inventing and testing it. When you, as the maker, can probably write one in five minutes.

MichaelM · Post by **MichaelM** » Tue Apr 24, 2012 1:21 am

Agree.

Now that the basic core is complete, I am working toward building an implementation with an internal memory block, and an external interface such that it is very much like that of the original. However, some changes have been required to the next address generator to provide a suitable registered address for both the internal and external memory interfaces. That resulted in a performance drop of the base core that took me a while to resolve. In resolving those issues, I've had to modify several of the microprogram control fields from encoded fields into one-hot decoded fields to reduce the number of logic levels. The cascaded changes have put my testing efforts for the second core/macro component a bit behind.

When I have something, I'll release the second core/macro component under the same license as the first.

Windfall · Post by **Windfall** » Tue Apr 24, 2012 10:59 am

MichaelM wrote:

Agree.

I would also like to suggest something else, if I may : on interrupts, accept a vector address instead of a code address. A code address makes it unnecessarily hard to adapt your design to existing hardware, because it requires hard-wiring vector contents in the interrupt controller, doing memory lookups in the interrupt controller, or rewriting all the software that is to be run by the core (replace vectors with JMPs). All of which is very awkward.

MichaelM · Post by **MichaelM** » Tue Apr 24, 2012 11:31 am

That's the intended purpose of the interrupt concept as implemented in the core. It could stand some additional refinements, however.

As it stands at the moment, the feature that you appear to want included is simply a vectored interrupt table which would exist at some address in memory.

The way I see it, the interrupt vector controller would have to monitor the memory writes into the table and capture those writes into an internal table. The associated interrupt request would function as the address for the vector, and the controller would present that adddress to the core when the core acknowledges the interrupt. The present core provides the necessary signals for this concept, but the interrupt controller is rudimentary in its implementation.

In my view, the table could be located anywhere in memory. The interrupt handling implementation concept for the core was driven by the fact that in most current implementations, ROM exists at the locations of the vector addresses for RST, NMI, and IRQ. Unfortunately, that prevents an easy remap of these traps without adding a service routine the provide the remapping using jumps, etc. That approach can naturally add significant latency, and defeats one of the features of the 65C02 architecture: low interrupt latency. Furthermore, in a new application of the core, it would be best to further reduce the interrupt latency by eliminating the indirect jump through the vector that that current processor performs. That operation is unnecessary if the vector or a table of vectors is captured and managed by the interrupt controller external to the core logic.

An interesting extension of this concept would involve modifications to the BRK instruction processing sequence. The definition of BRK is such that it effectively functions as BRK #imm. However, although the PC advances to point to the "immediate" operand, the instruction makes no acutal use of that "immediate" operand. My understanding of existing use of the BRK opcode is that the debugger code, may or may not use the "immediate" operand, and if does not, then it must adjust the PC prior to the RTI in order to execute the instruction represented by the "immediate" operand to BRK. It is not trivial, but it is possible to modify the instruction sequence for BRK such that the byte following the opcode can be treated as an index into a table of "software" interrupts as is available in some other processors.

Windfall · Post by **Windfall** » Tue Apr 24, 2012 12:35 pm

MichaelM wrote:

That's the intended purpose of the interrupt concept as implemented in the core. It could stand some additional refinements, however.

As it stands at the moment, the feature that you appear to want included is simply a vectored interrupt table which would exist at some address in memory.

The way I see it, the interrupt vector controller would have to monitor the memory writes into the table and capture those writes into an internal table. The associated interrupt request would function as the address for the vector, and the controller would present that adddress to the core when the core acknowledges the interrupt. The present core provides the necessary signals for this concept, but the interrupt controller is rudimentary in its implementation.

In my view, the table could be located anywhere in memory. The interrupt handling implementation concept for the core was driven by the fact that in most current implementations, ROM exists at the locations of the vector addresses for RST, NMI, and IRQ. Unfortunately, that prevents an easy remap of these traps without adding a service routine the provide the remapping using jumps, etc. That approach can naturally add significant latency, and defeats one of the features of the 65C02 architecture: low interrupt latency. Furthermore, in a new application of the core, it would be best to further reduce the interrupt latency by eliminating the indirect jump through the vector that that current processor performs. That operation is unnecessary if the vector or a table of vectors is captured and managed by the interrupt controller external to the core logic.

Doing the lookup in the core increases interrupt latency by just two cycles. Being forced to build a complicated interrupt controller instead is not a sensible tradeoff, in my opinion. Again, it makes the core very awkward to use in all existing designs, and forces new designs to adopt an interrupt handling system that no other core uses.

Arlet · Post by **Arlet** » Tue Apr 24, 2012 4:29 pm

MichaelM,

Thanks for the explanation. Of course, once you switch to a synchronous RAM, you're going to have to add a cycle to any instruction that changes the PC, like branches taken, to refill the pipeline.

Windfall · Post by **Windfall** » Tue Apr 24, 2012 7:22 pm

MichaelM wrote:

An interesting extension of this concept would involve modifications to the BRK instruction processing sequence. The definition of BRK is such that it effectively functions as BRK #imm. However, although the PC advances to point to the "immediate" operand, the instruction makes no acutal use of that "immediate" operand. My understanding of existing use of the BRK opcode is that the debugger code, may or may not use the "immediate" operand, and if does not, then it must adjust the PC prior to the RTI in order to execute the instruction represented by the "immediate" operand to BRK. It is not trivial, but it is possible to modify the instruction sequence for BRK such that the byte following the opcode can be treated as an index into a table of "software" interrupts as is available in some other processors.

I have your core up and running in my project, but there are still issues with interrupts (I process both IRQs and NMIs). E.g. could you perhaps explain, in the absence of documentation of the core interface (I'm trying to figure this out as I go along, really), how BRKs are supposed to be detected and redirected by the 'interrupt controller' ?

MichaelM · Post by **MichaelM** » Wed Apr 25, 2012 12:49 am

Arlet:

You have hit the nail on the head. I have begun attacking this deficiency in the current implementation. A new next address generator and wait state inserter have been included in the core. A 16kB block RAM has beed wrapped up with the core to form a more complete 65C02 implementation. I expect to be able to devote this coming weekend to working out the pipelining issues and adjusting the microcode to account for the additional pipeline delay in the block RAM read data path. I've had to make additional architectural changes to maintain the 100 MHz objective, and that's been a good learning excercise.

GARTHWILSON · Post by **GARTHWILSON** » Wed Apr 25, 2012 1:20 am

Quote:

Proper design techniques within FPGAs/CPLDs dictate a synchronous interface and would be difficult to achieve using a two-cycle interface based on Phi2 (or E) as the data enable. Without support for a variable memory cycle, the core and the memory would have to run at the same speed in such a system.

My thought process when developing the M65C02 core was to provide the basic interface signals so that an external memory interface and the core logic could be wrapped into a component suitable for whatever internal or external bus a user desired to implement. In the core that I posted, that may be a bit difficult because of the number of logic levels in the NA (AO) data path and the requirement that the Ack be returned in the same cycle. If the clock is slowed down sufficiently to allow additional logic delays in order to generate a wait state request, then the Ack signal can be used by memories or I/O components external to the core to stretch its memory cycle.

It might be easier to do that in the clock source. It does not necessarily have to be a constant frequency; so the low time or the high time of a single cycle of the phase-0 input can be stretched, with the timing being infinitely variable, and the cycle does not need to be the same length as the one before or after it. IOW, not just the frequency but also the duty cycle could be changed from cycle to cycle, based on the amount of time needed by the memory of peripheral being addressed. The performance would be better than wait states too, because if for example 1 period is too short but 2 periods is more than necessary and constitutes a waste of time, it can be 1.43 or any other ratio of the fastest clock speed, with no obligation to go in multiples. This can be done with a plain 6502, given the right external circuit connected to the clock source.

MichaelM · Post by **MichaelM** » Wed Apr 25, 2012 1:38 am

Windfall:

The interrupt vector controller interface consists of the IRQ_Msk (output), Int (input), Vector (input), and IntSvc (output) ports. IRQ_Msk is the I bit from the processor status word, PSW[2] (line 827). It is expected that the interrupt controller will provide synchronization and edge detection of an NMI signal, and level detection and masking of the IRQ signal. Those signals are are expected registered and combined in a priority encoder to drive the Int input signal. The microprogram will process Int on an allowed instruction boundary. When the microprogram begins the processing of the external interrupt, it will assert the IntSvc output and sample the interrupt Vector. If I were to implement that controller now, I would latch a detected NMI edge and hold it until the IntSvc is asserted. While the NMI latch is set, I would drive Vector with address programmed into the NMI vector locations. On the falling edge of IntSvc, I would clear the NMI latch to allow additional NMI interrupts, or to take an IRQ interrupt. WRT IRQ, I would apply a level detector and mask with the IRQ_Msk signal. The unmasked, level detected IRQ and latched NMI interrupt requests could simply be ORed together to form Int.

I may have muddied the waters earlier with respect to the BRK instruction. The current core implements that BRK instruction in the same manner as the WDC W65C02. I tested the BRK instruction (PC = 0x5D7), which in the assembly language test program I followed by a 0xFF byte so that I did not have to write a full interrupt service routine to adjust the return address prior to the RTI. (All unimplemented instructions execute in a single cycle, and don't demonstrate any side effects: NOPs) You can see that the value of P pushed onto the stack has the appropriate bits set so that the interrupt service routine can distinguish between BRKs and IRQs.

To implement the BRK in the manner described would require a change to the microcode to fetch the byte following the BRK into memory operand register, OP1. The external interrupt controller would then apply the value of the OP1 register as an offset to the value of the BRK vector.

It is also a change to the microcode to change the behavior of the core to NMI, RST, and IRQ traps. Instead of having Vector point into program RAM, Vector would point to the original 65C02's vector locations. The microcode would capture these original vector addresses into the {OP2, OP1} register pair and then execute the terminal portion of the JMP (abs) instruction sequence.

If your intent is to implement the core to execute a program from external memory, you will probably find that the Tcko values will require that the clock be slowed to 40 MHz or less. I've not sat down to consider how the input and output path delays would impact the achievable operating speed with an external asynchronous RAM.

Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core