M65C02A Core

Tor · Post by **Tor** » Mon Jul 13, 2015 6:50 am

It can move <transfer lenght> amount of bytes, isn't that a block move? Except that some block move operations will take care of overlap as well (looking at my minicomputer emulators right now - on one of the architectures a multi-byte copy is called just 'move' and the one which handles overlap is called 'bmove').

BigEd · Post by **BigEd** » Mon Jul 13, 2015 7:37 am

I'm supposing that the non-interruptable block move has higher performance, but of course will hurt interrupt latency.

MichaelM · Post by **MichaelM** » Mon Jul 13, 2015 12:00 pm

Dr Jefyll wrote:

In one respect I found your choice of terminology odd. What you refer to as an interruptible bock move seems to me to be better described as simply a byte move. There's nothing "block" about it, unless I'm missing something. (I realize it updates the source, destination, count and sets the flags.)

A byte move can be easily implemented using normal load/store instructions. Since MOV requires setting up a transfer length counter (A), a source pointer (X), and a destination pointer (Y), I have described it as a block move. Performing the set up will require 10 memory cycles, and then at least 4 more memory cycles to move a single byte. This is a high overhead ratio. A byte move using normal load/store instructions can be performed in less than 14 memory cycles even if the most complex addressing modes are used. I don't think that forcing a break in the execution of the instruction changes its purpose or description.

Dr Jefyll wrote:

MichaelM wrote:

The IND prefix instruction is used to select the interruptable MOV instruction.

Is it possible this option could instead be controlled by the operand byte you mentioned? It seems the operand byte would have extra bits available, since the src & dest mode require only two bits each. Perhaps you've already thought of this idea but there's a gotcha!

As usual, you are a keen observer.

The quick answer is: No. I had been looking for a way to force the microprogram sequencer to break the transfer loop of the MOV instruction. Without adding any additional logic, the microprogram sequencer has access to a number of the prefix instruction flags, but not to the working register where the MOV mode is stored. However, testing those prefix instruction flags and performing a multi-way branch to support interrupts required a dummy memory cycle in the instruction's termination sequence. That dummy cycle bothered me, so I found another logic-free way to break the instruction sequence and keep a looping construct to 7 memory cycles, i.e. the '816 MVN/MVP instruction's data transfer cycle length.

I may have eventually seen that the mode register could be used for the same purpose as I was using the IND flag register, but your observation allows the instruction to be used in a more cycle efficient manner. My solution had the effect of removing dummy cycles, but I was still left with the need to create an interruptable memory cycle. My initial solution met the 7 cycle transfer length objective, but still left me with the problem of trying to interrupt the instruction pipeline of the core. Your suggestion allows eliminating the IND prefix instruction from the loop, and following the MOV instruction with a NOP before the BNE $-5 conditional branch instruction. Thus, your suggested approach solves the remaining problem while maintaining the 7 cycle transfer loop that I was targeting. Many thanks for the suggestion.

MichaelM · Post by **MichaelM** » Mon Jul 13, 2015 12:23 pm

BigEd wrote:

I'm supposing that the non-interruptable block move has higher performance, but of course will hurt interrupt latency.

Once initialization is complete, the non-interruptable MOV instruction moves data, at a rate of 2 memory cycles per byte. A block transfer loop using the interruptable MOV instruction moves data at a rate of 7 memory cycles per byte. Either loop needs to be initialized, and initialization of the MOV instruction requires 10 to 12 memory cycles.

Nothing good is ever free.

I spent a lot of fruitless time trying to make the MOV instruction interruptable at the microprogram level. None of the approaches I considered resulted in a satisfactory result. The additional logic complexity needed was unwarranted for a single instruction. The M65C02A, like all 6502-like processors, has a low interrupt latency because it doesn't have any instructions with long execution times. Jeff's suggestion above, which I've implemented, is the final solution to this problem. If interrupt latency is a concern, then an M65C02A programmer has the option of implementing a block transfer loop using the interruptable MOV instruction.

OT: interrupt latency issues are a system design and implementation issue. IMO, unbuffered I/O devices should not be used unless absolutely necessary. Unbuffered I/O is the primary reason interrupt latency is such a concern in 8-bit systems like 6502-based systems. I tend to use buffered I/O devices in order to avoid the majority of interrupt latency issues. I even buffer event timers in some of my designs.

BigEd · Post by **BigEd** » Mon Jul 13, 2015 12:42 pm

Thanks for the explanation Michael. Just a thought: the early computers tended to have a DMA engine, or several, presumably because tying up the CPU to move data to or from devices was a waste, and an obstacle to time-sharing, and not as fast as a dedicated engine. In that approach, the DMA engine is autonomous and has a little state, but the state isn't part of the CPU state and so doesn't need to be preserved across interrupts or task switches. I wonder if it's worth considering that approach? The CPU needs to be able to tell whether the DMA engine is busy or idle, and in the limit it can spin in a conventional code loop waiting for idle: interrupts will look after themselves in that loop.

Dr Jefyll · Post by **Dr Jefyll** » Mon Jul 13, 2015 7:10 pm

MichaelM wrote:

Many thanks for the suggestion.

Glad that works out! As for my remark about terminology, there's evidently some confusion. I'll try again.

Code: Select all

       ldx # SrcAddress
       ldy # DestAddress
       lda # CountMinusOne
       mov inc, inc, autorepeat ; <----- 

       ;done -- all bytes moved now

The instruction marked with the arrow is a block move instruction. BTW, I invented my own assembler syntax for the options expressed in the operand byte.

The marked instruction is non-interruptable. It also takes a long time to execute, which raises a concern about interrupt latency.

Code: Select all

       ldx # SrcAddress
       ldy # DestAddress
       lda # CountMinusOne
Mv_Lp: mov inc, inc, single       ; <-----
       bne Mv_Lp

       ;done -- all bytes moved now

In this case the instruction marked with the arrow is not a block move instruction; it moves just one byte. (It has properties making it attractive for use in a block-move loop, but it doesn't do a block move.) Once again the marked instruction is non-interruptable, but in this case it executes quickly, thereby avoiding interrupt latency concerns.

All I'm saying is, the designations "interruptable" and "noninterruptable" are not the best, and I believe better terminology can be found -- maybe "REP" or "autorepeat" or something like that. It's possible I'm being swayed by personal preference. However, I believe a newcomer to your design will more quickly understand the mov instruction if its options are carefully and descriptively named.

-- Jeff
[Edit: booboo in second example]

MichaelM · Post by **MichaelM** » Mon Jul 13, 2015 10:02 pm

Dr Jefyll wrote:

All I'm saying is, the designations "interruptable" and "noninterruptable" are not the best, and I believe better terminology can be found -- maybe "REP" or "autorepeat" or something like that. It's possible I'm being swayed by personal preference. However, I believe a newcomer to your design will more quickly understand the mov instruction if its options are carefully and descriptively named.

I agree with your assessment. I like the notation that you suggested in the examples provided above. Using a third operand term as you've suggested above or using a different mnemonic like MOV or SMOV to designate the block move version or the single move version of the instruction is a matter of personal preference. Therefore, I will adopt your recommended terminology when referring to this opcode in the future.

I do like the REP or autorepeat identifiers. I am somewhat fond of REP, as in the REP prefix applied to the x86 INS and OUTS instructions. In defining the prefix instructions for the M65C02A, I really wanted to include a REP prefix.

One comment about your identifiers: the CountMinusOne should simply be Count. I decrement the transfer length during the data memory read cycle so that the condition code is valid during the following data memory write cycle. In this way the ALU flags are set on the instruction cycle that determines if another transfer is required or if the next instruction should be fetched and decoded.

MichaelM · Post by **MichaelM** » Mon Jul 13, 2015 10:16 pm

BigEd wrote:

Thanks for the explanation Michael. Just a thought: the early computers tended to have a DMA engine, or several, presumably because tying up the CPU to move data to or from devices was a waste, and an obstacle to time-sharing, and not as fast as a dedicated engine. In that approach, the DMA engine is autonomous and has a little state, but the state isn't part of the CPU state and so doesn't need to be preserved across interrupts or task switches. I wonder if it's worth considering that approach? The CPU needs to be able to tell whether the DMA engine is busy or idle, and in the limit it can spin in a conventional code loop waiting for idle: interrupts will look after themselves in that loop.

As part of the I/O peripheral set, including a DMA engine within a System-On-Chip would be beneficial. The DMA processors/channels of past systems is a concept that warrants consideration. The inability of the BRAMs to perform two read/write access per CPU cycle would mean that while a DMA engine is active, the processor would necessarily have to slow down or even stop. A cycle steal DMA engine would disturb the processor the least, but the paucity of dummy memory cycles in a 6502/65C02/M65C02A processor means that cycle stealing would not be very fast. A burst mode DMA engine with programmable burst lengths would give better overall performance.

I think that the M65C02A core, using the COP #imm instruction to access coprocessors, would be able to support a DMA engine or I/O channel concept as used in computers such as the Data General Nova. To complete the M65C02A architecture, I will need include a coprocessor. Rather than a DMA engine, I will probably implement a Booth or DSP48A-based multiplier as my coprocessor example.

BigEd · Post by **BigEd** » Tue Jul 14, 2015 8:29 am

I did wonder what the story was with memory contention in the early machines. As your machine is very efficient (not many idle cycles) the effect will be worse. For some machines, the answer was multi-banked memory. These days, we'd consider an instruction cache or small instruction prefetch buffer - which would be ideal if indeed the CPU is in a tight polling loop during the DMA. In your case, if your BRAMs are your memory, they are probably dual-ported and can be configured to be quite wide - possibly either of those could help a bit.

nyef · Post by **nyef** » Tue Jul 14, 2015 10:45 am

BigEd wrote:

These days, we'd consider an instruction cache or small instruction prefetch buffer - which would be ideal if indeed the CPU is in a tight polling loop during the DMA.

I've been thinking about using a 16-bit wide data bus to RAM/ROM, multiplexing it down to 8 bits for the CPU, and then having instruction fetches run through a two-byte single-line cache. With this, a little over half of your instruction-fetch cycles no longer need to hit the memory bus, leaving them free for DMA or whatever. And for those crazy self-modifying-code people, you could have it snoop memory write cycles.

Doing the same thing with a write-through data cache is probably overkill.

MichaelM · Post by **MichaelM** » Tue Jul 14, 2015 12:02 pm

For the current implementation of the M65C02A core there are only three instructions with dummy memory cycles: PHR (Push Relative Effective Address); PHW (Push 16-bit Word); and PLW (Pull/Pop 16-bit Word). That means that in the majority of circumstances, there are no free memory cycles.

For 6502-like processors, the best approach IMO to increasing memory bandwidth in an economical manner in order to provide free memory cycles for concurrent DMA operations is by creating a dual-ported memory where instructions are fetched on one bus and data on the other. Since the predominant number of memory cycles issued by 6502-like processors (or most other processors) are instruction memory fetches, the data port into the memory would be relatively free and could then easily support burst mode or cycle stealing DMA.

The M65C02A core does not contain a memory interface, interrupt handler, and peripherals. Those features are provided by the application-specific SOC implementation. Thus, the M65C02A core can support separate address spaces for instructions and data, if that is a desirable feature in a particular application. In its demonstration SOC configuration, the M65C02A uses a single, unified address space for instructions and data. Using the second port of the BRAMs, which are used to implement the SOC demonstrator's memory, could be used to support a DMA engine as a coprocessor or a peripheral function.

BigEd · Post by **BigEd** » Tue Jul 14, 2015 3:49 pm

Even a small icache might be a big help - but as you suggest, that's not part of the core, it's part of the memory subsystem.

Another thought: in a single-tasking context, or a context where I/O isn't especially slow, a single instruction something like WAI could allow the CPU to quiesce while the DMA engine does its thing.

MichaelM · Post by **MichaelM** » Tue Jul 14, 2015 4:11 pm

With a 65C02-compatible core like the M65C02A, both the STP and WAI instructions are implemented. Your suggestion has merit, but implementing DMA while in a WAI-induced processor core sleep cycle provides no better performance than what can be achieved with a block MOV instruction. In fact, the MOV instruction makes efficient use of existing resources within the processor core. Using an external DMA engine will add significantly more hardware.

BigEd · Post by **BigEd** » Tue Jul 14, 2015 4:20 pm

Right, yes, of course - your uninterruptable MOV is more or less an in-CPU DMA operation. The tradeoff then would be between the interrupt latency of the MOV and the complexity of the standalone DMA engine. I've no issue with the choice you've made, just mulling over the various possibilities.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Jul 14, 2015 4:33 pm

BigEd wrote:

Another thought: in a single-tasking context, or a context where I/O isn't especially slow, a single instruction something like WAI could allow the CPU to quiesce while the DMA engine does its thing.

Or the DMA controller (DMAC) could halt the MPU with RDY and tri-state the MPU's buses, making the DMAC the bus master. If it's fast enough, the DMAC won't tie up the system too long and while interrupt performance would briefly suffer, we'd be talking milliseconds at worst. It's a scheme that I periodically contemplate with POC V?.

M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core