M65C02A Core

barrym95838 · Post by **barrym95838** » Mon Aug 11, 2014 3:44 am

On further reflection, I think that you might be onto something very useful with your PLW zp and PLW abs instructions. They significantly lower the instruction count for things like Forth's EXIT (plw IP, jmp NEXT), and they continue the beloved 65xx tradition of keeping active pointers in memory instead of in registers (or even on the stack). The only remaining inefficiency to which I can quickly point is the lack of 16-bit-wide memory increments and decrements, like INW zp, DEW zp and friends. If you keep Forth's data stack TOS in a fixed location of zp, you seem to have the ingredients coming together for an efficient interpretive engine, even without auto-increment/decrement.

Even with its narrow hardware registers, I think that I would really enjoy writing assembly language programs for your processor, and that's not the kind of praise that I give lightly.

Code: Select all

        ...                     ; revised from previous example, to make use of PLW
        bsr  aaa                ; push run-time pointer to table
table:  .dw  value1, value2, value3, value4
aaa:    plw  N                  ; table pointer goes to zp
        ldy  #1
aloop:  lda  (N),y
        tax                     ; low half of value in x
        iny
        lda  (N),y              ; high half of value in a
        bsr  dosomething        ; use 16-bit value for something
        iny
        cpy  #9
        bne  aloop              ; do it 4 times
        bra  elsewhere

Keep up the good work, sir!

Mike

MichaelM · Post by **MichaelM** » Tue Aug 12, 2014 1:27 am

MichaelM wrote:

Some RTL will need to be modified in the near future if I want to add 16-bit operations (a la '816) as has been suggested by others on this thread. In order to perform 16-bit operations, 8-bit operations must be carried out least significant byte first. As I've patched it at the moment by reading the high byte of the indirect address pointer first, it won't support sequential 8-bit data reads starting with the low byte for 16-bit operations.

I was able to successfully implement the (sp,S),Y addressing mode with a simple microprogramming change that allowed me to take advantage of the microprogram control fields of the address generator. The change allowed me reverse the order in which the address bytes were fetched from the stack frame. However, as I noted above, an RTL change would be required to extend (using a prefix instruction) stack relative addressing to support 16-bit operations.

On the way home from work today, I determined that I could add another qualifier to the MAR CE (clock enable) in the address generator module. The MAR captures the output of the address generator so that it can be used to generate the next sequential address. To support the PC-relative addressing mode, I had to qualify the MAR CE signal with ~Sel_SP. This means that the MAR is not loaded with the stack address, and will retain the address calculated for PC-relative instructions, in particular the PHR rel16 instruction.

The MAR functions as a temporary register to hold the sum of the PC and the rel16 operand. The M65C02A core does not have a way to pass the PC through the ALU, so the independent 16-bit address generator adder must be used to make the calculation required. One of the RTL changes that I made this past weekend to support PHR/PER, after the discussions of last Thursday, was to add a multiplexer that would allow the PC, MAR, and the {OP2, OP1} register pair to be output on the data bus of the core.

The ~Sel_SP qualifier had to be added to the MAR CE so that when the MAR was being pushed onto the stack, the address of the first stack location did not overwrite the MAR in the process. In the (sp,S),Y addressing mode, the stack index is in the OP1 register. Since Sel_SP is asserted because the address is in the processor stack, the ~Sel_SP prevents the MAR from being loaded with the stack address; OP1 is being replaced by the read of the first byte of the destination pointer on the stack. Since MAR did not capture the address bus, incrementing MAR to address the high byte does not work.

I worked around this problem by changing the microprogram as I described last. However, with the addition of another qualifier to the MAR CE control signal it is possible to preserve the desired PC-relative addressing modes and get the MAR to sequentially address the stack. If the MAR CE qualifier is set to ~(Sel_SP & ~Stk_Off), then the (sp,S),Y addressing mode can use the MAR to access the second byte of the address. The signal Stk_Off selects the 16-bit quantity {0, OP1} to add to the 16-bit stack address {1, S}. Since the (sp,S),Y addressing mode asserts Sel_Off, the MAR will be loaded with the stack address on the rising edge of the clock following the current memory cycle.

Thus, this simple RTL change enables the normal use of the MAR for the (sp,S),Y addressing mode. It also sets up the microprogram to support 16-bit operations from the stack as a future enhancement of the M65C02A core.

barrym95838 wrote:

The only remaining inefficiency to which I can quickly point is the lack of 16-bit-wide memory increments and decrements, like INW zp, DEW zp and friends.

It appears that there is a need, in order to simplify the FORTH VM interpreter, to include some 16-bit operations. From your comments, 16-bit versions of the increment/decrement direct page instructions would be good to add to the list. I am thinking about how to implement some of yours and Jeff's recommendations, but it certainly appears necessary to provide a means for at least easily updating FORTH VM 16-bit registers/pointers held in page zero, if not elsewhere in general memory.

BigEd · Post by **BigEd** » Tue Aug 12, 2014 9:04 am

Great discussion and progress! On the topic of a writeable microcode, I recall that the in-house logic simulator at my first job was said to use custom microcode in the VAX system, to perform fault simulations in parallel - a kind of SIMD approach, probably with 4-state logic values. The timing of this only just works out - it would have been '85 or so.

According to http://en.wikipedia.org/wiki/Control_st ... ble_stores you'd be in good company - IBM's 370 and DG's Eagle machine had it, among others.

I think a writeable control store is much more interesting than an FPGA reload.

Cheers
Ed

Dr Jefyll · Post by **Dr Jefyll** » Tue Aug 12, 2014 1:44 pm

barrym95838 wrote:

... the beloved 65xx tradition of keeping active pointers in memory instead of in registers ...

There's some irony in this phrase, given that the "tradition" arose as a result of the miserly transistor budget of the original MOS 6502. They were too cheap... uh, cost-conscious

... to put any proper memory-addressing registers on-chip! That's why those 16-bit registers ended up residing in memory instead. OTOH, they gave us 128 of 'em -- indexable, no less!!

MichaelM wrote:

From your comments, 16-bit versions of the increment/decrement direct page instructions would be good to add to the list.

Yes -- and not only for support of Forth. 16-bit Inc/Dec's have general applicability. It's a conspicuous omission that the 6502 can't do 16-bit adjustments on the "registers" in zero-page. It made sense at the time, but we've outgrown that now I hope.

BigEd wrote:

I think a writeable control store is much more interesting than an FPGA reload.

I'm definitely in favor of writeable control store!! But what's the best implementation? You could either add data paths to the core, or do an FPGA reboot -- both are viable ways to reload the control store. At present I lean slightly toward the FPGA reboot. I'll list the cons before the pros:

Cons:

an application can't load new microcode except via a reboot. IMO this limitation is trivial.
there'd have to be write access provided to the configuration ROM -- not a major challenge.
given that the configuration ROM includes lots of data not pertaining to microcode, you'd need to know where the microcode resides and what the format is -- I mean so you could specifically alter only the contents of the control store. But sussing the format is something that'd only have to be done once. [Edit: or is it?]

Pros:

the FPGA-reload approach doesn't consume resources or complicate the HDL code, and it places no potential constraint on clock speed. IOW it allows a no-compromise core.

Re that last point, I admit I don't have good sense of how serious the compromise is. Any comment, Michael?

-- Jeff

MichaelM · Post by **MichaelM** » Wed Aug 13, 2014 1:01 am

Last night after, altering the RTL of the address generator module as described, I also tested the JMP (sp,S),Y instruction, and modified the vectors of the interrupt/trap handler module to place the COP instruction vector at the vector address shown in the '816 datasheet. It's my goal for tonight to test the COP zp and COP #imm instructions.

I will add the INW zp and DEW zp suggested by Mike and seconded by Jeff. This leaves 10 free opcodes. At least one or two have to be reserved for JMP (dp++) and/or JSR (dp++).

I am working on a means by which the operand size of some instructions can be changed using one of the two escape/prefix opcodes that I've reserved. Implementing this is likely going to take a bit of effort. Furthermore, I want the M65C02A to be able to execute standard 6502/65C02 code. Therefore, I think that I don't want to implement this capability in the M65C02A with a sticky operation/operand size bit in the PSW like the '816. These temporary flags will have to be connected to the test pins of the microsequencer, and the microprogram modified to test for one or the other flag as appropriate. I've got approximated 110 36-bit variable microprogram words remaining with which to implement the remaining 12 instructions and to modify the behavior of the other instructions. Thus, in the short term, it is probably better to allocate two opcodes for the 16-bit increment/decrement operation. If the planned implementation of dynamic operation/operand size control is successful, then these two opcodes can be reclaimed since a SIZ prefix applied to the existing INC zp/DEC zp instructions would perform the same function with only a minor 1 clock cycle penalty.

On a similar note, I have noticed that the index into the stack frame for stack relative and post-indexed stack relative indirect addressing modes is being specified as a 1 in order to address the top-of-stack value. In the implementation of these two addressing modes, I have selected an index value of 0 to refer to the top of stack value. The M65C02A address generator adds a 1 to the stack relative address calculation to compensate for the stack pointer's value. To me this appears to be more natural. Since the M65C02A is not intended to emulate the '816, I thought this difference will encourage potential users of the M65C02A core to not regard the M65C02A as an '816 lite and expect the behavior of the '816 in emulation mode.

MichaelM · Post by **MichaelM** » Wed Aug 13, 2014 2:29 am

Jeff:

I am not sure I can resolve all of the points you raise in your post, but I will try. (Any list formatting does seem to be a goner when placed in quote block.)

Dr Jefyll wrote:

I'm definitely in favor of writeable control store!! But what's the best implementation? You could either add data paths to the core, or do an FPGA reboot -- both are viable ways to reload the control store. At present I lean slightly toward the FPGA reboot.

From the perspective of maintaining performance, a FPGA reboot is the most viable approach. While constructing the M65C02A core as a microcomputer, I attempted to partition the internal memory into 4kB blocks that my rudimentary MMU could relocate as desired by the user. The demonstration project uses 4kB of the available block RAM as microprogram memory. The remaining 28 kB is implemented as internal block RAM in three blocks: (1) 16kB of User RAM, (2) 8kB of User RAM/ROM, and (3) 4kB of Monitor RAM/ROM. As it sits today, the M65C02A soft-core microcomputer project reliably and easily synthesizes, MAPs and PARs into a Spartan3A XC3S200A-4VQG100I FPGA at just under 33ns clock period. (My target speed for the part is 29.4912 MHz, a baud rate frequency.) Splitting these block RAMs into 7 independent blocks dramatically reduces the maximum reported clock speed. It is my suspicion that this result is driven by the significant increase in the number of bus connections that the fine grained BRAMs require from the FPGA. The three block configuration with which I am meeting my goals must be fitting into a sweet spot for this particular FPGA family. Thus, although all BRAMs are dual-ported, and the M65C02A microprogram memories are no exception, setting up the microprogram BRAMs to be accessible as a Writable Control Store (WCS) may be moving the solution back into that domain where the project won't meet my performance goals.

One significant issue with using a reboot of the FPGA to load new microcode is that while the FPGA is being rebooted, special care must be taken in the implementation of all of the external logic. The outputs of the FPGA will float, and for a brief time, between the completion of the configuration image load and the transfer of control to the new user application, may exhibit unreliable logic levels. This issue can be mitigated by ensuring that external circuits are held in reset during configuration by monitoring the FPGAs DONE pin and by ensuring that the configuration control logic releases DONE only after all configuration activities have been fully completed.

Another issue with the FPGA reboot approach is that the external configuration memory must be programmed with an image first. The 32 kb required to program the microprogram memories is a far cry from the nearly 1.2 Mb required to program the entire FPGA. A partial reconfiguration capability of portions of FPGAs is now available with newer generation FPGA families. I have not used this capability in the Virtex 5 family with which I am currently working, so I can't make any statements regarding its applicability for the purpose of WCS updates. Further, Virtex 5 FPGAs are much more expensive for general applications than the Spartan 3A FPGAs, which is the primary reason that I am focused on the XC3S200A-4VQG100I FPGA; it can be purchased low as $7.00 in reasonable volumes from Avnet or similar distributors.

This puts me in a quandry, I would like to be able to implement a classic WCS, but I would like to keep the performance for the XC3S200A-4VQG100I FPGA at or above 30 MHz.

Dr Jefyll wrote:

I'll list the cons before the pros:

Cons:

[*] an application can't load new microcode except via a reboot. IMO this limitation is trivial.
[*] there'd have to be write access provided to the configuration ROM -- not a major challenge.
[*] given that the configuration ROM includes lots of data not pertaining to microcode, you'd need to know where the microcode resides and what the format is -- I mean so you could specifically alter only the contents of the control store. But sussing the format is something that'd only have to be done once. [Edit: or is it?]

Pros:

[*] the FPGA-reload approach doesn't consume resources or complicate the HDL code, and it places no potential constraint on clock speed. IOW it allows a no-compromise core.

Xilinx provides some tools that myself, ElEctricEyE, and enso have used to reload the contents of the BRAMs without requiring rebuilding the FPGA project. I don't think that Xilinx has released the particular details of how the initialization of the BRAMs is mapped into the bit stream so that a user can write their own utility for this purpose.

Dr Jefyll wrote:

Re that last point, I admit I don't have good sense of how serious the compromise is. Any comment, Michael?

I don't think the compromise is particularly serious. I am leaning to implementing the user WCS approach on a "let's see what the performance issues are" basis. The WCS concepts used in the PDP 11/60, the IBM 360/370, etc. just appeal to me. Being able to dynamically load a new instruction sequence in the microprogram store and access it from a user application holds a certain appeal for me.

I've had this concept investigated on one of the projects that I lead, and it was successful. I keep this capability out of the discussion regarding options for those projects because there's too much risk of the project becoming classified as "SW". There's been a move to classify HDL-based FPGA projects as "SW", but I resist that each and every time.

On a historical note, I recently went on a buying spree and bought several PDP 11 backplanes, processor cards, and memory cards with the objective to build up a PDP 11/83 processor. In the process, I purchased a PDP 11/03 processor card and while researching that card, I discovered that you could develop user microcode for that family of LSI 11 processors. The WCS was on a card that was mapped into the I/O space on the LSI 11 Q-bus. The internal microcode bus was slow enough that a 40-pin ribbon cable with a DIP-40 header from the WCS card in an adjacent card slot could be plugged into one of the microcode ROM sockets of the processor chip set. (The LSI 11/03 operated at about 2-3 MHz and consisted of 3-4 40-pin chips.)

Edit: Corrected some misspelled words.

MichaelM · Post by **MichaelM** » Sat Aug 16, 2014 11:31 pm

Well somewhat out of order from my plan, I've validated that the M65C02A is implementing the Rockwell bit-oriented instructions: RMBx/SMBx zp and BBRx/BBSx zp,rel. Had to make some minor tweaks to the microprogram for the BBRx/BBSx instructions to eliminate the RMW tag and to capture the condition code.

Remaining to be tested are the COP zp, COP #imm, the two 16-bit IO move instructions (MWT zp,(Y) and MWF zp,(Y)), and the 16 stack relative instructions: ORA/AND/EOR/ADC/LDA/STA/CMP/SBC sp,S and ORA/AND/EOR/ADC/LDA/STA/CMP/SBC (sp,S),Y.

Dr Jefyll · Post by **Dr Jefyll** » Sun Aug 17, 2014 12:13 am

Nice to hear about all the progress, Michael. As for Writable Control Store, ...

Quote:

I don't think that Xilinx has released the particular details of how the initialization of the BRAMs is mapped into the bit stream so that a user can write their own utility for this purpose.

I've done enough reverse engineering that, to me, it seems reasonable to adopt this approach whether Xilinx has released the doc or not. But I'm not volunteering. And admittedly the effort might not be worthwhile, balanced against whatever the next-best alternative seems to be.

Quote:

The outputs of the FPGA will float, and for a brief time, between the completion of the configuration image load and the transfer of control to the new user application, may exhibit unreliable logic levels.

Doesn't the chip always configure upon powerup, and don't the levels also float then? IOW, isn't this an issue you face in any case?

Quote:

The 32 kb required to program the microprogram memories is a far cry from the nearly 1.2 Mb required to program the entire FPGA.

Memory is cheap. I don't see an issue with wasting almost 1.2MB if that'll achieve the goal. You mentioned a partial reconfiguration capability, but I don't see any great advantage. Maybe I'm missing something. I will say it seems a little backward to put multiplexers on the configuration memory so it can be written in-circuit.

Quote:

On a historical note, I recently went on a buying spree and bought several PDP 11 backplanes, processor cards, and memory cards

Sw-eet!!

But will you have time to play with all your toys?

cheers
Jeff

MichaelM · Post by **MichaelM** » Sun Aug 17, 2014 1:38 am

Dr Jefyll wrote:

MichaelM wrote:

I don't think that Xilinx has released the particular details of how the initialization of the BRAMs is mapped into the bit stream so that a user can write their own utility for this purpose.

I've done enough reverse engineering that, to me, it seems reasonable to adopt this approach whether Xilinx has released the doc or not. But I'm not volunteering. And admittedly the effort might not be worthwhile, balanced against whatever the next-best alternative seems to be.

Xilinx makes some nice tools for doing just this, i.e. patching the contents of their BRAMs, and the thread I linked to relates how EEyE and I used that tool to quickly change the memory contents before reprogramming the FPGA using either direct programming via JTAG or the configuration serial EEPROM. The structure of the FPGA, the location in the bitstream, and the organization of the BRAM makes reverse engineering this proprietary data structure a bit daunting, particularly since they provide the tools for free.

Dr Jefyll wrote:

MichaelM wrote:

The outputs of the FPGA will float, and for a brief time, between the completion of the configuration image load and the transfer of control to the new user application, may exhibit unreliable logic levels.

Doesn't the chip always configure upon powerup, and don't the levels also float then? IOW, isn't this an issue you face in any case?

You are absolutely correct that the Xilinx, Altera, and Microsemi RAM-based FPGAs all exhibit this behavior. I am simply pointing out to other readers that dealing with these reconfigurable parts requires some attention to the bitstream options and the external circuits. Therefore, if the interfaces for circuits external to the FPGA are not designed appropriately, then there may be some erratic behavior from both the FPGA and the external circuits. At least with Xilinx FPGAs, there are a number of bitstream options that can be used to reduce/eliminate open/floating pins, erratic output signals and incorrect state machine initialization. I always enable the Pull-ups during configuration option, and I always drive the DONE pin high after configuration is fully complete. The normally selected options for the bit stream generator do not select the options that I recommend. In particular, DONE is released high before the outputs are enabled and internal reset is released. These default configuration options can cause the state machines within the FPGA to misbehave if floating pins from the FPGA to external circuits cause those circuits to oscillate or otherwise behave erratically. The potential for metastability issues on inputs is somewhat elevated during this narrow time window as the FPGA switches from its internal configuration clock to a user clock.

Dr Jefyll wrote:

MichaelM wrote:

The 32 kb required to program the microprogram memories is a far cry from the nearly 1.2 Mb required to program the entire FPGA.

Memory is cheap. I don't see an issue with wasting almost 1.2MB if that'll achieve the goal. You mentioned a partial reconfiguration capability, but I don't see any great advantage. Maybe I'm missing something. I will say it seems a little backward to put multiplexers on the configuration memory so it can be written in-circuit.

The point I was trying to make was that the microprogram memory requirements are only a fraction of those of the full configuration image. Unless there is a significant change in the RTL of the FPGA design, it may be better to simply reprogram the microprogram memory with a 32kB image. Both the standard configuration image and any "user-defined" microprogram images would be stored in the same serial configuration memory. After configuration, at least with most Xilinx FPGAs, the pins used for reading the configuration image from an external serial (or parallel) EPROM revert to user defined I/O pins. In the case of the M16C5x, M65C02, and M65C02A soft-microprocessor demonstration projects, the SPI Master peripheral is connected to the configuration memory. This allows the configuration memory to be reprogrammed in circuit using whatever files transfer program the user wants to use.

For me a WCS provides several options: (1) patching/fixing existing instructions without modifying the configuration image; (2) adding instructions to the existing instruction set of a microprogrammed processor core; and (3) replacing the microprogram in an existing microprogrammed processor core in order to implement a completely different machine. In the first case, it may be more reliable, but maybe less fun

, to simply replace the configuration image with a new image which simply contains the patched microprogram ROMs. It will certainly be much faster to boot with a patched image than it would be to boot and then using a subroutine, load the microprogram patches into a WCS. The same reasoning applies to the second option, but with a WCS it may be possible to dynamically load application-specific microprogram sequences. Using the WCS in this way, a standard instruction set could potentially be changed to implement digital filters instructions, graphics primitives, etc. which operate at the speed of the microprogram controller of the soft-core.

The third option provides a way to change a 6502 into a 65C02 or even a completely different microprocessor like a 6809 or 68HC11. The ALU of the M65C02A is fairly flexible and capable, but it only has six addressable registers: A, X, Y, P (6 bits), OP1 and OP2. OP1 and OP2 are working registers for holding the operands of an instruction. Therefore, as it sits today the internal microarchitecture of the M65C02A is unlikely to be useful as an implementation of another microprocessor, but it could be used to implement an application-specific or special purpose processor by replacing the contents of the microprogram ROMs.

Dr Jefyll wrote:

MichaelM wrote:

On a historical note, I recently went on a buying spree and bought several PDP 11 backplanes, processor cards, and memory cards

Sw-eet!!

But will you have time to play with all your toys?

Probably not as much as I'd like.

Something about putting food on the table and paying the mortgage. One reason I'm trying to wrap up the M65C02A on my Chameleon Arduino-compatible FPGA Board as soon as possible is to use it to implement some peripherals for those PDP-11 processor cards. There is apparently a way to boot the PDP-11 processor cards using a serial interface. I am looking forward to completing the M65C02A and using the two serial ports and the remainder of the large Serial Configuration EPROM to implement a TU58 tape unit which interfaces to the PDP-11 using a serial port.

MichaelM · Post by **MichaelM** » Mon Aug 18, 2014 1:01 am

Completed testing of the M65C02A core's enhanced instruction set. The following two diagrams provide a map of the instructions as currently implemented.

: M65C02A Enhanced Instruction Set Opcode Map: 0x00-0x7F

: M65C02A Enhanced Instruction Set Opcode Map: 0x80-0xFF

Now the issue is how to add the features previously discussed that can be used to support a FORTH VM. Before continuing, I think that I'm going to wrap this thread up and start another that will be focused exclusively on what features and primitive instructions are needed to implement a fast FORTH interpreter with the M65C02A soft-core processor.

Several contributors to this thread have recently tapped on a thread in the FORTH sub-forum, 6519 Forth Processor. I am going to have to take some time to study the instructions in that microcomputer that support DTC FORTH.

Edit: added missing word, time, in the last sentence.
Edit: added link to GitHUB repository which has been updated will all the files for this version of the core.

tweakoz · Post by **tweakoz** » Sun Sep 21, 2014 6:31 am

@MichaelM -

Excellent work, I definitely like the M65C02A - I will be using it in a side project of mine.
http://github.com/tweakoz/zed64
Basically a modernized retrocomputer (and purely a creative endeavor).
I will use it for demo coding and possibly even some music production.

I originally started with Arlet's 6502. I recently switched to the MAM65C02 , since I value easy modification over cycle accuracy and modeling the original hardware. I have not succeeded yet in getting MAM65C02 to boot (I'm close). Now I will switch again (to the '02A)!

Call me a glutton for punishment, the challenge of concurrent assembly programming excites me. As such, I was definitely thinking of using multiple '02 cores in a SMP config (possibly with some additional core-local memory for performance purposes). I was prepared to do the modifications myself, though I would definitely be interested in the dual core mod you mentioned earlier in this thread. I noticed Xilinx block Rams have separable data in and data out ports. With the READ_FIRST option I am guessing this should allow atomic memory operations (at the very least an atomic exchange operation which could be used for a mutex or atomic FSM). I am guessing I would have to split the bidirectional data port on the '02A to accomodate this (like Arlet's does).

At any rate, nice work.

One question, with the '02A cores should I be using a specific assembler to get at all the opcodes easily ? Am I expected to just make a macro to embed an opcode as raw bytes - or is there a toolchain you have modified with the given custom opcodes ? I am kind of partial to 64tass now - though I will switch if necessary.

Thanks,
mtm

GARTHWILSON · Post by **GARTHWILSON** » Sun Sep 21, 2014 6:44 am

tweakoz, welcome!

Cross-32, or C32 for short, sold for $99 by Data Sync Engineering, is a great macro assembler that has the processor-specific stuff in separate files, and comes with those tables for dozens of processors, and they tell you how to make your own tables for new processors which could even include one of your own design. I have sensed some resistance here to get an assembler that's not free; but it may be the last one you ever need to get. It's the one I use for 65c02 and '816. (I use the DOS version, since I won't use Windows anymore; but I expect they mostly sell the Windows version. This computer I use for most things runs on Linux.)

MichaelM · Post by **MichaelM** » Sun Sep 21, 2014 1:09 pm

Welcome.

I use the Kingswood AS65 assembler and macros for tthe Rockwell instructions. I have not developed any macros for the new instructions; i've used direct machine code edits to test all of the new instructions.

PontusO released the source for his assembler. I'm lookong into modifying that source to directly support the M65C02A's opcodes.

I will look into Garth'S suggestion as well.

MichaelM · Post by **MichaelM** » Mon Jul 13, 2015 2:44 am

I have returned to working on my core. Work has definitely interfered with this hobby over the last 6 months. I have worked on a document describing the core, but I've not had much time to work on the core's RTL implementation.

I have added or decided how all of the instructions will be included. I have begun regression testing the core using Klaus' test program suite. The core, as it is today, still passes that test suite.

I have added all of the instructions except the COP instruction. It took a while to decide how to efficiently add a block move (MOV) instruction. In two modes of the instruction, it operates like the MVN/MVP of the 65C816. The MOV instruction has a single byte operand which determines the source address mode and the destination address mode. The modes are independently specified, and can be specified as hold, decrement, and increment.

There are three modes supported for the source and destination pointers, which means that the M65C02A MOV instruction supports 9 different combinations of the source and destination mode settings. When the source mode is set for increment and the destination mode is also increment, the M65C02A MOV instruction behaves like the MVN instruction. Similarly, with the modes both set for decrement, the instruction behaves like the MVP instruction. When the hold mode is used, the MOV instruction can easily be used to access FIFOs like those used in 16C550-compatible UARTs.

The instruction assumes that the transfer length count is loaded in A, the source pointer is stored in X, and the destination pointer is stored in Y. The instruction can be used as an uninterruptable block move, or it can be combined with a conditional branch instruction to allow the block move to be interrupted. The IND prefix instruction is used to select the interruptable MOV instruction. To facilitate its use with a conditional branch instruction, the decrementing of the transfer length counter (A) set the N and Z flags (like the DEC A instruction).

The instruction terminates the block transfer when the Z flag is set following the decrementing of A, or when the IND prefix instruction flag is set. An initial value of 0 for the transfer length count will move the entire 64kB address space of the M65C02A. In the uninterruptable mode, the initial transfer requires 4 memory cycles, but each subsequent transfer only requires 2 memory cycles. In the interruptable mode, each transfer cycle requires 5 memory cycles plus 2 memory cycles for the conditional branch for a total of 7 memory cycles per transfer. (I think this is similar to the '816 transfer cycle count for the MVN/MVP instructions.) Setting up the registers for the transfer requires 10 or 12 memory cycles; 12 cycles are required if the transfer length is greater than 255 bytes.

The following figure shows both modes of the M65C02A MOV instruction:

: M65C02A MOV Instruction Timing

Dr Jefyll · Post by **Dr Jefyll** » Mon Jul 13, 2015 3:37 am

Nice work, Michael -- glad to hear of further progress on this project! And the ability to blast a bunch of bytes from/to a 16C550 FIFO or other burst source/destination will make for a tidy boost in throughput.

In one respect I found your choice of terminology odd. What you refer to as an interruptible bock move seems to me to be better described as simply a byte move. There's nothing "block" about it, unless I'm missing something. (I realize it updates the source, destination, count and sets the flags.)

Quote:

The IND prefix instruction is used to select the interruptable MOV instruction.

Is it possible this option could instead be controlled by the operand byte you mentioned? It seems the operand byte would have extra bits available, since the src & dest mode require only two bits each. Perhaps you've already thought of this idea but there's a gotcha!

cheers,
Jeff

M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core