6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 2:40 am

All times are UTC




Post new topic Reply to topic  [ 10 posts ] 
Author Message
 Post subject: M65C02A Core (Release)
PostPosted: Thu Nov 26, 2015 11:31 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
With the exception of virtual memory support, I have included all of the features into the M65C02A core that I wanted to include when this project started. Therefore, I believe that the end of this project has been reached.

My number one goal was the creation of a synthesizable core compatible with the 6502/65C02 instruction set, i.e. W65C02S-compatible, capable of operating from internal block RAM in a single cycle. My previous core, M65C02, met the W65C02S compatibility goal, but required multiple memory cycles to operate from internal block RAM. The M65C02A core can operate from internal block RAM in a single cycle.

My second goal was to reduce the size of the core's logic so that I could include support for additional addressing modes, 16-bit operations, and make the architecture more compatible with high level languages such as C and Pascal. A sub-goal was to maintain seamless compatibility with the W65C02S in order to execute 6502/65C02 code at any time. This goal has been met by a redesign of the ALU and address generation logic of the M65C02 core. Additional controls and multiplexers were provided in both of these components so that several new addressing modes could be added in a backwardly-compatible manner with the 6502/65C02. The redesign of the ALU allowed the M65C02A ALU to be enhanced to support single cycle 16-bit logic, shift/rotate, and arithmetic operations in virtually the same footprint as the ALU for the 8-bit M65C02 ALU with the limitation that BCD addition/subtraction operations are limited to 8 bits.

My third goal was to provide instructions to directly support a DTC/ITC FORTH VM. The M65C02A provides a number of instructions that allow the implementation of both DTC and ITC versions of the FORTH VM. The instructions are supported by the addition of a module in the M65C02A core that provides the Interpretive Pointer (IP) and the Working (W) register. It also supports the 16-bit operations on these registers which are needed to use them effectively. A new IP-relative addressing mode was added and supported by three instructions: LDA, ADC, and STA. These instructions allow speedier access to literals, constants, and variables. In addition, the IP-relative ADC instruction can be used to implement efficiently relative branches.

My fourth goal was to restrict the microprogram memories to the same two block RAMs used in previous implementations. The organization of the two microprogram ROMs have been changed significantly and expanded, but the microprogram for the M65C02A core only the requires the two block RAMs that all my previous cores have used. To achieve this goal, the microprogram sequencer/controller has had to be extensively modified so more efficient use could be made of the limited microprogram memory. In addition, several single purpose control fields in previous implementations now provide multi-purpose control of core elements that are not used simultaneously.

Finally, my fifth goal was to fit a microcomputer using the new core and sporting a simple vectored interrupt handler, a simple memory management unit (MMU), two buffered UARTs and a buffered SPI Master into the same Xilinx Spartan-3A FPGA used in my previous projects: XC3S200A-4VQ100I. I have several boards that have Spartan-3A/3AN FPGAs mounted, and targeting these parts sets the lower performance boundary for the core. Using the core with other FPGAs is simple and straightforward. I've avoided using any features that are exclusively found in one FPGA family and not another so to move to another FPGA simply change the selected FPGA and define the IO pins.

The following two diagrams provide an instruction set map of the M65C02A core. I've ordered the tables with the least significant nibble across the top and the most significant nibble on the left which results in the common layout. The M65C02A utilizes six prefix instructions to extend the capabilities of the 6502/65C02 architecture. In these figures, instructions using a black font are those found in the 6502, those shown in a red font are those introduced with the 65C02, those shown in a blue font are those introduced by the Rockwell 65C02 and the WDC 65C02S, those shown in a dark red font are specific to the M65C02A. It is difficult show the effects of the prefix instructions on these two diagrams, so only the effects of the IND, SIZ, and ISZ prefix instructions are shown in these two diagrams. Instructions unaffected by these prefix instructions are shown with a clear background, instructions that are only affected by SIZ are shown with a green background, instructions that are only affected by IND are shown with a purple/blue background, and instructions that are affected by all three are shown in an orange background.
Attachment:
File comment: M65C02A Instruction Set Matrix (x0-x7)
M65C02A-InstructionSetMatrix(x0-x7).JPG
M65C02A-InstructionSetMatrix(x0-x7).JPG [ 130.53 KiB | Viewed 2739 times ]

Attachment:
File comment: M65C02A Instruction Set Matrix (x8-xF)
M65C02A-InstructionSetMatrix(x8-xF).JPG
M65C02A-InstructionSetMatrix(x8-xF).JPG [ 122.18 KiB | Viewed 2739 times ]

_________________
Michael A.


Last edited by MichaelM on Fri Dec 04, 2015 11:58 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 27, 2015 12:03 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
All 256 opcode slots are used in the M65C02A. The following five diagrams provide additional details regarding the instruction set. The columns labeled W65C02S and M65C02A provide the number of cycles each instruction requires. The column labeled Int indicates whether an instruction is interruptable (Y) or not (N). (The prefix instructions are not interruptable so that their effects do not need to be saved when interrupts are taken.) The columns labeled IND and SIZ indicate whether an instruction support indirection and 16-bit operations, respectively. (The ind instruction is used to implement secondary functions with opcodes that only use implicit addressing. In other words, instructions where ind can not have an affect on the address mode.) The columns labeled OSX, OAX, and OAY indicate whether an instruction supports one of the three register override prefix instructions. (osx swaps the functions of the stack pointer with X allowing stack operations to be performed using an auxiliary stack pointer maintained in X. osx also redirects register reads/writes to X to S instead. This allows S to use the LDX/STX/CPX/TAX/TXA instructions.) The column labeled Op provides the opcode for the instruction. The remaining columns define each cycle of the instruction.
Attachment:
M65C02A-ImplictFlowControl.JPG
M65C02A-ImplictFlowControl.JPG [ 209.37 KiB | Viewed 2736 times ]

Attachment:
M65C02A-ImmediateZeroPageDirect.JPG
M65C02A-ImmediateZeroPageDirect.JPG [ 173.35 KiB | Viewed 2736 times ]

Attachment:
M65C02A-ZeroPageIndirectRMW.JPG
M65C02A-ZeroPageIndirectRMW.JPG [ 186.77 KiB | Viewed 2736 times ]

Attachment:
M65C02A-AbsoluteRMW.JPG
M65C02A-AbsoluteRMW.JPG [ 212.91 KiB | Viewed 2736 times ]

Attachment:
M65C02A-Specific.JPG
M65C02A-Specific.JPG [ 199.3 KiB | Viewed 2736 times ]

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 27, 2015 9:08 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Bravo! Not every project gets to this level of finish.


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 30, 2015 9:10 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Congratulations on your work, Michael -- you've clearly invested many hours in a highly ambitious project. It must feel great to finally release the result! :mrgreen:

That said, and with your HDL work done at last, I think the summary you've presented is still less than complete. What's important more than anything is for a newcomer to readily get a grasp of what the M65C02A can do. I supposes the various M65C02A related threads contain the necessary information, but it'd be nice if this thread were to review the main features one by one. In particular, the capabilities of the prefixes are central.

Quote:
osx swaps the functions of the stack pointer with X allowing stack operations to be performed using an auxiliary stack pointer maintained in X. osx also redirects register reads/writes to X to S instead. This allows S to use the LDX/STX/CPX/TAX/TXA instructions.
This part seems reasonably clear. What about the OAX and OAY prefixes -- should we assume the same reasoning applies?

IMO the most reader-friendly way to present this is to give each of the prefixes a brief paragraph. It'll be helpful to include a specific example of usage for each. For instance I don't understand what you said about the ind prefix; an example would help. And I don't see the ISZ prefix explained at all.

Incidental question: can you elaborate please regarding the instructions that are and aren't interruptable. For example JSR and RTS are listed as not being interruptable -- yet this is something I would take for granted. Any 65xx instruction must be allowed to complete before an interrupt can be permitted! I have a feeling I'm missing something.

It's nice to see so many single-byte ops that execute in a single cycle! Eliminating dead cycles is a significant performance boost in itself (not to mention 16-bit operations and Forth VM support). Is it possible to estimate a MIPs or MHz figure for the M65C02A?

cheers,
Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 01, 2015 1:54 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
I'm pretty happy with the results, and several individuals on this forum have provided great suggestions, some of which have been incorporated into the final version. I certainly appreciate everyone's patience and freely given advice and suggestions for improvement. I've certainly had a lot of fun working on this project, and having the support of the forum has been a motivating factor.

Now that it is complete, I will be moving on to using the core in some of my other projects. I also expect to begin porting the fig-FORTH implementation that I've used so that it utilizes the capabilities built into the M65C02A for FORTH VM support. I have wanted to port EhBasic to the core as well, and I may attempt that as well.

I've been preparing to release the project on GitHUB. I've prepared a README file that should provide some of the details that was missing from the previous two posts. I've included it below:
Code:
M65C02A Microprocessor Core
=======================

Copyright (C) 2014-2015, Michael A. Morris.
All Rights Reserved.

Released under GPL v3.

General Description
-------------------

This project provides a synthesizable enhanced version of the 6502/65C02
processor core. The M65C02A soft-core processor implements the
instruction set architecture (ISA) of the 6502/65C02 microprocessors as
exemplified by the instruction set of the WDC W65C02S. The M65C02A soft-core
processor features a completely reworked microprogrammed control structure
compared to that used in the preceding M65C02 project. The basic logic
structure of the core has been significantly altered to allow the
implementation of a significantly enhanced 8/16-bit version of the 6502/65C02
processors.

This release supersedes any prior releases and provides the completed version
of the planned M65C02A soft-core processor. As provided, the M65C02A soft-core
processor provides the following enhancements to 6502/65C02 processors:

    (1)     M65C02A core allows the 6502/65C02 index registers, X and Y, to be
    used as accumulators. Although the one address, accumulator-based
    architecture of the 6502/65C02 microprocessors is preserved, three on-chip
    accumulators should make it easier for the programmer to keep extended width
    results in on-chip registers rather than loading and storing partial results
    from/to memory;
   
    (2)     M65C02A core allows the basic registers (A, X, Y, S) to be extended to
    16 bits in width. To maintain compatibility with 6502/65C02 microprocessors,
    the default operation width of the registers and ALU operations is 8 bits.
    Internally, the upper byte of any register (A, X, Y, S) or the memory operand
    register (M) is forced to logic 0 (except for S which is forced to 0x01)
    unless the programmer explicitly extends the width of the operation with a
    prefix instruction;
   
    (3)     M65C02A core’s ALU registers (A, X, and Y) are implemented using a
    modified, three level push-down register stack. This provides the programmer
    the ability to preserve intermediate results on-chip. The modification to the
    register stack is that load and store instructions only affect the TOS
    locations of the A, X, and Y register stacks. In other words, the TOS location
    of the register stacks is not automatically pushed on loads from memory, nor
    is it automatically popped on stores to memory. Explicit actions are required
    by the programmer to manage the contents of the register stacks associated
    with A, X, and Y;
   
    (4)     M65C02A core’s X Top-Of-Stack register, XTOS, serves as a base pointer
    for the base-relative addressing modes: bp,B and (bp,B),Y. This addressing mode
    provides the stack frame capability needed by programming languages like C and
    Pascal, and which must be emulated by 6502/65C02 microprocessors. (Note: base-
    relative addressing using XTOS is generally associated with the system stack,
    but can be used in a more general way with any data structures in memory.)
   
    (5)     M65C02A core’s XTOS can function as a third (auxiliary) stack pointer,
    SX, when instructions are prefixed with the osx instruction. (Note: when used
    as the auxiliary stack pointer, S becomes the source/target for all of the
    6502/65C02 instructions specific to the X register: ldx, stx, cpx, txa, tax,
    plx, phx. This feature provides seven more ways to affect the system stack
    pointer: lds, sts, cps, tsa, tas, pls, phs.)
   
    (6)     The M65C02A core provides support for kernel and user modes. The
    previously unused and unimplemented bit of the processor status word (P), bit
    5, is used to indicate the processor mode, M. The M65C02A core provides kernel
    mode and a user mode stack pointers, SK and SU, respectively, for this
    purpose. SU may be manipulated from kernel mode routines, but SK is
    inaccessible to user mode routines. (Note: a 6502/65C02 program will stay in
    the kernel mode unless bit 5 (kernel mode) of the PSW on the system stack is
    cleared when an rti instruction is performed. On reset, the M65C02A defaults
    to kernel mode for compatibility with 6502/65C02 microprocessors.)
   
    (7)     M65C02A core provides automatic support for stacks greater than 256
    bytes. This feature is automatically activated whenever stacks are allocated
    in memory outside of memory page 0 (0x0000-0x00FF) or memory page 1 (0x0100-
    0x01FF). (Note: a limitation of this feature is that if the stack grows into
    page 1, then the mod 256 behavior of normal 6502/65C02 stacks will be
    automatically restored.)
   
    (8)     M65C02A core provides a prefix instruction, ind or isz, to add
    indirection to an addressing mode. When an indirection prefix instruction is
    applied, indirection is performed before indexing. (Note: a consequence of
    this rule is that the indexed zero page direct addressing modes, zp,X and
    zp,Y, are converted to post-indexed indirect addressing modes: (zp),X and
    (zp),Y. Similar behavior applies to the indexed absolute addressing modes.
    When ind or isz is applied to an indirect addressing mode, the result is a
    double indirection as expected. However, the rule that indirection is applied
    before indexing still applies. The result is that pre-indexed indirect
    addressing modes, (zp,X) and (abs,X), translate into post-indexed double
    indirect addressing modes, ((zp)),X and ((abs)),X, instead of into pre-indexed
    double indirect addressing modes, ((zp,X)) and ((abs,X)).)
   
    (9)     M65C02A core provides a prefix instruction, siz or isz, which promotes
    the width of the ALU operation from 8 to 16 bits. The only restriction is that
    BCD operations cannot be promoted from 8-bit to 16-bit; BCD arithmetic is only
    available for 8-bit operands.
   
    (10)    M65C02A core allows the CMP/CPX/CPY instructions to set the V flag
    when a 16-bit operation is being performed. The M65C02A core also implement
    multi-flag conditional branches. The multi-flag conditional branches support
    four signed conditional branches: Than (LT), Less Than or Equal (LE), Greater
    Than (GT), and Greater Than or Equal (GE). In addition, four unsigned
    conditional branches are supported: lower (LO), lower or same (LOS), higher
    (HI), and Higher or Same (HOS).
   
    (11)    M65C02A core provides support for the implementation of virtual
    machines (VMs) for threaded interpreter’s such FORTH. The M65C02A core’s IP
    and W are 16 bit registers which support the implementation of DTC/ITC FORTH
    VMs using several dedicated M65C02A instructions. The core's microprogram
    implements the DTC FORTH inner interpreter with single byte instruction, and
    it also implements the ENTER/DOCOLON operation with another single byte
    instruction. The ITC FORTH version of these operations are supported using the
    ind prefix instruction. Instructions for pushing, popping, and incrementing IP
    and W are also included. Further, a new addressing mode, IP-relative with
    auto-increment, is applied to the LDA, ADC, and STA instructions. The LDA
    ip,I++ and STA ip,I++ instructions allow faster access to literals, constants,
    and variables. The ADC ip,I++ instruction allows the synthesis of IP-relative
    branches and jumps. Finally, the IP can be transferred to/from, or exchanged
    with, the A.
       
    (12)    M65C02A core provides a multi-purpose move byte instruction. The
    instruction has two operating modes: single cycle, and multi-cycle. In the
    single cycle mode, the instruction terminates after each move is completed.
    The count (A), source pointer (X), and destination pointer (Y) registers are
    updated before the instruction terminates. Decrementing the count register (A)
    sets the ALU ZC flags like a DEC A (DEA) instruction. This allows the
    programmer the option to loop back and continue the execution of the single
    cycle move instruction. The multi-cycle transfers the entire block before
    terminating. Because of this behavior, the multi-cycle move instruction is not
    interruptable. The source and destination pointers may be independently
    configured to increment, decrement, or hold. This allows the M65C02A move
    instruction to support a wide range of data transfer tasks.
   
    (13)    M65C02A core provides support for implementing application-specific
    coprocessors. Direct support for application-specific coprocessors allows an
    implementation based on the M65C02A core to be easily extended in a domain-
    specific manner.
   
To demonstrate the use of the M65C02A core, an example of its use to implement
a microcomputer is provided as part of the release. The example microcomputer
constructed using the M65C02A core provides the following features:

    •   M65C02A core (synthesizable, enhanced 6502/65C02-compatible core)
    •   a Multi-Source (16) Interrupt Handler                           
    •   a Memory Management Unit (with support for Kernel and User modes)
    •   28kB of memory (built from synchronous Block RAM)               
    •   2 Universal Asynchronous Receiver/Transmitter (UARTs)           
    •   1 Synchronous Peripheral Interface (SPI)                         

Implementation
--------------

The implementation of the current core provided consists of the following
Verilog source files and several memory initialization files:

    M65C02A.v                           (M65C02A Microcomputer: RAM/ROM/IO)
        M65C02A_Core.v                  (M65C02A Processor Core)
            M65C02A_MPC.v               (M65C02A Microprogram Controller)
                M65C02A_uPgm_ROM.coe    (M65C02A_uPgm_ROM.txt)
                M65C02A_IDecode_ROM.coe (M65C02A_IDecode_ROM.txt)
            M65C02A_AddrGen.v           (M65C02A Address Generator)
                M65C02A_StkPtrV2.v      (M65C02A Dual Stack Pointer)
            M65C02A_ALUv2.v             (M65C02A ALU Wrapper)
                M65C02A_ALU.v           (M65C02A ALU)
                    M65C02A_LST.v       (M65C02A ALU Load/Store/Xfr Mux)
                    M65C02A_LU.v        (M65C02A ALU Logic Unit)
                    M65C02A_SU.v        (M65C02A ALU Shift/Rotate Unit)
                    M65C02A_Add.v       (M65C02A ALU Dual Mode Adder Unit)
                    M65C02A_WrSel.v     (M65C02A ALU Register Write Decoder)
                    M65C02A_RegStk.v    (M65C02A ALU Register Stack)
                    M65C02A_RegStkV2.v  (M65C02A ALU Reg. Stack w/ Stk Ptr)
                    M65C02A_PSWv2.v     (M65C02A ALU Processor Status Word)
        M65C02A_IntHndlr.v              (Interrupt Handler)
        M65C02A_MMU.v                   (Memory Management Unit)
        M65C02A_SPIxIF.v                (SPI Master I/F)
            DPSFmnCE.v                  (Transmit & Receive FIFOs)
            SPIxIF.v                    (Configurable SPI Master)
                fedet.v                 (falling edge detector)
                redet.v                 (rising edge detector)
        UART.v                          (COM0/COM1 Asynch. Serial Ports)
            DPSFmnCS.v                  (Transmit & Receive FIFOs)
            UART_BRG.v                  (UART Baud Rate Generator)
            UART_TXSM.v                 (UART Transmit SM/Shifter)
            UART_RXSM.v                 (UART Receive SM/Shifter)
            UART_INT.v                  (UART Interrupt Generator)
                fedet.v                 (falling edge detector)
                redet.v                 (rising edge detector)

    M65C02A.ucf             - User Constraints File: period and pin LOCs
    M65C02A.bmm             - Block RAM Memory Definition File
    M65C02A.tcl             - Project settings file
   
    tb_M65C02A.v            - Completed M65C02A soft-microcomputer testbench

M65C02A-Specific Instructions
--------------

The M65C02A soft-core processor is fully compatible with all 6502/65C02
instructions and addressing modes. Regression testing using Klaus Dormann's
functional test suite executes without error. Unlike the mode switching used
by the 65816, the M65C02A uses prefix instructions. This decision can be good
or bad. It is good in that 6502/65C02 functions can be interspersed without
any concerns. It is bad in that it requires the programmer to explicitly
specify address indirection, register overrides, and operation size. In most
cases, the additional instruction length required to specify the various
prefix instructions does not result in poorer performance, or increased
program size because the strict 8-bit width of the 6502/65C02 processors
requires a lot more instructions to implement 16-bit operations.

This section will describe, in cursory detail, the various instructions that
are specific to the M65C02A soft-core processor. Only enough detail will be
provided to convey the character and behavior of the M65C02A-specific
instructions. Additional detail can be gleaned from the Verilog source code,
the microprogram listings, and the User Guide (under development).

##Prefix Instructions

The M65C02A gains much of its power from six (6) prefix instructions: IND,
SIZ, ISZ, OAX, OAY, and OSX. The IND and SIZ instructions add indirection and
promote ALU operations to 16 bits, respectively. The ISZ prefix instruction
applies IND and SIZ simultaneously. These three prefix instructions can also
be applied to implicit addressing mode instructions in order to increase the
opcode space. In general, however, IND, SIZ, and ISZ are intended to be
applied to logic, shift/rotate, and arithmetic instructions. IND converts a
direct addressing mode to an indirect addressing mode. When applied to an
indirect addressing mode, IND adds a second level of indirection.

Indirection added by IND is always performed before indexing. This rule has
some consequences. For example, Pre-indexed Zero Page Direct [zp,X] is
converted to Post-Indexed Zero Page Indirect [(zp),X] when IND is prefixed to
an instruction using the Pre-Indexed Zero Page Direct addressing mode. When
IND is applied to an instruction using the Pre-Indexed Zero Page Indirect
[(zp,X)] addressing mode, the result is Post-Indexed Zero Page Double Indirect
[((zp)),X] not Pre-Indexed Zero Page Double Indirect [((zp,X))].

When SIZ is applied to an instruction, the operation is doubled in size. While
operating on 8-bit quantities, the upper half of registers and/or ALU results
is forced to zero. This has the consequence that when mixing 8-bit and 16-bit
operations, the 8-bit registers/values will appear to be unsigned to the 16-bit
registers/values. There are only two limitations regarding promotion of
ALU operations to 16-bit widths: (1) BCD arithmetic is only valid for 8-bit
quantities, and (2) The Rockwell instructions cannot be promoted to 16 bits.

ISZ can be used whenever indirection and a 16-bit ALU operation are desired.
IND can be used to improve the utility of instructions such as BIT/TRB/TSB by
adding indirection and increasing the width of the operation from 8 to 16
bits. Only IND and SIZ were combined into a single prefix instruction. The
other three prefix instructions may be applied in combination with IND, SIZ,
and ISZ.

The OAX and OAY prefix instructions override the source/target registers.
OAX allows X to function as an accumulator for any instruction which has A
as a source/destination register. For its part, A takes on the pre-index
register role of A in the addressing modes of the instructions prefixed by
OAX. The OAY prefix instruction produces the same results with the Y and A
registers.

Ths OSX prefix instruction allows the programmer to override the default stack
pointer of an instruction with X. The X register has the capability of
functioning as a third, auxiliary stack pointer. The system stack pointer is
not converted to function as the pre-index register. Instead, S is substituted
for X in any instruction specific to X, i.e. LDX/STX/CPX, TAX/TXA, and
PHX/PLX. This allows S to be more easily manipulated. Note that the PHX/PLX
will use the auxiliary stack whose stack pointer is X.

When multiple prefix instructions are necessary, OSX and OAX are mutually
exclusive, and so are OAY and OAX. Internal flags record the execution of the
six prefix instructions. Excution of a mutually exclusive prefix instruction
will result in the flag register of the previously set mutually exclusive
instruction being reset. Furthermore, the flag registers for prefix
instructions are sticky, and remain set until the completion of the
immediately following non-prefix instruction. Finally, no protection is
provided against an infinite long series of prefix instructions, all of which
are uninterruptable, or application of prefix instructions to create an
existing addressing mode.

###Access to User Stack Pointer (SU) from Kernel Mode

When in Kernel mode, access to SU is provided by applying IND or ISZ to the
instruction sequences that access the system stack pointer:

    [SIZ] TSX           =>  TSX : X  <= SK;     IND/ISZ TSX =>  TSX : X  <= SU
    [SIZ] TXS           =>  TXS : SK <= X;      IND/ISZ TXS =>  TXS : SU <= X
       
    [SIZ] OSX/OAX TXA   =>  TSA : A  <= SK;     IND/ISZ TSA =>  TSA : A  <= SU
    [SIZ] OSX/OAX TAX   =>  TAS : SK <= A;      IND/ISZ TAS =>  TAS : SU <= A
   
    [SIZ] OSX OAY TXA   =>  TSY : Y  <= SK;     IND/ISZ TSY =>  TSY : Y  <= SU
    [SIZ] OSX OAY TAX   =>  TYS : SK <= Y;      IND/ISZ TYS =>  TYS : SU <= Y
       
The fastest way to access either SK or SU is to use the standard TXS/TSX
instructions. Use only SIZ to access the 16-bit SK or use only ISZ to access
the 16-bit SU. Use either OAX or OSX in order to transfer SK or SU to/from A.
Use OSX and OAY to transfer SK or SU to/from Y.

The M65C02A implements stacks using mod 256 behavior when they are located in
page 0 or page 1, and mod 65536 behavior when the stacks are not located in
these two pages. Therefore, it is recommended that SIZ or ISZ is used when
reading or writing the system stack pointers in order to transfer the upper
byte from/to the stack pointers. This will preserve or set the behavior of the
stacks.

Without using IND or ISZ, these instruction sequences will access the system
stack pointer, SK or SU, based on the state of the M bit in P. Applying only
IND will transfer only the lower 8-bits of SU. Adding SIZ by using ISZ will
transfer the complete 16-bit value of SU; if only SIZ is used, then only SK
will be accessed.

##Register Stack Manipulation Instructions

A unique feature of the M65C02A soft-core processor are the three level push-down
stacks used to implement each of the three primary registers. Load and
store operations do not automatically perform push and pop operations. This
behavior allows the M65C02A soft-core to seamlessly emulate the 6502/65C02
processors.

All three register stacks are implemented with 3 16-bit registers. Deeper
stacks are possible, but a three deep stack strikes a good balance in utility,
complexity, and resource utilization. The three register stacks are supported
by three single cycle instructions: DUP, SWP, and ROT. The operations of these
three instructions are described by the following equations:

    DUP :   {TOS, NOS, BOS} <= {TOS, TOS, NOS};
    SWP :   {TOS, NOS, BOS} <= {NOS, TOS, BOS};
    ROT :   {TOS, NOS, BOS} <= {NOS, BOS, TOS};
   
To implement a stack push, the programmer must push the current TOS value
prior to loading a new value by duplicating the TOS:

    DUP
    LDA abs
   
To implement a stack pop, the programmer must pop the TOS value after a store
to memory by rotating the stack:

    STA (zp)
    ROT
   
Although this preserves the TOS in BOS, it provides the needed popping of the
register stack.

###Special Behavior of the A Register Stack

In addition to the operations discussed above, the A register stack provides
several special behaviors. Using the IND prefix instruction, the bytes of the
A TOS register can be swapped:

    IND SWP :   {{TOS[7:0], TOS[15:8]}, NOS, BOS}
   
Using the IND prefix instruction, the nibbles of the A TOS register can be
rotated right:

    IND ROT :   {{TOS[3:0], TOS[15:4]}, NOS, BOS}
   
Using the IND prefix instruction, the A TOS register can be written into the
FORTH VM IP register:

    IND DUP :   IP <= ATOS;             => TAI
   
Using the SIZ prefix instruction the FORTH VM IP register can be written into
the A TOS register:

    SIZ DUP :   ATOS <= IP;             => TIA
   
Finally, using the ISZ prefix instruction the A TOS register and the FORTH VM
IP register can be exchanged:

    ISZ DUP :   ATOS <= IP; IP <= ATOS; => XAI
   
###Special Behavior of the X Register Stack

In addition to the behaviors discussed above, the X TOS register implements a
counter function that allows it to operate as a third stack pointer. Further,
the same logic used to implement 6502/65C02 stacks is used to support the
M65C02A move byte instruction. When a push is performed, the X TOS register is
decremented if X is the default stack pointer or when the OSX flag is set and
X is not the default stack pointer. When a pop is performed, the X TOS
register is incremented if X is the default stack pointer or when the OSX flag
is set and X is not the default stack pointer. The M65C02A move byte
instruction uses this functionality to implement the source pointer increment,
decrement, or hold operations.

###Special Behavior of the Y Register Stack

In addition to the behaviors discussed above, the Y TOS register implements a
counter function that allows it to operate as the destination pointer for the
M65C02A move byte instruction. The Y TOS register uses the same counter
implementation as the X TOS register. The M65C02A move byte instruction uses
this functionality to implement the destination pointer increment, decrement,
or hold operations.

##Base-Pointer Relative Addressing Mode Instructions

The M65C02A introduces the base-pointer relative [bp,B] and post-indexed base-pointer
relative indirect [(bp,B),Y] addressing modes. The M65C02A
base-pointer relative addressing modes is designed to support the stack frames
such as those used by programming languages such as C and Pascal.

The base-pointer (B) is X, and the offset included in the instruction, bp, is
zero based; this makes the value on the top of the stack offset 0. Further,
offset bp is signed so that positive addresses point to parameters of a
function, and negative offsets point to local variables. On entry into the
function, the current base pointer, X, is pushed, and then the stack pointer
is moved into X to mark the stack frame; offset 0 is the previous base pointer,
offset 2 is the return address, and offset 4 and above are the parameters
passed into function.

There are 17 M65C02A instructions using this addressing mode. There are eight
standard base-pointer relative instructions:

    ORA/ANL/EOR/ADC/STA/LDA/CMP/SBC bp,B

There are eight standard post-indexed base-pointer relative indirect
instructions:

    ORA/ANL/EOR/ADC/STA/LDA/CMP/SBC (bp,B),Y
   
These instructions support the IND, SIZ, ISZ, OAX and OAY prefix instructions
described above. IND, SIZ, and ISZ have the expected effects. When using OAX,
A becomes the base pointer, and X is the accumulator. When using OAY, A
becomes the post-index register, and Y is the accumulator. Since the base-pointer 
relative addressing mode is not dependent on a stack pointer, the OSX prefix
instruction has no effect. The registers of the X register stack, especially
TOS and NOS, can be used to set up and maintain base pointers into the system
stack and into other memory structures such as the auxiliary stack, which is
also implemented with X.

The last M65C02A instruction supporting this addressing mode is a jump
instruction:

    JMP (bp,B),Y

This instruction supports the IND, OAX, and OAY prefix instructions. IND adds
a second level of indirection, OAX allows A to function as the base-pointer,
and OAY allows A to be used as the post-index register:

    IND JMP (bp,B),Y        =>  JMP ((bp,B)),Y

    OAX JMP (bp,B),Y        =>  JMP (bp,A),Y
    OAX IND JMP (bp,B),Y    =>  JMP ((bp,A)),Y

    OAY JMP (bp,B),Y        =>  JMP (bp,B),A
    OAY IND JMP (bp,B),Y    =>  JMP ((bp,B)),A
   
Recall that OAX and OAY are mutually exclusive. This constraint avoids the
non-sensical situation where OAX and OAY are both applied. This would result
in A being simultaneously used as the base-pointer and the post-index
register, which is not sensical.

##Extended (16-bit) Push/Pop Instructions

The M65C02A provides four push instructions and two pop/pull instructions not
found on the 6502/65C02 processors. These instructions push/pull 16-bit values
to/from the system stack:

    PHR rel16           ; PusH word Relative

    PSH #imm16          ; PuSH word Immediate

    PSH zp              ; PuSH word Zero Page Direct
    PSH abs             ; PuSH word Absolute

    PUL zp              ; PULl word Zero Page Direct             
    PUL abs             ; PULl word Absolute
   
The PHR instruction resolves the absolute address of the 16-bit relative
displacement, and pushes that absolute 16-bit value onto the system stack. The
rel16 parameter is the distance from the address of the next instruction to
the target, i.e. relative to the PC. The stack used (SK/SU or SX) by this
instruction can be changed by adding the OSX prefix instruction. Other prefix
instructions have no affect on this instruction.

The PHR rel16 instruction can be used to synthesize a BSR rel16 instruction
using an instruction sequence like the following:

        PSH #$1
        PHR rel16
    $1: RTS

Another use of the PHR rel16 is to resolve, in a PC-relative manner, the
address of constant or variable needed at run time, when the program can be
relocated when it is loaded in memory. Using a base-relative LDA/STA
instruction prefixed with IND, the constant/variable can be accessed using the
pointer pushed onto the stack. Using the post-indexed base relative indirect
version of the same instructions, the constant/variable can be accessed
through the resolved pointer with indexing. Either technique can be used to
access objects in a position-independent manner using PHR to resolve the
address of the object.

The PSH #imm16 instruction simply pushes the 16-bit immediate constant
following the instruction onto the system stack. If the 16-bit immediate
values being pushed are addresses, then a better name for this instruction
might be PEA (Push Effective Address). The stack used (SK/SU or SX) by this
instruction can be changed by adding the OSX prefix instruction. Other prefix
instructions have no affect on this instruction.

The last four extended stack operations support the zero page direct and
absolute addressing modes to write/read 16-bit values to/from memory. They
support the IND and the OSX prefix instructions with the expected results to
the addressing mode (IND) and default stack pointer (OSX). Other prefix
instructions have no affect on these instructions.

##FORTH VM Support

The 6502/65C02 processors have long supported FORTH. The FORTH inner
interpreter can be implemented using the native instruction set. The 65C02-specific
instructions and addressing modes can be used to implement a FORTH
interpreter slightly faster than an FORTH interpreter which only uses 6502
instructions and addressing modes.

When developing the instruction set for the M65C02A soft-core processor, it
was decided that it would include instructions to support a FORTH VM. After an
analysis driven by a review of a 6502-compatible fig-FORTH implementation,
"Threaded Interpretive Languages" by R. G. Loeliger, research by Dr. Phillip
Koopman, and the "Moving FORTH" articles written by Dr. Brad Rodriguez
(developer and maintainer of Camel FORTH), it was decided that the M65C02A
would directly implement (as recommended by Jeff Laughton) the FORTH VM using
internal 16-bit registers for the Interpretive Pointer (IP) and the Working
register (W). Further, it was decided to dedicate several instructions to
directly support the implementation of the FORTH VM:

    NXT         ; NEXT
    ENT         ; ENTER/CALL/DOCOLON using Return Stack (RS)
    PHI         ; Push IP on RS
    PLI         ; Pull IP from RS
    INI         ; Increment IP
    LDA ip,I++  ; Load byte into A using IP-relative with autoincrement address
    ADC ip,I++  ; Add byte into A using IP-relative with autoincrement address
    STA ip,I++  ; Store byte in A at IP-relative with autoincrement address
   
NXT and ENT implement the NEXT and ENTER/CALL/DOCOLON functionality needed to
implement a Direct Threaded Code (DTC) FORTH VM. If these instructions are
prefixed by IND, then an Indirect Threaded Code (ITC) FORTH VM is the result.

The FORTH VM supported by the M65C02A will provide the following mapping of
the various FORTH VM registers (as described by Brad Rodriguez):

    IP  - Internal dedicated 16-bit register
    W   - Internal dedicated 16-bit register
    PSP - System Stack Pointer (S), allocated in memory (page 1 an option)
    RSP - Auxiliary Stack Pointer (X), allocated in memory (page 0 an option)
    UP  - Memory (page 0 an option)
    X   - Not needed, {OP2, OP1} or MAR can provide temporary storage required

The following pseudo code defines the operations performed by the FORTH NEXT,
ENTER, and EXIT functions/words terms of the ITC and the DTC models:

                   ITC                                   DTC
    ================================================================================
    NEXT:   W      <= (IP++) -- Ld *Code_Fld    ; W      <= (IP++) -- Ld *Code_Fld
            PC     <= (W)    -- Jump Indirect   ; PC     <= W      -- Jump Direct
    ================================================================================
    ENTER: (RSP--) <= IP     -- Push IP on RS   ;(RSP--) <= IP     -- Push IP on RS
            IP     <= W + 2  -- => Param_Fld    ; IP     <= W + 2  -- => Param_Fld
    ;NEXT
            W      <= (IP++) -- Ld *Code_Fld    ; W      <= (IP++) -- Ld *Code_Fld
            PC     <= (W)    -- Jump Indirect   ; PC     <= W      -- Jump Direct
    ================================================================================
    EXIT:
            IP     <= (++RSP) -- Pop IP frm RS  ; IP     <= (++RSP)-- Pop IP frm RS
    ;NEXT
            W      <= (IP++) -- Ld *Code_Fld    ; W      <= (IP++) -- Ld *Code_Fld
            PC     <= (W)    -- Jump Indirect   ; PC     <= W      -- Jump Direct
    ================================================================================

EXIT, the FORTH return from subroutine, is not supported by a dedicated
M65C02A instruction. EXIT is implemented as instruction sequences using the
other dedicated instructions:

    ITC             DTC
    ===             ===
    PLI             PLI
    IND NXT         NXT
   
Only a one (or two) cycle performance penalty is incurred by not providing
dedicated instruction for EXIT.

ENT, PHI, and PLI all default to the RS, which is implemented by the
auxiliary stack feature of the X TOS register. If OSX is prefixed to these
instructions, the PS is used instead. The PS is implemented with the system
stack pointer (SK or SU) of the M65C02A.

PHI, PLI, and INI operate on the FORTH VM IP register. If prefixed by IND,
these instructions operate on the FORTH VM W register: PHW, PLW, INW.

The ip,I++ addressing mode is an M65C02A-specific addressing mode. The
instructions using this addressing mode can be used in a general manner
independently of a FORTH VM. However, the addressing mode was included in
order to access constants, literals, and pointers stored in the FORTH
instruction stream. Like most M65C02A instructions, they default to operating
on 8-bit values, but they support the IND, SIZ, and ISZ prefix instructions
with the expected results to the addressing mode and the operation width.

The LDA ip,I++ instruction will load the byte which follows the current IP
into the accumulator and advances the IP by 1. If this instruction is prefixed
by SIZ, then the word following the current IP is loaded into the accumulator
and the IP is advanced by 2. If prefixed by IND, the instruction becomes LDA
(ip,I++), which uses the 16-bit word following the current IP as a byte
pointer. The IP is advanced by 2, and the byte pointed to by the pointer is
loaded into the accumulator. If prefixed by ISZ, the word following the
current IP is used as a word pointer, while the IP is advanced by 2, to load a
word into the accumulator.

The LDA ip,I++ instruction is matched by the STA ip,I++. Without indirection,
the STA ip,I++ instruction will write directly into the FORTH VM instruction
stream. With indirection, the STA (ip,I++) instruction can be used for
directly updating byte/word variables whose pointers are stored in the FORTH
VM instruction stream. (Although, the ability to create self-modifying FORTH
programs may be useful when compiling FORTH programs, the STA ip,I++
instruction is expected to be prefixed with IND or ISZ under normal usage.)

Finally, the ADC ip,I++ instruction allows constants (or relative offsets)
located at the current IP to be added to the accumulator. Like LDA ip,I++ and
STA ip,I++, the ADC ip,I++ supports the IND, SIZ, and ISZ prefix instructions.

For example, IP-relative conditional FORTH branches can be implemented using
the following instruction sequence:
           
            [SIZ] Bxx $1    ; [2[3]] test xx condition and branch if not true
            ISZ DUP [A]     ; [2] exchange A and IP (XAI)
            CLC             ; [1] prepare C flag for addition
            SIZ ADC ip,I++  ; [5] add IP-relative offset to A
            ISZ DUP [A]     ; [2] exchange A and IP (XAI)
    $1:

The IP-relative conditional branch instruction sequence only requires 12[13]
clock cycles, and IP-relative jumps require only 10 clock cycles.

Conditional branches and unconditional jumps to absolute addresses rather than
relative addresses can also be easily implemented. A conditional branch to an
absolute address can be implemented as follows:
           
            [SIZ] Bxx $1    ; [2[3]] test xx condition and branch if not true
            SIZ LDA ip,I++  ; [5] load absolute address and autoincrement IP
            IND DUP [A]     ; [2] transfer A to IP (TAI)
    $1:

Thus, a conditional branch to an absolute address requires 9[10] cycles, and
the unconditional absolute jump only requires 7 clock cycles. Clearly, if the
position independence of IP-relative branches and jumps is not required, then
the absolute address branches and jumps provide greater performance.

    (Note: the M65C02A supports the eight 6502/65C02 branch instructions which
    perform true/false tests of the four ALU flags. When prefixed by SIZ, the
    eight branch instructions support additional tests of the ALU flags which
    support both signed and unsiged comparisons. The four signed conditional
    branches supported are: less than, less than or equal, greater than, and
    greater than or equal. The four unsigned conditional branches supported are:
    lower than, lower than or same, higher than, and higher than or same. These
    conditional branches are enabled by letting the 16-bit comparison instructions
    set the V flag.)
   
    (Note: a general use of the ip,I++ addressing mode is for string operations.)
   
##PC-relative 16-bit Unconditional Branch

The M65C02A provides an unconditional branch to a 16-bit PC-relative address:

        BRL rel16
   
This instruction can be used to implement position-indepent code modules. In
addition, it can be used to synthesize calls to position-independent
subroutines:

        PSH #($1+2)
    $1: BRL rel16
   
This instruction's behavior is not modified by any of the M65C02A prefix
instructions.

##Move Byte Instruction

The M65C02A provides a move byte instruction. The instruction includes an
immediate parameter which specifies whether uninterruptable block moves or
interruptable single byte moves are performed. In addition, the parameter
controls whether the source pointer and destination pointers are incremented,
decremented, or unchanged. Further, each pointer can be independently
controlled. Two mnemonics are reserved for this instruction, although only one
opcode is used:

        MVB sm,dm           ; Block Move, MSB of parameter byte is 0
        MVS sm,dm           ; Single Move, MSB of parameter byte is 1

The accumulator is used as the count register. The ZC flags in P are set by
this instruction so that the MVS mode can be used with a conditional branch to
test if the transfer is complete or not.

The X and Y registers function as the source and destination pointers,
respectively. The sm and dm fields are the source and destination pointer
modes. These fields are defined as: I (increment), D (decrement), or H (hold).
The source mode is encoded in the bits [1:0] and the destination mode is
encoded in bits [5:4] of the parameter byte. The encodings are 3 - I, 2 - D,
and 0 - Hold (unchanged).

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 01, 2015 2:03 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
The following diagram provides a view of the M65C02A microcomputer that I've using to test the core. Neither the external memory interface nor the timer module are fully developed at this time; the external memory interface has been used for debug rather than for test programs. All programs written to date use the internal block RAMs as shown in the diagram.
Attachment:
M65C02A Microcomputer Block Diagram.JPG
M65C02A Microcomputer Block Diagram.JPG [ 93.73 KiB | Viewed 2647 times ]

The following two diagrams provide the Programming Model views for the M65C02A core in the 6502/65C02 compatibility mode and the extended capabilities mode.
Attachment:
M65C02A Programming Model - Compatibility View.JPG
M65C02A Programming Model - Compatibility View.JPG [ 53.33 KiB | Viewed 2647 times ]

Attachment:
M65C02A Programming Model - Extended Capabilities View.JPG
M65C02A Programming Model - Extended Capabilities View.JPG [ 81.5 KiB | Viewed 2647 times ]

Finally, the following image provides a block diagram of the M65C02A core logic.
Attachment:
M65C02A Core Block Diagram.JPG
M65C02A Core Block Diagram.JPG [ 93.47 KiB | Viewed 2647 times ]

I continue to work on a comprehensive MS Word User Guide, but I've attached the document in its current form for those who may be interested. A lot of work remains. I've edited, re-edited large sections of the document. It's surprising how difficult writing a clear and coherent user guide for the M65C02A has been and continues to be. Virtually every time I open the document, I find something that needs to be added, clarified, or simply deleted. A lot more remains to be added, and I'll be working on that while I turn my attention to porting fig-FORTH, EhBasic, and possibly developing a simulator for the M65C02A.
Attachment:

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 01, 2015 9:15 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Love the block diagrams! For most projects documentation is part of the 90% which remains after the first 90% is done, but it is very worthwhile.


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 31, 2015 3:14 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Just an update.

I was wrong to think that I the core was complete. I am continuing to update and improve the User's Guide for the core, but also began one of those related projects and as a consequence of that work some important changes have occurred to the M65C02A core's instruction set.

I have been porting the Pascal compiler written by Ronald Mak, which is described in his first book: "Writing Compilers and Interpreters - An Applied Approach". The compiler is written in C, and I've been able to recompile it using my Visual C++ 6.0 development kit with only minor changes required to bring the source code into compliance with the newer C/C++ compiler.

I have succeeded in modifying the code generator to generate M65C02A assembly language. I've made some changes to the code generator to account for relative simplicity of the assemblers that I expect to use with the compiler. Professor Mak makes extensive use of the text string substitution features of the Microsoft 8086 assembler. Although that simplifies the code generator by putting the addressing mode into the labels, which the assembler resolves, it makes it difficult to determine whether base-pointer relative or direct addressing is required.

I have also included some simple optimizations in the code generator. Given that the compiler is a single pass compiler, there are some areas in the code generation that can certainly be improved. Overall the code generated is fairly good, but when examined it is clear that a peephole optimizer could provide some significant benefits. There is also some improvements that could be made to expression evaluation.

One of the deficiencies in the M65C02A instruction set that became apparent as I was porting the code generation to the core was that the 6502/65C02/M65C02A lack a means to directly add/subtract values to the stack pointer. Even with all of the prefix instructions for register overrides that I included in the M65C02A, I did not include a way for the system stack pointers to be treated as accumulators. I reasoned that providing that capability for the X and Y index registers should be sufficient. However, allocating and deallocating local variables and function/procedure parameters is easily performed by the 8086 through direct manipulation of the stack pointer.

Since I did not have any free opcodes, I decided to replace the BSR rel16 instruction with a stack pointer adjustment instruction, ADJ. At first I thought that it would be implemented with an immediate operand, i.e. ADJ #imm, and could support the SIZ prefix instruction to extend it to a 16 bit immediate value.

That decision turned out to be rather short lived. In Pascal the function/procedure deallocates the parameters. This is elegantly done in the 8086 architecture using the RET N instruction, where N adjusts the stack pointer after the function/procedure return address has been pulled from the stack. In the 6502/65C02 this task is somewhat difficult. To make the penalty for using subroutines as low as possible, the number of cycles needed to remove the parameters from the stack must be kept to a minimum. My solution was to make the ADJ instruction use the contents of the Y register. This allows local variables to be easily allocated and removed, and function parameters to be removed in the calling routine by simply returning Y with the appropriate value and following every call with an ADJ instruction. This approach also works for C and other related HLLs.

I've purchased a meta-assembler, the Cross-32 Assembler recommended by Garth, and am hoping that I will be able to convert/extend one of the provided syntax tables to support the extended instruction set of the M65C02A core. Hopefully, the package will arrive early next week so that I can start to generate some machine code to run through one or more of my development boards. While waiting for that tool to arrive, I've started work on the runtime library. My first task was to write an efficient 16x16 signed multiply routine: _imul. In the process of implementing a Booth multiply routine, it became apparent that I was using a lot of cycles and memory managing the sign bits of the product. It was particularly difficult to account for sign when arithmetic overflow can occur during the computation of the partial products.

After several fruitless attempts, each leading to more complicated implementations, I finally decided to take advantage of a feature that I had included in the core, but had not fully developed: Arithmetic Shift Right or ASR. I overloaded the LSR A instruction using the IND prefix instruction to implement the ASR operation that greatly simplifies the implementation of the Booth signed multiplication routine. A unique feature of the M65C02A ASR instruction is that it can correctly reconstruct the sign bit in the event that an arithmetic overflow is signaled:
Code:
(En_ASR) ? (V ^ ((SIZ) ? Q[15] : Q[7])) : Ci

Thus, whenever an overflow is present, the sign bit is determined to be the complement of the existing sign bit. Without overflow, the existing sign bit is simply copied.

Finally, I found that my Booth implementation was using all of the registers in the A register stack, two of the registers in the Y register stack, and two registers from the X register stack. This high register usage was due primarily to my desire to keep the double length product, multiplier, and the loop counter in registers. (The multiplicand was stored on the stack, and added/subtracted from the most significant product register using the stack-relative addressing mode ADC/SBC instructions.)

The key to a fast implementation of the Booth multiplication algorithm is an efficient method for testing the Booth recoding/guard bit and the least significant bit of the multiplier. In the 6502/65C02 architecture, only the N, V, and C flags can be readily tested. If the Booth recoding/guard bit is found in the C flag, then determining the state of the multiplier's LSB is the problem. I made several attempts using combinations of ROL and ROR, but none of these attempts improved the cycle count in the inner loop of the Booth multiplication routine.

Sometime during the night I came up with the idea of reversing the multiplier so that the lsb is then shifted left into the C flag as the Booth recoding/guard bit and the state of the next most significant bit is captured in the N flag. This organization of the multiplier provides a very efficient mapping onto the 6502/65C02 architecture. The problem is that it will have a fixed cost to perform the reversal of the multiplier using ROR/ROL and two registers.

My final solution was to build the register reversal logic into the TOS register of the A register stack. (That register also supports byte swapping.) The additional resources required were minimal, and the performance was not degraded. The resulting routine efficiently perform signed multiplication of the 16-bit integers with a cycle count ranging from 284 cycles to 419 cycles. For my current core targeting a XC3S200A-4 part can support a 24 MHz clock rate. For this core, the multiplication routine provides a 32-bit product in 11.8 to 17.5 microseconds, which allows for support of fairly reasonable sample rates for industrial control systems if not audio processing. (Although it is possible to attach a hardware multiplier to the core, I wanted to try to optimize the M65C02A core as much as possible before resorting to using the big iron solution. In addition, the additional logic multiplexers needed to interface it to the core will reduce the operating frequency.)

An interesting result from this effort is that I can see my way to using the technique (and instructions) developed for this runtime library function to support 24x24 multiplications and 32x32. The 24x24 operation is what is required to support the Pascal real (single precision) data type.

The routine that I developed for the M65C02A Mak Pascal runtime, which uses all of the registers in the A register stack and one register from the Y register stack, is attached below:
Code:
0000: A900      [2] _imul:  lda #0          ;; {TOS, NOS, BOS} <= { 0,  x,  x}
0002: 0B        [1]         dup a           ;; {TOS, NOS, BOS} <= { 0,  0,  x}
0003: 0B        [1]         dup a           ;; {TOS, NOS, BOS} <= { 0,  0,  0}

0004: A010      [2]         ldy #16         ;; Load loop counter

0006: AB68      [4]         pla.w           ;; {TOS, NOS, BOS} <= {`R,  0,  0}

0008: 18        [1]         clc             ;; Init Booth recoding bit: C
0009: 2B        [1]         rot a           ;; {TOS, NOS, BOS} <= { 0,  0, `R}

000A: 8003      [2]         bra Test_B 
                           
000C: AB0A      [2] Loop:   asl.w a         ;; `R <<< 1 (Arithmetic Shift)
000E: 2B        [1]         rot a           ;; {TOS, NOS, BOS} <= {PH, PL, `R}

000F: 9009      [2] Test_B: bcc Sub_Shift   ;; {C, x} ? Add_Shft : Sub_Shft

                    Add_Shift:
0011: 3010      [2]         bmi Shft_P      ;; {1, N} ? P >> 1 : (P += M) >> 1
0013: 18        [1] Add_M:  clc             ;; Clr C flag before addition of M
0014: AB8B6300  [6]         adc.w 0,S       ;; Add Multiplicand from stack
0018: 8009      [2]         bra Shft_P

                    Sub_Shift:
001A: 1007      [2]         bpl Shft_P      ;; {0, N} ? (P -= M) >> 1 : P >> 1
001C: 38        [1] Sub_M:  sec             ;; Set C before subtraction of M
001D: AB8BE300  [6]         sbc.w 0,S       ;; Sub Multiplicand from stack
0021: 8000      [2]         bra Shft_P
                   
0023: BB4A      [2] Shft_P: asr.w a         ;; PH >>> 1 (Arithmetic Shift)
0025: 2B        [1]         rot a           ;; {TOS, NOS, BOS} <= {PL, `R, PH}
0026: AB6A      [2]         ror.w a         ;; PL >> 1 (Rotate PL Right)
0028: 2B        [1]         rot a           ;; {TOS, NOS, BOS} <= {`R, PH, PL}
                   
0029: 88        [1]         dey             ;; Decrement loop counter
002A: D0E0      [2]         bne Loop        ;; if(Cntr) loop
                           
002C: 2B        [1] _Exit:  rot a           ;; {TOS, NOS, BOS} <= {PH, PL, `R}
002D: A002      [2]         ldy #2          ;; Remove M from stack after return
002F: 60        [3]         rts             ;; Exit


--Edit: Added instruction cycle counts to the code listing. mam, 15L31

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2016 6:29 pm 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
Quote:
I have succeeded in modifying the code generator to generate M65C02A assembly language.

It's quite a feat to get a compiler working.

It sounds like you've discovered that the 6502 isn't the easiest processor to implement a compiler on. At some point with the number of changes to the original '02 design the cpu becomes a non-6502. Do you still have backwards compatibility ?

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2016 7:52 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Rob Finch wrote:
It's quite a feat to get a compiler working.

Don't want to convey the wrong impression, but most of the hard work was done by Professor Mak. I've simply got it to compile with a more modern compiler, and modified the code generation functions as needed. Professor Mak's compiler, and the accompanying text has made this task substantially easier.

I have implemented some optimizations within the code generator, but some additional work remains. For example, there are register saves which are immediately followed by reloads of the same value/variable. Apparently some of these sub-optimal productions are due to the recursive descent, single pass architecture that Professor Mak has used in the compiler. I look forward over the coming year to working with the compiler to implement additional optimizations, and possibly implementing a peephole optimizer for the generated assembly language output.
Rob Finch wrote:
It sounds like you've discovered that the 6502 isn't the easiest processor to implement a compiler on. At some point with the number of changes to the original '02 design the cpu becomes a non-6502. Do you still have backwards compatibility ?

Boy howdy is that an understatement. :D

It was with some of the issues with the base 6502/65C02 instruction set that I designed the extensions incorporated into the M65C02A core. Full backward compatibility was the primary objective. It was for this reason that I chose to use prefix instructions rather than mode bits in the PSW as used by the 65816. I measure backward compatibility by having the core pass Klaus Dormann's 6502_functional_tests.

However, from the number of SIZ prefix instructions that must be output by the code generator, it is clear that the core could see a significant performance improvement if, for example, the operand width defaulted to 16-bits and the SIZ prefix was used to down shift from 16 to 8 bits. If this approach was taken, then the core would no longer be backward compatible. Therefore, for the near term, I am going to continue to focus on maintaining backward compatibility with one caveat: the Rockwell instructions do not appear compatible with HLLs so they will be removed and replaced with additional BP-relative, SP-relative, and IP-relative instructions. Especially needed are the INC bp,B and DEC bp,B instructions. The amount of code required to increment/decrement local FOR loop control variables is a performance limitation that can be easily mitigated by simply supplying BP-relative versions of these two instructions. Thus, I am willing to loose W65C02S compatibility, but retain 65SC02 compatibility.

When I was doing the initial port, I found it advantageous to use the core's Forth VM IP in a manner similar to the BX register: register indirect addressing. I have a soft spot for the threaded code FORTRAN compiler described by DEC's James Bell. So I decided to replace all of those code productions with 6502/65C02 or M65C02A BP-relative addressing mode instructions. I hope to modify the compiler later in the year to produce direct threaded code. I am interested in studying the potential reduction in code size and performance. Given the small address space, getting the most out of the processor, with a limited amount of on-chip block RAM, is an interesting research area.

The use of three 3-deep register stacks has certainly made the M65C02A much easier to implement a HLL. As shown in the example code for the signed multiplication routine in a prior post, the register stacks can be used to good advantage. The BP-relative addressing modes, bp,B and (bp,B),Y, are also very handy in implementing the Pascal virtual machine. These two features of the M65C02A are the most used enhancements, beyond the SIZ prefix instruction, that allow the M65C02A to efficiently support the Mak Pascal compiler.

_________________
Michael A.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 25 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: