M65C02A Core

MichaelM · Post by **MichaelM** » Fri Aug 08, 2014 1:15 am

Dr Jefyll wrote:

I'm interested in the micro-programmability you mentioned. Here's a question that's probably completely ridiculous, but I have to ask: would the core's resources allow end-user microcode to be loaded at runtime?? I realize it's hardly trivial -- a whole new datapath would have to be created, not to the mention documentation you'd need to provide. But the prize would be enormous!! Specifically, it blows away the constraints I just mentioned about generality, and trying to guess what your users need. Loadable microcode would turn your core into a Swiss army knife -- except every tool would be the real deal!!!

I have that on my list to incorporate into the IO page. There are two options. The first would use the dual port nature of the block rams in which the microprogram ROM contents are stored. The primary port is used for the microsequencer and the other port would be used to write the user microprogram. The second would be to reboot the FPGA with another configuration image altogether.

The first option provides a way for a program, either in the instruction set of the processor or in the microprogram itself, to load specific areas of the microprogram with new sequences. The second option allows the processor core itself to be fully reconfigured. I am leaning toward implementing both capabilities after I've gotten the extended instruction set tested.

I think that I will require the weekend to finish the microcode and write some test code to verify that the extended instruction set behaves as expected. The microcode, as currently partitioned, is composed of two 36-bit microprograms. One controls each instruction sequence, and the other controls the ALU operations. The instruction sequence microprogram operates at the memory cycle level, and the ALU control microprogram operates at the instruction cycle level.

To implement the first approach, I will need to reformulate the microprogram BRAMs as dual-port memories. I have previously done this in order to create a dual core version of the M65C02 core. It is pretty much a matter of pulling the BRAM outside of the core, and adding the port connections to the core. On the other BRAM port, a set of data registers will be necessary to hold the write data until all 36/72 bits have been loaded.

The second approach simply requires instantiating the FPGA-specific configuration control object to the processor IO page. The SPI Master currently included could be used to update the configuration images from data transmitted to the microcomputer over the serial ports. The configuration control object can then be induced to load the desired configuration from two or more images stored in the serial EPROM.

Dr Jefyll · Post by **Dr Jefyll** » Fri Aug 08, 2014 3:13 am

Quote:

I have that on my list

Huh! Well, that's interesting -- at least I think so. A 65xx machine with customizable instructions? That's a new slant -- very intriguing!

Quote:

The second option allows the processor core itself to be fully reconfigured.

Offhand I'd say that makes the first option unnecessary, since all it offers is a way to sidestep rebooting the FPGA. Or am I missing something? Updating µcode on the fly (no reboot) is kind of an esoteric capability, I'd say. Maybe it'd get used, but I'm sure there's an associated cost, and meanwhile the potential may go untapped.

Quote:

The microcode, as currently partitioned, is composed of two 36-bit microprograms. One controls each instruction sequence, and the other controls the ALU operations. The instruction sequence microprogram operates at the memory cycle level, and the ALU control microprogram operates at the instruction cycle level.

Documenting this would be non-trivial, but would make for good reading, IMO. The processor within the processor!

-- Jeff

GARTHWILSON · Post by **GARTHWILSON** » Fri Aug 08, 2014 5:23 am

MichaelM wrote:

A quick summary from your responses so far is that it is definitely advantageous to implement NEXT as a microprogram sequence. At this moment I prefer to implement IP as an external location in zero page. Jeff's suggestion, from his KimKlone work, that the location of IP be implicit has merit and is something that the microprogram architecture can support. (This capability is in the current implementation and was recently enhanced to support 16-bit relative addressing.) It also appears that there are several other key FORTH words that would be worth implementing as microsequences: ENTER, EXIT, @ (load (TOS)), and ! (store (TOS),NOS). It also appears that it would be nice to support the common operations like addition (+), subtraction (-), AND, ORA, EOR, etc. from the parameter stack for cell-sized (16-bit) operations.

Is this a good summary of your recommendations to date?

I also gather from the discussions so far is that the issue of where to allocate the parameter stack and which processor register to use is not yet resolved. It does appear that it would be advantageous to be able to support stack operations with either X or S.

From your comments, and those in Brad Rodriguez's Moving Forth articles, the two stack pointers of the 6809 processor provided that processor a definite advantage as a FORTH engine over competitors such as the 8051, Z80, or the 6502. Although not supported by the M65C02A microarchitecture at the moment, it may be possible to add this feature without impacting performance too much.

Now a whole day after the above post, it occurs to me that a second stack pointer could be used in a different way to increase performance more, to use RTS as a single-instruction NEXT in DTC Forth, as Bruce tells of at viewtopic.php?t=586, or PLX, JMP(0,X) as a two-instruction NEXT in ITC Forth, an he tells of at viewtopic.php?t=586 . It would need to specify the second hardware stack though, to avoid the problems especially with interrupts which are mostly ruled out on an '816 by these methods.

MichaelM · Post by **MichaelM** » Fri Aug 08, 2014 11:58 pm

Dr Jefyll wrote:

MichaelM wrote:

The second option allows the processor core itself to be fully reconfigured.

Offhand I'd say that makes the first option unnecessary, since all it offers is a way to sidestep rebooting the FPGA. Or am I missing something? Updating µcode on the fly (no reboot) is kind of an esoteric capability, I'd say. Maybe it'd get used, but I'm sure there's an associated cost, and meanwhile the potential may go untapped.

Full reconfiguration of Xilinx and Altera FPGAs is a reboot of the FPGA. I personally prefer the first approach, and would go with the second approach if the objective was to switch to a wholly new architecture rather than simply replace the microcode. It is possible, with the second approach, for multiple FPGA images to be stored in the Serial Configuration Flash, and that each image can represent a modified instruction set of a single architecture (6502, 65C02S, M65C02, M65C02A, etc.), or a new architecture altogether (6800, 6809, 68HC11, 8080/8085, Z80, 8088/80188, etc.).

Dr Jefyll wrote:

Documenting this would be non-trivial, but would make for good reading, IMO. The processor within the processor!

I have a post in prep to do this, but I've not gotten through it completely. An incomplete description of the microsequencer that I used and the microprogram for the M65C02 can be found in the wiki for that core on GitHUB.

MichaelM · Post by **MichaelM** » Sat Aug 09, 2014 1:13 am

BDD:

I have a question regarding the PER instruction.

BigDumbDinosaur wrote:

MichaelM wrote:

I am not sure that PER would serve much purpose if BRA/BSR rel16 were available unless it was also possible to perform these two operations based on the top two locations of the stack.

PER is very useful, because the value that ends up on the stack is computed at run-time, making fully relocatable code possible. For example, a reference to a data table would become portable if PER were used to push the data table's address, rather than have the address set at assembly time. If all such references are generated by PER, then the program can be loaded anywhere and it will run without alteration.

I have read and reread the description of the PER instruction in the Eyes and Lichty manual several times. The text appears to describe this instruction as an assembly time calculation.

In the highlighted text of your response above, it seems that you are implying that the calculation of the value is a runtime operation. Am I misinterpreting your response?

From the Eyes and Lichty manual, it seems to me that it is the assembler that computes the relative address between the location counter of the instruction following the PER instruction and the target address provided in the operand field of the PER instruction. The resulting 16-bit relative offset is the value that the assembler inserts into the instruction stream.

The code examples in the Eyes and Lichty manual appear to provide incorrect values for the PER rel16 operand. The example in the text appears to be correct, and supports the idea that the operand of the PER is an assembly time calculation rather than a run time calculation.

From the '816 instruction set, I can't see how to use the rel16 offset pushed by PER to access the the data in a PC relative manner. Have I missed an instruction or addressing mode? In other words, how can the PC value of the instruction following PER be used as the base address and the rel16 offset on the stack as the index to access the data in a position independent manner?

GARTHWILSON · Post by **GARTHWILSON** » Sat Aug 09, 2014 2:08 am

At assembly time, the offset is calculated. At run time, the offset is added to whatever the current address is, to get the effective address to push. To expand on your quote from BDD, a possible application might be to have a complex data structure with code associated with it, kind of OOP-like. The whole thing could be scooted up or down in memory even after being loaded. Code that calls the structure does not have to concern itself with how the structure's own code knows how to access the right data within the structure. It is normal to do this kind of thing in Forth, although it's often not made to be relocatable.

Dr Jefyll · Post by **Dr Jefyll** » Sat Aug 09, 2014 2:12 am

From section 3.5.20 of the '816 data sheet:

Quote:

The second and third bytes of the instruction are added to the Program Counter, which has been updated to point to the opcode of the next instruction. [...] the result is stored on the stack.

This the second of the two calculations Garth mentioned.

MichaelM · Post by **MichaelM** » Sat Aug 09, 2014 3:58 am

Dr Jefyll wrote:

From section 3.5.20 of the '816 data sheet:

Quote:

The second and third bytes of the instruction are added to the Program Counter, which has been updated to point to the opcode of the next instruction. [...] the result is stored on the stack.

This the second of the two calculations Garth mentioned.

Thanks for your patience. What I've been looking for is a concrete, but correct example, that defines the value that the assembler places in the instruction stream. It appears to me that the values given for the PER operand fields in the Eyes and Lichty manual are all incorrect.

The addressing mode discussed in the '816 data sheet paragraph referenced above apparently applies to both PER and BRL. The effect of using PER followed by BRL is to synthesize a BSR rel16 instruction.

The usage being described is as follows:

Code: Select all

     PER RETURN - 1                ; Push return address onto stack
     BRL SUBROUTINE                ; Branch to subroutine using relocatable method
RETURN:
     .
     .
     .
     .
SUBROUTINE:
     .
     .
     RTS                          ; Return to location RETURN

This following is the text from the Eyes and Lichty manual that describes the code example provided above:

Eyes & Lichty - Page 179 wrote:

The 65802 and 65816 can synthesize the BSR function using their PER instruction. You use PER to compute and push the current run-time return address; since its operand is the return address’ relative offset (from the current address of the PER instruction), PER provides relocatability. As Fragment 12.2 shows, once the correct return address is on the stack, a BRA or BRL completes the synthesized BSR operation.

In this case, you specify as the assembler operand the symbolic location of the routine you want to return to minus one. Remember that the return address on the stack is pulled, then incremented, before control is passed to it. The assembler transforms the source code operand, RETURN – 1, into the instruction’s object code operand, a relative displacement from the next instruction to RETURN – 1. In this case, the displacement is $0002, the difference between the first byte of the BRL instruction and its last byte. (Remember, PER works the same as the BRL instruction; in both cases, the assembler turns the location you specify into a relative displacement from the program counter.) When the instruction is executed, the processor adds the displacement ($0002, in this case) to the current program counter address (the address of the BRL instruction); the resulting sum is the current absolute address of RETURN – 1, which is what is pushed onto the stack.

If at run-time the PER instruction is at $1000, then the BRL instruction will be at $1003, and RETURN at $1006. Execution of PER pushes $1005 onto the stack, and the program branches to SUBR1. The RTS at the end of the subroutine causes the $1005 to be pulled from the stack, incremented to $1006 (the address of RETURN), and loaded into the program counter.

If, on the other hand, the instructions are at $2000, $2003, and $2006, then $2005 is pushed onto the stack by execution of PER, then pulled off again when RTS is encountered, incremented to $2006 (the current run-time address of RETURN), and loaded into the program counter.

Am I missing something from this discussion? The whole purpose of the foregoing discussion was to demonstrate that the PER RETURN - 1 (0x620200) pushed the sum of the PC for the BRL instruction (0x1003) plus the operand of the PER instruction (0x0002) onto the stack. It appears to me that the same result could have been achieved by PEA RETURN - 1 instruction (0xF40510) followed by the BRL SUBROUTINE instruction in far fewer cycles.

I suppose that the answer to my earlier question regarding how to access data in a relocatable manner is that the PER instruction pushes the sum of the run-time PC of the instruction which follows the PER instruction plus the relative displacement of the PER instruction. To access the data in a relocatable manner, a post-indexed stack relative instruction must be used: LDA (sp,S),Y; STA (sp,S),Y; etc.

I guess I will have to give the implementation of PER a bit more thought. I have added the capability to the M65C02A core to perform a BSR rel16 operation. Thus, there is no need to synthesize that instruction using PER followed by BRA rel16. I will have to see if the capability to perform rel16 operations will allow me to perform the 16-bit offset calculation required for PER. The architecture of the M65C02A does not currently allow that resultant to be pushed onto the stack; it is used as the target address, i.e. the destination of BRA rel16 or BSR rel16. Since it is possible to push the PC, it may be a simple matter to add a multiplexer to push the MAR (Memory Address Register) instead of {PCH, PCL} when PER is the instruction.

It does seem that if BSR rel16 is available, and the post-indexed stack relative addressing modes are supported, (sp,S),Y, then the M65C02A should strive to support PER rel16. Otherwise, a high cycle count instruction sequence would be required to provide relocatable access to data structures.

Dr Jefyll · Post by **Dr Jefyll** » Sat Aug 09, 2014 4:43 am

Quote:

It appears to me that the same result could have been achieved by PEA RETURN - 1 instruction (0xF40510) followed by the BRL SUBROUTINE instruction in far fewer cycles.

Yes it could -- if the code would only stay in one place! But to use PEA the assembler would have to know the absolute address of RETURN -- and that's undetermined, since we intend it to be relocatable. What the assembler can figure out, however, is displacement between two unknown expressions: RETURN -1 and *. Maybe BDD can confirm that for us.

-- Jeff

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Aug 09, 2014 5:04 am

MichaelM wrote:

BDD:

I have a question regarding the PER instruction.

Sorry I missed all this.

As Garth and Jeff noted, PER's operand is a signed 16 bit offset relative to the instruction following PER, and is calculated by the assembler using the same rules as for calculating BRL's operand. When PER is executed, that offset is added by the MPU to the current value in PC and pushed, MSB first. As the offset is a constant, moving the PER instruction effectively "moves" the computed value.

In my 65C816 macros, I synthesize BSR (Branch to SubRoutine) with a combination of PER and BRL. As long as the subroutine is within reach of a long relative branch it works. See the following.

Code: Select all

;	BSR: Branch to Subroutine
;	——————————————————————————————————————————————————————————————————————
;	This macro synthesizes the BSR instruction implemented in the Motorola
;	6800 & 68000 microprocessors.  Programs in which subroutines are call-
;	ed via BSR are fully relocatable,  as the target address is calculated
;	relative to the program counter at run-time.   The target address must
;	be within the range of a long relative branch, +$7FFF or -$8000 bytes.
;	——————————————————————————————————————————————————————————————————————
;
bsr      .macro .sr            ;BSR <sub_addr>
.mib     =$82                  ;BRL opcode
.mip     =$62                  ;PER opcode
.na      =*+3                  ;compute BRL opcode address
.ra      .set .na+2            ;allow for BRL operand size
.ba      =.ra+1                ;BRL offset base address
.ra      .set .ra-.na          ;compute "return" address
.ta      .set .sr-.ba          ;compute BRL offset
         .byte .mip,<.ra,>.ra  ;assemble PER instruction
         .byte .mib,<.ta,>.ta  ;assemble BRL instruction
         .endm

MichaelM wrote:

It appears to me that the same result could have been achieved by PEA RETURN - 1 instruction (0xF40510) followed by the BRL SUBROUTINE instruction in far fewer cycles.

Dr Jefyll wrote:

Yes it could -- if the code would only stay in one place! But to use PEA the assembler would have to know the absolute address of RETURN -- and that's undetermined, since we intend it to be relocatable. What the assembler can figure out, however, is displacement between two unknown expressions: RETURN -1 and $. Maybe BDD can confirm that for us.

Jeff is correct. PEA pushes an operand that is determined at assembly time and is used as is by the MPU. If you relocate the PEA instruction the operand remains the same. Hence the need to use PER to synthesize BSR.

As for the cycle count, its importance will be in proportion to how many BSR instructions are used in your code. A total of nine cycles are required to execute the synthesized BSR, compared to the six used by JSR. I look at the slight increase in cycles as a small price to pay for increased programming flexibility.

MichaelM · Post by **MichaelM** » Sat Aug 09, 2014 1:15 pm

I must have been up waaay past my bedtime last night. I completely missed

the significance of the difference between the PEA BRL instruction sequence and the PER BRL instruction sequence when I posed the question:

MichaelM wrote:

Am I missing something from this discussion? The whole purpose of the foregoing discussion was to demonstrate that the PER RETURN - 1 (0x620200) pushed the sum of the PC for the BRL instruction (0x1003) plus the operand of the PER instruction (0x0002) onto the stack. It appears to me that the same result could have been achieved by PEA RETURN - 1 instruction (0xF40510) followed by the BRL SUBROUTINE instruction in far fewer cycles.

The task now is to determine a way to implement PER in the M65C02A microarchitecture that does not require too many (or any) changes to the current HDL/RTL. With the BSR Rel16 already supported in the M65C02A microarchitecture, PER would support PC-relative access to data structures.

barrym95838 · Post by **barrym95838** » Sat Aug 09, 2014 3:26 pm

You're forgiven, Michael. For what it's worth, the converse to the PER/BRL to BSR connection is also possible. In other words:

Code: Select all

        ...
        bsr  aaa                ; push run-time pointer to table
table:  .dw  value1, value2, value3, value4
aaa:    ldy  #1
aloop:  lda  (1,s),y
        tax                     ; low half of value in x
        iny
        lda  (1,s),y            ; high half of value in a
        bsr  dosomething        ; use 16-bit value for something
        iny
        cpy  #9
        bne  aloop              ; do it 4 times
        pla
        pla                     ; discard pointer
        bra  elsewhere

Here, the bsr is being used in place of the per. It's not quite the same, but similar, as long as the table is close to the code accessing it. It worked on the venerable 6800, so it's a proven method.

Mike

[Edit: fixed endless loop label snafu.]

MichaelM · Post by **MichaelM** » Sun Aug 10, 2014 6:22 pm

The following image provides the cycle counts for the M65C02A-only instructions being discussed on this thread. The attached image is for the M65C02A-only instructions that I've recently added. I've yet the define and implement the FORTH VM support instructions that have been previously discussed on this thread. In addition, I've not yet added the prefix/escape instructions that I've discussed or suggested by Jeff. There are currently two opcodes reserved for prefix/escape instructions, but not final definition of their operation has been finalized. (The cycle counts for the other M65C02A instructions, which are common with the 6502, 65C02, and W65C02S processors, are provided here.)

: Cycle Counts - M65C02A-Only Instructions

MichaelM · Post by **MichaelM** » Mon Aug 11, 2014 3:08 am

Have checked out the following new instructions:

BSR rel16 (M65C02A only)
BRA rel16 ('816 BRL rel16)
PHW #imm ('816 PEA abs)
PHW dp ('816 PEI dp)
PHW abs (M65C02A only)
PHR rel16 ('816 PER rel16)
PLW dp (M65C02A only)
PLW abs (M65C02A only)

I have a problem with the generation of sequential stack frame references for the (sp,S),Y addressing mode. The modification that the I put into the address generator to support PC-relative address is causing a conflict. I will work on to resolve this issue over the next week.

The stack relative addressing for a single bytes is fine, but not completely checked out.

MichaelM · Post by **MichaelM** » Mon Aug 11, 2014 3:38 am

Never mind about that (sp,S),Y issue. Add the following new instruction to the verified list:

JSR (sp,S),Y (M65C02A only)

Microprogramming is a great tool; no change required to the core's RTL. Did have to change the way that I pull the indirect address out of the stack frame. I expected to read the low byte first and then the high byte. However, solve the issue with the post-indexed stack-relative indirect addressing mode, I now read the high byte first and then the low byte. When the low byte is read into the core, the temporary register holding the stack frame index is overwritten with the low byte.

Some RTL will need to be modified in the near future if I want to add 16-bit operations (a la '816) as has been suggested by others on this thread. In order to perform 16-bit operations, 8-bit operations must be carried out least significant byte first. As I've patched it at the moment by reading the high byte of the indirect address pointer first, it won't support sequential 8-bit data reads starting with the low byte for 16-bit operations.

M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core

Re: M65C02A Core