Proper 65C02 core

MichaelM · Post by **MichaelM** » Sat Nov 03, 2012 10:51 pm

BigEd:

What was I supposed to see at the link you provided? It never completed initializing in my browser.

On the qiestion of my previous post, the tutorials and the WDC Programmer's Manual are not really clear in regard to whether the address arithmetic is modulo 256 for 8-bit zero page addressing or not. The Programmer's Manual, in particular, describes the indexed page 0 addressing modes as modulo 256 for the 6502 and 65C02 in the introduction, but when the address modes are presented individually later in the same chapter, no mention is made of this specific restriction. The address computation diagrams provided all appear to indicate that a carry from the LSB is propagated into the MSB of the address; this is behavior is described in the introduction of the address mode chapter as pertaining only to the 65802/65816.

Windfall had send me back a note that the addressing fix I implemented still needed some more work to behave correctly. All the research I've done seems to indicate that only the page zero address modes operate as modulo 256, but I can't find an example of the corner cases I posted about earlier.

Arlet · Post by **Arlet** » Sun Nov 04, 2012 7:10 am

MichaelM, here's a screen capture from BigEd's link. The visual6502 site uses javascript, so you may have to adjust your browser settings or a different browser (I use firefox).

BigEd · Post by **BigEd** » Sun Nov 04, 2012 11:28 am

Sorry, let me paste a slightly simpler test case:

Code: Select all

LDX #$7F
LDA ($80,X)
STA ($FF),Y
NOP

which can be simulated here: http://www.visual6502.org/JSSim/expert. ... a18091ffea
with the result:

Code: Select all

cycle  ab    db  rw  Fetch       pc    a   x   y   s   p
0      0600  a2  1    LDX#       0600  aa  00  00  fd  nv‑BdIZc
0      0600  a2  1    LDX#       0600  aa  00  00  fd  nv‑BdIZc
1      0601  7f  1               0601  aa  00  00  fd  nv‑BdIZc
1      0601  7f  1               0601  aa  00  00  fd  nv‑BdIZc
2      0602  a1  1    LDA(zp,X)  0602  aa  7f  00  fd  nv‑BdIzc
2      0602  a1  1    LDA(zp,X)  0602  aa  7f  00  fd  nv‑BdIzc
3      0603  80  1               0603  aa  7f  00  fd  nv‑BdIzc
3      0603  80  1               0603  aa  7f  00  fd  nv‑BdIzc
4      0080  00  1               0604  aa  7f  00  fd  nv‑BdIzc
4      0080  00  1               0604  aa  7f  00  fd  nv‑BdIzc
5      00ff  00  1               0604  aa  7f  00  fd  nv‑BdIzc
5      00ff  00  1               0604  aa  7f  00  fd  nv‑BdIzc
6      0000  a9  1               0604  aa  7f  00  fd  nv‑BdIzc
6      0000  a9  1               0604  aa  7f  00  fd  nv‑BdIzc
7      a900  00  1               0604  aa  7f  00  fd  nv‑BdIzc
7      a900  00  1               0604  aa  7f  00  fd  nv‑BdIzc
8      0604  91  1    STA(zp),Y  0604  00  7f  00  fd  nv‑BdIZc
8      0604  91  1    STA(zp),Y  0604  00  7f  00  fd  nv‑BdIZc
9      0605  ff  1               0605  00  7f  00  fd  nv‑BdIZc
9      0605  ff  1               0605  00  7f  00  fd  nv‑BdIZc
10     00ff  00  1               0606  00  7f  00  fd  nv‑BdIZc
10     00ff  00  1               0606  00  7f  00  fd  nv‑BdIZc
11     0000  a9  1               0606  00  7f  00  fd  nv‑BdIZc
11     0000  a9  1               0606  00  7f  00  fd  nv‑BdIZc
12     a900  00  1               0606  00  7f  00  fd  nv‑BdIZc
12     a900  00  1               0606  00  7f  00  fd  nv‑BdIZc
13     a900  00  0               0606  00  7f  00  fd  nv‑BdIZc
13     a900  00  0               0606  00  7f  00  fd  nv‑BdIZc
14     0606  ea  1    NOP        0606  00  7f  00  fd  nv‑BdIZc
14     0606  ea  1    NOP        0606  00  7f  00  fd  nv‑BdIZc

I'd be very interested to hear Windfall's observation.
Cheers
Ed

MichaelM · Post by **MichaelM** » Sun Nov 04, 2012 1:39 pm

Thanks very much. That's exactly what I needed. I certainly appreciate the time you took to prepare the test cases and make the simulation run.

It is certainly instructive to see the dummy read cycles which are described in the documentation.

I implemented a fix for the core I posted, but the method that I use to advance the address to the second byte of a pointer is not currently performed modulo 256. Armed with your examples, I will be making a modification to the microcode and address generator logic that will perform a modulo 256 operation when a zp addressing mod is fetching a 16-bit operand from page zero. For other addressing modes, it appears that the computation is treated as a 16-bit operation, so a simple 16-bit adder is all that's required to properly compute (abs,X); abs,X; and abs,Y.

Only the zp indexed and zp indexed indirect modes are not currently operating in the manner expected for a 6502 or 65C02. The core is handling all of these operations as sequential operations using 16-bit arithmetic; pretty much like how the 65816/65802 are expected to perform these operations.

Once again, thanks for the data. It is very helpful.

Arlet · Post by **Arlet** » Sun Nov 04, 2012 1:55 pm

It is interesting to see that you need extra logic to "cripple" the address calculation that was implemented as a shortcut in the original design. I must admit I never gave this issue much thought in my core, but just implemented in the most straightforward way that I could see, which mimics the original shortcut.

MichaelM · Post by **MichaelM** » Sun Nov 04, 2012 3:23 pm

C'est la vie.

When I laid out the objectives for my core, I was willing to add hardware to perform the address calculations in parallel with the ALU instead of using the ALU in a sequential manner to perform the calculations. In this manner, I expected to save one or two cycles in the implementation of many instructions. The method that you used in your implementation certainly results in a more cycle accurate implementation, but also one that uses less resources. The penalty in my approach is more resources and a level of non-compliance that is not necessarily bad unless you are running existing programs.

One of my objectives was to make the core easily synthesizable and portable to Spartan 3AN FPGAs, and another was to save clock cycles, and the core meets those objectives. It was never part of my original objectives to be cycle compatible with either the 6502 or the 65C02 since I never expected to interface the core to "original" equipment and peripherals. And it was not an objective to implement the modulo 256 behavior of the zp addressing; this was way my way of handling the JMP (abs) page crossing issue of the 6502. I was naive regarding the contortions that real 6502 programmers would use to get around the zp addressing limitations, so I allowed zp indexing to cross page boundaries. After Windfall asked for a "Proper 65C02" core, I added 65C02 code compatibility as an objective, but failed to put in the corner cases I asked for help on earlier into the core's testbench/test program. The result is that zp indexed addressing simply crosses page boundaries like the '802/816 in native mode. (Doing assembler level programming of an Apple II in the early 80's to emulate a synchronous interface to a CDC 6600 was a short term task, and I never really became proficient in the nuances of all of the 6502 addressing modes.)

The fix that I put in (a simple 2:1 multiplexer on the address output), when Windfall reported this issue back to me, fixed the result for the first byte, but allows the second byte to be fetched from the next sequential location. This is incorrect as Ed's simulation data clearly demonstrates.

There are a number of spare bits in the microcode and each addressing mode is implemented in a separate microroutine, which provides a simple means of controlling the address generator so that modulo 256 operations are performed for all of the zp addressing modes. At first blush, this correction should only require (the previously inserted) 2:1 multiplexer directly controlled by the microcode. But I am also adding a programmable microcycle length controller (a la Am2925) that should make it considerably easier to interface the core to distributed single-cycle RAM, two-cycle block RAM, and variable cycle external RAM. I am in the process of testing these changes, and simultaneously trying to maintain 100MHz core performance in a XC3S200AN Spartan 3AN FPGA. Yesterday as I integrated these changes into the baseline, I violated one of my own principles and did not generate a testbench for the modified microprogram controller. It looks like I have an issue in the microcycle length controller, because the core's testbench no longer passes the BCD addition/subtraction tests.

PS: Thanks for the heads up regarding javascript. I don't have it enabled on this computer, and am not likely to upgrade to Chrome or Firefox; too many tools and work products on this computer, and I despise the idea of making changes to or upgrading programs when so much can go wrong. I do have a Linux laptop that I am bringing online, and I'll use it to follow the link.

Arlet · Post by **Arlet** » Sun Nov 04, 2012 3:47 pm

For me, cycle accuracy was a minor goal, but it happened quite naturally by following Hanson's block diagram and minimizing resources. Actually, given the fact that the double phase clock was replaced by a single phase, with extended access time, I was pleasantly surprised that I could make everything fit in the same number of cycles. Of course, the ability to drop a few cycles by having special address calculating logic has an advantage for any code that isn't counting cycles.

BigEd · Post by **BigEd** » Sun Nov 04, 2012 3:56 pm

There is an oddity in cycle-accuracy, where a couple of instructions have a dummy cycle that they really don't need. (Should I look this up, or is it well known?)

Arlet · Post by **Arlet** » Sun Nov 04, 2012 4:17 pm

I think those are the same cycles that Michael's core avoids, and that mine implements. Getting rid of the dummy cycles would take extra resources, and may have a small effect on max clock rate as well.

BigEd · Post by **BigEd** » Sun Nov 04, 2012 4:58 pm

No, these were not cycles that the 6502 needs to do its work, which as you say are also used by your core. These were actually unnecessary. But, now I can't find this situation, or the (very old) posting on this forum which pointed out the discrepancy.
I don't think I dreamt the whole thing.
Cheers
Ed

MichaelM · Post by **MichaelM** » Sun Nov 04, 2012 5:10 pm

Agreed.

There appears to be dummy cycles inserted whenever the ALU is being used to compute an address. In the case of (zp,X), there is a dummy cycle which performs a dummy read of the address pointed to by the zp address while the X register is being added (% 256). According to the WDC Programmer's Reference Manual, the 65C02 does not perform the dummy read, but it will need to perform a dummy cycle. (The 6502's dummy read, read strobe and all, interferes with the behavior of I/O devices, so the 65C02 performs a dummy cycle, but does not drive the bus control lines.) Another place for dummy cycles is when a carry occurs at page boundaries. Finally, there is also a dummy cycle during RMW instructions when the ALU is being used.

Like Arlet says, in my core, I use a separate address generator. The additional resources allow the core to compute the address for the next memory read/write cycle while the operands are being assembled for the ALU. Then when all of the operands are ready, the fetch of the next opcode occurs while the ALU performs its operation.

On average, I can save save enough cycles in a general program to provide about a 40% increase in throughput. However, as Arlet has pointed out previously, this optimization comes at considerable cost, and the requires single cycle memory to operate.

In an FPGA, using the block RAMs, two cycles are required to fill the pipeline. Thus, with the appropriate memory interface, it would be possible to perform sequential reads in a burst transaction. However, I have previously attempted this feat, and found that it is difficult to take proper advantage of the sequential fetch operations, which is one reason I've returned to wrap up my original core.

I would say that an address generator no more complex than I have implemented, or the more tranditional approach that Arlet uses, is the approach to take with this instruction set architecture. Additional speed improvements come at increased complexity not really justified by the limitations of the processor architecture. This statement does not apply to the '816 or the 65Org16/Org32 projects. The additional performance of these instruction set architectures provide the justification for expending more resources to speed the execution.

I was going to say something stupid like "No additional pipelining can be applied to this architecture because ...", but as so many have demonstrated with x86: where there's a will (and money/time) there's a way to improve the performance of any CISC architecture.

For additional performance improvements in an FPGA, I have toyed with the idea of using the dual port nature of the block RAMs to provide a dual 8-bit fetch path. One port would use the address from the memory address register, and the other port would fetch the next sequential location. In this manner, any two byte instruction or two byte parameter would be fetched in a single cycle. This simple enhancement would greatly speed the execution of a large number of the instructions. Cycle counts for instructions like ADC #xx or ADC zp would be reduced by one. Cycle counts for instructions like ADC (zp,X) could be reduce by two, and instructions like JMP (abs,X) could be reduced by 3.

BigEd · Post by **BigEd** » Sun Nov 04, 2012 5:57 pm

Ah, I finally found the post I'd seen before! Sixth page of a search for 'cycles'. It's an observation by Bruce in 2004 about the 65C02's cycle counts, which explains why I didn't see anything fishy in the 6502 cycle counts:
viewtopic.php?f=2&t=243
"A 65C02 Bug?"

The meat of it is that of the read-modify-write Absolute,X instructions, INC and DEC have different cycle counts to the others. The NMOS 6502 always uses 7, but the CMOS 65C02 short cuts when no page is crossed - except for two cases.

I wasn't dreaming!

Cheers
Ed

MichaelM · Post by **MichaelM** » Sun Nov 04, 2012 6:25 pm

There certainly is a lot of good information on the site. I ran across some really neat data yesterday on LCC work done 8-9 years ago for the '816. I just hadn't had time to dig that far back in the archives. Need to spend more time digging through those old posts when I have some more free time.

MichaelM · Post by **MichaelM** » Tue Nov 13, 2012 4:08 am

An update has been posted to GitHUB for the M65C02 Core. The update corrects the reported issues with respect to zero page addressing modes. The MPC has been updated to include a microcycle length controller which allows the microcycle to be controlled as having lengths of 1, 2, and 4 clock cycles. The length 1 and length 2 microcycles are intended for LUT and Block RAMs, respectively, and do not support insertion of wait states. The length 4 microcycle is intended for external memories, and it support insertion of wait states. The length 4 microcycle should allow the M65C02 core to be adapted to asynchronous or synchronous external memories. It may also be possible to dynamically control the microcycle length between 1 and 2, and take advantage of the pipelining of Block RAMs so that only the first cycle is a length 2 cycle, and all other cycles are length 1 cycles until the next program branch, or data read/write cycle occurs.

To simplify the control of the address generator with respect to zero page wrapping, the Wait control bit (no longer required because of the internal microcycle length controller) was repurposed as the ZP control bit. When ZP is asserted, the address calculation of truncated to 8 bits. The microcode asserts the ZP only for zp; zp,X; zp,Y; (zp); (zp,X); and (zp),Y addressing modes, and in these modes, only while fetching the address operands from page zero. An absolute address mode can still address page zero locations, and its indexe address calculations are allowed to cross into page 1.

The testbench was expanded to run the previous implementation as a check for the new implementation. With the exception of the expected errors in page 0 address wrapping the new implementation matches that of the old implementation.

ElEctric_EyE · Post by **ElEctric_EyE** » Tue Nov 13, 2012 6:19 pm

Your 65C02 core seems to be progressing nicely. You definately had some cycle saving ideas in mind with dynamically controlling the microcode cycle length to speed up operation. This is an automatic control function for opcodes to have 1,2, or 4 cycles? I haven't checked your code out yet on Github. Github has chosen to lock me out for some reason.

Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core

Re: Proper 65C02 core