Proper 65C02 core
Re: Proper 65C02 core
BigEd:
What was I supposed to see at the link you provided? It never completed initializing in my browser.
On the qiestion of my previous post, the tutorials and the WDC Programmer's Manual are not really clear in regard to whether the address arithmetic is modulo 256 for 8-bit zero page addressing or not. The Programmer's Manual, in particular, describes the indexed page 0 addressing modes as modulo 256 for the 6502 and 65C02 in the introduction, but when the address modes are presented individually later in the same chapter, no mention is made of this specific restriction. The address computation diagrams provided all appear to indicate that a carry from the LSB is propagated into the MSB of the address; this is behavior is described in the introduction of the address mode chapter as pertaining only to the 65802/65816.
Windfall had send me back a note that the addressing fix I implemented still needed some more work to behave correctly. All the research I've done seems to indicate that only the page zero address modes operate as modulo 256, but I can't find an example of the corner cases I posted about earlier.
What was I supposed to see at the link you provided? It never completed initializing in my browser.
On the qiestion of my previous post, the tutorials and the WDC Programmer's Manual are not really clear in regard to whether the address arithmetic is modulo 256 for 8-bit zero page addressing or not. The Programmer's Manual, in particular, describes the indexed page 0 addressing modes as modulo 256 for the 6502 and 65C02 in the introduction, but when the address modes are presented individually later in the same chapter, no mention is made of this specific restriction. The address computation diagrams provided all appear to indicate that a carry from the LSB is propagated into the MSB of the address; this is behavior is described in the introduction of the address mode chapter as pertaining only to the 65802/65816.
Windfall had send me back a note that the addressing fix I implemented still needed some more work to behave correctly. All the research I've done seems to indicate that only the page zero address modes operate as modulo 256, but I can't find an example of the corner cases I posted about earlier.
Michael A.
Re: Proper 65C02 core
MichaelM, here's a screen capture from BigEd's link. The visual6502 site uses javascript, so you may have to adjust your browser settings or a different browser (I use firefox).
Re: Proper 65C02 core
Sorry, let me paste a slightly simpler test case:
which can be simulated here: http://www.visual6502.org/JSSim/expert. ... a18091ffea
with the result:
I'd be very interested to hear Windfall's observation.
Cheers
Ed
Code: Select all
LDX #$7F
LDA ($80,X)
STA ($FF),Y
NOPwith the result:
Code: Select all
cycle ab db rw Fetch pc a x y s p
0 0600 a2 1 LDX# 0600 aa 00 00 fd nv‑BdIZc
0 0600 a2 1 LDX# 0600 aa 00 00 fd nv‑BdIZc
1 0601 7f 1 0601 aa 00 00 fd nv‑BdIZc
1 0601 7f 1 0601 aa 00 00 fd nv‑BdIZc
2 0602 a1 1 LDA(zp,X) 0602 aa 7f 00 fd nv‑BdIzc
2 0602 a1 1 LDA(zp,X) 0602 aa 7f 00 fd nv‑BdIzc
3 0603 80 1 0603 aa 7f 00 fd nv‑BdIzc
3 0603 80 1 0603 aa 7f 00 fd nv‑BdIzc
4 0080 00 1 0604 aa 7f 00 fd nv‑BdIzc
4 0080 00 1 0604 aa 7f 00 fd nv‑BdIzc
5 00ff 00 1 0604 aa 7f 00 fd nv‑BdIzc
5 00ff 00 1 0604 aa 7f 00 fd nv‑BdIzc
6 0000 a9 1 0604 aa 7f 00 fd nv‑BdIzc
6 0000 a9 1 0604 aa 7f 00 fd nv‑BdIzc
7 a900 00 1 0604 aa 7f 00 fd nv‑BdIzc
7 a900 00 1 0604 aa 7f 00 fd nv‑BdIzc
8 0604 91 1 STA(zp),Y 0604 00 7f 00 fd nv‑BdIZc
8 0604 91 1 STA(zp),Y 0604 00 7f 00 fd nv‑BdIZc
9 0605 ff 1 0605 00 7f 00 fd nv‑BdIZc
9 0605 ff 1 0605 00 7f 00 fd nv‑BdIZc
10 00ff 00 1 0606 00 7f 00 fd nv‑BdIZc
10 00ff 00 1 0606 00 7f 00 fd nv‑BdIZc
11 0000 a9 1 0606 00 7f 00 fd nv‑BdIZc
11 0000 a9 1 0606 00 7f 00 fd nv‑BdIZc
12 a900 00 1 0606 00 7f 00 fd nv‑BdIZc
12 a900 00 1 0606 00 7f 00 fd nv‑BdIZc
13 a900 00 0 0606 00 7f 00 fd nv‑BdIZc
13 a900 00 0 0606 00 7f 00 fd nv‑BdIZc
14 0606 ea 1 NOP 0606 00 7f 00 fd nv‑BdIZc
14 0606 ea 1 NOP 0606 00 7f 00 fd nv‑BdIZc
Cheers
Ed
Re: Proper 65C02 core
Thanks very much. That's exactly what I needed. I certainly appreciate the time you took to prepare the test cases and make the simulation run.
It is certainly instructive to see the dummy read cycles which are described in the documentation.
I implemented a fix for the core I posted, but the method that I use to advance the address to the second byte of a pointer is not currently performed modulo 256. Armed with your examples, I will be making a modification to the microcode and address generator logic that will perform a modulo 256 operation when a zp addressing mod is fetching a 16-bit operand from page zero. For other addressing modes, it appears that the computation is treated as a 16-bit operation, so a simple 16-bit adder is all that's required to properly compute (abs,X); abs,X; and abs,Y.
Only the zp indexed and zp indexed indirect modes are not currently operating in the manner expected for a 6502 or 65C02. The core is handling all of these operations as sequential operations using 16-bit arithmetic; pretty much like how the 65816/65802 are expected to perform these operations.
Once again, thanks for the data. It is very helpful.
It is certainly instructive to see the dummy read cycles which are described in the documentation.
I implemented a fix for the core I posted, but the method that I use to advance the address to the second byte of a pointer is not currently performed modulo 256. Armed with your examples, I will be making a modification to the microcode and address generator logic that will perform a modulo 256 operation when a zp addressing mod is fetching a 16-bit operand from page zero. For other addressing modes, it appears that the computation is treated as a 16-bit operation, so a simple 16-bit adder is all that's required to properly compute (abs,X); abs,X; and abs,Y.
Only the zp indexed and zp indexed indirect modes are not currently operating in the manner expected for a 6502 or 65C02. The core is handling all of these operations as sequential operations using 16-bit arithmetic; pretty much like how the 65816/65802 are expected to perform these operations.
Once again, thanks for the data. It is very helpful.
Michael A.
Re: Proper 65C02 core
It is interesting to see that you need extra logic to "cripple" the address calculation that was implemented as a shortcut in the original design. I must admit I never gave this issue much thought in my core, but just implemented in the most straightforward way that I could see, which mimics the original shortcut.
Re: Proper 65C02 core
C'est la vie.
When I laid out the objectives for my core, I was willing to add hardware to perform the address calculations in parallel with the ALU instead of using the ALU in a sequential manner to perform the calculations. In this manner, I expected to save one or two cycles in the implementation of many instructions. The method that you used in your implementation certainly results in a more cycle accurate implementation, but also one that uses less resources. The penalty in my approach is more resources and a level of non-compliance that is not necessarily bad unless you are running existing programs.
One of my objectives was to make the core easily synthesizable and portable to Spartan 3AN FPGAs, and another was to save clock cycles, and the core meets those objectives. It was never part of my original objectives to be cycle compatible with either the 6502 or the 65C02 since I never expected to interface the core to "original" equipment and peripherals. And it was not an objective to implement the modulo 256 behavior of the zp addressing; this was way my way of handling the JMP (abs) page crossing issue of the 6502. I was naive regarding the contortions that real 6502 programmers would use to get around the zp addressing limitations, so I allowed zp indexing to cross page boundaries. After Windfall asked for a "Proper 65C02" core, I added 65C02 code compatibility as an objective, but failed to put in the corner cases I asked for help on earlier into the core's testbench/test program. The result is that zp indexed addressing simply crosses page boundaries like the '802/816 in native mode. (Doing assembler level programming of an Apple II in the early 80's to emulate a synchronous interface to a CDC 6600 was a short term task, and I never really became proficient in the nuances of all of the 6502 addressing modes.)
The fix that I put in (a simple 2:1 multiplexer on the address output), when Windfall reported this issue back to me, fixed the result for the first byte, but allows the second byte to be fetched from the next sequential location. This is incorrect as Ed's simulation data clearly demonstrates.
There are a number of spare bits in the microcode and each addressing mode is implemented in a separate microroutine, which provides a simple means of controlling the address generator so that modulo 256 operations are performed for all of the zp addressing modes. At first blush, this correction should only require (the previously inserted) 2:1 multiplexer directly controlled by the microcode. But I am also adding a programmable microcycle length controller (a la Am2925) that should make it considerably easier to interface the core to distributed single-cycle RAM, two-cycle block RAM, and variable cycle external RAM. I am in the process of testing these changes, and simultaneously trying to maintain 100MHz core performance in a XC3S200AN Spartan 3AN FPGA. Yesterday as I integrated these changes into the baseline, I violated one of my own principles and did not generate a testbench for the modified microprogram controller. It looks like I have an issue in the microcycle length controller, because the core's testbench no longer passes the BCD addition/subtraction tests.
PS: Thanks for the heads up regarding javascript. I don't have it enabled on this computer, and am not likely to upgrade to Chrome or Firefox; too many tools and work products on this computer, and I despise the idea of making changes to or upgrading programs when so much can go wrong. I do have a Linux laptop that I am bringing online, and I'll use it to follow the link.
When I laid out the objectives for my core, I was willing to add hardware to perform the address calculations in parallel with the ALU instead of using the ALU in a sequential manner to perform the calculations. In this manner, I expected to save one or two cycles in the implementation of many instructions. The method that you used in your implementation certainly results in a more cycle accurate implementation, but also one that uses less resources. The penalty in my approach is more resources and a level of non-compliance that is not necessarily bad unless you are running existing programs.
One of my objectives was to make the core easily synthesizable and portable to Spartan 3AN FPGAs, and another was to save clock cycles, and the core meets those objectives. It was never part of my original objectives to be cycle compatible with either the 6502 or the 65C02 since I never expected to interface the core to "original" equipment and peripherals. And it was not an objective to implement the modulo 256 behavior of the zp addressing; this was way my way of handling the JMP (abs) page crossing issue of the 6502. I was naive regarding the contortions that real 6502 programmers would use to get around the zp addressing limitations, so I allowed zp indexing to cross page boundaries. After Windfall asked for a "Proper 65C02" core, I added 65C02 code compatibility as an objective, but failed to put in the corner cases I asked for help on earlier into the core's testbench/test program. The result is that zp indexed addressing simply crosses page boundaries like the '802/816 in native mode. (Doing assembler level programming of an Apple II in the early 80's to emulate a synchronous interface to a CDC 6600 was a short term task, and I never really became proficient in the nuances of all of the 6502 addressing modes.)
The fix that I put in (a simple 2:1 multiplexer on the address output), when Windfall reported this issue back to me, fixed the result for the first byte, but allows the second byte to be fetched from the next sequential location. This is incorrect as Ed's simulation data clearly demonstrates.
There are a number of spare bits in the microcode and each addressing mode is implemented in a separate microroutine, which provides a simple means of controlling the address generator so that modulo 256 operations are performed for all of the zp addressing modes. At first blush, this correction should only require (the previously inserted) 2:1 multiplexer directly controlled by the microcode. But I am also adding a programmable microcycle length controller (a la Am2925) that should make it considerably easier to interface the core to distributed single-cycle RAM, two-cycle block RAM, and variable cycle external RAM. I am in the process of testing these changes, and simultaneously trying to maintain 100MHz core performance in a XC3S200AN Spartan 3AN FPGA. Yesterday as I integrated these changes into the baseline, I violated one of my own principles and did not generate a testbench for the modified microprogram controller. It looks like I have an issue in the microcycle length controller, because the core's testbench no longer passes the BCD addition/subtraction tests.
PS: Thanks for the heads up regarding javascript. I don't have it enabled on this computer, and am not likely to upgrade to Chrome or Firefox; too many tools and work products on this computer, and I despise the idea of making changes to or upgrading programs when so much can go wrong. I do have a Linux laptop that I am bringing online, and I'll use it to follow the link.
Michael A.
Re: Proper 65C02 core
For me, cycle accuracy was a minor goal, but it happened quite naturally by following Hanson's block diagram and minimizing resources. Actually, given the fact that the double phase clock was replaced by a single phase, with extended access time, I was pleasantly surprised that I could make everything fit in the same number of cycles. Of course, the ability to drop a few cycles by having special address calculating logic has an advantage for any code that isn't counting cycles.
Re: Proper 65C02 core
There is an oddity in cycle-accuracy, where a couple of instructions have a dummy cycle that they really don't need. (Should I look this up, or is it well known?)
Re: Proper 65C02 core
I think those are the same cycles that Michael's core avoids, and that mine implements. Getting rid of the dummy cycles would take extra resources, and may have a small effect on max clock rate as well.
Re: Proper 65C02 core
No, these were not cycles that the 6502 needs to do its work, which as you say are also used by your core. These were actually unnecessary. But, now I can't find this situation, or the (very old) posting on this forum which pointed out the discrepancy.
I don't think I dreamt the whole thing.
Cheers
Ed
I don't think I dreamt the whole thing.
Cheers
Ed
Re: Proper 65C02 core
Agreed.
There appears to be dummy cycles inserted whenever the ALU is being used to compute an address. In the case of (zp,X), there is a dummy cycle which performs a dummy read of the address pointed to by the zp address while the X register is being added (% 256). According to the WDC Programmer's Reference Manual, the 65C02 does not perform the dummy read, but it will need to perform a dummy cycle. (The 6502's dummy read, read strobe and all, interferes with the behavior of I/O devices, so the 65C02 performs a dummy cycle, but does not drive the bus control lines.) Another place for dummy cycles is when a carry occurs at page boundaries. Finally, there is also a dummy cycle during RMW instructions when the ALU is being used.
Like Arlet says, in my core, I use a separate address generator. The additional resources allow the core to compute the address for the next memory read/write cycle while the operands are being assembled for the ALU. Then when all of the operands are ready, the fetch of the next opcode occurs while the ALU performs its operation.
On average, I can save save enough cycles in a general program to provide about a 40% increase in throughput. However, as Arlet has pointed out previously, this optimization comes at considerable cost, and the requires single cycle memory to operate.
In an FPGA, using the block RAMs, two cycles are required to fill the pipeline. Thus, with the appropriate memory interface, it would be possible to perform sequential reads in a burst transaction. However, I have previously attempted this feat, and found that it is difficult to take proper advantage of the sequential fetch operations, which is one reason I've returned to wrap up my original core.
I would say that an address generator no more complex than I have implemented, or the more tranditional approach that Arlet uses, is the approach to take with this instruction set architecture. Additional speed improvements come at increased complexity not really justified by the limitations of the processor architecture. This statement does not apply to the '816 or the 65Org16/Org32 projects. The additional performance of these instruction set architectures provide the justification for expending more resources to speed the execution.
I was going to say something stupid like "No additional pipelining can be applied to this architecture because ...", but as so many have demonstrated with x86: where there's a will (and money/time) there's a way to improve the performance of any CISC architecture.
For additional performance improvements in an FPGA, I have toyed with the idea of using the dual port nature of the block RAMs to provide a dual 8-bit fetch path. One port would use the address from the memory address register, and the other port would fetch the next sequential location. In this manner, any two byte instruction or two byte parameter would be fetched in a single cycle. This simple enhancement would greatly speed the execution of a large number of the instructions. Cycle counts for instructions like ADC #xx or ADC zp would be reduced by one. Cycle counts for instructions like ADC (zp,X) could be reduce by two, and instructions like JMP (abs,X) could be reduced by 3.
There appears to be dummy cycles inserted whenever the ALU is being used to compute an address. In the case of (zp,X), there is a dummy cycle which performs a dummy read of the address pointed to by the zp address while the X register is being added (% 256). According to the WDC Programmer's Reference Manual, the 65C02 does not perform the dummy read, but it will need to perform a dummy cycle. (The 6502's dummy read, read strobe and all, interferes with the behavior of I/O devices, so the 65C02 performs a dummy cycle, but does not drive the bus control lines.) Another place for dummy cycles is when a carry occurs at page boundaries. Finally, there is also a dummy cycle during RMW instructions when the ALU is being used.
Like Arlet says, in my core, I use a separate address generator. The additional resources allow the core to compute the address for the next memory read/write cycle while the operands are being assembled for the ALU. Then when all of the operands are ready, the fetch of the next opcode occurs while the ALU performs its operation.
On average, I can save save enough cycles in a general program to provide about a 40% increase in throughput. However, as Arlet has pointed out previously, this optimization comes at considerable cost, and the requires single cycle memory to operate.
In an FPGA, using the block RAMs, two cycles are required to fill the pipeline. Thus, with the appropriate memory interface, it would be possible to perform sequential reads in a burst transaction. However, I have previously attempted this feat, and found that it is difficult to take proper advantage of the sequential fetch operations, which is one reason I've returned to wrap up my original core.
I would say that an address generator no more complex than I have implemented, or the more tranditional approach that Arlet uses, is the approach to take with this instruction set architecture. Additional speed improvements come at increased complexity not really justified by the limitations of the processor architecture. This statement does not apply to the '816 or the 65Org16/Org32 projects. The additional performance of these instruction set architectures provide the justification for expending more resources to speed the execution.
I was going to say something stupid like "No additional pipelining can be applied to this architecture because ...", but as so many have demonstrated with x86: where there's a will (and money/time) there's a way to improve the performance of any CISC architecture.
For additional performance improvements in an FPGA, I have toyed with the idea of using the dual port nature of the block RAMs to provide a dual 8-bit fetch path. One port would use the address from the memory address register, and the other port would fetch the next sequential location. In this manner, any two byte instruction or two byte parameter would be fetched in a single cycle. This simple enhancement would greatly speed the execution of a large number of the instructions. Cycle counts for instructions like ADC #xx or ADC zp would be reduced by one. Cycle counts for instructions like ADC (zp,X) could be reduce by two, and instructions like JMP (abs,X) could be reduced by 3.
Michael A.
Re: Proper 65C02 core
Ah, I finally found the post I'd seen before! Sixth page of a search for 'cycles'. It's an observation by Bruce in 2004 about the 65C02's cycle counts, which explains why I didn't see anything fishy in the 6502 cycle counts:
viewtopic.php?f=2&t=243
"A 65C02 Bug?"
The meat of it is that of the read-modify-write Absolute,X instructions, INC and DEC have different cycle counts to the others. The NMOS 6502 always uses 7, but the CMOS 65C02 short cuts when no page is crossed - except for two cases.
I wasn't dreaming!
Cheers
Ed
viewtopic.php?f=2&t=243
"A 65C02 Bug?"
The meat of it is that of the read-modify-write Absolute,X instructions, INC and DEC have different cycle counts to the others. The NMOS 6502 always uses 7, but the CMOS 65C02 short cuts when no page is crossed - except for two cases.
I wasn't dreaming!
Cheers
Ed
Re: Proper 65C02 core
There certainly is a lot of good information on the site. I ran across some really neat data yesterday on LCC work done 8-9 years ago for the '816. I just hadn't had time to dig that far back in the archives. Need to spend more time digging through those old posts when I have some more free time.
Michael A.
Re: Proper 65C02 core
An update has been posted to GitHUB for the M65C02 Core. The update corrects the reported issues with respect to zero page addressing modes. The MPC has been updated to include a microcycle length controller which allows the microcycle to be controlled as having lengths of 1, 2, and 4 clock cycles. The length 1 and length 2 microcycles are intended for LUT and Block RAMs, respectively, and do not support insertion of wait states. The length 4 microcycle is intended for external memories, and it support insertion of wait states. The length 4 microcycle should allow the M65C02 core to be adapted to asynchronous or synchronous external memories. It may also be possible to dynamically control the microcycle length between 1 and 2, and take advantage of the pipelining of Block RAMs so that only the first cycle is a length 2 cycle, and all other cycles are length 1 cycles until the next program branch, or data read/write cycle occurs.
To simplify the control of the address generator with respect to zero page wrapping, the Wait control bit (no longer required because of the internal microcycle length controller) was repurposed as the ZP control bit. When ZP is asserted, the address calculation of truncated to 8 bits. The microcode asserts the ZP only for zp; zp,X; zp,Y; (zp); (zp,X); and (zp),Y addressing modes, and in these modes, only while fetching the address operands from page zero. An absolute address mode can still address page zero locations, and its indexe address calculations are allowed to cross into page 1.
The testbench was expanded to run the previous implementation as a check for the new implementation. With the exception of the expected errors in page 0 address wrapping the new implementation matches that of the old implementation.
To simplify the control of the address generator with respect to zero page wrapping, the Wait control bit (no longer required because of the internal microcycle length controller) was repurposed as the ZP control bit. When ZP is asserted, the address calculation of truncated to 8 bits. The microcode asserts the ZP only for zp; zp,X; zp,Y; (zp); (zp,X); and (zp),Y addressing modes, and in these modes, only while fetching the address operands from page zero. An absolute address mode can still address page zero locations, and its indexe address calculations are allowed to cross into page 1.
The testbench was expanded to run the previous implementation as a check for the new implementation. With the exception of the expected errors in page 0 address wrapping the new implementation matches that of the old implementation.
Michael A.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
Re: Proper 65C02 core
Your 65C02 core seems to be progressing nicely. You definately had some cycle saving ideas in mind with dynamically controlling the microcode cycle length to speed up operation. This is an automatic control function for opcodes to have 1,2, or 4 cycles? I haven't checked your code out yet on Github. Github has chosen to lock me out for some reason.