Agreed.
There appears to be dummy cycles inserted whenever the ALU is being used to compute an address. In the case of (zp,X), there is a dummy cycle which performs a dummy read of the address pointed to by the zp address while the X register is being added (% 256). According to the WDC Programmer's Reference Manual, the 65C02 does not perform the dummy read, but it will need to perform a dummy cycle. (The 6502's dummy read, read strobe and all, interferes with the behavior of I/O devices, so the 65C02 performs a dummy cycle, but does not drive the bus control lines.) Another place for dummy cycles is when a carry occurs at page boundaries. Finally, there is also a dummy cycle during RMW instructions when the ALU is being used.
Like Arlet says, in my core, I use a separate address generator. The additional resources allow the core to compute the address for the next memory read/write cycle while the operands are being assembled for the ALU. Then when all of the operands are ready, the fetch of the next opcode occurs while the ALU performs its operation.
On average, I can save save enough cycles in a general program to provide about a 40% increase in throughput. However, as Arlet has pointed out previously, this optimization comes at considerable cost, and the requires single cycle memory to operate.
In an FPGA, using the block RAMs, two cycles are required to fill the pipeline. Thus, with the appropriate memory interface, it would be possible to perform sequential reads in a burst transaction. However, I have previously attempted this feat, and found that it is difficult to take proper advantage of the sequential fetch operations, which is one reason I've returned to wrap up my original core.
I would say that an address generator no more complex than I have implemented, or the more tranditional approach that Arlet uses, is the approach to take with this instruction set architecture. Additional speed improvements come at increased complexity not really justified by the limitations of the processor architecture. This statement does not apply to the '816 or the 65Org16/Org32 projects. The additional performance of these instruction set architectures provide the justification for expending more resources to speed the execution.
I was going to say something stupid like "No additional pipelining can be applied to this architecture because ...", but as so many have demonstrated with x86: where there's a will (and money/time) there's a way to improve the performance of any CISC architecture.
For additional performance improvements in an FPGA, I have toyed with the idea of using the dual port nature of the block RAMs to provide a dual 8-bit fetch path. One port would use the address from the memory address register, and the other port would fetch the next sequential location. In this manner, any two byte instruction or two byte parameter would be fetched in a single cycle. This simple enhancement would greatly speed the execution of a large number of the instructions. Cycle counts for instructions like ADC #xx or ADC zp would be reduced by one. Cycle counts for instructions like ADC (zp,X) could be reduce by two, and instructions like JMP (abs,X) could be reduced by 3.
_________________ Michael A.
|