Arlet, thank you. Are you Arlet Ottens, the author of
https://github.com/Arlet/verilog-6502 ? I tried your core with my Pool demo. It dropped-in very easily. I managed to close timing on it at 50 MHz, no problem, in a Spartan 3. It is very economical on FPGA resources.
Ed, as you probably know, the Spartan 3 FPGA has 4-input LUTs; and the fastest designs are those in which each register DATA input is a function of 4 or fewer signals. If combinatorial expressions depend on more than 4 terms, the tool chain has to use more LUTs, there are more routing and logic delays between the flops, and the maximum achievable FPGA clock rate might be reduced.
Increasing the depth of combinatorial logic between registers in my design would indeed allow the FPGA to 6502 clock ratio to be closer; however, it could make timing closure more challenging, such that there was no change in the ultimate 6502 clock rate achieved.
If you look in the file chip_6502.v, you will see this wrapper module for the `included combinatorial logic cloud:
module LOGIC (
input [`NUM_NODES-1:0] i,
output [`NUM_NODES-1:0] o);
`include "logic.inc" // this file contains combinatorial assign statements
endmodule
There is only once instance of this module, the outputs from which are registered and fed back to its inputs. I have experimented with a cascade of two and a cascade of four instances in-between the registers. The FPGA clock had to be reduced. The 6502 clock ended up about the same; and the cost of duplicating the entire combinatorial cloud was a doubling (or quadrupling) of FPGA fabric resources consumed!
So, my sledge-hammer attack of simply duplicating the entire cloud did not work. I would need to be more selective about which unnecessary registers I removed. Perhaps an algorithm could be devised that understood logic delays and made decisions about register removal, or it could be done by hand. It is an interesting problem. I agree there is potential here to improve performance.