Hello,
After a long pause, I decided to get back into 6502 hacking, and implement an idea I've been toying with for a few years: using multiple small CPLDs to implement a 6502.
My CPLD of choice was the Xilinx XC9572XL in 44 pin TFQP package. My original plan was to use 5 or 6 of them, but somewhat as a surprise to myself, I was able to fit it into 4. It's a very tight fit, especially for the control logic and the ALU. At first, it seemed completely hopeless watching the tools allocate big chunks of resources for the simplest expressions, but with a lot of experimenting and reading the fitter reports, I gradually gained an understanding on how to write the code so it would match the capabilities of the CPLDs and the tools.
A few years ago I tried something similar, but noticed that the CPLD was a very poor fit for bigger adders (mostly because there's no fast carry chain, and also because the AND-OR structure is not good for XOR operations), and had given up on the idea. But then a while ago, I was going through the datasheets for another project, and I noticed that nice dedicated XOR port in each macrocell. I spent a few days going over different ways to turn that XOR into the centerpiece of the ALU.
I had several reasons for picking this particular type of CPLD. It's fairly easy to solder, even by hand, I had previous experience with it, and it also has just enough resources to make this possible, but not too many to make it simple, resulting in a very nice puzzle that has kept me busy for a while.
Everything builds with standard settings, optimized for speed, on ISE 14.7, except for a couple of KEEP attributes at strategic places. Also, automatic FSM extraction needs to be turned off, because it doesn't respect the ordering in overlapping casez clauses, introducing bugs in the control logic. I did run it with automatic FSM extraction once, copied the state encoding, and then turn it back off. I must say, the tools are pretty amazing when optimizing smallish logic functions, but tend to be very clueless how to deal with more complex stuff, such as deciding when to allocate another macrocell for a subexpression. I always check the fitter reports for extra variables (recognized by totally random names with dollar signs). If there are any, I try to rewrite my code to avoid them. Also, the code generation for the built-in "+" expression isn't very good (except for adding/subtracting simple constants), so it's best avoided. Also, optimizing for size sometimes makes the implementation bigger. I recommend always optimize for speed because you get more control over the result.
I made a small board, containing the 4 CPLDs, plus an extra CPLD for UART/SPI, some SRAM and Flash, and 6 LED displays on a 74HC595 chain. My goal was to have it run at speed of 10 MHz. It's currently running stable at 12 MHz. I've added a wait state for the flash, mainly to test the RDY logic, but also because flash is rather slow. I briefly tried it at 24 MHz but it crashed. Haven't tried testing the maximum speed, nor did I do any analysis of the longest path. [Edit: I've added the source code for the extra CPLD as well, even though it's not strictly part of the project, it can still be useful]
Bootstrapping was done by writing simple UART loader and hard-coding it into I/O CPLD (which is connected to full address+data bus). The UART loader reads 256 bytes over UART, writes them to memory, and jumps to the first instruction. From there, a secundary loader took more data from UART and wrote it to flash.
Source code still needs to be cleaned up a bit, but I've made a repository on github:
https://github.com/Arlet/cpld-6502(By the way, if anybody's is working in Verilog, I highly recommend the Verilator project:
https://www.veripool.org/wiki/verilator If I disable all output, it simulates the entire design for 100 million cycles in 21 seconds on my desktop PC, that's more than enough to run Klaus Dormann's verification program)