Using ISE 12.4, and pushing the synthesis a little (high effort, optmize for timing), I can get my Atom model (6502 CPU + RAM/ROM/PIA/VDG) to pass synthesis with a 14ns clock:
All constraints were met.
Data Sheet report:
-----------------
All values displayed in nanoseconds (ns)
clk | 13.893
This is on a Spartan-3 (XC3S200-5FT256). On a Spartan-6, I get a synthesis estimate of 111 MHz. Mapping fails due to bad pin constraints (I just used the Spartan-3 constraints file).
I'm really looking forward to how my 65k core compares, although I think, as a first-timer and writing a complete new core, with extra functionality, this will be much bigger. But it's generic, configurable to 16, 32 or 64 bit registers and a maximum memory interface of 8, 16, or 32 bit, so I could at least try to compare the 16bit registers/8bit memory interface with the numbers here. Maybe I should add a switch to switch off the extra opcodes and registers, just for comparison...
BTW: current status is that I have almost finished writing the fetch/decode logic - which is quite complex (I think this is the most complex part in my design actually) as I do prefix bytes, but it _should_ support decoding one opcode per cycle as long as memory bandwidth permits (so I should be able to break the one opcode / at least two cycles limit of the 6502 :-) It's untested as of now, though.
... Arlet's is the only design which uses LUTs as 16-bit RAMs - this could be a reason why the slice count is by far the smallest.
I just ran a couple of runs with two versions of Arlet's core, one of which avoids using RAMs for the register file. The size difference is there, but not enormous. This is with version 12.4 of the xilinx tools, targetting a xc3s50a-5-tq144:
I just ran a couple of runs with two versions of Arlet's core, one of which avoids using RAMs for the register file. The size difference is there, but not enormous.
Any difference in speed ? Can you show the code ? I'm curious how you implemented it.
/*
* write to a register. Usually this is the (BCD corrected) output of the
* ALU, but in case of the JSR0 we use the S register to temporarily store
* the PCL. This is possible, because the S register itself is stored in
* the ALU during those cycles.
*/
always @(posedge clk)
if( write_register & RDY )
// this version will get us a RAM - denser!
AXYS[regsel] <= (state == JSR0) ? DIMUX : { ADD[dw-1:4] + ADJH, ADD[3:0] + ADJL };
//case( state )
// JSR0: AXYS[regsel] <= DIMUX;
// default : AXYS[regsel] <= { ADD[dw-1:4] + ADJH, ADD[3:0] + ADJL };
//endcase
So, you only changed the code a little bit with that case statement, so it didn't recognize the RAM anymore.
I was thinking that instead of the register array, indexed by 'regsel', it's possible to use the register more directly by rewriting some of the code. For instance, when doing stack operations, the S register can be sent directly to the address bus, instead of going through the AXYS[] array. This may be a little faster (at the cost of more logic).
Actually, both versions are your code, from this time last year (private emails)! We did discover at that time how fragile (or robust) the RAM-recognition was.
I see what you mean - the S value from the ALU could be routed to the Address pins. I think I have never looked carefully at the timing reports, to see where the critical path is. I did determine that synthesis tactics could make quite a difference to speed.
It would be interesting to see what would happen if the design was changed from the current one, which mimics the way the original 6502 used the ALU to do everything, to one where the registers are separated, and combined with some logic. So, you could have an X-register unit that was capable of holding the value, but also incrementing and decrementing it. That way you could implement INX and DEX without going through the ALU.
This could still be done at the same number of cycles, but you could also choose to deviate from the original 6502 instruction times, and try to make it faster. In theory, you should be able to do INX and DEX in 1 cycle, for instance.
Single cycle instructions? I always thought that would mean major rework - be interesting to see it though! And we know the 65CE02 managed it. In an FPGA fabric, an incrementer local to the register might well be efficient. It would presumably simplify push and pop type operations.
Single cycle instructions? I always thought that would mean major rework - be interesting to see it though! And we know the 65CE02 managed it. In an FPGA fabric, an incrementer local to the register might well be efficient. It would presumably simplify push and pop type operations.
I didn't say it wouldn't involve major rework
For instance, if you have two consecutive INX instructions, the first INX instruction writes the new X register value at the same time that the second INX instruction already needs it. So, to avoid using the old value, it needs an extra MUX to choose between the current value of the X and the new value that it will get.
Also, all these extra muxes may mean the maximum clock rate will drop, which could make the end result slower for typical programs. The problem is that you won't find out until you've already done most of the work.