Hmm, I don't think the idea is well-described as NUMA - that's where different parts of the memory system are different distances from the core in question
What is a core? Does it have one memory bus? Two? Four? Eight? Or, should we be more precise and consider
masters instead? From the POV of memory, it has no knowledge of how many execution engines exist; only a certain number of "ports", to which a "master" can connect to.
One core can easily have two masters: one for instruction fetch, and one for data access (in fact, that's the definition of a Harvard architecture system). Or, you can have two cores, with one general purpose port each. Or you can have one core with six master ports: one for instruction fetch, three for integer load/store, and two for floating point (e.g., as might be found on a superscalar architecture).
The definition of core is both hazy and misleading. It's better to think in terms of bus masters and bus slaves.
It seems we're agreed that the '816 takes about 5 times the source (C or HDL) to describe compared to the '02.
PLA logic doesn't necessarily scale like that, and if you write your HDL in the manner of a PLA (which is exactly what SMG enables), you gain that benefit. My first 4 attempts to implement the KCP53000 CPU core failed spectacularly; all hand-written Verilog, using the usual case-statement methods, or ternary logic approaches, or what-have-you. I'm going to go on record and risk my reputation saying this here and now:
Don't do that.
Verilog is a
disgustingly bad language for expressing state machine decoders. It is so bad, in fact, that when I switched my instruction decode logic from using nested case statements to the result of compiling my SMG code, besides a
factor of 4 reduction in lines of code written, the number of logic cells consumed in the FPGA dropped by over a thousand.
I'm willing to bet that if someone decapped a 65816 and studied the PLA, I predict you'll find that
most of the minterms are shared across all five modes of operation.
What I think is more significant is that you may end up losing precise timing closure with a real 65816, at least if you're targeting an FPGA. Because the 65816 circuitry is level sensitive and not edge-sensitive, it's difficult (or possibly even impossible) to precisely match its timing characteristics. To get everything to match, you have two choices:
1. You'd need to make everything, even clocked logic, out of asynchronous gates, and FPGAs definitely doesn't like to synthesize such designs, OR,
2. You need to clock your logic 2x as fast as a real 65816, and just bite the bullet and implement your logic so that things happen on alternating even and odd cycles. (This is the approach taken by the 80386 processor, by the way; a 33MHz 80386 is really clocked at 66MHz on the motherboard.)
People here lament the lack of a 65816 core; I cannot speak for why others haven't made one, but there are several reasons why I did not make my own (and went the stack CPU route instead), and why I still wouldn't want to:
1. I didn't have a tool like SMG available at the time. Hand writing PLA logic in Verilog is a pain. This issue is now resolved, however.
2. Stack CPUs map naturally to the fully synchronous circuits that FPGAs prefer. My S16X4 is as fast as a 65816 in practice, but consumes only 300 lines of Verilog. Not even exaggerating.
3. I didn't want to walk over WDC's sole source of income. I would never be able to look at myself in the mirror and feel comfortable with myself if I did. If I were to release a 65816 clone, it would receive a significant overhaul (see below), to the point where folks might not want to support WDC anymore. In particular, my 65816 clone would:
- a. Throw away current timing constraints. You don't need them in practice. A subroutine call on the latest Intel CPUs no longer takes 25 cycles to complete. More like 2. Most RET instructions consume 0 cycles today.
b. Throw away the multiplexed bus. Unnecessary on an FPGA.
c. Throw away the 8-bit data bus, and replace it with a real 16-bit bus, minimum. I might, in fact, actually go with a 32-bit bus as well, making JML and JSL instructions faster to decode for free.
d. Support single-cycle misaligned accesses via a Motorola 68040-like bus, where all 24 address bits are exposed, and all bus transfers are tagged with a size tag (8-bit, 16-bit, 24-bit, 32-bit transfer, etc). This leaves it up to external logic to split memory accesses into multiple cycles if it wants. If you have a data/instruction cache with long lines, this can really boost performance. It also lets me focus on what's relevant (proper separation of concerns). This is also the approach I take with my KCP53000 CPU, so I speak from experience here.
e. Split instruction and data memory fetches into separate bus masters. Again, it just makes implementing the CPU that much easier. It does push the complexity off into a subsequent stage of circuitry, but it's manageable.
Extra credit:
f. Provide better support for MMUs, privilege modes, hardware coprocessors, multiple cores, and so forth. It's so easy to instantiate multiple cores in Verilog that it'd be irresponsible of me to not consider these things.
g. Macro-op fusion to allow things like STA, DEX, BNE sequences to execute in a fraction of the time it'd normally take.
Of course, all these performance enhancements would compound.