I'm a huge fan of multi-core 6502 systems, however there are significant limitations when a dual core uses both bus phases.
Firstly, an agnostic 6502/65816 will require tri-state bus transceivers. NMOS 6502 always outputs address lines. 65C02 is more graceful. 65816 uses both bus phases to output its extended address range. A 65C02 only system is less general but more tractable.
Secondly, two sequential writes to SRAM won't work because writes will be scribbled elsewhere when bus control switches. Video displays which use the opposite phase avoid this case because they only read. Commodore's dual core systems used DRAM chips which are now obsolete. If you want a period accurate system, study Commodore's dual core floppy drives and be prepared to source dodgy DRAM chips. This is not a beginner project. It may be possible to interface contemporary SDRAM. However, this may require periodic interrupt to a routine which takes the RAM off-line for refreshing. This will hugely affect interrupt response time and is also not a beginner project. It may also be more expensive than SRAM. I'm investigating a SRAM cross-over configuration where one processor writes to an individual chip and the other processor reads it. This requires a minimum of four SRAM chips and has the advantage that each core gets private stack by default.
Thirdly, 65xx peripheral chips are really not suited for use on both bus phases. The NMOS versions used the idle phase to draw energy in preparation for the processor reading a register. An NMOS system which overcame this limitation would then encounter the two sequential write problem. If you devise a system where a peripheral chip switches bus phase, timers will drift. This means periodic interrupts will only be approximate. This may be acceptable for task switching but it may be incompatible with tasks such as software UART. It is preferable for peripheral chips to have an affinity to one core or one bus phase. If you want two cores to share all RAM then it is preferable for one core to receive interrupts. This allows cores to be differentiated by idling in a loop until one receives an interrupt.
Fourthly, a parallel bus over multiple boards may greatly limit the maximum operating frequency. This may be incompatible with video or lead to a system where all cores are slower than a single core system. Distance (and cost) can be reduced by chip stacking processors, cross-over SRAM and I/O chips. Two processor cards may be a boost if they have integrated peripherals. However, a RAM card, a serial port/parallel port card, mouse/keyboard card, a sound card, a video card and two processor cards will only keep pace if the parallel bus consumes a huge amount of energy. I hope you didn't want a portable version of your design.
jds on Fri 1 Apr 2022 wrote:
It would also be helpful to set up a memory location to identify if the processor is the first or second one, simply by being able to read the phi0 clock.
If two cores have the same RAM (somehow) and the same ROM but only one core has I/O then both may mark a boot-strap state. Both may copy a loop from ROM to RAM. Both may configure an interrupt. (One core does so in vain.) One core receives the interrupt and escapes the loop. This core performs LDX #$00 // TXS. The other core is coaxed out of the loop and performs LDX #$80 // TXS. Core number is now stored in the top bit of RegS. It is now possible to determine core using TSX // TXA // BMI or similar. This eliminates the hardware for a clock phase register. Although, it does so at the expense of maximum interrupt performance. If interrupts are vectored in RAM then it requires no additional overhead.
jds on Fri 1 Apr 2022 wrote:
Rockwell had a dual processor chip in their data book, but interestingly it wasn't just two 6502's with inverted clocks, they actually shared most parts apart from the internal registers, so things like the ALU were working twice as fast, but this surely was a very efficient multiprocessor.
I strongly suspect that the dual core, NMOS Rockwell processor was vaporware. 6502 uses both clock phases internally and therefore scope for multiplexing ALU is very limited. Yield for two cores on one die would be poor. It would also be very awkward to stack two dies in one package. The easiest option is to stack two packaged chips. Even here, gains for NMOS 6502 are small because tri-state bus transceivers are required on all address lines. This is in addition to separate interrupt lines and two phase clock.