I've seen reference in the forum archive to an SBC [Single Board Computer] which could be stacked with another instance of itself (DBC [Dual Board Computer]?) and could be partially populated with extra RAM and an optional second processor. This concept can be greatly expanded with extra ROMs, extra serial ports, dual display and extra expansion connectors. However, the maximal system is not the topic of this discussion. After this flight of fancy, I then considered the minimal system. This was aided by reference to a possibly mythical dual core 6502 from Rockwell. Given other Rockwell variants, this has just enough plausibility to be true but I'm not sure how Rockwell would have implemented it.
A typical method to make a hyperthreaded processor is to have one ALU and two or more sets of registers. However, 6502 has circuitry which is clocked on both edges. 6502 also lacks LEA type instructions due to the technique of placing the output of one register on the address bus while loading another. This type of arrangement does not lend itself to hyperthreading. I also considered the mirrored die technique where, for example, quad core processors may resemble a four leaf clover. I'm not sure this process would be suitable for a dual core 6502 either, although I could be wrong. It just strikes me as unsuitable for a vaguely pin compatible implementation in 40 pin DIP.
Perhaps the mythical Rockwell chip is an early example of chip stacking. RAM stacking is gfenerally known and I believe that
mvk's Gigatron has functionality to stack two 32KB staTIC RAM chips. It is common practice to stack 2, 4 or 8 units of DIP RAM. Address decode may be provided by 74x139 dual 2-to-4 line decoder or 74x138 3-to-8 line decoder. More recently, chip vias have allowed 96 or more layers of storage within one package. However, I also find this technique unsatisfactory. I presume that it is possible to double wire 6502s within a 40 pin DIP. However, it would be significantly easier to test, package and test single core chips and then stack 'em like RAM chips.
RAM stacking is quite easy because only one pin differs. However, processor stacking is a more difficult proposition because a larger number of signals differ. Starting with 65C02 with tri-state address and data pins, it is possible to gang power, address and data lines. However, clock and interrupts differ significantly. Indeed, to prevent bus contention, it is highly desirable to have two clocks with, for example, 45% duty in phases which never overlap. This in itself creates a problem which may invalidate the whole technique. Specifically, reduced duty at the same frequency reduces system tolerances. This may not be apparent, for example, unless a system is warm. Alternatively, it is possible to maintain the minimum clock phase duration if the oscillator frequency is slightly reduced. In systems where this is possible, the result will be an undesirable reduction in single-threaded performance.
Interrupts may be equally awkward. We now have two NMI and two IRQ signals. However, we have absolutely no atomic instructions which are suitable for locking resources. Specifically, 6502 read-modify-write instructions do not have the desired effect when they occur on opposite phases of a binary clock signal. Furthermore, rather than having a (relatively) deterministic scale of three priority levels, we now have nine cases and no locking. Good luck with that. For this reason, it may be considerably easier to restrict all interrupts to one core or, at most, a doorbell or timer on the other core.
However, I have yet to mention the most fundamental problem. How does each core distinguish itself? I initially thought that the solution would require a tri-state buffer which could be used to selectively reveal clock phase on the common data bus. One processor always sees zero and one processor always sees one. Indeed, for upward compatibility, it is possible to extend this principle across multiple arrangements in a manner which is software compatible. In the trivial case of a mono core system, $FF indicates data bus pulled high and register not present. 0000001x indicates dual core, 000001xx indicates quad core, 00001xxx indicates octo core and suchlike.
In the trivial case of dual core processor stacking and single core interrupts, it is also possible to omit any indicator register. This requires a rather tenuous boot sequence, although if you are familiar with a
boot sequence proposed by Dr Jefyll, you may find this proposal quite straightforward:
- Both cores exit reset. Either core leads but, hopefully, cores are only staggered by one phase. Allowances should be made to improve reliability.
- Both cores ignore interrupts.
- Both cores set peripherals to idle state.
- Both cores copy an idle loop to RAM.
- Both core wait for a brief period.
- Both cores enter the idle loop and listen for interrupts.
- One core receives timer interrupt.
- While handling interrupt, one core releases the other from the idle loop.
- Interrupted core waits for brief period.
- Interrupted core releases itself from the idle loop.
- Each core sets a different stack pointer and performs separate tasks.
- Each core can be distinguished by stack pointer range. In the trivial case, the top bit of the stack pointer is sufficient.
It might be desirable to skew stack allocation by half of the maximum nested interrupt stack depth. Alternatively, processor variants may trivially allow configuration of separate stack pages. However, from the empirical figures of GARTHWILSON and the 64 byte stack allocation of GeckOS, 128 byte stack may be more than sufficient.
Anyhow, the minimal dual core 65C02 computer requires one chip stack, minor modification to clock circuitry, minor reduction in clock speed or reduced bus tolerance and minor modification to software. It does not require I/O ports, FIFOs which incorporate clock phase into the addressing or doorbell circuitry. Although, it might be useful to provide an interrupt of some form to the second core. This arrangement may be retro-fitted to a large number of installations. Furthermore, the result may be compatible with clock stretching or other techniques which are typically used to access slow peripherals.