6502 systems typically run on alternate clock phases and therefore it is
relatively trivial to make dual core systems. However, this arrangement is hostile to further processors and provides no additional benefit if it is overcome.
A common technique to increase processing power is square or rectangular grid of nodes where adjacent nodes are directly connected. I believe that some of Seymour Cray's designs followed this pattern. It was also a common arrangement for INMOS Transputers and the subsequent XMOS designs. It is also the arrangement of the GreenArrays which is associated with Chuck Moore. Indeed, if Seymour Cray's fastest designs and Chuck Moore's small, cheap, flexible designs have a common element then it is probably worth investigating.
I found that it is possible to make a two dimensional grid (or more) from 6502 processors by making adjacent processors run on alternate clock phases. In the trivial case, we have a checkerboard of node phases. Adjacent nodes may communicate via single port RAM. This RAM is available to both processors because they only access it on alternate memory cycles. Practical arrangements should scale beyond 144 cores. Timings within the grid are quite critical. However, a basic technique, practiced with success by Seymour Cray, is to use the same length wires for clock distribution.
The basic unit may be tessellated at least 36 times:-
Code:
^ ^
| |
v v
+---------+ +---------+ +---------+ +---------+
| 6502 | | | | 6502 | | |
<---->| (Even |<---->| Buffers |<---->| (Odd |<---->| Buffers |<---->
| Phase) | | | | Phase) | | |
+---------+ +---------+ +---------+ +---------+
^ ^
| |
v v
+---------+ +---------+
| | | |
| Buffers | | Buffers |
| | | |
+---------+ +---------+
^ ^
| |
v v
+---------+ +---------+ +---------+ +---------+
| 6502 | | | | 6502 | | |
<---->| (Odd |<---->| Buffers |<---->| (Even |<---->| Buffers |<---->
| Phase) | | | | Phase) | | |
+---------+ +---------+ +---------+ +---------+
^ ^
| |
v v
+---------+ +---------+
| | | |
| Buffers | | Buffers |
| | | |
+---------+ +---------+
^ ^
| |
v v
Nodes along the perimeter are most suitable for high bandwidth I/O. For example, it is possible to have eight or more
video displays along one edge and storage along another edge. However, nodes at the corner of the network only have one spare clock phase and therefore may only participate in one high bandwidth I/O function.
Suggested memory map for 6502:-
Code:
$0000-$00FF: Zero page.
$0100-$01FF: Stack.
$0200-$BFFF: 47.5KB of general purpose memory.
$C000-$CFFF: Memory shared with adjacent cores.
$D000-$DFFF: I/O which includes doorbells to invoke adjacent cores.
$E000-$FFFF: Supervisor firmware.
The memory map reveals immediate limitations of the architecture. Most obviously, four bi-directional channels to adjacent nodes requires eight sets of buffers. Even if the memory map is adjusted, it may be undesirable to allocate more than 1/2 of the memory for this purpose. Therefore, directly mapped buffers will be a maximum of 4KB and are likely to be much smaller; typically 1KB or less. However, if a bi-directional channel minimizes hardware, there may be further overheads. For example, a software FIFO implementation may not distinguish between buffer empty and buffer full. This relegates implementation to being one byte short of the full span. We also require space for multiple pointers. Therefore, the maximum buffer size may never exceed 4KB minus five bytes (or seven 256 byte uni-directional channels) and is more likely to be annoyingly short of 1KB or three separate pages.
Additional capacity can be be obtained by page banking buffers. This is especially desirable if multiplexed RAM is supplied in large quantities. However, this arrangement may reduce communication speed. With 65816 or 65265, it is possible to have 64KB or more per uni-directional channel. It is also possible to block copy to and from buffers. Or, indeed, directly block copy from buffer to buffer so that data can be passed through the grid.
Suggested memory map for 65816 and 65265:-
Code:
$000000-$0000FF: Direct page.
$000100-$0001FF: Stack.
$00D000-$00DEFF: I/O which includes doorbells to invoke adjacent cores.
$00DF00-$00DFFF: W65C265 microcontroller internal I/O.
$00E000-$00FFFF: Supervisor firmware.
$010000-$BFFFFF: 12224KB of general purpose memory.
$C00000-$FFFFFF: Memory shared with adjacent cores.
If boards are constructed with 4, 6, 9, 12, 16 or more cores, it is strongly recommended that they are all connected with ribbon cable where alternate wires are grounded. This creates favorable physics where signals are largely shielded from each other. A further consideration is that a rectangular grid of boards can be folded at the seams. In this case, ribbon cables may not be the same length but effort should be made to keep length to a minimum. In trivial cases, ribbon cables may be less than 2 inches (5cm) while providing easy access to high bandwidth I/O nodes on the perimeter of the grid. This requires minor consideration for hex bolt placement and power distribution.
MTBF of large arrays is a concern. Thankfully, BigDumbDinosaur has shown empirically that 65816 + Static RAM + SCSI Disk (and suitably long hex bolts) has uptime which exceeds 300 days. From this, it is reasonable to extrapolate that 144 cores and eight disks has uptime which exceeds two days. It is also reasonable to assume that the typical system will be much smaller and therefore much more reliable. Regardless, the shear cross-sectional area of the silicon is an invitation for bit error and this may cascade catastrophically through a grid computer. Without exception, validation and checksums are strongly encouraged. In particular, checksum size should be dimensioned according to the size, speed and longevity of communication. Specifically, 16 bit CRC is very likely to be inadequate.
The primary purpose of the supervisor firmware is to maintain integrity of the system. This includes power on testing, addressing of identical nodes and bad node/link detection. High integrity, portable communication is a secondary task within this framework. 8KB ROM should be suitable for this task given it was achieved on Connection Machine or Transputer nodes with less total memory.
Anyhow, it is possible to start with a
dual core or quad core system and, conceptually, scale-out to a ludricrous extent.
Further ReferencesIn 1987, Clive Sinclair filed a patent regarding wafer-scale integration where partial or full wafers could be packaged. Connectivity of the good nodes coudld be determined at boot. A
one hour lecture by Danny Hillis explains the progression of Connection Machine development and makes passing mention of Z80 being unsuitable for clustering before covering 1 bit computing, striped RAID, self-test and virtualization. Limitations of this architecture have only recently been overcome with CUDA.