idea: a cell-like chip based on many 6502 cores

kakemoms · Post by **kakemoms** » Fri Jun 24, 2016 10:55 am

commodorejohn wrote:

Well, there's the AppleCrate II, although that's not a multiprocessor system so much as a networked cluster.

Wow! Thats huge! Thanks for sharing, gives me some hours of interesting reading and thinking.

kc5tja wrote:

This assumes that every operation the 6502 performs occupies 5 clock cycles, and so maintains phase relative to its peers. This, however, won't be the case. So, with surprising frequency in fact, your 6502s will conflict with each other while accessing shared memory.

What you're describing is called NUMA (Non-Uniform Memory Access), where CPUs have local memory resources and remote memory resources mapped into a common address space. It's really the only way to scale beyond the traditional SMP limits.

However, straight message passing would actually be faster than NUMA (particularly if all the links are point to point or a reasonable approximation thereof) in the general case, since you pass messages directly from one node to another without having to contend for a shared resource (or, if you do, it's software mitigated, so event-driven code can forward data at adaptively scheduled intervals. Think R-ALOHA, for instance). You could use a simple FLIT-routed network, or if you can afford the hardware, a point-to-point, mesh network between your processing elements, and never have to worry about sluggishness, nor synchronization.

Well, I know that in many instances it take less or more than 5 cycles per shared memory fetch. That is were the BE line comes into existence to hold off the 6502 core to prevent conflict. It will give delays, but one can either design (optimise) or forget about it (=runs any code, but slower). I am not saying 5 is a magic number and some kind of statistics might provide a better basis for choosing a divider (e.g. 1/5th clock cycle access time, 1/4th and so on..).

Edit: A simple analysis for a case were each 6502 accessed the shared bus 1/5th of the cycles gave 1 cycle delay in 20% of the time with two 6502, 36% of the time with three 6502 and around 50% of the time with four 6502. It means that each 6502 requires 5.5cycles (in average) for a shared SRAM access. By using CLK2 HIGH/LOW multiplexing you can have eight 6502 on such a shared bus and it only slows them down by around 10%.

As per today, I have a single core solution for my 6502/65C02 multiplexed system with a 65C22 on a 4-layer "CPU board" in production:

: A 65C02 based cpu board

Its missing some logic, but I may build a new interface card and add some of these boards to see how well they can cope with different shared memory divisors. There is also a connector for the 65C22 parallel port which would make point-to-point communication possible (although software driven). I am not sure there is much to gain by sending data over the 65C22 instead of the shared SRAM in this small system, but maybe for larger.

Another implementation would be to have two different shared memory areas along "horizontal" and "vertical" direction in a mesh:

Code: Select all

A1    A2    A3    A4 .....

B1    B2    B3    B4 .....

C1    C2    C3    C4 .....

and so on.. In which each letter+number represents a CPU board. All CPU boards that start with "A" share one memory area (for example $8000-$BFFF) and all CPU boards that end with "1" share another memory area (for example $C000-$FFFF). If each shared memory area has 10 CPU boards connected, then at 100 CPU boards the 65C22 point-to-point communication is probably going to help (even as a software implemented solution).

Thanks for sharing info on multiprocessor networking. I will do some more reading during the upcoming holidays!

idea: a cell-like chip based on many 6502 cores

Re: idea: a cell-like chip based on many 6502 cores