Massively parallel 6502 systems

BigEd · Post by **BigEd** » Wed Apr 03, 2019 12:14 pm

Nearby, in the newbies forum, an off-topic excursion concerning using multiple 6502s to implement a complex graphics subsystem:

railsrust wrote:

So I got this email from "my little helper" yesterday:

Quote:

...what would you think of 4 or more 6502s, closely coupled but yet able to execute independently? If we think of the theme of throwing processors at the problem ...
...
Heh, and a bargraph led display that acts like a speedometer the more 6502s you use.

I know people have done multiprocessor ’02’s before I wonder if there is anything in open domain, namely the task dispatcher and manager. They also have done a bunch of 8051s which are mores selfcontained.

I gotta believe one FPGA can manage a crapload of ‘02s.

Anyone know of a way to manage multiple 6502s like this?

GARTHWILSON wrote:

railsrust wrote:

Anyone know of a way to manage multiple 6502s like this?

It's getting off-topic, but that's ok. It's your own topic. :)

Here are some earlier topics that are very relevant, and kc5tja made valuable contributions on:

idea: a cell-like chip based on many 6502 cores
and André Fachat's post in this one (already cued up), about distributing the work load among the various processors in a multiprocessing system: Parallel Processing with 6502s
See also: Dual Processor SBC.

WDC's W65C02S adds some more signals at the pins, namely ML\ (memory lock not) output (pin 5 on a DIP), BE (bus enable) input (pin 36 on a DIP), and VP\ (vector pull not) output (pin 1 on a DIP).

I have a couple more links to add to Garth's list of previous topics:

6502 Grid Computer (2017, Rob Finch)
Multiprocessing on FPGA using dual-port RAM (pipedream) (2015, BigEd)

but I'll very much second the idea noted in the initial quote: it's the software, the management of tasks and of data transfer, which is the major challenge here.

whartung · Post by **whartung** » Wed Apr 03, 2019 5:46 pm

BigEd wrote:

but I'll very much second the idea noted in the initial quote: it's the software, the management of tasks and of data transfer, which is the major challenge here.

Depends on how tightly the CPUs are integrated. If they're essentially stand alone machines (w independent CPU/ROM/RAM) connected through some networking, then, yea, it's mostly a software issue.

But if they're sharing RAM and/or other devices, where handshaking is done at a hardware level, then it's a different animal.

Folks are struggling getting the 65816 address decoded properly.

Just having several CPUs fighting for a common bus can be an issue.

Druzyek · Post by **Druzyek** » Wed Apr 03, 2019 8:08 pm

I have thought about trying to run two CPUs through one CPLD. When the clock goes low and you are waiting 30ns for the processor's address lines to settle, you could have the CPLD switch to the address lines of a second processor that has already settled and let that access memory while you wait. I don't think you could run them both at full speed but you would at least be doing something useful during that 30ns.

drogon · Post by **drogon** » Wed Apr 03, 2019 8:37 pm

Druzyek wrote:

I have thought about trying to run two CPUs through one CPLD. When the clock goes low and you are waiting 30ns for the processor's address lines to settle, you could have the CPLD switch to the address lines of a second processor that has already settled and let that access memory while you wait. I don't think you could run them both at full speed but you would at least be doing something useful during that 30ns.

Not sure why you even need a CPLD for just 2 x 6502's into a common memory system - after all, this is how video is done on some of the older systems - Apple II, etc. 6502 accesses RAM on one half cycle, video on the other - one reason the clock crystals seemed a bit weird then. (Exact multiples of NTSC or PAL scan frequency)

If you ran both 6502's off the same Ph2 clock, but one inverted, then we know that the 6502 only uses (less than) half a cycle to access RAM/ROM, so that leaves the other half cycle for the other processor.

Some glue might be needed to toggle the BE pins (65C02) appropriately, and deal with R/W which I don't think is tri-stated with BE.

the '816 has the added complication of the upper 8-bits of address being latched on the "dead" half cycle, so you might need a separate tri-state buffer on the output of that latch.

Other than that... I am being too naive?

-Gordon

GARTHWILSON · Post by **GARTHWILSON** » Wed Apr 03, 2019 8:40 pm

whartung wrote:

BigEd wrote:

but I'll very much second the idea noted in the initial quote: it's the software, the management of tasks and of data transfer, which is the major challenge here.

Depends on how tightly the CPUs are integrated. If they're essentially stand alone machines (w independent CPU/ROM/RAM) connected through some networking, then, yea, it's mostly a software issue.

But if they're sharing RAM and/or other devices, where handshaking is done at a hardware level, then it's a different animal.

I'm not concerned with the handshaking so much as the administration job of deciding how to distribute the work load to keep the various processors busy in a way that's productive overall.

Thanks for the additional links, Ed.

GaBuZoMeu · Post by **GaBuZoMeu** » Wed Apr 03, 2019 9:05 pm

whartung wrote:

Just having several CPUs fighting for a common bus can be an issue.

Depends on how you try to attempt that and what you are willing to pay

Assume you really wish to use one (extraordinary fast) memory to serve as a common RAM for n CPUs. Assume further you arrange a clock generator with n outputs each output delayed by t ns where t is the cycle time of the common RAM. You then have to mux each CPUs bus to that RAM put/fetch_and_latch a byte and then serve the next CPU. Really challenging I think and even with say 7 ns RAM and virtual no delay from the muxes only 10 CPUs (yielding a total cycle time of 70 ns = 14 MHz) could interact this way.

But most likely only a fraction of the RAM need to be "common", e.g. one KB or two. You could then use dual port RAM and use the "other" side to synchronize all DPRAMs. A 6502 can only write each fourth cycle (three if zero page but that would make less sense), so even at 14MHz clock only each 4x 70ns = 280ns a byte could be issued by one CPU. The "other" side of the DPRAM could be operated by some logic at full speed e.g. 14 ns. This logic could select one DPRAM to deliver its contents while all other DPRAMs are simultaneously written with that data. With no further delays 280/14=20 quasi simultaneous writes could be served. Here the problem is to fetch all CPU side accesses and queue them up for processing on the "other" side.

Again challenging I assume

Using a FPGA with its block RAMs inside might be easier. Inside the FPGA you can operate even faster. On the other hand: for each 6502 there are 20 pins (10 AB, 8 DB, PHI2, RWB) required....

Me thinks a couple of loosely coupled autonomous computers exchanging highly condensed information in a low frequency occurrence are much much easier to manage. A W65C265S with its four serial ports could act like a poor man's Transputer.

my 2 cents

Dr Jefyll · Post by **Dr Jefyll** » Wed Apr 03, 2019 9:48 pm

drogon wrote:

Not sure why you even need a CPLD for just 2 x 6502's into a common memory system - after all, this is how video is done on some of the older systems - Apple II, etc. 6502 accesses RAM on one half cycle, video on the other [...] Some glue might be needed to toggle the BE pins

Slightly OT, since 2 processors doesn't qualify as massively parallel, but since the subject was mentioned here is a vintage homebrew using two 6809's in that fashion. The two CPU's take turns accessing a shared RAM. And one of the CPU's is also the video system. This project predates CPLD's (although I did exploit programmable logic in the form of 32 x 8 TTL PROM's).

Because DRAM requires multiplexers anyway, this design doesn't do the trick of tying the two CPU address buses together and toggling the BE pins.

Using static RAM you *could* tie the buses together that way (ie, omit the multiplexer). Having each CPU tristated for the first half of every cycle won't affect the RAM because the RAM is fast enough to do the entire access in the remaining half of the cycle. But tristating for the first half of every cycle means extra delay before the address decoder for memory-mapped I/O can begin doing its job, and that might force a reduction in clock speed (as Druzyek mentioned).

-- Jeff

Martin A · Post by **Martin A** » Fri Apr 05, 2019 10:17 pm

One possible way to synchronise multiple CPU's is alter the clock ratio.

I've done a quick and dirty test on a breadboard. It's got a 25.175mhz can oscillator (as that's what I had!) driving a 74HC163 4 bit synchronous counter.

The B C and D outputs from the counter feed the A B and C inputs of a permanently enabled 74HC138 3 to 8 decoder to produce 8 non overlapping clocks with a 7:1 high to low ratio. Sending the Y0 to Y7 outputs from the decoder through a 74HC240 inverts the signals to produce 8 "CPU clocks".

If the same Y0-Y7 outputs controlled the enable pins for set of 74HC244 buffers for each CPU then they would all only be connected to the target memory for 1/8 of the time but appear to have full access.

The test board produced a high time of just under 80ns which is comfortably more than the access time for modern SRAM. A 32mhz master clock would have produced 2mhz CPU clocks and 60ns access periods.

The question is then whether 8 CPU's in the 1-2mhz range is a worthwhile goal.

Massively parallel 6502 systems

Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems

Re: Massively parallel 6502 systems