Multiprocessing on FPGA using dual-port RAM (pipedream)

BigEd · Post by **BigEd** » Fri Oct 31, 2014 5:56 pm

This was a pipe-dream thought, but perhaps worth sharing: a 6502 core on FPGA is pretty small, so it's surely possible to fit 4 or even maybe 16 of them on a reasonably-priced FPGA. As the RAM blocks (at least on Xilinx FPGAs) are dual-ported, it would be simple enough to hook every RAM up to a pair of 6502s, and if every 6502 was hooked up to 2 or to 4 RAMs, it would be possible to make a pipeline or a mesh (or a torus) of processors. It would be easy to share code too, if that makes sense, but most crucially the processors could communicate by posting data to the shared memory and setting a flag to indicate that it's ready.

With 4 neighbours, each 6502 would see 4 blocks of 2k RAM, all of which would be shared, but each block shared with a different neighbour. By convention part of each RAM would be private to one side or the other, and part would be shared. The address map might look interesting, being a patchwork, and each patch appearing in a different part of the address map on each side of the shared block.

Code: Select all

+------------------------------------------+
|                                          |
| +---+                                    |
| |   |                                    |
| | +-+--+ +---+ +----+ +---+ +----+ +---+ |
+---+6502| |RAM| |6502| |RAM| |6502| |RAM+-+
  | +----+ +---+ +----+ +---+ +----+ +---+  
  |                                         
  | +---+        +---+        +---+         
  | |RAM|        |RAM|        |RAM|         
  | +---+        +---+        +---+         
  |                                         
  | +----+ +---+ +----+ +---+ +----+ +---+  
  | |6502| |RAM| |6502| |RAM| |6502| |RAM|  
  | +----+ +---+ +----+ +---+ +----+ +---+  
  |                                         
  | +---+        +---+        +---+         
  | |RAM|        |RAM|        |RAM|         
  | +-+-+        +---+        +---+         
  |   |                                     
  +---+

At the same time, this gives each processor more memory than it would otherwise have, and connects the processors together.

Just maybe, the zero page and stack could be implemented as distributed RAM, and therefore be private. Or maybe there's enough block RAM to have a private block as well as the shared blocks - depends on how big the FPGA is, and how many CPUs to squeeze in. As we know from the Atari 2600 and other efforts, we don't need a full page 0 or page 1 to make a viable machine. Even 64 bytes mapped into both pages can be useful.

As for programming such a network, well that's a software problem!

(The transputer was all about local memory and synchronous communication with up to 4 neighbours over a byte-wide channel, but we don't have an FPGA model for the transputer. It would of course be possible to design a byte-wide channel as a peripheral, but shared memory comes for free.)

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Oct 31, 2014 7:57 pm

BigEd wrote:

This was a pipe-dream thought...a 6502 core on FPGA is pretty small, so it's surely possible to fit 4 or even maybe 16 of them on a reasonably-priced FPGA. As the RAM blocks (at least on Xilinx FPGAs) are dual-ported, it would be simple enough to hook every RAM up to a pair of 6502s, and if every 6502 was hooked up to 2 or to 4 RAMs, it would be possible to make a pipeline or a mesh (or a torus) of processors...

What you have described sounds somewhat like a 6502 analog of AMD's hyper-transport feature that is used on Opteron motherboards to keep each processor core aware of what the other cores are doing. I suspect development of the VHDL to accomplish this would a lengthy process over and above development of the the 6502 cores themselves. Even more interesting would be the development of an assembler or compiler that could take advantage of multiple cores. With the ability to write threaded code we would have true multiprocessing with the 65xx family.

BigEd · Post by **BigEd** » Fri Oct 31, 2014 8:17 pm

Looks like Hypertransport is rather more sophisticated than what I was thinking of. The original transputer link protocol, although electrically serial, was byte-based at the logical level. Processes could exchange multi-byte messages, but it was up to the sender and receiver to agree on the number of bytes in each message, or to implement a variable-length message protocol on top of the byte-sized primitive.

In a shared-memory architecture, I think all that's needed is one or more buffers of agreed size (one byte or more) and for each buffer a control byte to indicate the status of the buffer (empty or full, owned by side A or side B, or whatever.) So, two bytes of memory space for each direction of each link would be minimal, and equivalent to the original transputer link.

(There was a later transputer link protocol, called DSLink, which was packet based and implemented multiple virtual channels over each physical link.)

Cheers
Ed

nyef · Post by **nyef** » Fri Oct 31, 2014 8:24 pm

You might not even need dual-port RAM for this. I could see doing this without the FPGA, with a '245 on the data bus between each CPU and RAM pair, something similar for the address lines, and adjacent CPUs operating on opposite sides of the phi2 cycle.

ElEctric_EyE · Post by **ElEctric_EyE** » Fri Oct 31, 2014 8:41 pm

Fascinating idea BigEd!

It wouldn't be that hard to write the Verilog for the BlocRAM's.
In fact I would like to implement your idea in my project after I add in the HIDs (i.e. mouse, keyboard, touchscreen etc.). As you say, the software for both machines would be taking care of everything.

I picture a common address block part of CPU1 and CPU2 zero page. Then the remaining zero page and stack be be unique to each CPU1 and CPU2.

BigEd · Post by **BigEd** » Fri Oct 31, 2014 8:53 pm

It's a fair point, nyef, that you only need a shared byte (or two) to arrange for a synchronous communication channel. So you could apply the same principle in the FPGA, and make all the memory private.

The LX9 has 32 block RAMs of 2kbyte each, so for a 16-core array, that would be 4 kbyte per node private, or 16 kbyte if it's shared four ways. (In many cases, all the cores would be running the same code, so sharing memory for that purpose is a win.)

Indeed, EEye, if you make a small window of page 0 shared, that will be fast and natural. But the memory map might get messy in the everything-shared case.

I think one could get quite creative in how to map the rams into the memory spaces, and how to control write access for maximum safety. Which is to say, I think there's a wealth of different possible ways of doing it.

nyef · Post by **nyef** » Fri Oct 31, 2014 9:43 pm

BigEd wrote:

It's a fair point, nyef, that you only need a shared byte (or two) to arrange for a synchronous communication channel. So you could apply the same principle in the FPGA, and make all the memory private.

Which wasn't where I was going with that. You can share a 32k RAM (or more!) between two 6502 CPUs if you run one of them on phi2 and the other on negated phi2, and isolate the RAM from whichever CPU is on the "down" part of its cycle so that the CPU on the "up" part of its cycle can do its thing. The Apple ][ uses basically this trick for video refresh, although there the "other CPU" is basically a state machine built out of discrete components.

BigEd · Post by **BigEd** » Fri Oct 31, 2014 9:47 pm

Oh, yes, understood - you were talking about discrete designs. I was just thinking that the same structure could be done on FPGA too.

Multiprocessing on FPGA using dual-port RAM (pipedream)

Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)

Re: Multiprocessing on FPGA using dual-port RAM (pipedream)