GARTHWILSON wrote:
The one thing I don't know how to deal with is how to split up jobs between the various processors such that the "administrative" overhead and all the communicating between computers doesn't take up so much of the time that you lose the performance advantage you had hoped to gain in having parallel processors. I suppose it depends on what kind of work you want to do with the system. Maybe the administrative overhead would eat up a pretty small percentage of the gains if you want to do something like Fourier transforms on large arrays. Then each separate computer could take its job and go off in its corner and not bother the master computer for a long time. This is one subject I'd like to hear more of from someone experienced with multiprocessor programming.
As with all things, "it depends". It depends on the algorithm or job you want to do on multiple processors. Number crunching jobs can often be parallelized into a number of similar jobs that can be distributed to multiple, similar processors, but Amdahl's Law (
http://en.wikipedia.org/wiki/Amdahl's_Law ) restricts the speedup you can get, as most algorithms have a communication part that can not be parallelized. A good example where a job is massively distributed, and communication is minimized as much as possible, is WorldCommunityGrid (
http://www.worldcommunitygrid.org/ - If you don't mind I would strongly suggest you think about supporting this initiative with some spare CPU cycles)
Another example is the combination of two PC graphic cards that alternatively compute complex video frames e.g. in a video game.
A starting point for parallelization of algorithms is
http://en.wikipedia.org/wiki/Parallelization
(although really only a starting point).
A different point is the offloading of specific (single) tasks that run in parallel to a main task. An example would be a TCP/IP offload engine, or even the graphics card processor.
The type of load you want to do with a multiprocessor system also influences how much synchronization and communication effort you need. A fixed job program can directly be put into the coprocessor's ROM, while a general-purpose processor needs to be able to load its program from another media (or from the main processor)
There are two ways of coupling processors, with shared memory or with message passing, "either of which may be implemented in terms of the other" - with proper hardware support I want to add.
In the 6502 world I actually only know of a few examples, but both types of them:
1) The Commodore disk drives (old-style, IEEE488) use two processors, a 6502 and a 6504, running on opposite phases of PHI2. The 6504 handles the disk drive hardware, the 6502 the bus communication and the file system. The 6504 has later been optimized away, and its program be put into the interrupt routine of the main 6502. The two processors use some shared RAM to communicate. As the program is fixed, it is in a separate ROMs, and job interface was small (in terms of number of interactions) so the communication is very efficient.
2) There have been demos on the C64 that actually use the VC1541 floppy processor to compute stuff. IIRC one computed a Mandelbrot graphic. Here the job interface is relatively small ("start job with parameters" and "send me the result"), so communication is efficient even though it was through the Commodore serial bus. IIRC the floppy drive loaded its own program directly from the floppy, so the main (C64) CPU was not used for this.
3) My CS/A65 computer implements a general purpose coprocessor, running on opposite clock as the main processor. Both use up to 64k of shared RAM to communicate. The plan is to use it for offloading either TCP/IP or SCSI-disk and filesystem load, and communicate with the main CPU only via file I/O calls, using shared memory to transfer the data. If the main CPU would have to wait for the coprocessor (e.g. on a disk read), it would basically be more effort than gained. However, my operating system is a multitasking system, and while the coprocessor reads a file, the calling process on the main processor can wait and the main processor can execute other jobs. The board will have 32k RAM and 32k either RAM or ROM. If it is 64k RAM total, the main CPU would have to load the coprocessor program into the shared RAM - good for debugging anyway - but later I want to have it in ROM.
So, from these examples, how would you optimize the communication and synchronization.....?
Basically it is very simple: reduce the number and size of communication calls. You can read a more theoretical approach here
http://www-128.ibm.com/developerworks/l ... nterface1/ and
http://www-128.ibm.com/developerworks/l ... nterface2/
(please bear with the author :-)
Also make communication asynchronous if possible. That avoids that one processor needs to wait for the other processor (as long as it has something else to do)
If you have a specific algorithm you want to distribute, look at the parallelization algorithms.
Parallel processing is not easy, and needs a lot of planning. In my 6502 case I have a multitasking OS, so I could "outsource" tasks to another processor without modifying the program itself. But if it is very "chatty", with lots of communications, the additional effort to communicate from coprocessor to main processor may then even eat up what has been gained by the coprocessor.
I hope this helps a bit.
André