Parallel Processing with 6502s

Let's talk about anything related to the 6502 microprocessor.
fachat
Posts: 1124
Joined: 05 Jul 2005
Location: near Heidelberg, Germany
Contact:

Post by fachat »

Quote:
For me, if I were to go multiple cpus, I would go with the common ram and each cpu while it has its own rom/ram would access the common ram in a round robin fashion. Each one can simply have data or programs passed back and forth with a token id for memory, or the ram can segment for each cpu.
I have recently published a coprocessor board for my CS/A computer. The CS/A as a 1MByte address space (using an MMU for a 6502), and the coprocessor's 64k address space are mapped as shared memory into the 1M.
In theory you could use multiple coprocessors in that setup with one controlling (main bus) CPU.

More details on http://www.6502.org/users/andre/csa/copro/index.html

André
6502inside
Posts: 102
Joined: 03 Jan 2007
Location: Sunny So Cal
Contact:

Post by 6502inside »

kc5tja wrote:
Yeah, Commodore's PET-series disk drive units relied on twin processors. I'm not sure if the Commodore VIC-series (e.g., the 1540 through 1581) did or not though.
They do not. The 2031 and 154x/157x/1581 drives are single CPU. However, the CPU still alternately operates in IP and FDC modes, changing "personality" on interrupt, in a manner analogous to the old drive architecture.
fachat
Posts: 1124
Joined: 05 Jul 2005
Location: near Heidelberg, Germany
Contact:

Post by fachat »

GARTHWILSON wrote:
The one thing I don't know how to deal with is how to split up jobs between the various processors such that the "administrative" overhead and all the communicating between computers doesn't take up so much of the time that you lose the performance advantage you had hoped to gain in having parallel processors. I suppose it depends on what kind of work you want to do with the system. Maybe the administrative overhead would eat up a pretty small percentage of the gains if you want to do something like Fourier transforms on large arrays. Then each separate computer could take its job and go off in its corner and not bother the master computer for a long time. This is one subject I'd like to hear more of from someone experienced with multiprocessor programming.
As with all things, "it depends". It depends on the algorithm or job you want to do on multiple processors. Number crunching jobs can often be parallelized into a number of similar jobs that can be distributed to multiple, similar processors, but Amdahl's Law (http://en.wikipedia.org/wiki/Amdahl's_Law ) restricts the speedup you can get, as most algorithms have a communication part that can not be parallelized. A good example where a job is massively distributed, and communication is minimized as much as possible, is WorldCommunityGrid (http://www.worldcommunitygrid.org/ - If you don't mind I would strongly suggest you think about supporting this initiative with some spare CPU cycles)
Another example is the combination of two PC graphic cards that alternatively compute complex video frames e.g. in a video game.

A starting point for parallelization of algorithms is http://en.wikipedia.org/wiki/Parallelization
(although really only a starting point).

A different point is the offloading of specific (single) tasks that run in parallel to a main task. An example would be a TCP/IP offload engine, or even the graphics card processor.

The type of load you want to do with a multiprocessor system also influences how much synchronization and communication effort you need. A fixed job program can directly be put into the coprocessor's ROM, while a general-purpose processor needs to be able to load its program from another media (or from the main processor)

There are two ways of coupling processors, with shared memory or with message passing, "either of which may be implemented in terms of the other" - with proper hardware support I want to add.

In the 6502 world I actually only know of a few examples, but both types of them:

1) The Commodore disk drives (old-style, IEEE488) use two processors, a 6502 and a 6504, running on opposite phases of PHI2. The 6504 handles the disk drive hardware, the 6502 the bus communication and the file system. The 6504 has later been optimized away, and its program be put into the interrupt routine of the main 6502. The two processors use some shared RAM to communicate. As the program is fixed, it is in a separate ROMs, and job interface was small (in terms of number of interactions) so the communication is very efficient.

2) There have been demos on the C64 that actually use the VC1541 floppy processor to compute stuff. IIRC one computed a Mandelbrot graphic. Here the job interface is relatively small ("start job with parameters" and "send me the result"), so communication is efficient even though it was through the Commodore serial bus. IIRC the floppy drive loaded its own program directly from the floppy, so the main (C64) CPU was not used for this.

3) My CS/A65 computer implements a general purpose coprocessor, running on opposite clock as the main processor. Both use up to 64k of shared RAM to communicate. The plan is to use it for offloading either TCP/IP or SCSI-disk and filesystem load, and communicate with the main CPU only via file I/O calls, using shared memory to transfer the data. If the main CPU would have to wait for the coprocessor (e.g. on a disk read), it would basically be more effort than gained. However, my operating system is a multitasking system, and while the coprocessor reads a file, the calling process on the main processor can wait and the main processor can execute other jobs. The board will have 32k RAM and 32k either RAM or ROM. If it is 64k RAM total, the main CPU would have to load the coprocessor program into the shared RAM - good for debugging anyway - but later I want to have it in ROM.

So, from these examples, how would you optimize the communication and synchronization.....?

Basically it is very simple: reduce the number and size of communication calls. You can read a more theoretical approach here http://www-128.ibm.com/developerworks/l ... nterface1/ and http://www-128.ibm.com/developerworks/l ... nterface2/
(please bear with the author :-)

Also make communication asynchronous if possible. That avoids that one processor needs to wait for the other processor (as long as it has something else to do)

If you have a specific algorithm you want to distribute, look at the parallelization algorithms.

Parallel processing is not easy, and needs a lot of planning. In my 6502 case I have a multitasking OS, so I could "outsource" tasks to another processor without modifying the program itself. But if it is very "chatty", with lots of communications, the additional effort to communicate from coprocessor to main processor may then even eat up what has been gained by the coprocessor.

I hope this helps a bit.

André
Nightmaretony
In Memoriam
Posts: 618
Joined: 27 Jun 2003
Location: Meadowbrook
Contact:

Post by Nightmaretony »

scratch above, the more I think about it, the better I simply like dual port festyivities for each CPU. This way, the main can tell the others what to do. I did that stunt for a poker game with 2 65C02s yearsd ago and it worked out nicely with the secondary being a graphics processor. (didnt do the code, just the hardware design and lost the schematic so bleah, memory is fun!)

but on the thinking, me likes the ports....
"My biggest dream in life? Building black plywood Habitrails"
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

This is the "actor" model of communication, and is what allowed AmigaOS to achieve its high performance even with a single CPU. Message-passing is the only universally applicable concurrency solution that simultaneously obsoletes the need for such complicated systems as read-/write-locking, semaphores, monitors, et. al.

In fact, you strictly speaking don't even need queues to implement message passing, as the L4 and QNX microkernels clearly demonstrate. However, it is nice to have if resources permit it.

High-level programming languages that explicitly support this model of distributed computation include Erlang, which has a mammothly huge embedded marketshare in the telecommunications industry. This speaks volumes for its applicability.

So don't feel bad for making this choice. It's actually one of the best choices that exists. People have been saying that multithreading is hard. This is true for Windows and Unix. But, AmigaOS developers have been using multithreading since 1985, and nobody complained. Hell, the totality of the OS itself is built on the very same concept of client/server interaction using message passing!
Post Reply