GARTHWILSON wrote:
kc5tja, or anyone, what kinds of jobs would the parallel processing be good for?
Well, actually, the answer to this question depends on the specific
architecture or interconnection structure between the cores.
I'm going to assume two kinds of interconnects: shared bus model (which I think the cell processor uses), and a flat matrix of self-contained cores.
The shared bus model is ideal for relatively coarse-grained parallel decomposition of tasks. For example, you can assign one CPU the job of computing the next frame's graphics, another CPU for computing the next chunk of audio to play back, and a third CPU for handling user I/O and for controlling the inputs to all the other CPUs.
OR, as in the Commodore floppy disk drives prior to the 1541, you can have one CPU handle GPIB bus I/O, and another CPU handling the commands transferred
over the bus. Kind of like what I was suggesting to Daryl for his SPI project.
The limiting factor for parallelism is, obviously, the bus bottleneck. Modern SMP systems tend to allow about 16 microprocessors to interconnect to a common bus before they start fighting each other for bandwidth. I'll say give or take a few CPUs, because it obviously depends on how they use cache and so forth. But 16 is average.
For the flat matrix of self-contained computers (e.g., Intellasys 40C18 chip), the applications are numerous. Each individual core can be programmed to perform some function somewhat analogous to how you'd program an FPGA CLB. As a result, you can partition one portion of the chip to serve as some dedicated processing logic, and another portion for another task altogether. They needn't inter-communicate, although they can if you want.
Another application of the matrix topology is
systolic processing. A CPU 'pipeline' is an example of a systolic processor, albeit a significantly over-simplified one. Each stage of the pipeline performs its sole task, and passing its results on to the next stage when complete. If you have
N pipeline stages, clocked at
M cycles per second, then assuming you keep all the stages full, you can achieve
Nx
M operations per second.
The 40C18 cores have interconnects in the north/south as well as east/west directions, so more complex forms of systolic processing are possible too. In this case,
N = core width * core height, so as you can imagine, you can get some significant processing power from such an arrangement.
Real-world examples of systolic processing include surface acoustic simulations (e.g., if you want to engineer a new kind of SAW filter), audio synthesis (e.g., waveguide synthesis), and image/video applications (wavelet (de)compression, edge detection, motion prediction, etc).
In terms of home hobby projects, multiple cores can help out with interrupt handling as well. Just as your brain relies on its spinal chord to handle emergency interrupt communications ("
OHMYGODHOLY****THISISHOT!!" messages from your temperature-sensing neurons gets translated to "Well, duhh, release the soldering iron immediately!"
long before your brain is even aware of the original message), you can program cores to serve as I/O coprocessors on behalf of interior computational processors.
For example, this is how I implemented my VGA "clock" display on the Intellasys 24-core prototype chip (the S24). Core #12 served as the CRTC, generating HSYNC and VSYNC pulses on its I/O pins. It also communicates which pulse occurs by sending a message to core #13 via the right-hand IPC channel. As soon as the message is sent, core #12 resumes its timebase tasks. This allows core 13 to do whatever it needs to do
asynchronously of core 12, since it is of utmost importance that core 12 remain isochronous.
Core 13, a computational node, simply counts the lines drawn so far (resetting its counter upon hearing about a VSYNC event), and issues triggers to core 19 (via its upward IPC channel), telling core 19
which thing to draw at the current time. After doing so, it goes to sleep, waiting for core 12 to wake it up again with a new notification.
Core 19, then, is responsible for coordinating cores 20, 21, 22, 14, 15, and 16 for generating the hours, minutes, and seconds (hours: 20/14, minutes: 21/15, and seconds: 22/16) graphics representations. When the graphics have been assembled, they pass it
back to core 19, where it is then forwarded to core 18, another I/O core, equipped with a DAC. The DAC then drives the green pin on the VGA connector.
Note: Core 13 determines what to draw based on current line, while core 19 handles what to draw based on horizontal offset from the left edge.
Like core 12, core 18 must run isochronously, which means once fed data, it must take a constant amount of time to serialize the data it receives. However, core 19 can run faster than core 18, because 19 will "block" if 18 isn't ready for more data. This is a very, very convenient feature of the Intellasys architecture.
Here's the "floorplan" for the chip:
Code:
+---------+ +---------+ +---------+ +---------+ +---------+
G <---| Core 18 |<---| Core 19 |<-->| Core 20 |<-->| Core 21 |<-->| Core 22 |
| | | | | | | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
^ ^ ^ ^
| | | |
| V V V
+---------+ +---------+ +---------+ +---------+ +---------+
_HS <---| Core 12 |--->| Core 13 | | Core 14 | | Core 15 | | Core 16 |
_VS <---| | | | | | | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
Core Chart:
12: CRTC
13: Vertical graphics gate
14, 15, 16: Moving indicators for hours, minutes, seconds
18: Dot shifter/DAC interface
19: Graphics job scheduler
20, 21, 22: Fixed containers for hours, minutes, seconds
Quote:
Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)?
I don't have an answer to this question; my guess is that the value lies mostly in the educational value of learning to decompose problems in terms of a fixed number of concurrently running, fully isolated processes. I wouldn't expect anything useful to come from this though. I could be wrong, of course.
Quote:
The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?
Well, I strongly, strongly, strongly, strongly encourage NOT going with a standard backplane. That is a shared-memory, symmetric multiprocessing model, which due to the 6502's bus timings, limits you to
two CPUs before you start to run into bus contention.
I like the flat matrix layout because it's simple. However, you run into problems when node A needs to talk to node C -- it has to go through B first. Thus, accurate timing of A's message to C must necessarily take B into account. Still, it's amazingly versatile, as my example above proves.
To make this work, you'll need to violate Intellasys patents, though.
* First, the interprocessor communications registers work on a synchronous message-passing model. If A wants to send a byte to B, and B isn't already reading from the IPC port, then A will block until B reads. Likewise, if B reads from the port, and A hasn't written anything to it yet, B will block until A writes. In this way, both nodes rendezvous in time perfectly, and both carry on their merry way after synchronization (and data transfer) has occured.
* Second, and this is patented, you'll want to give about 256 bytes of address space or so to each IPC register. Why? Because then you can let the CPU JMP or JSR into "IPC" space, where the CPU will block
waiting for instructions to execute. This doesn't sound terribly useful, but it is actually of
immense value. It's how you bootstrap the thing, first of all, and second of all, it allows you to write software in a purely event-driven manner. For example:
Code:
.proc hsyncDetected
inx ; increment our line counter
cpx #GFX_BEGIN
bne L1
lda #$20 ; JSR opcode
sta IPC_UP
lda #<nodeX_subroutineY
sta IPC_UP
lda #>nodeX_subroutineY
sta IPC_UP
rts ; return to our own IPC space!
L1:
cpx #GFX_END
bne L2
lda #$20
sta IPC_UP
lda #<nodeX_subroutineZ
sta IPC_UP
lda #>nodeX_subroutineZ
sta IPC_UP
L2:
rts
.endproc
.proc vsyncDetected
ldx #$00
rts
.endproc
Note how we did not have to write an "event loop" that sits and polls some I/O port, decodes some message, then dispatches based on the unmarshalled message. All that extra overhead and complexity just disappears completely.
NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which
prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port.
* Third, I've found it essential to be able to write to multiple ports at once if necessary on the Intellasys chips. I'd expect the same would happen on the 6502 implementation too. Therefore, you're likely going to want to use 4K of IPC space (with four IPC ports [up, down, left, right], 2^4 = 16 distinct 256-byte spaces, which totals 4K). This way, a single CPU write can notify three other nodes concurrently.
Finally, amazingly, Forth turns out to be quite a pleasant language for working in such a parallel environment. If this doesn't float your boat, you might want to research Occam, a programming language designed for the Transputer architecture (which works almost identically to the Intellasys architecture).