idea: a cell-like chip based on many 6502 cores

swatkins · Post by **swatkins** » Wed Feb 11, 2009 1:16 pm

I was thinking about this a few years ago, the 6502 has about 4000 transistors, while some modern CPUs (Itanium) have 1.72 billion transistors. So I think it would be possible to create a CPU containing thousands of 6502 cores, maybe 400,000 cores, and running faster than 1GHz speed. It would also be necessary to have logic for addressing and connecting those cores into pipelines, and for on-chip RAM.

Such a CPU could potentially perform 400,000 X 2.5 billion 8-bit operations per second which is 1 peta op. That's comparable to the speed of the "Roadrunner", the fastest supercomputer in the world, or the total speed of the folding@home network, in a single chip. Those super computers are clocking from 1 to 5 peta flops for floating point operations.

The "cell supercomputer" project aims to create chips with a thousand cores each having 1 million transistors. I think it would be more efficient to use 250 times more 6502 cores having only 4000 transistors each.

The cores could be arranged in a hexagonal lattice such that each core could talk to 6 immediate neighbors and perhaps also 6 more distant neighbors. The 8-bit cpu is very suitable for character operations and multiple cores could be combined in series for 32-bit integer, floating point and other operations.

Most if not all complex problems can be programmed to take advantage of a large network of processors.

None of the modern processors provide proportionally more power than a 6502 for the number of transistors they use. In fact for normal integer ops a modern 32-bit processor really only provides four times more power than a 6502 would running at the same clock speed, but they use from 10,000 to 400,000 times as many transistors!

swatkins · Post by **swatkins** » Wed Feb 11, 2009 1:27 pm

here are some relevant links:

number of transistors in 6502 and other classic processors
http://www.classiccmp.org/pipermail/cct ... 70250.html

cell processors
http://cellsupercomputer.com/how_can_one_chip.php

number of transistors in modern intel processors
http://www.intel.com/pressroom/kits/quickreffam.htm

speed of folding@home distributed computing network
http://fah-web.stanford.edu/cgi-bin/mai ... pe=osstats

IBM Roadrunner supercomputer
http://en.wikipedia.org/wiki/IBM_Roadrunner

kc5tja · Post by **kc5tja** » Wed Feb 11, 2009 2:26 pm

With that many CPUs, you will have bus contention issues. With most CPUs, you can only have about 5 CPUs on any single bus before you run into contention. With the 6502, you can only have two, because its bus cycle is so fast.

What you want aren't CPUs, but rather entire computers. That means one CPU, one 64K or larger chunk of RAM, and the I/O facilities needed to communicate with at least a peer computer. As you can imagine, this will cut down on the number of transistors dedicated to the CPU.

BTW, there are about 4000 gates, not transistors, in a 6502. That means the number of transistors used is closer to 8000 to 16000. Still pretty small, though. Remember, the 6502 is now a fully static design (like the 1802 in the article you cited), and nobody uses NMOS anymore.

Another thing to consider is communications overhead. What if node A wants to talk to node C, but can only do so through node B? You'll need to wait for node B first.

I suggest you look at http://www.intellasys.net -- they're already doing what you're looking for, albeit on a microcontroller scale.

GARTHWILSON · Post by **GARTHWILSON** » Wed Feb 11, 2009 11:19 pm

The post at viewtopic.php?p=6194#p6194 and the topic it is in are relevant to the matter of efficiently distributing tasks among multiple processors, albeit on a much smaller scale.

If you want to run the processors as fast as you're talking about however, ie, much faster than you could run their memory if it's not in the same IC, you'll have to include that memory in the onboard transistor count. You'll also need CMOS to go fast. I don't know the transistor count of the 65C02, but I expect it's two to three times what it was for the NMOS 6502. The simplest floating-point operation (addition) on a 65(c)02 however would probably take at least a hundred clock cycles. I've written floating-point routines, but never counted up the cycles.

A friend I rarely see who is a programmer at JPL was advocating more parallelism instead of trying to squeeze so much out of individual processors, and with the resulting penalties. After the birthday party where I saw him last, I Emailed him trying to get him to elaborate, but never heard back. I am interested though.

Quote:

In fact for normal integer ops a modern 32-bit processor really only provides four times more power than a 6502 would running at the same clock speed

That's really only the best-case scenario, like fixed-point/scaled-integer addition. A 6502 doing a 32-bit multiply takes thousands of clocks to do what a 32-bit processor with an onboard hardware multiplier may do in just a couple of clocks. I would like to see a true 32-bit, simple 6502-type processor with just a few additions like the hardware multiplier. This processor would have a 32-bit data bus as well as address bus, and 32-bit registers. In essence, the "byte" becomes 32 bits, and all of the four-gigaword memory space is zero page.

When I was writing my 65816 Forth, I found that because of the 16-bit registers, even though the data bus was still only 8 bits, the 65816 required a fraction as many instructions (far less than half) to do many of the Forth words compared to the 6502. It runs Forth 2-3 times as fast as the 6502 does at the same clock speed, in spite of the data bus not being any wider. If it had a 16-bit data bus, the speed difference would increase.

kc5tja · Post by **kc5tja** » Wed Feb 11, 2009 11:59 pm

GARTHWILSON wrote:

A friend I rarely see who is a programmer at JPL was advocating more parallelism instead of trying to squeeze so much out of individual processors, and with the resulting penalties.

The funny thing is, doing this makes each individual CPU substantially simpler, at the longer-term cost of software complexity, and the even LONGER term gain in performance well above and beyond the current norm. Proof: Intel Larrabee, nVidia GPUs, and Intellasys 40C18 chips.

Quote:

That's really only the best-case senario, like fixed-point/scaled-integer addition.

Actually, more generally, it's true if, and only if, your respective architectures are the same. Attempting to compare an uncached, unpipelined, scalar, unthreaded architecture like the 6502 to a superscalar, deep-piped, threaded, and split cached architecture like the Intel architecture is like attempting to compare a water pistol against a firehose.

Provide a 6502 with a coprocessor that has a multiplier unit with the same number of transistors as that used in the x86 architecture, and then your puny 8-bit 6502 will be every bit as fast as a Pentium-IV at integer multiplication, at the same clock speed of course. Provide that coprocessor with the ability to DMA fetch its data from RAM, and you can now multiply entire vectors, of arbitrary length, at speeds equal to or in excess of current architectures. This concept was the basis behind the Amiga's original blitter design, and is the reason why we even have a GPU market today. Only configuration latency and raw memory bandwidth limits your throughput.

Likewise, strip the Pentium-IV of its caches, its 32-bit bus, its speculative instruction execution, its branch prediction, its register renaming, its pipelines, and you'll end up with a chip that is actually SLOWER than the 6502.

(This is borne out in a 65816 vs 80286 benchmark experiment that WDC used to have up on their site. Not sure if they still do.)

GARTHWILSON · Post by **GARTHWILSON** » Fri Feb 20, 2009 4:25 am

kc5tja, or anyone, what kinds of jobs would the parallel processing be good for? I think I have a feel for it, but would like to hear it defined better if possible. Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)? The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?

kc5tja · Post by **kc5tja** » Fri Feb 20, 2009 7:51 am

GARTHWILSON wrote:

kc5tja, or anyone, what kinds of jobs would the parallel processing be good for?

Well, actually, the answer to this question depends on the specific architecture or interconnection structure between the cores.

I'm going to assume two kinds of interconnects: shared bus model (which I think the cell processor uses), and a flat matrix of self-contained cores.

The shared bus model is ideal for relatively coarse-grained parallel decomposition of tasks. For example, you can assign one CPU the job of computing the next frame's graphics, another CPU for computing the next chunk of audio to play back, and a third CPU for handling user I/O and for controlling the inputs to all the other CPUs. OR, as in the Commodore floppy disk drives prior to the 1541, you can have one CPU handle GPIB bus I/O, and another CPU handling the commands transferred over the bus. Kind of like what I was suggesting to Daryl for his SPI project.

The limiting factor for parallelism is, obviously, the bus bottleneck. Modern SMP systems tend to allow about 16 microprocessors to interconnect to a common bus before they start fighting each other for bandwidth. I'll say give or take a few CPUs, because it obviously depends on how they use cache and so forth. But 16 is average.

For the flat matrix of self-contained computers (e.g., Intellasys 40C18 chip), the applications are numerous. Each individual core can be programmed to perform some function somewhat analogous to how you'd program an FPGA CLB. As a result, you can partition one portion of the chip to serve as some dedicated processing logic, and another portion for another task altogether. They needn't inter-communicate, although they can if you want.

Another application of the matrix topology is systolic processing. A CPU 'pipeline' is an example of a systolic processor, albeit a significantly over-simplified one. Each stage of the pipeline performs its sole task, and passing its results on to the next stage when complete. If you have N pipeline stages, clocked at M cycles per second, then assuming you keep all the stages full, you can achieve NxM operations per second.

The 40C18 cores have interconnects in the north/south as well as east/west directions, so more complex forms of systolic processing are possible too. In this case, N = core width * core height, so as you can imagine, you can get some significant processing power from such an arrangement.

Real-world examples of systolic processing include surface acoustic simulations (e.g., if you want to engineer a new kind of SAW filter), audio synthesis (e.g., waveguide synthesis), and image/video applications (wavelet (de)compression, edge detection, motion prediction, etc).

In terms of home hobby projects, multiple cores can help out with interrupt handling as well. Just as your brain relies on its spinal chord to handle emergency interrupt communications ("OHMYGODHOLY****THISISHOT!!" messages from your temperature-sensing neurons gets translated to "Well, duhh, release the soldering iron immediately!" long before your brain is even aware of the original message), you can program cores to serve as I/O coprocessors on behalf of interior computational processors.

For example, this is how I implemented my VGA "clock" display on the Intellasys 24-core prototype chip (the S24). Core #12 served as the CRTC, generating HSYNC and VSYNC pulses on its I/O pins. It also communicates which pulse occurs by sending a message to core #13 via the right-hand IPC channel. As soon as the message is sent, core #12 resumes its timebase tasks. This allows core 13 to do whatever it needs to do asynchronously of core 12, since it is of utmost importance that core 12 remain isochronous.

Core 13, a computational node, simply counts the lines drawn so far (resetting its counter upon hearing about a VSYNC event), and issues triggers to core 19 (via its upward IPC channel), telling core 19 which thing to draw at the current time. After doing so, it goes to sleep, waiting for core 12 to wake it up again with a new notification.

Core 19, then, is responsible for coordinating cores 20, 21, 22, 14, 15, and 16 for generating the hours, minutes, and seconds (hours: 20/14, minutes: 21/15, and seconds: 22/16) graphics representations. When the graphics have been assembled, they pass it back to core 19, where it is then forwarded to core 18, another I/O core, equipped with a DAC. The DAC then drives the green pin on the VGA connector.

Note: Core 13 determines what to draw based on current line, while core 19 handles what to draw based on horizontal offset from the left edge.

Like core 12, core 18 must run isochronously, which means once fed data, it must take a constant amount of time to serialize the data it receives. However, core 19 can run faster than core 18, because 19 will "block" if 18 isn't ready for more data. This is a very, very convenient feature of the Intellasys architecture.

Here's the "floorplan" for the chip:

Code: Select all

        +---------+    +---------+    +---------+    +---------+    +---------+
  G <---| Core 18 |<---| Core 19 |<-->| Core 20 |<-->| Core 21 |<-->| Core 22 |
        |         |    |         |    |         |    |         |    |         |
        +---------+    +---------+    +---------+    +---------+    +---------+
                            ^              ^              ^              ^
                            |              |              |              |
                            |              V              V              V
        +---------+    +---------+    +---------+    +---------+    +---------+
_HS <---| Core 12 |--->| Core 13 |    | Core 14 |    | Core 15 |    | Core 16 |
_VS <---|         |    |         |    |         |    |         |    |         |
        +---------+    +---------+    +---------+    +---------+    +---------+

        Core Chart:
        12: CRTC
        13: Vertical graphics gate
14, 15, 16: Moving indicators for hours, minutes, seconds
        18: Dot shifter/DAC interface
        19: Graphics job scheduler
20, 21, 22: Fixed containers for hours, minutes, seconds

Quote:

Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)?

I don't have an answer to this question; my guess is that the value lies mostly in the educational value of learning to decompose problems in terms of a fixed number of concurrently running, fully isolated processes. I wouldn't expect anything useful to come from this though. I could be wrong, of course.

Quote:

The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?

Well, I strongly, strongly, strongly, strongly encourage NOT going with a standard backplane. That is a shared-memory, symmetric multiprocessing model, which due to the 6502's bus timings, limits you to two CPUs before you start to run into bus contention.

I like the flat matrix layout because it's simple. However, you run into problems when node A needs to talk to node C -- it has to go through B first. Thus, accurate timing of A's message to C must necessarily take B into account. Still, it's amazingly versatile, as my example above proves.

To make this work, you'll need to violate Intellasys patents, though.

* First, the interprocessor communications registers work on a synchronous message-passing model. If A wants to send a byte to B, and B isn't already reading from the IPC port, then A will block until B reads. Likewise, if B reads from the port, and A hasn't written anything to it yet, B will block until A writes. In this way, both nodes rendezvous in time perfectly, and both carry on their merry way after synchronization (and data transfer) has occured.

* Second, and this is patented, you'll want to give about 256 bytes of address space or so to each IPC register. Why? Because then you can let the CPU JMP or JSR into "IPC" space, where the CPU will block waiting for instructions to execute. This doesn't sound terribly useful, but it is actually of immense value. It's how you bootstrap the thing, first of all, and second of all, it allows you to write software in a purely event-driven manner. For example:

Code: Select all

.proc hsyncDetected
  inx    ; increment our line counter
  cpx #GFX_BEGIN
  bne L1
  lda #$20   ; JSR opcode
  sta IPC_UP
  lda #<nodeX_subroutineY
  sta IPC_UP
  lda #>nodeX_subroutineY
  sta IPC_UP
  rts    ; return to our own IPC space!

L1:
  cpx #GFX_END
  bne L2
  lda #$20
  sta IPC_UP
  lda #<nodeX_subroutineZ
  sta IPC_UP
  lda #>nodeX_subroutineZ
  sta IPC_UP

L2:
  rts
.endproc

.proc vsyncDetected
  ldx #$00
  rts
.endproc

Note how we did not have to write an "event loop" that sits and polls some I/O port, decodes some message, then dispatches based on the unmarshalled message. All that extra overhead and complexity just disappears completely.

NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port.

* Third, I've found it essential to be able to write to multiple ports at once if necessary on the Intellasys chips. I'd expect the same would happen on the 6502 implementation too. Therefore, you're likely going to want to use 4K of IPC space (with four IPC ports [up, down, left, right], 2^4 = 16 distinct 256-byte spaces, which totals 4K). This way, a single CPU write can notify three other nodes concurrently.

Finally, amazingly, Forth turns out to be quite a pleasant language for working in such a parallel environment. If this doesn't float your boat, you might want to research Occam, a programming language designed for the Transputer architecture (which works almost identically to the Intellasys architecture).

TMorita · Post by **TMorita** » Thu Feb 26, 2009 7:06 pm

GARTHWILSON wrote:

kc5tja, or anyone, what kinds of jobs would the parallel processing be good for?

My current project at work is a message-based (loosely actor-based) concurrent programming language. I've been working on it about a year. It works fairly well - I can use saturate all CPUs on an eight CPU box, but the coding is a bit tricky.

The nice thing is you don't need to worry about the number of physical processors, deadlocks, priority inversion, etc. because it's handled automagically. You write code, and it just works.

Toshi

kc5tja · Post by **kc5tja** » Thu Feb 26, 2009 9:47 pm

TMorita wrote:

The nice thing is you don't need to worry about the number of physical processors, deadlocks, priority inversion, etc. because it's handled automagically. You write code, and it just works.

Toshi

If it's one thing experience has taught me, it's to be highly skeptical of such claims.

GARTHWILSON · Post by **GARTHWILSON** » Thu Feb 26, 2009 11:47 pm

Toshi, can you elaborate?

And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.

TMorita · Post by **TMorita** » Fri Feb 27, 2009 12:11 am

GARTHWILSON wrote:

Toshi, can you elaborate?

And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.

The company at which I work manufactures deep-packet inspection systems for handling gigabit traffic. The front-end systems snoop the traffic on the wire and generate huge amount of real-time data.

The original system (developed before I originally came here) was written as an interpreter in C++ using the STL, and it was single-threaded and pretty slow.

So, I decided to rewrite it, and looked at all available languages, and decided none of them were suitable. Erlang and CHARM++ came the closest to being usable, but they weren't quite suitable. So I wound up developing a custom programming language that took the useful characteristics from both, and added some specific features required for our application.

Toshi

GARTHWILSON · Post by **GARTHWILSON** » Mon Mar 02, 2009 8:23 am

I said I'd respond to this later, and now I've had it mostly idle and taking up a couple of tabs for the last week and a half, so it's high time I wrapped it up and posted it.

Quote:

The shared bus model is ideal for relatively coarse-grained parallel decomposition of tasks. For example, you can assign one CPU the job of...

So it sounds like you're mostly talking about jobs that will always be there and are basically tied to the hardware, IOW, having to do with I/O instead of data processing. Is that right?

Quote:

my guess is that the value lies mostly in the educational value of learning to decompose problems in terms of a fixed number of concurrently running, fully isolated processes. I wouldn't expect anything useful to come from this though. I could be wrong, of course.

That's what I have a hard time imagining too, how to decompose and distribute it, when most of the work I do with my workbench computer made up of serial tasks, ie, that each little piece depends on the outcome of the previous one. Maybe it's just that I'm too conditioned to think that way. Maybe more could be done in parallel; but with dozens, or even thousands of processors??! OTOH, some of what I do with interrupts could be done this way, and if each interrupt source could have its own processor at its disposal, it could get its jobs done without interfering with other jobs. Actually, you did mention that further up. Now, if I could just find some memory that was more than dual-port, like maybe quad-port or...?...

Quote:

Well, I strongly, strongly, strongly, strongly encourage NOT going with a standard backplane. That is a shared-memory, symmetric multiprocessing model, which due to the 6502's bus timings, limits you to two CPUs before you start to run into bus contention.

Perhaps "backplane" was not the best word to use, but I was referring to some kind of mother board that would connect the modules to each other appropriately, and not necessarily a bus where the same lines run to all the modules.

Quote:

To make this work, you'll need to violate Intellasys patents, though. Smile

Would it matter, as long as we're not making money at it?

Quote:

then A will block until B reads.

Does that mean approximately the same thing as stopping its clock and waiting, so it's not even aware of idle cycles passing? This appears to partially remedy what I see as some loss of control in timing in real-time tasks.

Quote:

NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port

which goes along with the blocking, if I understand you correctly.

Quote:

* Third, I've found it essential to be able to write to multiple ports at once if necessary on the Intellasys chips. I'd expect the same would happen on the 6502 implementation too. Therefore, you're likely going to want to use 4K of IPC space (with four IPC ports [up, down, left, right], 2^4 = 16 distinct 256-byte spaces, which totals 4K). This way, a single CPU write can notify three other nodes concurrently

which basically is what I have with my address-decoding scheme. It allows setting up two or three VIAs at the same time, even starting their counters synchronously, although I have never used that capability.

kc5tja · Post by **kc5tja** » Mon Mar 02, 2009 5:16 pm

GARTHWILSON wrote:

(re: shared buses) . . . So it sounds like you're mostly talking about jobs that will always be there and are basically tied to the hardware, IOW, having to do with I/O instead of data processing. Is that right?

It doesn't always have to be I/O related. There's no reason why one CPU handles all the I/O, and two or three other CPUs handles complex math, for example. This is the basis for the Cell architecture used in the PlayStation 3, for example.

It's also how most every contemporary supercomputer is built now-a-days, where you often have 8-CPU or 16-CPU motherboards (shared bus architecture) arranged on a network of some kind. Programs on a single motherboard can exploit shared memory to its advantage when chugging out results for sophisticated math.

Quote:

That's what I have a hard time imagining too, how to decompose and distribute it, when most of the work I do with my workbench computer made up of serial tasks, ie, that each little piece depends on the outcome of the previous one.

You might think that now, but parallelism always exists in some capacity. It is true that the "stages" in your "pipeline" might need to be synchronized. But, I'm quite sure there is some kind of overlap you can exploit to improve performance.

Quote:

Maybe it's just that I'm too conditioned to think that way. Maybe more could be done in parallel; but with dozens, or even thousands of processors??!

Dozens of processors? Yes. Thousands? Well, you'll end up in a situation not entirely unlike a recession -- a lot of workers unemployed because there's only so much work to go around.

Quote:

OTOH, some of what I do with interrupts could be done this way, and if each interrupt source could have its own processor at its disposal, it could get its jobs done without interfering with other jobs.

Yes; note that none of the Intellasys cores have interrupts. Literally everything is done through IPC. The expectation is that you'll dedicate a core or two to a specific peripheral.

Quote:

Actually, you did mention that further up. Now, if I could just find some memory that was more than dual-port, like maybe quad-port or...?...

I'm not sure why this is needed? All RAM in Intellasys parts are single-port. Only the IPC registers are dual-ported and hardware synchronized.

Quote:

Perhaps "backplane" was not the best word to use, but I was referring to some kind of mother board that would connect the modules to each other appropriately, and not necessarily a bus where the same lines run to all the modules.

Yes, this architecture will be preferred for its performance and safety benefits. (If one node crashes, the worst that can happen is surrounding nodes will deadlock. No cross-core memory corruption is possible.)

Quote:

Would it matter, as long as we're not making money at it?

I'm not a lawyer, but I don't think it would catch Intellasys' attention until cash gets involved.

Quote:

Does that mean approximately the same thing as stopping its clock and waiting, so it's not even aware of idle cycles passing?

The Intellasys cores are fully asynchronous (no clock!), so the normal 4-way handshake signals are enough to halt the CPU's instruction execution. With the 6502/65816 architecture, you'll need some kind of logic to either negate the RDY signal or to halt the clock, whichever is most appropriate for the processor model you're using.

Quote:

This appears to partially remedy what I see as some loss of control in timing in real-time tasks.

Well, it's not really a loss of control. When you think about it, you have exactly the same kind of timing problems when designing a complex digital circuit.

Each chip in a circuit will run independently of each other. They might be synchronized via a single clock, but each chip nonetheless performs completely asynchronously with respect to each other. And, yet, this rarely becomes a problem in practice, thanks to bus protocols.

Well, the same kind of logic applies inside a multi-core microcontroller. For example, in my VGA demonstrations with the now-defunct 24-core chip, I used the following "floorplan":

Core 12: CRTC -- generated HSYNC and VSYNC hardware pulses on its I/O pins. It also sent either 0 (HSYNC line) or -1 (VSYNC line) to core 13 a few nanoseconds later.

Core 13: Used the 0s or -1s to maintain a raster line counter. Depending on whether this counter was within a fixed range of values, it would transmit a 0 to core 19. It always transmitted -1 to core 19, because it also needed to know when a VSYNC occured.

Core 19: Upon detecting a -1, it would reset its own state machine so it knew that its next set of graphics was the first line of graphics to display. Otherwise, it assumed interior graphics. Upon receiving a 0, it would send requests for data to cores 20, 21, 22, and through those, 14, 15, and 16 too. These cores are responsible for computing, in real-time, the graphics to display for the guts of the clock. It'd then pass the data to core 18 for shifting to the DAC, taking care of scheduling the words to display so that you get neat columns on the screen.

Core 18: The green-gun DAC (thus, producing a "green-screen" monochrome display) and pixel "shift register".

Except for cores 18 and 12, all other cores spend most of their time waiting for something to be sent over the IPC channels.

NOTE: The above description describes the original version of the software. Modern versions now use "port-execution" in cores 12, 13, 18, and 19, where cores spoonfeed CALL instructions to other cores as-needed. So, e.g., instead of core 13 having an IF/ELSE/THEN construct inside a main event loop, core 12 just CALL's either the HSYNC or VSYNC word in core 13 directly.

Note also that these calls occur asynchronously -- once the CALL opcode is sent to the adjacent core, both cores free-run. When the adjacent node finishes its task, it'll RETURN to the IPC channel, thus blocking for more instructions. Likewise, if core 12 is too fast, it'll attempt to CALL into node 13 again, blocking until node 13 is ready for it.

Quote:

NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port

which goes along with the blocking, if I understand you correctly.

I do not see this as relevant to blocking at all. This is Intellasys saying that you have 16 words of memory dedicated to IPC instead of 4096. The only way to get the chip to obviously not overrun the IPC space is to disable PC incrementing when fetching instructions from IPC space.

And, it now occurs to me that 256 bytes of IPC channel space might not be enough; after only a small handful of JSRs passed to the core, you'll again run out of the IPC channel's address space, requiring you to JMP back to its origin again if you want to continue executing instructions sent from another node.

Hence, you'll want to send JMP instructions instead of JSR, and the jumped-to routine will need to JMP back to the IPC space. This is the only way to prevent PC-overflow from occuring.

Quote:

which basically is what I have with my address-decoding scheme. It allows setting up two or three VIAs at the same time, even starting their counters synchronously, although I have never used that capability.

You'll start using it when you involve parallelism because it often occurs that you'll have 2 or 3 nodes waiting for common events from another node. For example, when core 19 tells core 20 to compute its graphics, core 20 concurrently forwards that same message to cores 21 and 14 with a single write. This permits the message to ripple to all the cores faster than if I'd written messages explicitly. When you're working at video frequencies, computing what to display on the current video line, every nanosecond counts, especially when you get only 39.7ns per pixel displayed!

TMorita · Post by **TMorita** » Thu Mar 05, 2009 7:10 pm

GARTHWILSON wrote:

Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)? The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?

If you haven't considered it already, have you looked at the WDC65C134S?

IIRC, it already has support for running in an 8-processor configuration (albeit in a non-shared-memory configuration). This would simplify the hardware configuration. The only problem is it's not really ideal for a message-passing system since the cost of sending a message would be higher than a shared-memory architecture machine.

Toshi

GARTHWILSON · Post by **GARTHWILSON** » Fri Mar 06, 2009 4:40 am

Quote:

have you looked at the WDC65C134S?

I should look at that or the '265 more carefully, as they have so many features, but the reason I have not so far is that using external memory (since I'm not going to go to the expense of having my application mask-ROMed) takes away so many of the I/O bits.

This is kind of hastily written, and still getting away from cell processors, but while we're talking about having multiple processors instead of doing everything with a single processor, I have to ask—What goes on in PCs' video cards, sound cards, and graphics engines to take a big burden off the main processor? Samuel has occasionally touched on how GEOS (software) on the Commodore 64 computed the video for various fonts and sizes, which, although not lightning-fast, was very good for a 1MHz processor. But as I type here on my PC, possibly with audio going on at the same time and occasionally streaming video too, and see the immediate effect of inserting a word which adjusts all the lines below it on the screen with no visible delay regardless of fonts and other factors, I sometimes wonder how much of that work of figuring out what value to give each dot is done by the main processor and how much by the video chipset. The PC to me is basically an appliance. I know very little about its innards. I don't even know if there's still a separate floating-point co-processor in the same package with the main processor (not that the video calculations use FP very often), or if the main processor has its own FP capabilities.

For the streaming audio and video, I have no doubt that the audio and video chipsets do a lot of buffering, maybe even several seconds' worth, and I suspect these may even access the sets of samples by DMA; but the control of the timing is atrocious, as witnessed by the fact that the audio may lead or lag the video by a tenth of a second or even more, which is totally unacceptable for my kind of work, to put it mildly. This is obviously a common problem as we frequently see it even on TV, from major stations that can afford all the processing power they could want.

So what can the home computer builder apply from all this in order to improve performance without losing control of timing?