idea: a cell-like chip based on many 6502 cores
idea: a cell-like chip based on many 6502 cores
I was thinking about this a few years ago, the 6502 has about 4000 transistors, while some modern CPUs (Itanium) have 1.72 billion transistors. So I think it would be possible to create a CPU containing thousands of 6502 cores, maybe 400,000 cores, and running faster than 1GHz speed. It would also be necessary to have logic for addressing and connecting those cores into pipelines, and for on-chip RAM.
Such a CPU could potentially perform 400,000 X 2.5 billion 8-bit operations per second which is 1 peta op. That's comparable to the speed of the "Roadrunner", the fastest supercomputer in the world, or the total speed of the folding@home network, in a single chip. Those super computers are clocking from 1 to 5 peta flops for floating point operations.
The "cell supercomputer" project aims to create chips with a thousand cores each having 1 million transistors. I think it would be more efficient to use 250 times more 6502 cores having only 4000 transistors each.
The cores could be arranged in a hexagonal lattice such that each core could talk to 6 immediate neighbors and perhaps also 6 more distant neighbors. The 8-bit cpu is very suitable for character operations and multiple cores could be combined in series for 32-bit integer, floating point and other operations.
Most if not all complex problems can be programmed to take advantage of a large network of processors.
None of the modern processors provide proportionally more power than a 6502 for the number of transistors they use. In fact for normal integer ops a modern 32-bit processor really only provides four times more power than a 6502 would running at the same clock speed, but they use from 10,000 to 400,000 times as many transistors!
Such a CPU could potentially perform 400,000 X 2.5 billion 8-bit operations per second which is 1 peta op. That's comparable to the speed of the "Roadrunner", the fastest supercomputer in the world, or the total speed of the folding@home network, in a single chip. Those super computers are clocking from 1 to 5 peta flops for floating point operations.
The "cell supercomputer" project aims to create chips with a thousand cores each having 1 million transistors. I think it would be more efficient to use 250 times more 6502 cores having only 4000 transistors each.
The cores could be arranged in a hexagonal lattice such that each core could talk to 6 immediate neighbors and perhaps also 6 more distant neighbors. The 8-bit cpu is very suitable for character operations and multiple cores could be combined in series for 32-bit integer, floating point and other operations.
Most if not all complex problems can be programmed to take advantage of a large network of processors.
None of the modern processors provide proportionally more power than a 6502 for the number of transistors they use. In fact for normal integer ops a modern 32-bit processor really only provides four times more power than a 6502 would running at the same clock speed, but they use from 10,000 to 400,000 times as many transistors!
some relevant links
here are some relevant links:
number of transistors in 6502 and other classic processors
http://www.classiccmp.org/pipermail/cct ... 70250.html
cell processors
http://cellsupercomputer.com/how_can_one_chip.php
number of transistors in modern intel processors
http://www.intel.com/pressroom/kits/quickreffam.htm
speed of folding@home distributed computing network
http://fah-web.stanford.edu/cgi-bin/mai ... pe=osstats
IBM Roadrunner supercomputer
http://en.wikipedia.org/wiki/IBM_Roadrunner
number of transistors in 6502 and other classic processors
http://www.classiccmp.org/pipermail/cct ... 70250.html
cell processors
http://cellsupercomputer.com/how_can_one_chip.php
number of transistors in modern intel processors
http://www.intel.com/pressroom/kits/quickreffam.htm
speed of folding@home distributed computing network
http://fah-web.stanford.edu/cgi-bin/mai ... pe=osstats
IBM Roadrunner supercomputer
http://en.wikipedia.org/wiki/IBM_Roadrunner
With that many CPUs, you will have bus contention issues. With most CPUs, you can only have about 5 CPUs on any single bus before you run into contention. With the 6502, you can only have two, because its bus cycle is so fast.
What you want aren't CPUs, but rather entire computers. That means one CPU, one 64K or larger chunk of RAM, and the I/O facilities needed to communicate with at least a peer computer. As you can imagine, this will cut down on the number of transistors dedicated to the CPU.
BTW, there are about 4000 gates, not transistors, in a 6502. That means the number of transistors used is closer to 8000 to 16000. Still pretty small, though. Remember, the 6502 is now a fully static design (like the 1802 in the article you cited), and nobody uses NMOS anymore.
Another thing to consider is communications overhead. What if node A wants to talk to node C, but can only do so through node B? You'll need to wait for node B first.
I suggest you look at http://www.intellasys.net -- they're already doing what you're looking for, albeit on a microcontroller scale.
What you want aren't CPUs, but rather entire computers. That means one CPU, one 64K or larger chunk of RAM, and the I/O facilities needed to communicate with at least a peer computer. As you can imagine, this will cut down on the number of transistors dedicated to the CPU.
BTW, there are about 4000 gates, not transistors, in a 6502. That means the number of transistors used is closer to 8000 to 16000. Still pretty small, though. Remember, the 6502 is now a fully static design (like the 1802 in the article you cited), and nobody uses NMOS anymore.
Another thing to consider is communications overhead. What if node A wants to talk to node C, but can only do so through node B? You'll need to wait for node B first.
I suggest you look at http://www.intellasys.net -- they're already doing what you're looking for, albeit on a microcontroller scale.
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
The post at viewtopic.php?p=6194#p6194 and the topic it is in are relevant to the matter of efficiently distributing tasks among multiple processors, albeit on a much smaller scale.
If you want to run the processors as fast as you're talking about however, ie, much faster than you could run their memory if it's not in the same IC, you'll have to include that memory in the onboard transistor count. You'll also need CMOS to go fast. I don't know the transistor count of the 65C02, but I expect it's two to three times what it was for the NMOS 6502. The simplest floating-point operation (addition) on a 65(c)02 however would probably take at least a hundred clock cycles. I've written floating-point routines, but never counted up the cycles.
A friend I rarely see who is a programmer at JPL was advocating more parallelism instead of trying to squeeze so much out of individual processors, and with the resulting penalties. After the birthday party where I saw him last, I Emailed him trying to get him to elaborate, but never heard back. I am interested though.
That's really only the best-case scenario, like fixed-point/scaled-integer addition. A 6502 doing a 32-bit multiply takes thousands of clocks to do what a 32-bit processor with an onboard hardware multiplier may do in just a couple of clocks. I would like to see a true 32-bit, simple 6502-type processor with just a few additions like the hardware multiplier. This processor would have a 32-bit data bus as well as address bus, and 32-bit registers. In essence, the "byte" becomes 32 bits, and all of the four-gigaword memory space is zero page.
When I was writing my 65816 Forth, I found that because of the 16-bit registers, even though the data bus was still only 8 bits, the 65816 required a fraction as many instructions (far less than half) to do many of the Forth words compared to the 6502. It runs Forth 2-3 times as fast as the 6502 does at the same clock speed, in spite of the data bus not being any wider. If it had a 16-bit data bus, the speed difference would increase.
If you want to run the processors as fast as you're talking about however, ie, much faster than you could run their memory if it's not in the same IC, you'll have to include that memory in the onboard transistor count. You'll also need CMOS to go fast. I don't know the transistor count of the 65C02, but I expect it's two to three times what it was for the NMOS 6502. The simplest floating-point operation (addition) on a 65(c)02 however would probably take at least a hundred clock cycles. I've written floating-point routines, but never counted up the cycles.
A friend I rarely see who is a programmer at JPL was advocating more parallelism instead of trying to squeeze so much out of individual processors, and with the resulting penalties. After the birthday party where I saw him last, I Emailed him trying to get him to elaborate, but never heard back. I am interested though.
Quote:
In fact for normal integer ops a modern 32-bit processor really only provides four times more power than a 6502 would running at the same clock speed
That's really only the best-case scenario, like fixed-point/scaled-integer addition. A 6502 doing a 32-bit multiply takes thousands of clocks to do what a 32-bit processor with an onboard hardware multiplier may do in just a couple of clocks. I would like to see a true 32-bit, simple 6502-type processor with just a few additions like the hardware multiplier. This processor would have a 32-bit data bus as well as address bus, and 32-bit registers. In essence, the "byte" becomes 32 bits, and all of the four-gigaword memory space is zero page.
When I was writing my 65816 Forth, I found that because of the 16-bit registers, even though the data bus was still only 8 bits, the 65816 required a fraction as many instructions (far less than half) to do many of the Forth words compared to the 6502. It runs Forth 2-3 times as fast as the 6502 does at the same clock speed, in spite of the data bus not being any wider. If it had a 16-bit data bus, the speed difference would increase.
Last edited by GARTHWILSON on Thu Feb 12, 2009 12:06 am, edited 1 time in total.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
GARTHWILSON wrote:
A friend I rarely see who is a programmer at JPL was advocating more parallelism instead of trying to squeeze so much out of individual processors, and with the resulting penalties.
Quote:
That's really only the best-case senario, like fixed-point/scaled-integer addition.
Provide a 6502 with a coprocessor that has a multiplier unit with the same number of transistors as that used in the x86 architecture, and then your puny 8-bit 6502 will be every bit as fast as a Pentium-IV at integer multiplication, at the same clock speed of course. Provide that coprocessor with the ability to DMA fetch its data from RAM, and you can now multiply entire vectors, of arbitrary length, at speeds equal to or in excess of current architectures. This concept was the basis behind the Amiga's original blitter design, and is the reason why we even have a GPU market today. Only configuration latency and raw memory bandwidth limits your throughput.
Likewise, strip the Pentium-IV of its caches, its 32-bit bus, its speculative instruction execution, its branch prediction, its register renaming, its pipelines, and you'll end up with a chip that is actually SLOWER than the 6502.
(This is borne out in a 65816 vs 80286 benchmark experiment that WDC used to have up on their site. Not sure if they still do.)
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
kc5tja, or anyone, what kinds of jobs would the parallel processing be good for? I think I have a feel for it, but would like to hear it defined better if possible. Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)? The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
GARTHWILSON wrote:
kc5tja, or anyone, what kinds of jobs would the parallel processing be good for?
I'm going to assume two kinds of interconnects: shared bus model (which I think the cell processor uses), and a flat matrix of self-contained cores.
The shared bus model is ideal for relatively coarse-grained parallel decomposition of tasks. For example, you can assign one CPU the job of computing the next frame's graphics, another CPU for computing the next chunk of audio to play back, and a third CPU for handling user I/O and for controlling the inputs to all the other CPUs. OR, as in the Commodore floppy disk drives prior to the 1541, you can have one CPU handle GPIB bus I/O, and another CPU handling the commands transferred over the bus. Kind of like what I was suggesting to Daryl for his SPI project.
The limiting factor for parallelism is, obviously, the bus bottleneck. Modern SMP systems tend to allow about 16 microprocessors to interconnect to a common bus before they start fighting each other for bandwidth. I'll say give or take a few CPUs, because it obviously depends on how they use cache and so forth. But 16 is average.
For the flat matrix of self-contained computers (e.g., Intellasys 40C18 chip), the applications are numerous. Each individual core can be programmed to perform some function somewhat analogous to how you'd program an FPGA CLB. As a result, you can partition one portion of the chip to serve as some dedicated processing logic, and another portion for another task altogether. They needn't inter-communicate, although they can if you want.
Another application of the matrix topology is systolic processing. A CPU 'pipeline' is an example of a systolic processor, albeit a significantly over-simplified one. Each stage of the pipeline performs its sole task, and passing its results on to the next stage when complete. If you have N pipeline stages, clocked at M cycles per second, then assuming you keep all the stages full, you can achieve NxM operations per second.
The 40C18 cores have interconnects in the north/south as well as east/west directions, so more complex forms of systolic processing are possible too. In this case, N = core width * core height, so as you can imagine, you can get some significant processing power from such an arrangement.
Real-world examples of systolic processing include surface acoustic simulations (e.g., if you want to engineer a new kind of SAW filter), audio synthesis (e.g., waveguide synthesis), and image/video applications (wavelet (de)compression, edge detection, motion prediction, etc).
In terms of home hobby projects, multiple cores can help out with interrupt handling as well. Just as your brain relies on its spinal chord to handle emergency interrupt communications ("OHMYGODHOLY****THISISHOT!!" messages from your temperature-sensing neurons gets translated to "Well, duhh, release the soldering iron immediately!" long before your brain is even aware of the original message), you can program cores to serve as I/O coprocessors on behalf of interior computational processors.
For example, this is how I implemented my VGA "clock" display on the Intellasys 24-core prototype chip (the S24). Core #12 served as the CRTC, generating HSYNC and VSYNC pulses on its I/O pins. It also communicates which pulse occurs by sending a message to core #13 via the right-hand IPC channel. As soon as the message is sent, core #12 resumes its timebase tasks. This allows core 13 to do whatever it needs to do asynchronously of core 12, since it is of utmost importance that core 12 remain isochronous.
Core 13, a computational node, simply counts the lines drawn so far (resetting its counter upon hearing about a VSYNC event), and issues triggers to core 19 (via its upward IPC channel), telling core 19 which thing to draw at the current time. After doing so, it goes to sleep, waiting for core 12 to wake it up again with a new notification.
Core 19, then, is responsible for coordinating cores 20, 21, 22, 14, 15, and 16 for generating the hours, minutes, and seconds (hours: 20/14, minutes: 21/15, and seconds: 22/16) graphics representations. When the graphics have been assembled, they pass it back to core 19, where it is then forwarded to core 18, another I/O core, equipped with a DAC. The DAC then drives the green pin on the VGA connector.
Note: Core 13 determines what to draw based on current line, while core 19 handles what to draw based on horizontal offset from the left edge.
Like core 12, core 18 must run isochronously, which means once fed data, it must take a constant amount of time to serialize the data it receives. However, core 19 can run faster than core 18, because 19 will "block" if 18 isn't ready for more data. This is a very, very convenient feature of the Intellasys architecture.
Here's the "floorplan" for the chip:
Code: Select all
+---------+ +---------+ +---------+ +---------+ +---------+
G <---| Core 18 |<---| Core 19 |<-->| Core 20 |<-->| Core 21 |<-->| Core 22 |
| | | | | | | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
^ ^ ^ ^
| | | |
| V V V
+---------+ +---------+ +---------+ +---------+ +---------+
_HS <---| Core 12 |--->| Core 13 | | Core 14 | | Core 15 | | Core 16 |
_VS <---| | | | | | | | | |
+---------+ +---------+ +---------+ +---------+ +---------+
Core Chart:
12: CRTC
13: Vertical graphics gate
14, 15, 16: Moving indicators for hours, minutes, seconds
18: Dot shifter/DAC interface
19: Graphics job scheduler
20, 21, 22: Fixed containers for hours, minutes, seconds
Quote:
Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)?
Quote:
The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?
I like the flat matrix layout because it's simple. However, you run into problems when node A needs to talk to node C -- it has to go through B first. Thus, accurate timing of A's message to C must necessarily take B into account. Still, it's amazingly versatile, as my example above proves.
To make this work, you'll need to violate Intellasys patents, though.
* First, the interprocessor communications registers work on a synchronous message-passing model. If A wants to send a byte to B, and B isn't already reading from the IPC port, then A will block until B reads. Likewise, if B reads from the port, and A hasn't written anything to it yet, B will block until A writes. In this way, both nodes rendezvous in time perfectly, and both carry on their merry way after synchronization (and data transfer) has occured.
* Second, and this is patented, you'll want to give about 256 bytes of address space or so to each IPC register. Why? Because then you can let the CPU JMP or JSR into "IPC" space, where the CPU will block waiting for instructions to execute. This doesn't sound terribly useful, but it is actually of immense value. It's how you bootstrap the thing, first of all, and second of all, it allows you to write software in a purely event-driven manner. For example:
Code: Select all
.proc hsyncDetected
inx ; increment our line counter
cpx #GFX_BEGIN
bne L1
lda #$20 ; JSR opcode
sta IPC_UP
lda #<nodeX_subroutineY
sta IPC_UP
lda #>nodeX_subroutineY
sta IPC_UP
rts ; return to our own IPC space!
L1:
cpx #GFX_END
bne L2
lda #$20
sta IPC_UP
lda #<nodeX_subroutineZ
sta IPC_UP
lda #>nodeX_subroutineZ
sta IPC_UP
L2:
rts
.endproc
.proc vsyncDetected
ldx #$00
rts
.endproc
NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port.
* Third, I've found it essential to be able to write to multiple ports at once if necessary on the Intellasys chips. I'd expect the same would happen on the 6502 implementation too. Therefore, you're likely going to want to use 4K of IPC space (with four IPC ports [up, down, left, right], 2^4 = 16 distinct 256-byte spaces, which totals 4K). This way, a single CPU write can notify three other nodes concurrently.
Finally, amazingly, Forth turns out to be quite a pleasant language for working in such a parallel environment. If this doesn't float your boat, you might want to research Occam, a programming language designed for the Transputer architecture (which works almost identically to the Intellasys architecture).
GARTHWILSON wrote:
kc5tja, or anyone, what kinds of jobs would the parallel processing be good for?
The nice thing is you don't need to worry about the number of physical processors, deadlocks, priority inversion, etc. because it's handled automagically. You write code, and it just works.
Toshi
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Toshi, can you elaborate?
And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.
And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
GARTHWILSON wrote:
Toshi, can you elaborate?
And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.
And Samuel, I started a response to your 2nd-to-last post, but I need to find it and finish it.
The original system (developed before I originally came here) was written as an interpreter in C++ using the STL, and it was single-threaded and pretty slow.
So, I decided to rewrite it, and looked at all available languages, and decided none of them were suitable. Erlang and CHARM++ came the closest to being usable, but they weren't quite suitable. So I wound up developing a custom programming language that took the useful characteristics from both, and added some specific features required for our application.
Toshi
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
I said I'd respond to this later, and now I've had it mostly idle and taking up a couple of tabs for the last week and a half, so it's high time I wrapped it up and posted it.
So it sounds like you're mostly talking about jobs that will always be there and are basically tied to the hardware, IOW, having to do with I/O instead of data processing. Is that right?
That's what I have a hard time imagining too, how to decompose and distribute it, when most of the work I do with my workbench computer made up of serial tasks, ie, that each little piece depends on the outcome of the previous one. Maybe it's just that I'm too conditioned to think that way. Maybe more could be done in parallel; but with dozens, or even thousands of processors??! OTOH, some of what I do with interrupts could be done this way, and if each interrupt source could have its own processor at its disposal, it could get its jobs done without interfering with other jobs. Actually, you did mention that further up. Now, if I could just find some memory that was more than dual-port, like maybe quad-port or...?...
Perhaps "backplane" was not the best word to use, but I was referring to some kind of mother board that would connect the modules to each other appropriately, and not necessarily a bus where the same lines run to all the modules.
Would it matter, as long as we're not making money at it?
Does that mean approximately the same thing as stopping its clock and waiting, so it's not even aware of idle cycles passing? This appears to partially remedy what I see as some loss of control in timing in real-time tasks.
which goes along with the blocking, if I understand you correctly.
which basically is what I have with my address-decoding scheme. It allows setting up two or three VIAs at the same time, even starting their counters synchronously, although I have never used that capability.
Quote:
The shared bus model is ideal for relatively coarse-grained parallel decomposition of tasks. For example, you can assign one CPU the job of...
So it sounds like you're mostly talking about jobs that will always be there and are basically tied to the hardware, IOW, having to do with I/O instead of data processing. Is that right?
Quote:
my guess is that the value lies mostly in the educational value of learning to decompose problems in terms of a fixed number of concurrently running, fully isolated processes. I wouldn't expect anything useful to come from this though. I could be wrong, of course.
That's what I have a hard time imagining too, how to decompose and distribute it, when most of the work I do with my workbench computer made up of serial tasks, ie, that each little piece depends on the outcome of the previous one. Maybe it's just that I'm too conditioned to think that way. Maybe more could be done in parallel; but with dozens, or even thousands of processors??! OTOH, some of what I do with interrupts could be done this way, and if each interrupt source could have its own processor at its disposal, it could get its jobs done without interfering with other jobs. Actually, you did mention that further up. Now, if I could just find some memory that was more than dual-port, like maybe quad-port or...?...
Quote:
Well, I strongly, strongly, strongly, strongly encourage NOT going with a standard backplane. That is a shared-memory, symmetric multiprocessing model, which due to the 6502's bus timings, limits you to two CPUs before you start to run into bus contention.
Perhaps "backplane" was not the best word to use, but I was referring to some kind of mother board that would connect the modules to each other appropriately, and not necessarily a bus where the same lines run to all the modules.
Quote:
To make this work, you'll need to violate Intellasys patents, though. Smile
Would it matter, as long as we're not making money at it?
Quote:
then A will block until B reads.
Does that mean approximately the same thing as stopping its clock and waiting, so it's not even aware of idle cycles passing? This appears to partially remedy what I see as some loss of control in timing in real-time tasks.
Quote:
NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port
which goes along with the blocking, if I understand you correctly.
Quote:
* Third, I've found it essential to be able to write to multiple ports at once if necessary on the Intellasys chips. I'd expect the same would happen on the 6502 implementation too. Therefore, you're likely going to want to use 4K of IPC space (with four IPC ports [up, down, left, right], 2^4 = 16 distinct 256-byte spaces, which totals 4K). This way, a single CPU write can notify three other nodes concurrently
which basically is what I have with my address-decoding scheme. It allows setting up two or three VIAs at the same time, even starting their counters synchronously, although I have never used that capability.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
GARTHWILSON wrote:
(re: shared buses) . . . So it sounds like you're mostly talking about jobs that will always be there and are basically tied to the hardware, IOW, having to do with I/O instead of data processing. Is that right?
It's also how most every contemporary supercomputer is built now-a-days, where you often have 8-CPU or 16-CPU motherboards (shared bus architecture) arranged on a network of some kind. Programs on a single motherboard can exploit shared memory to its advantage when chugging out results for sophisticated math.
Quote:
That's what I have a hard time imagining too, how to decompose and distribute it, when most of the work I do with my workbench computer made up of serial tasks, ie, that each little piece depends on the outcome of the previous one.
Quote:
Maybe it's just that I'm too conditioned to think that way. Maybe more could be done in parallel; but with dozens, or even thousands of processors??!
Quote:
OTOH, some of what I do with interrupts could be done this way, and if each interrupt source could have its own processor at its disposal, it could get its jobs done without interfering with other jobs.
Quote:
Actually, you did mention that further up. Now, if I could just find some memory that was more than dual-port, like maybe quad-port or...?...
Quote:
Perhaps "backplane" was not the best word to use, but I was referring to some kind of mother board that would connect the modules to each other appropriately, and not necessarily a bus where the same lines run to all the modules.
Quote:
Would it matter, as long as we're not making money at it?
Quote:
Does that mean approximately the same thing as stopping its clock and waiting, so it's not even aware of idle cycles passing?
Quote:
This appears to partially remedy what I see as some loss of control in timing in real-time tasks.
Each chip in a circuit will run independently of each other. They might be synchronized via a single clock, but each chip nonetheless performs completely asynchronously with respect to each other. And, yet, this rarely becomes a problem in practice, thanks to bus protocols.
Well, the same kind of logic applies inside a multi-core microcontroller. For example, in my VGA demonstrations with the now-defunct 24-core chip, I used the following "floorplan":
Core 12: CRTC -- generated HSYNC and VSYNC hardware pulses on its I/O pins. It also sent either 0 (HSYNC line) or -1 (VSYNC line) to core 13 a few nanoseconds later.
Core 13: Used the 0s or -1s to maintain a raster line counter. Depending on whether this counter was within a fixed range of values, it would transmit a 0 to core 19. It always transmitted -1 to core 19, because it also needed to know when a VSYNC occured.
Core 19: Upon detecting a -1, it would reset its own state machine so it knew that its next set of graphics was the first line of graphics to display. Otherwise, it assumed interior graphics. Upon receiving a 0, it would send requests for data to cores 20, 21, 22, and through those, 14, 15, and 16 too. These cores are responsible for computing, in real-time, the graphics to display for the guts of the clock. It'd then pass the data to core 18 for shifting to the DAC, taking care of scheduling the words to display so that you get neat columns on the screen.
Core 18: The green-gun DAC (thus, producing a "green-screen" monochrome display) and pixel "shift register".
Except for cores 18 and 12, all other cores spend most of their time waiting for something to be sent over the IPC channels.
NOTE: The above description describes the original version of the software. Modern versions now use "port-execution" in cores 12, 13, 18, and 19, where cores spoonfeed CALL instructions to other cores as-needed. So, e.g., instead of core 13 having an IF/ELSE/THEN construct inside a main event loop, core 12 just CALL's either the HSYNC or VSYNC word in core 13 directly.
Note also that these calls occur asynchronously -- once the CALL opcode is sent to the adjacent core, both cores free-run. When the adjacent node finishes its task, it'll RETURN to the IPC channel, thus blocking for more instructions. Likewise, if core 12 is too fast, it'll attempt to CALL into node 13 again, blocking until node 13 is ready for it.
Quote:
NOTE The Intellasys chips don't have huge address spaces. It gets around this by having hardware logic which prevents PC from auto-incrementing when executing instructions from the IPC ports. Unfortunately, that kind of logic doesn't exist for the 65xx architecture, which is why I prescribe a 256 byte region per IPC port
I do not see this as relevant to blocking at all. This is Intellasys saying that you have 16 words of memory dedicated to IPC instead of 4096. The only way to get the chip to obviously not overrun the IPC space is to disable PC incrementing when fetching instructions from IPC space.
And, it now occurs to me that 256 bytes of IPC channel space might not be enough; after only a small handful of JSRs passed to the core, you'll again run out of the IPC channel's address space, requiring you to JMP back to its origin again if you want to continue executing instructions sent from another node.
Hence, you'll want to send JMP instructions instead of JSR, and the jumped-to routine will need to JMP back to the IPC space. This is the only way to prevent PC-overflow from occuring.
Quote:
which basically is what I have with my address-decoding scheme. It allows setting up two or three VIAs at the same time, even starting their counters synchronously, although I have never used that capability.
GARTHWILSON wrote:
Would our 6502.org members benefit from developing some kind of very small (a few square inches, max), standard computer board that could serve as a building block for greater parallelism, even though we're not putting multiple processors on a single IC (assuming no one here has the resources to carry out swatkins' original suggestion)? The idea would be that you could keep expanding the computing power by plugging in more and more of these modules, whether into a backplane or into each other. How might these modules efficiently communicate with each other?
IIRC, it already has support for running in an 8-processor configuration (albeit in a non-shared-memory configuration). This would simplify the hardware configuration. The only problem is it's not really ideal for a message-passing system since the cost of sending a message would be higher than a shared-memory architecture machine.
Toshi
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
have you looked at the WDC65C134S?
This is kind of hastily written, and still getting away from cell processors, but while we're talking about having multiple processors instead of doing everything with a single processor, I have to ask—What goes on in PCs' video cards, sound cards, and graphics engines to take a big burden off the main processor? Samuel has occasionally touched on how GEOS (software) on the Commodore 64 computed the video for various fonts and sizes, which, although not lightning-fast, was very good for a 1MHz processor. But as I type here on my PC, possibly with audio going on at the same time and occasionally streaming video too, and see the immediate effect of inserting a word which adjusts all the lines below it on the screen with no visible delay regardless of fonts and other factors, I sometimes wonder how much of that work of figuring out what value to give each dot is done by the main processor and how much by the video chipset. The PC to me is basically an appliance. I know very little about its innards. I don't even know if there's still a separate floating-point co-processor in the same package with the main processor (not that the video calculations use FP very often), or if the main processor has its own FP capabilities.
For the streaming audio and video, I have no doubt that the audio and video chipsets do a lot of buffering, maybe even several seconds' worth, and I suspect these may even access the sets of samples by DMA; but the control of the timing is atrocious, as witnessed by the fact that the audio may lead or lag the video by a tenth of a second or even more, which is totally unacceptable for my kind of work, to put it mildly. This is obviously a common problem as we frequently see it even on TV, from major stations that can afford all the processing power they could want.
So what can the home computer builder apply from all this in order to improve performance without losing control of timing?
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?