6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 9:52 am

All times are UTC




Post new topic Reply to topic  [ 14 posts ] 
Author Message
PostPosted: Wed Dec 30, 2020 10:23 am 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
6502 systems typically run on alternate clock phases and therefore it is relatively trivial to make dual core systems. However, this arrangement is hostile to further processors and provides no additional benefit if it is overcome.

A common technique to increase processing power is square or rectangular grid of nodes where adjacent nodes are directly connected. I believe that some of Seymour Cray's designs followed this pattern. It was also a common arrangement for INMOS Transputers and the subsequent XMOS designs. It is also the arrangement of the GreenArrays which is associated with Chuck Moore. Indeed, if Seymour Cray's fastest designs and Chuck Moore's small, cheap, flexible designs have a common element then it is probably worth investigating.

I found that it is possible to make a two dimensional grid (or more) from 6502 processors by making adjacent processors run on alternate clock phases. In the trivial case, we have a checkerboard of node phases. Adjacent nodes may communicate via single port RAM. This RAM is available to both processors because they only access it on alternate memory cycles. Practical arrangements should scale beyond 144 cores. Timings within the grid are quite critical. However, a basic technique, practiced with success by Seymour Cray, is to use the same length wires for clock distribution.

The basic unit may be tessellated at least 36 times:-

Code:
           ^                                 ^
           |                                 |
           v                                 v
      +---------+      +---------+      +---------+      +---------+
      |  6502   |      |         |      |  6502   |      |         |
<---->|  (Even  |<---->| Buffers |<---->|  (Odd   |<---->| Buffers |<---->
      |  Phase) |      |         |      |  Phase) |      |         |
      +---------+      +---------+      +---------+      +---------+
           ^                                 ^
           |                                 |
           v                                 v
      +---------+                       +---------+
      |         |                       |         |
      | Buffers |                       | Buffers |
      |         |                       |         |
      +---------+                       +---------+
           ^                                 ^
           |                                 |
           v                                 v
      +---------+      +---------+      +---------+      +---------+
      |  6502   |      |         |      |  6502   |      |         |
<---->|  (Odd   |<---->| Buffers |<---->|  (Even  |<---->| Buffers |<---->
      |  Phase) |      |         |      |  Phase) |      |         |
      +---------+      +---------+      +---------+      +---------+
           ^                                 ^
           |                                 |
           v                                 v
      +---------+                       +---------+
      |         |                       |         |
      | Buffers |                       | Buffers |
      |         |                       |         |
      +---------+                       +---------+
           ^                                 ^
           |                                 |
           v                                 v


Nodes along the perimeter are most suitable for high bandwidth I/O. For example, it is possible to have eight or more video displays along one edge and storage along another edge. However, nodes at the corner of the network only have one spare clock phase and therefore may only participate in one high bandwidth I/O function.

Suggested memory map for 6502:-

Code:
$0000-$00FF: Zero page.
$0100-$01FF: Stack.
$0200-$BFFF: 47.5KB of general purpose memory.
$C000-$CFFF: Memory shared with adjacent cores.
$D000-$DFFF: I/O which includes doorbells to invoke adjacent cores.
$E000-$FFFF: Supervisor firmware.


The memory map reveals immediate limitations of the architecture. Most obviously, four bi-directional channels to adjacent nodes requires eight sets of buffers. Even if the memory map is adjusted, it may be undesirable to allocate more than 1/2 of the memory for this purpose. Therefore, directly mapped buffers will be a maximum of 4KB and are likely to be much smaller; typically 1KB or less. However, if a bi-directional channel minimizes hardware, there may be further overheads. For example, a software FIFO implementation may not distinguish between buffer empty and buffer full. This relegates implementation to being one byte short of the full span. We also require space for multiple pointers. Therefore, the maximum buffer size may never exceed 4KB minus five bytes (or seven 256 byte uni-directional channels) and is more likely to be annoyingly short of 1KB or three separate pages.

Additional capacity can be be obtained by page banking buffers. This is especially desirable if multiplexed RAM is supplied in large quantities. However, this arrangement may reduce communication speed. With 65816 or 65265, it is possible to have 64KB or more per uni-directional channel. It is also possible to block copy to and from buffers. Or, indeed, directly block copy from buffer to buffer so that data can be passed through the grid.

Suggested memory map for 65816 and 65265:-

Code:
$000000-$0000FF: Direct page.
$000100-$0001FF: Stack.
$00D000-$00DEFF: I/O which includes doorbells to invoke adjacent cores.
$00DF00-$00DFFF: W65C265 microcontroller internal I/O.
$00E000-$00FFFF: Supervisor firmware.
$010000-$BFFFFF: 12224KB of general purpose memory.
$C00000-$FFFFFF: Memory shared with adjacent cores.


If boards are constructed with 4, 6, 9, 12, 16 or more cores, it is strongly recommended that they are all connected with ribbon cable where alternate wires are grounded. This creates favorable physics where signals are largely shielded from each other. A further consideration is that a rectangular grid of boards can be folded at the seams. In this case, ribbon cables may not be the same length but effort should be made to keep length to a minimum. In trivial cases, ribbon cables may be less than 2 inches (5cm) while providing easy access to high bandwidth I/O nodes on the perimeter of the grid. This requires minor consideration for hex bolt placement and power distribution.

MTBF of large arrays is a concern. Thankfully, BigDumbDinosaur has shown empirically that 65816 + Static RAM + SCSI Disk (and suitably long hex bolts) has uptime which exceeds 300 days. From this, it is reasonable to extrapolate that 144 cores and eight disks has uptime which exceeds two days. It is also reasonable to assume that the typical system will be much smaller and therefore much more reliable. Regardless, the shear cross-sectional area of the silicon is an invitation for bit error and this may cascade catastrophically through a grid computer. Without exception, validation and checksums are strongly encouraged. In particular, checksum size should be dimensioned according to the size, speed and longevity of communication. Specifically, 16 bit CRC is very likely to be inadequate.

The primary purpose of the supervisor firmware is to maintain integrity of the system. This includes power on testing, addressing of identical nodes and bad node/link detection. High integrity, portable communication is a secondary task within this framework. 8KB ROM should be suitable for this task given it was achieved on Connection Machine or Transputer nodes with less total memory.

Anyhow, it is possible to start with a dual core or quad core system and, conceptually, scale-out to a ludricrous extent.

Further References

In 1987, Clive Sinclair filed a patent regarding wafer-scale integration where partial or full wafers could be packaged. Connectivity of the good nodes coudld be determined at boot. A one hour lecture by Danny Hillis explains the progression of Connection Machine development and makes passing mention of Z80 being unsuitable for clustering before covering 1 bit computing, striped RAID, self-test and virtualization. Limitations of this architecture have only recently been overcome with CUDA.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 12:22 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(That diagram looks a little familiar!)


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 1:47 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1120
Location: Albuquerque NM USA
Interesting thought. Wearing the Z80 hat and motivated by the inexpensive 4K 71342 dual port RAM with hardware semaphores ($2), I've designed a multiprocessor slice based on a Z80, dual port RAM, and a CPLD. https://www.retrobrewcomputers.org/doku ... asmo:dpram The dual port RAM serves as message buffers as well as the bootstrap program where Z80 is held in reset until the dual port RAM is loaded with program and reset released with a software command. Other than few basic memory and I/O mapping function, the CPLD is mostly free to do other functions yet unspecified.

This basic idea applies to 6502 and can be modified for the grid computing concept. Each processor slice now has four 4K dual port RAM. One particular processor slice has a mass storage and can boot itself and then load program into its neighbor's dual port RAM and command it to boot, which, in turn, boot up the neighbor next down the grid.

A processor slice can be built for $20, so what kind of problems can this grid computer solve?


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 4:15 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
plasmo wrote:
A processor slice can be built for $20, so what kind of problems can this grid computer solve?


None that I can't solve with a single $5 Raspberry Pi Zero.

That's the reality today, however, building something purely for academic purposes is still worthwhile and probably great fun.

When the Pi came out people did build "Beowulf clusters" out of them - mostly just for fun though - I helped with a trivial demo, I think we had 6 or 7 Pi's computing Pi. The same program ran faster on the Laptop we were hosting it all from, but it was still a nice proof of concept.

I was involved with parallel processing for a good number of years from the late 80's. The Inmos Transputer to start with - and it's well worth having a look at that - a 32-bit CPU with 4K of on-board RAM, hardware floating point (T800) and 4 "link" engines which can transfer byte-wide data with low latency and high speed (for the time)

Making a "link engine" for the 6502 is something I've looked at and seen others looking at in the past... Take a CPLD, give it a 6502 bus interface with on-board DMA engine with a high speed serial output, read Hoare's Communicating sequential processes book (and subsequent updates/research by others) and off you go.

Writing applications for such a beast - that's the crux. There is no magic bullet, but for some problems, breaking them down into simple chunks isn't too hard - the "SIMD" problems - Single Instruction, Multiple Data - so each node/compute element runs the same code, but is fed different data. One trivial example is Mandelbrot - you have the same code on each node and a head end node feeding each node with a coordinate pair and fetching back a result (or even having the node push the result to a downstream display node) - the trick then is to keep each node busy - with the ultimate setup being one node per pixel. Raytracing can be similarly done. Other things need other thoughts and it gets exponentially more complex from then on...

Then there's this "grid" thing.. Is that actually the most efficient way? Transputers (and their successors) could be wired up in a grid, but often the best way was simply a linear pipeline ... (sometimes from a physical point of view too - if you ever had to manually wire up 2 cabinets of 200 transputers, you want it to be fool proof, although the company I worked for in the Transputer era developed their own electronic link switch chip)

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 4:43 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Conway's Life might be an amusing application - a bit of local computation, and a bit of communication.

Likewise a physics/physical simulation of 2D or 3D phenomena. Supernova!?! Weather?!? Floating point problems not ideal, of course.

Various kinds of rendering might be suitable (a specific kind of 3D physics, I suppose.)

Looks like arrays of evaluation engines have been used in chess machines in the past.

A Bombe simulator! Lots of scope for parallel work packages in decryption.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 5:18 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1120
Location: Albuquerque NM USA
Conway's Life was the reason for my little Z80 slave processor. The neighborhood operations need a bit of help even for a small 128x64 OLED display.

There was an eBay auction for 30 or so T805 about a year ago. I thought it would be interesting to build up an array of transputers again, but I quit bidding when it went past $400. I believe it went quite a bit more than $400. Oh well, I didn't want to build array of processors that bad and I probably don't have the software anymore anyway.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 30, 2020 8:17 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
Sheep64 wrote:
6502 systems typically run on alternate clock phases and therefore it is relatively trivial to make dual core systems. However, this arrangement is hostile to further processors and provides no additional benefit if it is overcome.

A common technique to increase processing power is square or rectangular grid of nodes where adjacent nodes are directly connected.

Be sure to also look at the topic "idea: a cell-like chip based on many 6502 cores."

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2021 12:57 am 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 107
Location: Long Island, NY
plasmo wrote:
inexpensive 4K 71342 dual port RAM with hardware semaphores ($2)


Woah, where are these $2? and why are they $18 on Mouser? :shock:


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2021 2:13 am 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1120
Location: Albuquerque NM USA
Agumander wrote:

Woah, where are these $2? and why are they $18 on Mouser? :shock:


utsource.net Look up IDT71342LA35J It is $2.10 each in quantity of 20. I used it in several designs so I've purchased 2 lots of 20 each. They all worked well.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2021 6:12 pm 
Offline

Joined: Mon Sep 14, 2015 8:50 pm
Posts: 112
Location: Virginia USA
The PIB of the 65c265 is suggested in its datasheet to be used in a star network among other possibilities:

pg. 39:

2.21 Parallel Interface Bus (PIB)
2.21.1 The Parallel Interface Bus (PIB) pins are used to communicate between processors in a
"star" network configuration or as a co-processor on a "host" processor bus such as an IBM PC or
compatible or an Apple II or Mac II personal computer. This PIB may also be used as part of the file
server system for large memory systems

I imagine the '265 would be center of a network of 2 or more cpus (i.e. 65816) and communication would be asynchronous such that the 65816s may operate at 7+Mhz and 65265 at 3.6 or 7.2 Mhz allowing for 2 or more UARTs.

Mapping of the PIB to the 65816s could be either memory mapped to zero page or some other method. Transfers are signaled via IRQ pin. There are other edge interrupts that share pins on ports 5 and 6 but they would need to be de-multiplexed.

There's no reason (other than tradition) not to map direct page on the '265 to $D000 allowing for somewhat faster I/O and likewise the machine stack pointer to below $D000.

All of this is conjecture though.

Happy New Year all,
Andy


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 01, 2021 8:48 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
(Over in the world of Z80, Chris Fenton has recently built ZedRipper, a 16-processor setup using a kind of token ring. See here and here.)


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 02, 2021 4:49 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8546
Location: Southern California
GARTHWILSON wrote:
Be sure to also look at the topic "idea: a cell-like chip based on many 6502 cores."
Oh, and I forgot this one: "a multi-core 65c02 system upgraded for Forth"
and this one: "Parallel Processing with 6502s"

I have a lot of 2MHz Rockwell 65c02's and 65c22's and 8Kx8 EPROMs and 2Kx8 and maybe 8Kx8 RAMs too, plus simple 74HC logic, left over from production years ago. One thing I have thought would be fun to do with them is make a lot of small boards that could be connected in a multiprocessor grid.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 04, 2021 11:55 am 
Offline
User avatar

Joined: Tue Aug 11, 2020 3:45 am
Posts: 311
Location: A magnetic field
BigEd wrote:
(That diagram looks a little familiar!)


That's uncanny!

The short answer is that I derived this independently.

I thought that I had read the forum archive in full using text-to-speech and encouraged others to do likewise in an effort to reduce duplicate discussion and de-lurk valuable contributors. This would explain my familiarity with the topic and yet my unfamiliarity with the diagram. Indeed, my initial reaction was to determine when I had subconsciously absorbed the information before regurgitating it. This is a minor problem when inhaling a corpus of text. At the very least, I wished to save you from any more of my "original" ideas.

It took me a while to go through my bulk download logs. I was aware of the title of the discussion by Mon 26 Oct 2020. However, the URL has been stuck in my fetch queue since Mon 30 Nov 2020 and your logs will indicated that the idiot straking the forum with wget on Mondays missed a few discussions mostly related to FPGA and EhBASIC.

Anyhow, after reading The Transputer Handbook, David May's personal website, outline designs of various super-computers, mention of a possibly apocryphal dual core Rockwell 6502 (which doesn't seem to have a part number or datasheet), subsequent reduction to processor stacking and general discussion, the extension to a checkerboard of adjacent nodes was obvious. However, others are many years ahead of me:

nyef on Sun 28 Jul 2013 wrote:
You might not even need dual-port RAM for this. I could see doing this without the FPGA, with a '245 on the data bus between each CPU and RAM pair, something similar for the address lines, and adjacent CPUs operating on opposite sides of the phi2 cycle.


Our diagrams may have been more similar if I had not considered and rejected toroids and hyper-toroids. A hyper-toroid of 6502s is completely impractical because it requires wiring a high speed parallel bus to six or more adjacent nodes. Good luck with that. If you solve the wiring problem then it only increases pressure on the memory map. Regarding rejection of a continuous checkerboard, David May's recent work incorporates environmental events, such as temperature fluctuation and line level changes, into Communicating Sequential Processes and does so in an autonomous manner. The general principle is that a computer is not a cluster of nodes where I/O is an after-thought. Instead, a computer's useful throughput is limited by interaction with its environment in a manner analogous to Kleiber's law, the Metabolic theory of ecology and Information theory related to black holes. Surface area is the primary consideration! This should not be a surprised to anyone familiar with heat dissipation and reversible computing.

On a practical basis, it may seem preferable to have nodes more densely connected in a toroid. However, one or more seams in a compute fabric aids high bandwidth I/O, such a video display, networking and storage. In the trivial case, a good ol' rectangular grid with synchronous clock is best. Although I would really like an 8*8*8 hyper-toroid of 6502s, 12*12 grid is cheaper and more practical.

_________________
Modules | Processors | Boards | Boxes | Beep, Beep! I'm a sheep!


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 22, 2021 1:57 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Sheep64 wrote:
... mention of a possibly apocryphal dual core Rockwell 6502...


Here's the preliminary datasheet:
R65C00/21 DUAL CMOS MICROCOMPUTER AND R65C29 DUAL CMOS MICROPROCESSOR
(It's possible the part was never made, or never widely sold.)

Edit: mentioned in this thread
Sharing I/O between two CPUs?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 14 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 74 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron