AndrewP, I really like your eight overlapping clock phases which seamlessly switch between two clock speeds. I briefly spent time doodling 4-10 phase schemes but failed to find anything worthwhile. Anyhow, I have a minor optimization which may reduce size and cost. If using 4 bits or less of 74x374 or 74x574, it is possible to use 74x161 in a non-counting configuration. Likewise, if using 4 bits or less of 74x373 or 74x573, it is possible to use 74x157 in a loopback configuration. I learned such devious tricks from Dr Jefyll.
I'm very impressed that you got four cores to share one memory bank. Most people achieve two cores and no further. I got beyond two but the case is restricted to the point that it is probably worthless.
If multi-media is required, for example 48kHz audio and 25.175MHz video, then dual port RAM is highly desirable. Agumander had great success with video blitting and Turing complete audio over dual port RAM. In the general case, cores with dual port RAM may be arranged as an arbitrarily deep tree such that one window of RAM with a fixed address is up-stream and one bank switched window is down-stream to each subordinate core. This arrangement requires the smallest memory allocations for communication and allows the fastest block copying when data is passed through. Furthermore, specialist cores may be compute only or have nothing beyond a DAC shadowed by RAM. A core may be paused, poked or interrupted and therefore no ROM or I/O is required. This allows discrete specialist cores to run above 30MHz while the root node (with all miscellaneous I/O) may be much slower.
If you have intense tasks and don't care about communication overhead, it is possible to serialize read or write operations over 24 clock cycles or more. This allows 32 or more cores to be wired to one window of memory; optionally with atomic locking. This is an extension of discrete circuitry for
double clocked Quad SPI over 2^N+1 clock cycles. It is preferable for each core to have small FPGA due to the complexity of the logic. However, FPGA may also perform specialist tasks such as FPU or encryption.
The case most likely to interest AndrewP is the general, symmetric, high bandwidth option. If you want fast, cheap cores with fast communication, try a chequer-board configuration where adjacent cores run on opposite phases. Unfortunately, this arrangement does not have any unified view of memory. Each pocket of memory is local to one core or shared with one adjacent core. To prevent address glitches which come from sequential writes across two clock phases, two SRAM chips are wired in a cross-over configuration. Any given RAM chip in the cross-over configuration receives writes from one core only. Furthermore, atomic locking can be skipped because each core shadow writes to a read-only window. When only one party can write, the locks are redundant. An optional return path allows full-duplex communication using one region of memory. However, this is not general purpose read/write memory. The semantics are deliberately broken to implement communication between cores. To reduce size and cost, [url=file:///media/trisquel/C9CE-889C/20220326/in/Tech/6502/viewtopic.php?f=4&t=5805]50 cent skinny DIP RAM[/url] may be chip stacked.
Unfortunately, 2KB window between cores is extremely cramped if it is configured as, for example, seven 255 byte buffers plus pointers. However, four (or more) larger buffers consume too much of a 16 bit address-space. Therefore, large regions of 65816 address-space may be highly desirable for communication. This is a more onerous requirement than a tree of dual port RAM where the root node has one banked window into shared memory and leaf nodes may have a partially decoded window which fills the address-space.
At present, I feel quite stupid because I tried to tie everything to a central, two phase clock stretching unit. Firstly, this has scaling problems. Secondly, AndrewP shows that it is completely redundant because:
- Local RAM is never clock stretched.
- Shared RAM may be the only shared resource and clock stretching may be minimal, if any.
AndrewP on Thu 24 Mar 2022 wrote:
(hopefully) consistency is preserved.
I think you've got the edge case absolutely correct. With 2:1 clock speeds, you only require a nudge of one phase (viewed from the fast clock) or one cycle (viewed from the slow clock) to re-join the slower collective. The tricky part is when to apply the change of speed. However, there is probably some grace if an absolute address is strobed to switch speed because the surrounding instruction and data cycles are (hopefully) unaffected by clock stretching. The exact phase is critical when executing from shared, clock stretched RAM. However, that case is guaranteed to fail if shared RAM is switched off.
I didn't try to solve a practical, restricted case. I tried to solve ridiculous, erroneous cases. For example, where one core spends 33 cycles on slow I/O and then another core attempts same from the same clock stretching hardware. Instead, it may be preferable for each core to have local I/O and local clock stretching. Actually, I'm a little disappointed that some symmetric I/O configurations are impossible. For example, two cores on opposite clock phases won't trivially share one 6522. If 6522 didn't have clock input, sequential writes across two phases would glitch. That's no different to RAM. However, reads wouldn't work either. Reads would glitch but there is another limitation. Historically, NMOS 65xx peripherals used the idle phase for charge pumping. They didn't have the speed or power to answer two cores.
This precludes a general SMP system where dozens of cores use one bank of RAM and all peripherals. This means one core cannot use the majority of the system's RAM. Likewise, it is more difficult for all cores saturate a network connection. Despite such limitations, I wonder if it is possible to fit two cores onto a 100mm*100mm board.
AndrewP on Thu 24 Mar 2022 wrote:
Sheep64 on Tue 22 Mar 2022 wrote:
what else is commonly broken and how would you implement it?
Another post incoming because that has a long answer.
I'm interested in your answer. For me, it is most immediately intellectual curiosity. For you, it might make you more systemic with future projects. A project may be an outrageous success, mixed or a complete failure. Many of my best projects had a mixed outcome where some parts worked well, some was unknown and some failed. We often see this with hardware projects which require a few patch wires. If you never require patch wires then you are advancing too cautiously.
AndrewP on Sun 27 Mar 2022 wrote:
SNES controller
I quite like your SNES circuits and I am very impressed with your soldering. SNES protocol is a good test and SNES protocol can be practically extended to a range of input peripherals which includes keyboard and mouse.
AndrewP on Sun 27 Mar 2022 wrote:
Dang it! I bought four SNES joypads with the 9 pin D connector. I assumed that the socket would be easier to source and that the pins would vary less. Presumably, that's why you have an adapter board.
AndrewP on Thu 31 Mar 2022 wrote:
Logisim is slow. My simulated computer struggles to crack 19Hz when running the entire simulation on a i9 9900K.
That's painfully slow. Make-A-Chip on ZX Spectrum could simulate 20 gate ULA and I presume that was written in BASIC on a system with the effective processing power of 6502 running at 300kHz.
AndrewP on Thu 31 Mar 2022 wrote:
Logisim but betterer.
I'm looking for
KiCAD but betterer. Your requirements increase the scope by approximately 1/2 but it remains feasible.
AndrewP on Thu 31 Mar 2022 wrote:
I have a lot of circuits in my simulation. It takes me about a week to transcribe just one. If I can automate this I may even end up saving myself time in the long run.
I've found many limitations in KiCAD. Many of the inconsistencies between schematic editor and layout editor could be resolved by integrating them into one program. In particular, there would be no requirement to keep two inconsistent file formats synchronized.
Would you prefer to do simulation directly in a schematic editor which allows a true bus structure? Specifically, within one application, you would go directly from simulation to manual board layout.
Automated layout is computationally difficult and may preclude a self-hosting circuit CAD system. However, it may be feasible if the problem is sufficiently constrained.
AndrewP on Tue 12 Apr 2022 wrote:
Elephants have good memory but can you stack them?