I was quite pleased with last week's advance on clock stretching. However, while ruminating in the Sheep Pen, I realized that I've not described much more than baud rate generator. I am therefore determined to be more original.
The story so far:-
I have been influenced by recent discussion about
software programmable clock speed and
yet another multi core topology discussion where each core has a dedicated task in a multi-media system. This led to references to previous discussion but omitted any consideration of how to clock or boot a grid on 6502 processors.
Clock Stretching Octo Core And BeyondOne pair of cores may use a slightly asymmetric counting circuit to obtain square waves of opposite polarity and relatively little signal skew. Two pairs of cores may share the counting circuit via AND gates. Consider the typical case where RAM is faster than ROM. When two cores in the same phase access separate banks of RAM, they may do so at the fastest speed. However, if either core (or both cores) access ROM, everything is paused to allow access at a lower speed. This arrangement allows each pair of processors to share one memory-space while also allowing synchronous access to a (typically) smaller pool of memory shared between corresponding pairs of processors.
It is possible to scale this process to a larger cluster of synchronous cores. However, as the number of cores increase, the likelihood of accessing slow memory also increases. We also have the problem that a deep tree of AND gates (and the physical distance of signalling) is likely to restrict the technique to eight cores or less. Why eight cores? A trivial variation of the circuit allows a latch to set the minimum duration for clock stretching. This may be used to test system stability or operate at low power. When the latch is reset at power-up, outputs AND all clock stretching signals. In this case, all delays are continuously asserted. If the mask is modified, a subset of genuine delay signals may be ignored but a minimum cycle time remains in place. Overall, the maximum number of cores in a cluster is heavily influenced by the fan-in of delay signals which may include a register to define maximum speed. The eight core limit is derived from two tiers of three input AND of which eight inputs are cores and the ninth input is a software defined speed limit.
Other communication techniques may be used between clusters, such as dual port RAM. This is akin to long distance electrical distribution in which synchronous regions of alternating current are joined with DC ties.
Clock Stretching Beyond 30MHzIt is possible to implement two tiers of clock stretching such that a preliminary tier begins clock stretching before address decode occurs in full. Analysis requires categorization of glue logic chips according to number of inputs, complexity and the consequent propagation delay. In particular, address decode of three or more bits falls into a secondary tier. These are the weights I use when designing but they may not have accuracy:-
- 74HC logic gates with two inputs have 7ns propagation.
- 74HC139 with three inputs has 12ns propagation.
- 74HC138 with six inputs has 28ns propagation.
- 74HC688 has more than 30ns propagation.
Casual use of 74x688 has fallen into disuse due to excessive propagation delay (except in video generation, where they should be retired). Likewise, it should be apparent that any address decode exceeding 30MHz should increasingly look like the Garth Wilson Special where only the top two bits of the address-space are significant. In the preferred embodiment, address line A14 AND A15 is LOW when address is 0-48KB and A14 NAND A15 is LOW when address is 48-64KB. This may be run in parallel with
dual core bank switching. The result is broadly compatible with several popular memory maps which use similar arrangements to reduce circuitry and increase speed.
After the first phase of clock stretching has been established, address decode may ripple through two or more tiers of 74x138 and selectively request further delay to the secondary tier of the clock stretching circuit. For example, 48-50KB may be slow user defined I/O and 56-64KB may be slow boot ROM. In general, any irregular map of clock stretching may be defined.
In the minimal case, two tier, dual core clock stretching requires four 74HC161 chips. However, this arrangement may be restricted to two tiers of +2 wait state only - unless video phase and symmetric access to the slowest I/O is sacrificed. If this is acceptable then six inputs are available for clock stretching. It may be preferable to have two fast counters and four slow counters. The fast tier is restricted to +2 operation while the slow tier obtains a 3x multiplier chosen at the end of the first fast cycle. In this arrangement, phase count is always odd. Specifically, the product of phases is (1 + 2^F)(1 + 2^S) where 2^F and 2^S is never one.
Unfortunately, each tier of clock stretching incurs a minimum of two ticks per cycle. Therefore, a two tier clock stretching circuit requires an oscillator which exceeds the minimum cycle time by a factor of four. For example, a 120MHz oscillator requires a 120MHz tier of counters and a 60MHz tier of counters to obtain a minimum cycle time of 30MHz. It is otherwise suitable to approximate the cycle time of each section of a memory map with 33.3ns granularity or better.
Square Wave Video GenerationI wondered if it is possible to make counters with more than two phases. I have also
considered audio and visual applications for square waves with arbitrary duty and duration. Indeed, it is possible to generate horizontal sync, left porch, playfield address, right porch and corresponding vertical signals in a manner which may be hard-wired or software configurable, without the use of 74x688 comparators. Indeed, I believe that it is possible to arrange a ring of four registers where one or three self-inhibit. I also believe that 1920*1080*60FPS display requires a 16 bit shift register running at 160MHz and 12 74HC161 chips running at 10MHz or less. I already have a ring of 2*2 counters running at 25MHz.