I've been thinking about how to interface a video circuit to a 6502 asynchronously, without having them share a base clock signal, but still maintaining good throughput, and it looks like FIFOs might be the best way. I wanted to share my thoughts so far and see if anyone had past experience hooking them up this way or other thoughts on the matter. The method could be useful for other things than just video, if they have similar needs.
I've made several different video output circuits, mostly connected to 6502 computers of some kind, but they've all been synchronous - the CPU clock and video clock are directly related. Generally the video circuit has divided down the pixel clock to coordinate its internal workings, and from that, supplied the clock signal for the CPU so that the CPU's memory accesses naturally occur at times when the video circuit is not using the memory.
I'm keen to try doing it differently though, for two main reasons - firstly it's good to try new things, and whether they're better or worse, you learn more that way; and secondly, tying the clocks together has often become a thorn in my side later on, as for example supporting different screen resolutions requires upsetting the timing for the CPU and memory. I'm hoping that keeping these things separate with a robust interface layer in the middle will make it easier and safer to make changes on both sides. Related to that, it should also make it easier for other people to replicate and plug in to their own systems, as it will place less constraints on the design of the rest of their system.
For this interface I am only interested in writing data, not reading it; if any form of read is necessary, it's OK for it to be quite slow. The data written will be a form of command stream, and the idea is that the CPU is just writing to a memory-mapped I/O port, without any wait states, and the video circuit is picking up the data asynchronously when it's ready for it. It's up to the code to wait long enough between write operations so that the video circuit has had time to consume the data.
My first sketch for how this could work was disappointing. I envisaged the CPU latching data into an 8-bit D flipflop register, and setting a flag in another D flipflop to say that data was available. That's all synchronous to the CPU's clock. Then the video circuit needs to notice this has happened, and consume the data at the right time during its own clock cycle. It could use another D flipflop to sample the state of the first one, but in time with its own clock. Then if it's set, there is data pending. However, there is a chance of metastability in this flipflop, and it would be prudent to add a second similar flipflop to protect against that. The inverted output of the second flipflop can then be used to reset both of the other two, so that its own output is asserted for just one cycle of the video clock:
Attachment:
File comment: Metastability delays
20230914_132205.jpg [ 2.84 MiB | Viewed 5185 times ]
In this diagram the signal on the left has a rising edge when PHI2 falls with the right address on the bus - it's also used to clock the data flipflop. WRITE indicates to the video circuit that a write operation is pending. PIX/4 represents the pixel clock divided by 4 - this is just how this worked in the last video circuit I made, you can just think of it as being the video clock.
The trouble with this is the amount of latency, leading to poor throughput. I put some numbers under the diagram based on PIX/4 being 9MHz (SVGA resolution) or 6.3MHz (VGA resolution). The centre flipflop will trigger 0-1 video clocks after the left flipflop gets set, and the rightmost flipflop will trigger 1 video clock later than that. The video circuit itself then likely needs another video clock cycle before it has completely finished with the data that was latched. In all we need 2-3 video clock cycles to consume the data, which could be nearly half a microsecond.
During this time, the CPU can't send any more data. Worse, when this time expires, it will take the same amount of time again for the next byte of data to arrive. So the throughput is limited to around 2MHz, whereas a 6MHz or 9MHz video system ought to be able to receive data at three or four times that rate. I would prefer it if the video system was the limiting factor.
It is possible to reduce the time by cutting some corners - the centre flipflop could be clocked on the inverse of PIX/4, for example, so that half a cycle is saved between it and the rightmost flipflop. I feel like the more you cut this time, though, the higher the risk of metastability in the rightmost flipflop. It'd also be possible for the video circuit to latch the data byte, perhaps in parallel with the flag, but that would again need two layers of latching to avoid metastability issues, leading to a lot of ICs.
On the whole though, this is looking more and more like a FIFO. I happened to buy some 74HCT40105s from Mouser a few months ago - these are 16-stage 4-bit FIFOs, and quite cheap:
https://www.ti.com/lit/gpn/CD74HC40105Their timing characteristics are... not great. But I think they are fast enough to be useful here. Looking around more widely, the IDT7200 series look very appealing, providing fast 256x9 storage for example, though much more expensive; Digikey however does have some other similar options that are somewhere in the middle price-wise. So they may be better options - still, the 40105 is what I have, so I started to think around that, and I think it may be fast enough anyway.
The 40105 provides an input port with a "SI" clock signal to shift data in (on its rising edge) and an active-high DIR ("data-in ready") output that tells whether it's ready to receive data (i.e. not busy or full). We can potentially connect SI to an active-low "write" signal in the usual 6502 fashion, qualified with PHI2, so that it rises at the end of PHI2 and shifts in the data. If we are careful with data rates then we don't need to worry about DIR.
However according to the datasheet (which is not very clear in some senses) the SI signal should have a pulse width of at least 24ns (worst case listed). A 6502 system with a 20MHz clock would start to violate that, so we probably want to extend the SI pulse somewhat. It also requires a data hold time of 38ns so we still need to hold the data in a separate register for at least that period. Another interesting statistic that I'm not sure exactly how to interpret is the DIR pulse width requirement of up to 60ns. Given that this is an output, I take this to mean that I need to expect up to 60ns of "busy time" with DIR low after I shift in data, while the data ripples out of the first stage of the device.
Finally there's a propagation delay from SI to DIR of up to 63ns. I am not sure whether this means it could take up to 63ns for DIR to go low after I shift in data, or whether it's just repeating that it'll take 63ns for DIR to go high again after shifting in data. Based on the quoted maximum frequencies I suspect the former, but it's not very clear.
One other figure that's given is the total propagation time from SI to the data being available at the device's output port, which is quite high - e.g. 600ns - but that's not a concern for me, I don't mind the latency so long as the throughput is good.
So overall then, I think that after causing a rising edge on SI we need to wait at least 63ns before the next one. Assuming data can be consumed at much the same rate, this amounts to a throughput rate of about 15MHz, which is fine for my needs.
I believe the fastest rate that a 6502 can reasonably write data to an I/O port is once per three clock cycles, if the port is in page zero - any such write operation requires at least an instruction opcode, a target address, and the data byte itself. So if it were to exceed the capability of the 40105 (15MHz write rate) it would need a clock speed of 45MHz or more - and that's almost certainly beyond the CPU's capabilities. Overall I think this means that the 40105 is indeed fast enough for this purpose.
The circuit would look something like this:
Attachment:
File comment: 6502-to-40105 schematic
20230914_132107.jpg [ 3.05 MiB | Viewed 5185 times ]
The flipflop is asynchronously reset by the qualified write signal ("CS^PHI2") and then is set again by the rising edge of PHI2. Its output forms the SI signal, stretched out to the next clock cycle. The data register is either latched during CS^PHI2 or synchronously clocked in on the rising edge of that signal, it doesn't matter which.
I showed 8 bits of data there, it is possible I'd add another 4 or 8 bits from the address bus though to provide more context to the operation, for example control lines to select a type of operation to perform, or some form of address offset within the video memory. I would like to keep the interface relatively narrow though, I'm tired of wiring 20-bit address buses to video circuits!
And here's a timing diagram I drew, based on the unlikely case that the I/O port is indeed in page zero and we're writing a continuous stream of constant data with an unrolled loop:
Attachment:
File comment: 6502-to-40105 timing diagram
20230914_132045.jpg [ 2.98 MiB | Viewed 5185 times ]
Note that SI won't rise until half a clock cycle after the data is latched, plenty of time for the data register to be stable, and the data register will hold this value for at least two clock cycles, meeting the required 38ns hold time. The gap between rising edges on SI is at least three clock cycles, and SI's pulse width is at least one clock cycle.
The only thing I didn't really consider here is the output end of the device. It has similar characteristics to the input end, as far as I can see, and as such I suspect it would be able to support around 15MHz read rate - which is plenty for my existing video circuits, which would only be able to process data once every four pixels. So I'm not to worried there. Of course if the video circuit is slow to read the data then the CPU would also need to slow down sending the data, either through knowledge of the video circuit's ability, or by having a means to tell when the FIFO is full - or maybe through CPU clock stretching to automatically rate-limit it, but I'd like to avoid that. In practice I think this is unlikely to be a major problem, and it's not caused by the FIFO itself, it's just the limit of the video circuit's speed.