6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 11:04 pm

All times are UTC




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Thu Sep 14, 2023 12:34 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I've been thinking about how to interface a video circuit to a 6502 asynchronously, without having them share a base clock signal, but still maintaining good throughput, and it looks like FIFOs might be the best way. I wanted to share my thoughts so far and see if anyone had past experience hooking them up this way or other thoughts on the matter. The method could be useful for other things than just video, if they have similar needs.

I've made several different video output circuits, mostly connected to 6502 computers of some kind, but they've all been synchronous - the CPU clock and video clock are directly related. Generally the video circuit has divided down the pixel clock to coordinate its internal workings, and from that, supplied the clock signal for the CPU so that the CPU's memory accesses naturally occur at times when the video circuit is not using the memory.

I'm keen to try doing it differently though, for two main reasons - firstly it's good to try new things, and whether they're better or worse, you learn more that way; and secondly, tying the clocks together has often become a thorn in my side later on, as for example supporting different screen resolutions requires upsetting the timing for the CPU and memory. I'm hoping that keeping these things separate with a robust interface layer in the middle will make it easier and safer to make changes on both sides. Related to that, it should also make it easier for other people to replicate and plug in to their own systems, as it will place less constraints on the design of the rest of their system.

For this interface I am only interested in writing data, not reading it; if any form of read is necessary, it's OK for it to be quite slow. The data written will be a form of command stream, and the idea is that the CPU is just writing to a memory-mapped I/O port, without any wait states, and the video circuit is picking up the data asynchronously when it's ready for it. It's up to the code to wait long enough between write operations so that the video circuit has had time to consume the data.

My first sketch for how this could work was disappointing. I envisaged the CPU latching data into an 8-bit D flipflop register, and setting a flag in another D flipflop to say that data was available. That's all synchronous to the CPU's clock. Then the video circuit needs to notice this has happened, and consume the data at the right time during its own clock cycle. It could use another D flipflop to sample the state of the first one, but in time with its own clock. Then if it's set, there is data pending. However, there is a chance of metastability in this flipflop, and it would be prudent to add a second similar flipflop to protect against that. The inverted output of the second flipflop can then be used to reset both of the other two, so that its own output is asserted for just one cycle of the video clock:

Attachment:
File comment: Metastability delays
20230914_132205.jpg
20230914_132205.jpg [ 2.84 MiB | Viewed 5186 times ]


In this diagram the signal on the left has a rising edge when PHI2 falls with the right address on the bus - it's also used to clock the data flipflop. WRITE indicates to the video circuit that a write operation is pending. PIX/4 represents the pixel clock divided by 4 - this is just how this worked in the last video circuit I made, you can just think of it as being the video clock.

The trouble with this is the amount of latency, leading to poor throughput. I put some numbers under the diagram based on PIX/4 being 9MHz (SVGA resolution) or 6.3MHz (VGA resolution). The centre flipflop will trigger 0-1 video clocks after the left flipflop gets set, and the rightmost flipflop will trigger 1 video clock later than that. The video circuit itself then likely needs another video clock cycle before it has completely finished with the data that was latched. In all we need 2-3 video clock cycles to consume the data, which could be nearly half a microsecond.

During this time, the CPU can't send any more data. Worse, when this time expires, it will take the same amount of time again for the next byte of data to arrive. So the throughput is limited to around 2MHz, whereas a 6MHz or 9MHz video system ought to be able to receive data at three or four times that rate. I would prefer it if the video system was the limiting factor.

It is possible to reduce the time by cutting some corners - the centre flipflop could be clocked on the inverse of PIX/4, for example, so that half a cycle is saved between it and the rightmost flipflop. I feel like the more you cut this time, though, the higher the risk of metastability in the rightmost flipflop. It'd also be possible for the video circuit to latch the data byte, perhaps in parallel with the flag, but that would again need two layers of latching to avoid metastability issues, leading to a lot of ICs.

On the whole though, this is looking more and more like a FIFO. I happened to buy some 74HCT40105s from Mouser a few months ago - these are 16-stage 4-bit FIFOs, and quite cheap: https://www.ti.com/lit/gpn/CD74HC40105

Their timing characteristics are... not great. But I think they are fast enough to be useful here. Looking around more widely, the IDT7200 series look very appealing, providing fast 256x9 storage for example, though much more expensive; Digikey however does have some other similar options that are somewhere in the middle price-wise. So they may be better options - still, the 40105 is what I have, so I started to think around that, and I think it may be fast enough anyway.

The 40105 provides an input port with a "SI" clock signal to shift data in (on its rising edge) and an active-high DIR ("data-in ready") output that tells whether it's ready to receive data (i.e. not busy or full). We can potentially connect SI to an active-low "write" signal in the usual 6502 fashion, qualified with PHI2, so that it rises at the end of PHI2 and shifts in the data. If we are careful with data rates then we don't need to worry about DIR.

However according to the datasheet (which is not very clear in some senses) the SI signal should have a pulse width of at least 24ns (worst case listed). A 6502 system with a 20MHz clock would start to violate that, so we probably want to extend the SI pulse somewhat. It also requires a data hold time of 38ns so we still need to hold the data in a separate register for at least that period. Another interesting statistic that I'm not sure exactly how to interpret is the DIR pulse width requirement of up to 60ns. Given that this is an output, I take this to mean that I need to expect up to 60ns of "busy time" with DIR low after I shift in data, while the data ripples out of the first stage of the device.

Finally there's a propagation delay from SI to DIR of up to 63ns. I am not sure whether this means it could take up to 63ns for DIR to go low after I shift in data, or whether it's just repeating that it'll take 63ns for DIR to go high again after shifting in data. Based on the quoted maximum frequencies I suspect the former, but it's not very clear.

One other figure that's given is the total propagation time from SI to the data being available at the device's output port, which is quite high - e.g. 600ns - but that's not a concern for me, I don't mind the latency so long as the throughput is good.

So overall then, I think that after causing a rising edge on SI we need to wait at least 63ns before the next one. Assuming data can be consumed at much the same rate, this amounts to a throughput rate of about 15MHz, which is fine for my needs.

I believe the fastest rate that a 6502 can reasonably write data to an I/O port is once per three clock cycles, if the port is in page zero - any such write operation requires at least an instruction opcode, a target address, and the data byte itself. So if it were to exceed the capability of the 40105 (15MHz write rate) it would need a clock speed of 45MHz or more - and that's almost certainly beyond the CPU's capabilities. Overall I think this means that the 40105 is indeed fast enough for this purpose.

The circuit would look something like this:

Attachment:
File comment: 6502-to-40105 schematic
20230914_132107.jpg
20230914_132107.jpg [ 3.05 MiB | Viewed 5186 times ]


The flipflop is asynchronously reset by the qualified write signal ("CS^PHI2") and then is set again by the rising edge of PHI2. Its output forms the SI signal, stretched out to the next clock cycle. The data register is either latched during CS^PHI2 or synchronously clocked in on the rising edge of that signal, it doesn't matter which.

I showed 8 bits of data there, it is possible I'd add another 4 or 8 bits from the address bus though to provide more context to the operation, for example control lines to select a type of operation to perform, or some form of address offset within the video memory. I would like to keep the interface relatively narrow though, I'm tired of wiring 20-bit address buses to video circuits!

And here's a timing diagram I drew, based on the unlikely case that the I/O port is indeed in page zero and we're writing a continuous stream of constant data with an unrolled loop:

Attachment:
File comment: 6502-to-40105 timing diagram
20230914_132045.jpg
20230914_132045.jpg [ 2.98 MiB | Viewed 5186 times ]


Note that SI won't rise until half a clock cycle after the data is latched, plenty of time for the data register to be stable, and the data register will hold this value for at least two clock cycles, meeting the required 38ns hold time. The gap between rising edges on SI is at least three clock cycles, and SI's pulse width is at least one clock cycle.

The only thing I didn't really consider here is the output end of the device. It has similar characteristics to the input end, as far as I can see, and as such I suspect it would be able to support around 15MHz read rate - which is plenty for my existing video circuits, which would only be able to process data once every four pixels. So I'm not to worried there. Of course if the video circuit is slow to read the data then the CPU would also need to slow down sending the data, either through knowledge of the video circuit's ability, or by having a means to tell when the FIFO is full - or maybe through CPU clock stretching to automatically rate-limit it, but I'd like to avoid that. In practice I think this is unlikely to be a major problem, and it's not caused by the FIFO itself, it's just the limit of the video circuit's speed.


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 14, 2023 3:41 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
your idea of using a FIFO to send commands to some clock independent secondary graphics system is pretty much exactly what i've been rambling about over in the General Discussions.
though i would add a second FIFO to make the interface bi-directional. that way the secondary system can be used for general computational work as well, and report back errors, statuses, etc.

the IDT720x FIFOs are cheap if you know where to get them (i got my IDT7202 (1024x9) used from utsource for a few bucks each).

also the easiest way to handle the Empty/Full Flags is by doing it in software. you can just use an octal buffer or similar to allow the CPU to read those status lines itself.
so when the CPU wants to write to the FIFO it checks the Full Flag and if it's not active writes a byte, and then goes back to checking the flag.
it's obviously slower than doing it in hardware (like pausing the CPU when trying to write to a full FIFO) but i think it's much simplier to implement.


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 14, 2023 4:07 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Thanks Proxy, I'll reread that in more detail as I'd only skim-read it before! And I'll post more thoughts on your plans in that thread if I have them.

Initially I don't want the CPU to have to check whether the FIFO is full because that will significantly reduce the data rate. Longer term though with more of a proper GPU on the other end (or I guess in your case, a DMA system) the CPU won't be as directly involved, it so can and will have to check this sort of thing.

I did consider that I could detect the case where the CPU is trying to write to the FIFO but it is already full, and issue an NMI, with the NMI handler then being responsible for retrying the write operation. Or tying it into my cycle-stretching system of course - but that's tricky without compromising other aspects of my system's general execution speed, and I like the idea of this not being too closely tied to the core system design, instead being something more easy to adapt to other systems.


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 14, 2023 4:54 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Thinking a bit about the other end of the interface, there is still a metastability concern there, as although the FIFO decouples things from the CPU clock, it doesn't really synchronise them with the video clock. It seems to be less of a problem though, and I think this can still work at full video clock speed, at least for 25MHz VGA.

My video circuits typically look something like this - this is based on my most recent one, but may not exactly match the schematics as I drew it from memory:

Attachment:
File comment: Typical video interface circuit
20230914_174817.jpg
20230914_174817.jpg [ 3.05 MiB | Viewed 5150 times ]


At the top left is an array of tristate counters that scan through the video memory to generate the output signal. The video address bus that they feed is also driven by an array of bus transceivers, passing addresses through from the CPU. There's another bus transceiver to pass the CPU's data bus through to the video data bus during write operations, and the video address and data buses also of course go to some video RAM, with the video data bus also going out to some output circuitry that separates pixels from the bytes of data, decodes colours, etc. I tend to output four pixels per byte, and put multiple RAM chips in parallel to get more width on the video data bus to increase the number of colours.

The video RAM has /VRAMOE and /VRAMWE signals, the counters have a /OE signal that's linked to the VRAM's, and the transceivers also have a /XCVROE signal. These used to be generated by separate logic parts, but I moved them to a simple PLD when upgrading to SVGA resolution.

The timings of the signals work like this:

Attachment:
File comment: Typical video control signals
20230914_174336.jpg
20230914_174336.jpg [ 2.63 MiB | Viewed 5150 times ]


PIX is the pixel clock (e.g. 25.175MHz, or 36MHz for SVGA) and PIX/4 is that clock divided by four, as my system mostly processes four pixels at a time. As shown the video address bus is driven with read addresses half of the time, and the other half of the time it may contain write addresses - I didn't show it, but this only happens if a write is actually in progress. /VRAMOE allows the counters to output their address to the video address bus, and during write operations /XCVROE allows the transceivers to drive the video buses.

/VRAMOE also allows the video RAM to output to the video data bus for the output circuitry to latch and process over the course of this cycle, and /VRAMWE is activated for a small portion of the cycle during write operations, after a period to allow the video address bus to stabilise with the write address.

These signals are mostly just inverted or shifted versions of the PIX/4 clock signal, and they tend to require a bit of fine tuning to make the circuit stable, given propagation delays etc. The signals that are only active during write operations are also gated with some address decoding from the CPU side, and this decision is made in this diagram on the falling edge of PIX/4, just in time for the write cycle - this corresponds with PHI2 rising in my existing system.

For the asynchronous version, that decision needs to be based on whether the FIFO has data or not. The 40105 provides DOR ("data-out ready") that is driven high when data is available, and /SO ("shift-out") that causes the data to be removed from the FIFO on its falling edge. But I can't just use DOR directly to drive whether the write signals are active on a particular cycle, because its changes are not synchronised to the video clock - it could change at around the same time as the video clock and violate the setup or hold requirements of the filpflops. So it will be necessary to buffer it through a flipflop to guard a bit against metastability:

Attachment:
File comment: Metastability protection for DOR
20230914_174406.jpg
20230914_174406.jpg [ 665.55 KiB | Viewed 5150 times ]


Here we AND together the DOR signals from two 4-bit FIFOs, so that we only react when both are ready, and gate it with a clock signal that needs to rise at an appropriate time. DORS is a "synchronised" version of DOR that can drive the rest of the circuit.

DORCLK could be driven by the rising edge of PIX/4, giving two pixels' worth of time for any metastability to sort itself out before the DORS output is used - that's about 80ns at VGA resolution, or 55ns at SVGA resolution. Here's what that might look like, including the rest of the derived signals:

Attachment:
File comment: Video timing signals with FIFO
20230914_174346.jpg
20230914_174346.jpg [ 3.12 MiB | Viewed 5150 times ]


/SO needs to go low as soon as we decide we're consuming the data, as it's on the critical timing path - it causes DOR to then go low, and if there's another word of data waiting it could take DOR 60ns to go high again according to the worst case figures in the datasheet. We should hold /SO low for at least 30ns, but the exact timing of when it rises again isn't important and I didn't show it.

So long as DOR does turn itself around within two pixels' time, then if there's data in the FIFO, the video circuit can process another write operation on the very next cycle. This is much better than the best case in my current system, due to the increased possible CPU speed with this new system. The typical stats in the datasheet are easily good enough to do this in SVGA resolution, though the worst case stats are not; in that case, the next word just wouldn't be processed until the cycle afterwards.

As we are asserting /SO right at the start of the write cycle, we need to latch the data from the FIFO at that point, as it's going to change it - so a further 8-bit (or more) register is needed for that. This is adding back some of the complexity I thought the FIFO would save, but I think it's still efficient and beneficial to use the FIFO here.


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 14, 2023 11:40 pm 
Offline

Joined: Fri Mar 18, 2022 6:33 pm
Posts: 491
Hi George,

I'm really glad you're doing this! It's a bit over my head still, but back when I was first planing my text VGA circuit I remember Jeff hinting that you could use a FIFO in place of dual-port RAM. I ended up taking the easy path and just tracking down some NOS dual port RAM, but I've always been curious what the FIFO solution would be like.

_________________
"The key is not to let the hardware sense any fear." - Radical Brad


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 1:18 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Yes I did contemplate using a FIFO for video. The plan remains vague and notional, so perhaps some head-slap show-stoppers lie ahead. But I was looking at FIFOs such as the 7200, 7201 and 7202 types whose datasheets I'm attaching -- not the 74HCT40105s you're considering, George.

gfoot wrote:
My first sketch for how this could work was disappointing. I envisaged the CPU latching data into an 8-bit D flipflop register, and setting a flag in another D flipflop to say that data was available.
I'm not quite following this remark, and the ones that mention metastability. Am I right in thinking they pertain to how bytes will get written into the FIFO? (I mean as opposed to the output section of the FIFO where bytes gets read out.) I don't understand where the concern arises. Edit: oh, wait -- later you mention a 15 MHz data rate, so perhaps it's the slowness of the 74HCT40105 that necessitates the added complexity.

In its introduction, the TI 720x datasheet mentions "Read and Write Frequencies up to 40 MHz," and I haven't looked into the details. But it sounds as if the FIFO could be written to at full speed, with no more glue logic than if one were writing to a 74_573 octal latch. Hopefully I'm not mistaken about that.

The 740x series are also much deeper that the 74HCT40105 -- deep enough to contain enough data for an entire horizontal scan line. So, for example, if the scan line is 80 characters (or 640 pixels) wide, I envisioned the CPU managing all of this with a single interrupt. The ISR would simply fetch and write 80 bytes into the FIFO then do its RTI.

So, there'd be no handshaking, at least not on a byte-by-byte level. But on a scan-line by scan-line level, I imagine there does need to be some loose coordination between the ISR that fills the FIFO and the video circuit that empties it. And, hmmm, does the tail wag the dog, or does the dog wag the tail? :)

I expect there are different ways it could work. Off the top of my head, maybe there should be a 6522 VIA timer running at the horizontal frequency, outputting pulses on PB7. Each pulse triggers both the video circuit, causing it to read out 80 bytes, and an interrupt, causing the CPU to write in 80 more bytes. And the CPU would have to "prime the pump" beforehand, loading the first 80 bytes.

That's as far as I've gotten! I hope it's helpful, or at least stirs the pot in an entertaining way! And BTW yes, AFAIK the 720x series FIFOs are no longer in production. :|

-- Jeff


Attachments:
SN74ACT720x TI FIFO series.pdf [310.15 KiB]
Downloaded 63 times
IDT720x IDT FIFO series.pdf [249.23 KiB]
Downloaded 62 times

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 2:26 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
It strikes me there are (at least!) three kinds of ways of using FIFOs to help with video
- the pixels come from the 6502 side, are pushed through the FIFO by DMA
- or pushed through by the 6502
- or the pixels always live on the far side, and it's higher level commands which get written by the 6502 and executed on the far side

We might perhaps describe these as
- DMA video
- 6502 beam-chasing video
- smart video controller

And each of those will have its own story, and complexity, and cost.

(Hope this is a useful contribution...)


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 2:31 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
I haven't read through all of this yet, but the C128 VDC and some variants of thr CRTC chip can do this. You set the ideo memory address in two CRTC/VDC registers, and access that location through a third register.

It's not asynchronously generating pixels directly, but still decouples the two clock domains. And it is not too much CPU intensive

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 4:44 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1117
Location: Albuquerque NM USA
Compact flash interface is a 256x16 FIFO. CF support multi-sector reads, so once a block of sectors is commanded to read operation, 2 bytes of data are streaming out of CF data port at high speed. So if a picture is stored in contiguous blocks in CF, it can be stream out to video display from 16-bit CF data port to video memory. This can be done with DMA, or alternatively, video logic that snoops CF data port such that whatever coming out of the CF goes into pixel shift registers with CPU supplies the video addresses in beam racing fashion. Just thinking out loud, I’ve not done anything like that.
Bill
Edit, actually CPU only set up CF to multi-sector read and just read the CF data port continuous at particular data rate to match the video display rate. The ancillary logic will automatically takes data from the CF data port and fill the RGB video shift registers.


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 4:58 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Jeff, yes that's a little different to what I'm planning - I'm still using a framebuffer, but just changing the way it gets written to, in line with how PC hardware evolved. Its more like the third item on BigEd's list, though initially it will just be an interface for writing data fairly explicitly without any particularly fancy commands (maybe just "set address", "write byte", and perhaps "write byte and increment address" for example). This seems similar to what Andre said about the C128, and could be a stepping stone towards more advanced commands.

In fact what I'm really building is just an interface for data from one clock domain to alter register values in another clock domain, in sync with the target clock and without wait states - what happens to the register contents after that is flexible.

The metastability issue is when the data is pulled out of the FIFO in the target clock domain - the data becoming available in the FIFO might not be well-timed against that clock. It might not matter as much for a live stream of pixel data, where you're mostly keeping the FIFO full with enough data for the whole scanline, and the read circuit can assume that data is available on every cycle.

The 720x series do sound good for the use you're suggesting and I think it's partly what Proxy has in mind too (see his thread in the General forum). As far as I could see these parts were still available new from Mouser and Digikey, but rather expensive, so perhaps getting used ones is indeed more economical.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 16, 2023 11:24 am 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
As far as a generic system goes, here is a fairly complete (but untested) overall circuit design, supporting asynchronous writes to up to 16 8-bit registers (four of which are shown) which are then guaranteed to change in sync with a remote clock signal:
Attachment:
fifo_write_to_registers.png
fifo_write_to_registers.png [ 98.58 KiB | Viewed 5000 times ]

In practise for video I probably won't use as many registers, and will probably have some non-register things driven by it instead; the requirements are going to be similar enough to the registers' requirements, though, that they're a good stand-in for now.

The address decoding for CS^PHI2 on the left hand side could be anything based on A15-A4. It doesn't matter if it overlaps with something else, so long as that other thing doesn't mind being written to - I will probably overlap it with RAM, either losing 16 bytes of page zero, or somewhere else in the memory map.

U4, U5, U6 are used to latch the data bus for a while due to the 40105's fairly long setup requirements. SI would rise around the falling edge of PHI2, and the 6502 will change the address bus well before the hold time is satisfied. You could perhaps find some other way to generate SI earlier and avoid needing to latch the data and address.

The purpose of the three flipflops to the right of the centre of the diagram is to resynchronize the asynchronous DOR signal with the local clock (DORCLK), generating a short pulse on ACT that's in sync with DORCLK. This then drives two 74HC259s acting as active-high decoders, and their outputs provide clock pulses to the various registers to store the data that was sent. The timing diagram shows an example of DOR changing at an awkward moment leading to metastability in DORS, but which is resolved by the time ACT is triggered. Longer periods of metastability would still cause problems of course.

ACTRST is used to end the ACT pulse fairly quickly; it also stays asserted until DOR finally goes low again, which could take a while (e.g. 60ns? I still don't understand what the datasheet is saying). This may prevent ACT triggering a second time, but really DORCLK needs to be slow enough that DOR will reset in time (e.g. <10MHz).

It would be possible to AND together the DIR signals and feed that back to the CPU to allow it to poll ahead of writing data, or use it to cause clock stretching if a write is performed when there's no space, but I don't think I'm going to need this functionality.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 16, 2023 12:08 pm 
Offline

Joined: Sat Oct 09, 2021 11:21 am
Posts: 718
Location: Texas
Was studying your diagram for a bit now George, I think I'm understanding it now. Those 74HC40105's are the FIFO's, which capture both data and address data. Didn't know what even existed!

Are the data stored on the registers always just color data? Or is it address and color data? I'm wondering why you couldn't just directly connect those '40105s to the RAM, with some glue logic of course. You can only shove 8-bits at this thing at a time, so perhaps something like: Write Low Addr, Write High Addr, Write Color Data which could tell the RAM to finally actually look at the FIFO's. This would require a lot of those FIFO chips, but it would not require those 8-bit registers anymore. Just thinking out loud.

I'm also interested in what you discover here, because like you, my video circuits are always synced to the clock. Well, except for my stand-alone video cards, but those use shift registers and are very slow to read from or write to. And like you have hinted at, reading from them isn't the goal here.

What is your target input and target output speeds? When I was using shift registers, I kind of had a 'max input speed' of around 1 MHz. So if my target input computer was 10 MHz, it might have caused problems, but because there are a lot of extra cycles between actually shifting data out of the computer, a 5 MHz speed should be fine. As an example, rough numbers here. For output speed, I'm guessing you are going with 25.175 MHz or a division of that?

I'll be reading the updates, thanks George!

Chad


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 16, 2023 1:43 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
sburrow wrote:
Are the data stored on the registers always just color data? Or is it address and color data?
Not necessarily - you can pass through whatever is useful, and I haven't entirely thought that bit through - there are a lot of options. For a 6502 supplying the data, we can get 8 useful bits of data from the data bus, and maybe some more bits from the address bus. Operations like "STA $8000,X" can allow us to essentially write X and A at the same time, for 16 bits of data being passed through. However I'm more inclined to just stick with 8 bits of data, and pass some address lines as either control lines, or register indices - hence including four of those in the circuit.

Quote:
I'm wondering why you couldn't just directly connect those '40105s to the RAM, with some glue logic of course. You can only shove 8-bits at this thing at a time, so perhaps something like: Write Low Addr, Write High Addr, Write Color Data which could tell the RAM to finally actually look at the FIFO's. This would require a lot of those FIFO chips, but it would not require those 8-bit registers anymore. Just thinking out loud.

It's appealing and they do have tristate outputs - however, I have a feeling it would reduce the overall throughput. I have a lot more faith in the 8-bit registers that I've used before, especially they have much better characteristics!

Another factor is the width of data I'd need to pass - let's say about 20 address lines and 8 data lines are needed to write a byte into video memory. We can put enough FIFOs in parallel to hold all of that, but then we also need to feed that breadth of data at the input end, and the 6502 can't supply it all at once - so we'd still need some registers at the 6502 end to hold some of the data, and I thought we might as well just put them at the output end instead.

For a first pass I'm going to just use this to set up a video address in registers, then pass some data, and have that written to the selected address. One option is to pass all these bytes in sequence - e.g. three address bytes followed by a data byte - and have the receiving circuit read each in turn; but most of the bytes don't need to change between write operations, as you're usually writing to a similar location to the last one that was written, so it makes more sense to use the address bits to select what the data represents. Maybe the code then looks like this:

Code:
VIDEO_HADDR = $8000
VIDEO_VADDR_LO = $8001
VIDEO_VADDR_HI = $8002
VIDEO_WRITEDATA = $8003
    ...
    stx VIDEO_HADDR
    sta VIDEO_WRITEDATA
    inx
    stx VIDEO_HADDR
    sta VIDEO_WRITEDATA
    ... etc ...


Each of those store instructions puts a byte into the FIFO, but with different address bits, and the receiving end stores the results in different video registers, plus writing to the WRITEDATA register causes it to then write the result to video RAM.

Possibly for a second pass I might make the horizontal portion of the address automatically increment after each write. Another interesting option is to have the address in video registers represent a base address, and again use indexed addressing so that some of the CPU's address lines pass an offset through to the video circuit.

Code:
VIDEO_HADDR = $8000
VIDEO_VADDR_LO = $8001
VIDEO_VADDR_HI = $8002
VIDEO_WRITEDATA = $8008    ; bit 3 = write data; then bits 0,1,2 are an address offset

    ldy #7
loop:
    lda (zp_spritebase),y
    sta VIDEO_WRITEDATA,y
    dey
    bpl loop


In the longer term the CPU shouldn't be providing all this kind of data all the time, and should instead be sending commands that do more work on the video side in a GPU. Another stepping stone could be having the data sent by the CPU be a pixel bitmask for the next eight pixels, and some address bits specifying what colour to write into those pixels. Great for drawing text and single-colour graphics primitives, and it leads in to the GPU supporting horizontal line drawing primitives and then more complex operations.

Quote:
What is your target input and target output speeds? When I was using shift registers, I kind of had a 'max input speed' of around 1 MHz. So if my target input computer was 10 MHz, it might have caused problems, but because there are a lot of extra cycles between actually shifting data out of the computer, a 5 MHz speed should be fine. As an example, rough numbers here. For output speed, I'm guessing you are going with 25.175 MHz or a division of that?

I want this to work with my fast PDIP system, without needing clock stretching, so a clock speed of about 35MHz. I think this is viable because, as you said, the write operations into the FIFO won't be on sequential cycles - the fastest case would be once every three cycles, I think, and even that can't really do any useful work other than clearing the screen perhaps. So any useful operations are going to be pretty spaced out, and the effective clock rate would be no more than 1/5th of the actual CPU clock, well below the 10MHz-15MHz threshold that I think these FIFOs can support.

For the output I'll probably go with 25.175MHz pixel clock and then see how I feel. I will keep the video memory access rate the same as in my past circuits - a quarter of the pixel clock - and this is the rate that data will be removed from the FIFOs. Using a quarter of the pixel clock (or even an eighth) makes it a lot easier to generate the control signals for the video RAM, especially write-enable.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 16, 2023 2:09 pm 
Offline

Joined: Sat Oct 09, 2021 11:21 am
Posts: 718
Location: Texas
gfoot wrote:
I want this to work with my fast PDIP system, without needing clock stretching, so a clock speed of about 35MHz.


Oof, that's fast! I probably aught to read that topic also :)

I think I'm seeing some basic needs. On my separate video cards, the input clock speed is significantly lower than the video clock speed of 25.175 MHz. Thus a (shift) register setup makes sense: Send one byte, send another byte, send a third byte, then have it all go at once, etc. I don't need to store any additional information because by the time I'm ready to do it all over again, the video circuit has processed that LONG AGO. So, no need for a FIFO.

But your input speed is faster than the video circuit, so it is possible for the computer to outrun the video circuit. Thus the need for the FIFOs.

Nathan (Paganini) mentioned dual-port RAM. I know the size of those are far too small to be useful in a larger 'graphical' setting, especially anticipating colors, but could it itself act kind of like a FIFO? Or replace it. Just an idea here, but if you were to write to the dual-port RAM address data, color data, commands, sprite data, etc, the video circuit could pull from that like it would from a FIFO. You'd probably not use all of the dual-port RAM, but it might be within your speed capabilities and reduce chip count significantly.

Ultimately you know what is best. I learned all of my video circuitry from you George, and I am very grateful for that. Everybody has a particular favorite focus, and mine is video, and so I have you to thank for what I am now capable of doing. So, thank you. :)

Chad


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 16, 2023 5:13 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
sburrow wrote:
gfoot wrote:
I want this to work with my fast PDIP system, without needing clock stretching, so a clock speed of about 35MHz.


Oof, that's fast! I probably aught to read that topic also :)

If you prefer, I made a hackaday.io page about it, as more of a summary of where it's got to, without the history and discussions along the way: https://hackaday.io/project/192630-fast ... 2-computer

Quote:
I think I'm seeing some basic needs. On my separate video cards, the input clock speed is significantly lower than the video clock speed of 25.175 MHz. Thus a (shift) register setup makes sense: Send one byte, send another byte, send a third byte, then have it all go at once, etc. I don't need to store any additional information because by the time I'm ready to do it all over again, the video circuit has processed that LONG AGO. So, no need for a FIFO.

But your input speed is faster than the video circuit, so it is possible for the computer to outrun the video circuit. Thus the need for the FIFOs.

I think it varies quite a bit. For streaming simple data, the rates seem to be about the same - the video circuit can read at about 6MHz, and the CPU can send at about that speed, given its overheads. But usually the CPU needs to do quite a bit more work before it has the data to send - e.g. loading it from somewhere in memory - and so the CPU will be the bottleneck. But then, if/when I upgrade the GPU so that the CPU sends fewer, more complex operations, the GPU will then be the bottleneck as it will take many cycles to complete those operations. So I envisage quite a bit of back and forth on this point, and being able to use the FIFOs at full speed will probably be less important in the future.

Quote:
Nathan (Paganini) mentioned dual-port RAM. I know the size of those are far too small to be useful in a larger 'graphical' setting, especially anticipating colors, but could it itself act kind of like a FIFO? Or replace it. Just an idea here, but if you were to write to the dual-port RAM address data, color data, commands, sprite data, etc, the video circuit could pull from that like it would from a FIFO. You'd probably not use all of the dual-port RAM, but it might be within your speed capabilities and reduce chip count significantly.

I think the more capable FIFOs that Jeff, Proxy etc have been suggesting - IDT720x series - are essentially that - a small amount of dual-port RAM with built-in address counters, and one port being write-only and the other read-only. They look like a good solution in general, I just don't like the price :)

Quote:
Ultimately you know what is best. I learned all of my video circuitry from you George, and I am very grateful for that. Everybody has a particular favorite focus, and mine is video, and so I have you to thank for what I am now capable of doing. So, thank you. :)

Thanks Chad. It's also been great to see your journey, and especially the results with software you're writing - it's something I'm not sure I'll ever get around to, as it's too close to my day-job, and electronics is meant to be kind of an escape from that!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 36 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron