6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 11:33 am

All times are UTC




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Fri Sep 08, 2023 1:40 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
for starters, The general idea is to have a VGA card with an onboard CPU to do some graphics work like drawing lines, moving rectangles, filling areas, etc. and interacting with the main system (via FIFOs or similar).

so i'll just be using this post to kinda ramble and document my thought process while coming up with different ideas. do note that i'm already planning to build a 65816 SBC around my Altera DE2 FPGA Dev board, so i will likely not really bother with making a card like this right now. but again i wanted to atleast write it all out in case i want to come back to it later or someone else finds anything useful with this pile.

plus, even if i never use most of these ideas, i will definitely reuse the idea for the DMAC in the FPGA SBC and future projects as well.

.

anyways, let's start simple, one way to make such a card is by just having the video circuit directly share the bus with the CPU. every now and then when a byte needs to be read by the video circuit it will either use the first half of the CPU's cycle to read it... or pause the CPU, do the read, then unpause it. latter one is slower but makes the timings less tight.
main downside is that the CPU cannot run continuously and that the CPU and video circuit need to share a common clock.

a slight upgrade to that idea would be to use a FIFO between the video circuit and the bus. so instead of reading 1 byte every few cycles it could read a whole row of pixels in a single burst. this can again be done either cycle stealing (in which case it doesn't make much of a difference), or by pausing the CPU.
the latter one is again slower, but not as slow as before because the overhead of pausing and unpausing the CPU is only present for the start/end of the burst transfer, instead of before/after every byte.

Using FIFOs like that made me think, what if the video circuit only did the sync signal generation and reading from the FIFO, and the CPU itself is responsable for keeping the FIFO fed with data?
That would leave the CPU alone on the bus and even decouple the clocks, so the CPU could run at a differnt speed than the video circuit.
one downside is that the 65xx CPUs are not good at copying data quickly, specifically for the resolutions and color depths i'm aiming for.
320x200 and 400x300 are my targets (ideally at 8bpp), they need 2 rows of pixels to be the same. so to avoid the CPU having to load the same set of bytes twice, there would be 2 FIFOs to the video circuit.
on the CPU side the data inputs and write lines are connected in parallel, so the CPU always writes to both FIFOs. on the video side one FIFO's data is used for even scanlines, and the other for odd scanlines.
this does mean that the CPU only needs to load each byte to the FIFO once per frame, but that is still too much data for it to handle.

320x200 @ 4bpp = 32000 Bytes. a whole frame has 420000 Cycles (at 25MHz), so if the CPU is running at 20MHz that gives 336000 CPU Cycles.
loading 32000 Bytes within 336000 Cycles is an average of 10.5 Cycles per Byte. same math for 400x300 @ 2bpp = 30000 Bytes gives 11.05 Cycles per Byte.
either case would require a very very tight copying loop, or some overclocking (which can get bottlenecked by the 25ns FIFOs i want to use).
and neither is a good option because even when you do get it just slightly above that average, that leaves no time for the CPU to do anything else, like communicating to the Main System.

Clearly a software fed FIFO is not the answer.
but since the video circuit no longer has to generate addresses and read bytes by itself, it frees up some space in the CPLD... so what if that space was used for a very bare bones DMA Controller to assist the CPU? and with "bare bones" i do mean bare bones. it would only ever read from Memory to the video FIFO, so it has a single 16-bit source address register, an 8-bit byte count register, and 1 bit in the status/control register to enable it. there is also a hidden 8-bit counter that increments after every write operation to move the read pointer through memory.

now this may seem like a step back as the video circuit is now again on the same bus as the CPU to read bytes into the FIFO except now the CPU has to do it manually, adding some overhead work, while before it was automatic.
but there are some differences that i think make it better.
first one being that the DMAC is still seperate from the video circuit, so the CPU and video clocks can still be seperate as the DMAC just runs at the same speed as the CPU.
second reason, even though the CPU has extra overhead for handling the DMAC this also means that it's more controlled. instead of the CPU being forced off the bus it can now choose itself when exactly to give up the bus to the DMAC.
but it still needs to keep the FIFO fed, so to make timing easier i would send an IRQ to the CPU at the start of every second scanline, the IRQ like would be held low until the DMAC is activated.

while thinking about the design of the simple DMAC i thoughts of a slightly upgraded version that would also work on the interface FIFOs (which connect to the main system).
so it would still have the same 16-bit address register and 8-bit byte count register, but now there would be 2 extra bits in the control register.
one bit controls the direction (either Memory -> FIFO, or FIFO -> Memory), and the other controls the FIFO selection, (either the video or interface FIFO). that way you could quickly read/write bytes from/to the main system.
of course now the DMAC has to deal with the FIFO flags as well to avoid invalid data. specifically if either the FIFO it reads from is empty, or the FIFO it writes to is full, it will abort the current transfer, set a flag in the control register, and send an IRQ to the CPU until it reads the control register.
the byte count register decrements with each operation so that when a transfer is aborted, the CPU can simply start it again to have it continue where it left of.

anyways after that upgrade i pretty much immediately thought of another one. if the entire video circuit is moved to a smaller CPLD like an ATF1504, then almost the entire ATF1508 is empty and the DMAC can be made a bit more complex to allow for Memory -> Memory transfers. maybe even rectangular moves!

while theorizing about the registers and internale state machines i thought to how the current DMAC can only do a transfer in a single direction either to or from a FIFO... and that gave me the idea that i don't even need to change anything about the registers and way it works.
now i don't know if anyone else has made a DMAC like this before, if not i take full credit for the idea because i think it's pretty damn cool.

so, let me explain:

usually DMACs will calculate the source address, read a byte/word, calculate the destination address, and then write the byte/word. but what if instead it was a 2 pass operation? so imagine a FIFO in parallel to the CPU's data bus (ie both input and output connected to it).
so on the first pass the DMAC copies from Memory into the special FIFO, on the second pass it then copies from the FIFO back to Memory.
functionally it's very similar, but now the CPU has to run the DMAC twice and swap the source for the destination address.
one pretty big plus side is that the source and destination registers can be overlapping as the data is copied to a third location before writing.
this also works very well with rectangular copies. if 1 bit in the control register is used to switch between linear and rectangular you can mix them between reading and writing. so for example a software sprite can be loaded linearly into the FIFO and then written to the screen memory rectangularly.
the speed should also be quite good. since you're directly interfacing the RAM and FIFO IC the operation should be done in a single cycle (25ns FIFO, with 50ns cycle periods (20MHz), using the full cycle for the transfer should make that possible), another cycle is then used to update counters, check if the FIFOs are full/empty, check if any counter reached it's maximum and update the state machine accordingly.

.

anyways, even with a DMAC that can do rectangles, it will only work so well when each pixel takes up less than a byte.
ideally i'd want 8bpp, so that each pixel takes up 1 Byte. not only does that give you alot of colors to work with but also turns the DMAC into a simple blitter, as it can now work on the same resolution as the pixels.
but the main problem with that is memory. 320x200 @ 8bpp require 64000 Bytes, leaving only 1.5kB left for Work RAM, IO, and ROM. (and 0.5kB are already used for the ZP and Stack).
400x300 @ 8bpp needs a whopping 120000 Bytes, which won't even fit into the 65C02's memory range without banking.
so for that a different CPU could be used that has a greater memory range. the 65816 is a pretty good choice, not only allowing for more memory but also having a much more powerful instruction set.
i would go for atleast 128kB of RAM, so bank 0 contains the ROM, IO, and Work RAM, while all of bank 1 is used for the video memory. this still isn't enough for 400x300, but honestly i think that's fine. 320x200 seems plenty enough for now.

of course using a 65816 adds some complexcity as well, as i need to latch bit 0 of the combined Address/Data bus for the 17th address line, and a 74'245 to demux the address/data bus. plus the DMAC needs an extra bit as well to select what bank to access.

more memory also means having to move more data per frame again. even with a DMAC it could get tight.
with the concept i have for the DMAC it should be able to transfer 1 Byte every 2 Cycles, so at 20MHz that works out to a total bandwidth of ~9.5MB/s. while 320x200 @ 8bpp needs ~3.66MB/s of average speed.
doing a bit more math, a full row of pixels (2 scanlines) takes 1280 CPU Cycles to complete, within that time the DMAC needs to load the next row of pixels (320 Bytes) into the FIFO, taking up ~640 CPU cycles just for the copying, which is around 1 full scanline.
this means that within the active frame the CPU only really runs for half the time, in the vertical blanking period the CPU can run all the time.

.

one other (rather crazy) idea that came to my mind was to loop all the way back around and go with Dual Port RAM again, but also still have the second 65816, FIFOs, and DMAC. so like my dumb VGA card there would be 4x 16kB DP-RAM ICs (for a total of 64kB of VRAM) and an ATF1508 to generate sync and video signals from that RAM. but then instead of having the main system interface the DP-RAM directly it would be the secondary 65816 with it's DMAC in a second ATF1508, which then communicates to the main system via 2 FIFOs.

the PCB for this would be absolutely massive considering the dumb VGA card by itself is already very large (4x 68-pin PLCCs and 1x 84-pin PLCC chip take up quite a lot of space).

but it would be the best of both worlds. the video circuit fetches it's own data without interfering with the CPU and it's bus, the CPU doesn't have to worry about keeping the video circuit fed with data, there is a second (seperately clocked) 65816 to off-load some work, with it's own DMAC (which wouldn't work on my SBC directly because it cannot take the CPU off the bus) to speed up graphics work.

.

so, there are a lot of different paths i could take here (not even mentioning FPGAs which would make all of this much easier, but i'll get to them in a later project).

for that last one i could cheat a little... since the dumb VGA card already exists i would just need to design an expansion card with the 65816, DMAC, RAM, ROM, and another expansion connector on it's top. so the CPU card goes into the SBC and the dumb VGA card plugs into the CPU card.
that will make it quite tall and i might have to lie it on it's side to avoid it toppling over, but atleast i wouldn't have to completely remake the VGA part of the circuit, plus it saves me materials, time, and money for ordering and assembling such a huge board.

.

.

anyways these were my current rambling. once i got the FPGA board working i might come back here to talk about FPGA specific designs that simply wouldn't fit onto CPLDs.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 10, 2023 6:02 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I encountered a video about the Fujitsu FM-4 recently, which is one of those weird old Japanese computers with two 6809s in it. The second 6809 is dedicated as a GPU, receiving instructions through a "mailbox" in a shared memory area, leaving the main CPU free to continue processing while a graphics operation is in progress. This was apparently successful enough as a concept that games for the FM-4 have significantly more advanced gameplay than contemporary games for single-CPU computers, even if the graphics are not noticeably more advanced. There is simply more time available to handle game logic when the main CPU isn't bogged down with graphics work.

With the 6809 using a similar bus architecture to the 6502, I considered how this could be translated into a "modern" design using the '816. It's not difficult to set up a two-way FIFO chip to act as a command channel, but a shared SRAM device is also quite feasible, even with the increased clock speeds of modern WDC parts. This would require that, as in the FM-4, there are buffers to isolate the shared memory from both of the buses it's connected to, and that some kind of arbitration mechanism is in place so that simultaneously attempted accesses do not interfere. A shared SRAM also has the definite advantage that, if it's mapped to the right part of the GPU's address space, the GPU can be made ROM-less, with the CPU being responsible to load the firmware before releasing the GPU from reset. If the CPU and GPU are clocked at the same speed but out of phase, there is an unambiguous "first come first serve" strategy for arbitration available, with the latecoming request having RDY pulled down to delay the access. This would be well within the capability of an SPLD (maybe even overkill unless combined with other functionality in the same device), or it could be built from jellybean logic.

I've previously worked out that with a 48MHz dot clock, 800x512 75Hz SVGA output should be feasible, with only a few output-stage logic parts actually needing to run at that speed. With doubled pixels, the dot clock would be 24MHz and the effective resolution would be 400x256. This suggests a two-mode graphics architecture, one with high resolution but limited colour, the other with better colour but limited resolution. The high-resolution mode could be a single bitmap plus cell-based colour attribute, optimised for text display; the high-colour mode could be 6bpp to cover the colour space uniformly without too much complexity. These would need to fetch 16 and 24 bits respectively, 6 million times a second. Bear that in mind.

It's also straightforward to divide the clock one more stage to 12MHz, and use that as both the CPU and GPU clock, without leaving too much on the table in terms of processing speed. The cycle time is 83ms, so basic 55ns SRAM parts will work fine for one access per cycle, but not for two. Bear that in mind, too.

The complete memory architecture for the GPU thus has five distinct memory devices, which minimally could all be 32Kx8 55ns SRAMs. One is directly connected to the bus and is mapped to the lower half of bank 0; the second is the shared memory with the CPU and is mapped to the upper half of bank 0. The other three form the framebuffer, and are connected directly to the video scanout hardware, and only indirectly to the bus, through three individual buffers. This allows half of the GPU cycles to be used for video scanout (with the required total databus width of 24 bits) and the other half may access the framebuffer, but processing that doesn't directly involve framebuffer accesses still proceeds at full speed. Each plane of the framebuffer is mapped to one of the '816's higher banks. The '816's block-copy instructions could be used on the framebuffer at a rate of 8 cycles per copied byte.

Video scanout is accomplished primarily by latching the data from the framebuffer into 8-bit shift registers, after preconditioning it to resolve the colour attribute and/or pixel doubling. These shift registers are almost the only logic devices that actually need to operate at 48MHz. Everything else operates within the 83ns clock cycle, including the preconditioning logic and the scanout address counters.

In this architecture, it's also possible to consider ways to make the GPU's processing more convenient (and thus faster), such as by providing masked colour write operations. One way to do this would be to provide the 6-bit colour in some of the bank address bits, and then the data bus carries the mask, which would not be presented directly to the SRAMs but to selectors. This would cause all three framebuffer planes to go through a read-modify-write cycle during a single GPU clock cycle; I think there is just time to insert the required 20ns /WE pulse after obtaining valid data. Since this is the tightest timing margin, it would be the most likely justification to fit faster memory for the framebuffer. In a similar fashion, reads from these same addresses could return a mask of where the stored colour matches the one presented.

For maximum performance, we would like to have enough framebuffer to double-buffer, or to scroll freely in all directions without explicitly redrawing. The resolutions quoted each need 25KB per plane (with the high-resolution mode drawing alternate bitmap scanlines from two planes), so for double-buffering we would need to fit 64KB per plane. This is just a matter of fitting either a larger device or twice the number of them, and doesn't noticeably complicate the design.


Top
 Profile  
Reply with quote  
PostPosted: Thu Sep 14, 2023 6:04 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I think a versatile DMA controller can be useful for a lot of things. I'm not sure I'd personally use it for this sort of video operation though, as it's always going to be slower and more intrusive than performing the blitting operations within video memory itself. Especially with a 6502, where host memory is very limited, I think it makes more sense to just have a lot of video memory - more than is needed for the framebuffer - and use that to store sprites and other graphics, with dedicated DMA-like circuitry inside the video circuit to support efficient copying of this data. It can do this in between "output" accesses to video memory, without slowing anything else down.

This is more or less how PC video hardware has worked since the late 80s. Ever since EGA, the video memory has been essentially 32 bits wide (four planes with 8 bits per plane), and there's been some form of support for operating on all 32 bits at once - at the core was a simple 32-bit latch, where the CPU could execute as 8-bit read from one location then an 8-bit write to another and the video card would read and write the whole 32 bits, but there were also options to selectively change the data in flight (overwriting some of the colour planes for some of the pixels). For fast graphics though, using the latch to quickly copy large amounts of data from one area of video memory to another was very important in early games.

Something else you might be interested in considering is the ANTIC processor from early Atari systems - if you're considering having the CPU generate a stream of pixel data for the video circuit to output, rather than the video circuit scanning a framebuffer itself - I think this is sort of what ANTIC did, and you could design a more advanced version these days that essentially dynamically works out where to read strips of pixel data from, to implement hardware sprites and things like that. Perhaps rather than storing a framebuffer then, you store a list of scanline definitions, each consisting of a list of ranges of video memory to output pixels from; the CPU can update that (or you DMA it across) and your graphics processor executes it, streaming pixels to the output circuit.

These FIFO/DMAC ideas are very interesting and I look forward to seeing what innovative things you can do with them!


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 8:06 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
i like the idea of being able to load functions and such to the GPU's memory to give it different abilities. but i'm not sure about the shared SRAM.
mainly because it needs extra circuitry to keep both busses seperate from eachother and forces both CPU and GPU to use the same clock, something i really, really want to avoid.

using Dual Port RAM is also an option, it's fairly cheap and way easier to hook up on both sides, plus it's fully asynchronous so the clock speed of either side doesn't matter.
you could then preload the DP-RAM with a program and then send a reset to the GPU via some control register or similar. one downside is the speed, the IDT700x DP-RAM chips that i have have a 55ns access time, so any 65xx CPU running +8MHz will need atleast 1 wait state to use them, making them as slow as SST39SF0x0 Flash chips.

another option (the one my first failed VGA card used) is to have a ROM onboard (as they can be physically smaller than shared- or DP-RAM) and just make it contain a small bootloader (~256 Bytes) to load a program over the Bi-directional FIFO into it's own faster RAM, and then jump to it.
it's effectively the same as sharing memory directly, but without the complexity of actually sharing it. though it is slightly slower to load the program.

on another note, how exactly would you generate the addresses for the frame buffer to shift out the data?
if you do the row doubling in hardware using a pair for FIFOs like i described above, then you can simply use a counter that increments and writes to the FIFOs on even scanlines, and does nothing on odd ones. once it reaches the last count value, it resets back to 0 for the next frame.
without FIFOs you have to set the counter back at the end of an even scanline, and let it contine normally on odd ones. which seems more complicated IMO unless you're using using programmable logic.

gfoot wrote:
I think a versatile DMA controller can be useful for a lot of things. I'm not sure I'd personally use it for this sort of video operation though, as it's always going to be slower and more intrusive than performing the blitting operations within video memory itself.

I'm not sure i completely follow, the DMAC i proposed would directly work on video memory or any other part of memory as well. and it's speed is almost the same as a regular DMAC that reads a byte, stores it in a register, and then writes it back. just with all of the reads done first and then the writes afterwards.

gfoot wrote:
Especially with a 6502, where host memory is very limited, I think it makes more sense to just have a lot of video memory - more than is needed for the framebuffer - and use that to store sprites and other graphics, with dedicated DMA-like circuitry inside the video circuit to support efficient copying of this data. It can do this in between "output" accesses to video memory, without slowing anything else down. [...]

yea having more memory than necessary would be one reason to go with a CPU that can directly address more than a 65C02. plus if you do go the route of using DP-RAM for the framebuffer memory then the video circuit is completely off the GPU's bus and the DMAC and GPU can do whatever they want without interfering with the video circuit's reading of pixel data.
or a couple of buffers to seperate the GPU/DMAC bus from the video circuit's. so the GPU/DMAC can only access the framebuffer memory when the video circuit isn't reading from it itself (which can be helped by using some FIFOs, so the video circuit loads an etire row of pixels at once, giving the GPU/DMAC more continuous time to copy bytes over).
this of course couples the clocks again, so the video circuit and GPU/DMAC have to run at some shared clock speed.

or do you mean something else and i'm misreading this?

gfoot wrote:
Something else you might be interested in considering is the ANTIC processor from early Atari systems - if you're considering having the CPU generate a stream of pixel data for the video circuit to output, rather than the video circuit scanning a framebuffer itself - I think this is sort of what ANTIC did, and you could design a more advanced version these days that essentially dynamically works out where to read strips of pixel data from, to implement hardware sprites and things like that.

i did look at the ANTIC before making the thread, and is one of the reasons i made it in the first place. but hardware sprites are a bit too complex.
just having a DMAC with rectangular copying is fast enough to implement some usable sprites.

gfoot wrote:
Perhaps rather than storing a framebuffer then, you store a list of scanline definitions, each consisting of a list of ranges of video memory to output pixels from; the CPU can update that (or you DMA it across) and your graphics processor executes it, streaming pixels to the output circuit.

now you're confusing me a bit, what you described is pretty much exactly the whole point for having a VGA card with it's own processor onboard. to do those kinds of operations to take work off the main system.
So the main system doesn't have to deal with a framebuffer or generating any graphics on it's own. it just sends a list of commands and data to the VGA card via the FIFOs (or some DP-RAM) and the CPU on the VGA card (aka the GPU) then constructs the frame in it's onboard frame/line buffer and then has the video circuit read it out. (or the GPU/DMAC pushes the pixels directly to the output).

of course it would never be as good as the ANTIC itself, unless you use an FPGA to implement some +100MHz softcore GPU to do all those operations insanely fast (or do it in hardware directly).

or are you proposing another layer of abstraction and have yet another CPU within the video circuit to help the GPU construct a frame of video? :lol:
hmmm, now that i think about it... and this is probably insane... but what about having 5 CPUs on a VGA card?
something like this:
Attachment:
yyXJA5YNPx.png
yyXJA5YNPx.png [ 105.14 KiB | Viewed 8075 times ]

so you have 1 Master CPU on the card that controls 4 "core" CPUs. each core CPU takes care of 1/4th of the framebuffer. this would allow you to do 4 operations at once as long as they're on seperate quarters of the screen.
something more ideal would have all core CPUs be able to access the entire framebuffer instead of only 1 quarter of it. but that would require a more complicated bus architecture.
a few buffers to only allow 1 core to access the framebuffer at a time, with said memory being fast enough to switch between all cores within a cycle (plus the video circuit) it could effectively create n+1 port memory (n read/write ports, 1 for each core, and 1 pure read port for the output).

again the whole idea is completely insane and would reqiuire a massive PCB, or something modular/stacked where you can plug "core cards" into a main video card PCB and use an FPGA or similar to handle the bus multiplexing to the framebuffer depending on how many cores are installed.

but it is a neat concept...


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 8:27 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Quote:
using Dual Port RAM is also an option, it's fairly cheap and way easier to hook up on both sides, plus it's fully asynchronous so the clock speed of either side doesn't matter.
you could then preload the DP-RAM with a program and then send a reset to the GPU via some control register or similar. one downside is the speed, the IDT700x DP-RAM chips that i have have a 55ns access time, so any 65xx CPU running +8MHz will need atleast 1 wait state to use them, making them as slow as SST39SF0x0 Flash chips.
I just checked the datasheet, and I think a 55ns part can be made to run with a 12MHz 65xx bus. As with most SRAMs, there is a significantly shorter /OE access time and /WE pulse width requirement than /CE and address access times. At that speed, you would only need to insert wait states when a write-to-read contention hazard is signalled by the /BUSY output. Even this could be avoided if you can guarantee (by software design) that no such hazards will occur, or if you use the hardware semaphore feature to mutex them.

If you do want faster running than that, or think that the timings are too tight to guarantee operation, 35ns or faster parts are also listed. These would be suitable for the WDC parts running at any speed within spec, and perhaps even overclocked.


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 10:00 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Proxy wrote:
gfoot wrote:
I think a versatile DMA controller can be useful for a lot of things. I'm not sure I'd personally use it for this sort of video operation though, as it's always going to be slower and more intrusive than performing the blitting operations within video memory itself.

I'm not sure i completely follow, the DMAC i proposed would directly work on video memory or any other part of memory as well. and it's speed is almost the same as a regular DMAC that reads a byte, stores it in a register, and then writes it back. just with all of the reads done first and then the writes afterwards.

I think the bit where we differ here is I don't consider video memory to be "a part of memory" - I think of it as being quite separate, and arranged quite differently as there are a lot of benefits in doing that. As such it's hard for me to imagine a DMA controller that could read and write video memory just as easily as the CPU's regular memory. However I'm thinking of framebuffers here, rather than GPU program memory, and you make a good point:

Quote:
plus if you do go the route of using DP-RAM for the framebuffer memory then the video circuit is completely off the GPU's bus and the DMAC and GPU can do whatever they want without interfering with the video circuit's reading of pixel data.

The GPU + program memory (dual port?) + framebuffer + video circuit as a unit, feels a lot like a whole regular computer, just one that's optimized for and dedicated to drawing graphics. It is of course similar to how people use microcontrollers for this purpose. It also reminds me a bit in that sense of Acorn's Tube interface, where a host computer does all the grunt work of complex graphics operations etc, while another one on the end of a few FIFOs does all the foreground, higher-level work.

Quote:
gfoot wrote:
Perhaps rather than storing a framebuffer then, you store a list of scanline definitions, each consisting of a list of ranges of video memory to output pixels from; the CPU can update that (or you DMA it across) and your graphics processor executes it, streaming pixels to the output circuit.

now you're confusing me a bit, what you described is pretty much exactly the whole point for having a VGA card with it's own processor onboard. to do those kinds of operations to take work off the main system.

Yes sorry, my point was more that regardless of whether the CPU or a GPU is doing the work, you might get away with not having a large framebuffer at all and just having the GPU figure out what needs to be drawn on the next scanline just in time to fill a FIFO and do that. I don't know the pros and cons, it would probably allow for some interesting dynamic effects and would avoid the need for megabytes of framebuffer memory at higher resolutions, but would also present some restrictions I suppose. And yes, maybe this would mean there'd be two stages of GPU... :D

Quote:
hmmm, now that i think about it... and this is probably insane... but what about having 5 CPUs on a VGA card?

Oh at least! :D

Quote:
something like this:
Attachment:
yyXJA5YNPx.png

so you have 1 Master CPU on the card that controls 4 "core" CPUs. each core CPU takes care of 1/4th of the framebuffer. this would allow you to do 4 operations at once as long as they're on seperate quarters of the screen.

Interesting. Modern GPU hardware actually benefits more from having multiple execution units processing the same command stream on roughly the same data. I'm not sure if that would be true here as well though - some of the reasons, like cache coherency, don't apply, and others are only really relevant for 3D graphics (being able to detect "gradients" between adjacent pixels, because a 2x2 "quad" of pixels are calculated in tandem, so as soon as you've done some maths, you can immediately compare it to two of your neighbours). This is probably not the sort of operation that's going to matter here though, so having them work on different things might scale better.

Quote:
something more ideal would have all core CPUs be able to access the entire framebuffer instead of only 1 quarter of it. but that would require a more complicated bus architecture.

In the PC world, SLI was a technique to allow two video cards to cooperate on a scene with almost no communication between them. They just rendered alternate scanlines of the scene to half-height framebuffers, which were then composited and sent to output. Again, a somewhat different type of rendering, but interesting by comparison. Each only needed to draw half the number of pixels, but all the other work was replicated between them and so there wasn't always any benefit.

Sorry these are mostly random anecdotes from the PC world, but maybe there's some inspiration to be had. In a 6502 world, I can imagine storing odd and even scanlines in different RAM ICs, with separate buses for two GPUs to write into, with the video circuit reading lines alternately from one or the other. Writing pixels is pretty expensive I think, in our context, so it could be more beneficial to share the load than it ultimately was in the PC world.


Top
 Profile  
Reply with quote  
PostPosted: Fri Sep 15, 2023 10:26 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
The whole "writing pixels is pretty expensive" thing is why I suggested having masked-write hardware built into the framebuffer circuit. For a simple 1bpp bitplane, TSB and TRB do basically the same thing and are nearly as fast.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 4:23 pm 
Offline
User avatar

Joined: Tue Feb 28, 2023 11:39 pm
Posts: 257
Location: Texas
Proxy wrote:
...
on another note, how exactly would you generate the addresses for the frame buffer to shift out the data?
if you do the row doubling in hardware using a pair for FIFOs like i described above, then you can simply use a counter that increments and writes to the FIFOs on even scanlines, and does nothing on odd ones. once it reaches the last count value, it resets back to 0 for the next frame.
without FIFOs you have to set the counter back at the end of an even scanline, and let it contine normally on odd ones. which seems more complicated IMO unless you're using using programmable logic.


Honestly, this is probably the most confusing part of your design to me. Assuming the FIFO's are holding pixel data and not commands, then maybe? Just reading from memory twice though is a simple matter of just effectively hard wiring a single bit shift from the counters. No fancy extra multiplexing, no complicated logic to add, just simply, start from the two's place.

If your plan is to only ever read from the FIFOs and never have any actual framebuffer, then I could see this, but you'd end up multiplexing them just as much as anything else it seems like you're trying to avoid doing.

If the FIFOs are to hold commands for your second "GPU" (For lack of a better term), then I would think you would not want to process those commands more than once per pixel. Think about operations that work with math, such as using XOR to mask/unmask sections for things like sprites or a mouse cursor.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 5:57 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Yuri wrote:
Honestly, this is probably the most confusing part of your design to me. Assuming the FIFO's are holding pixel data and not commands, then maybe?


that may have been a bit confusing, but my concept would have 2 sets of FIFOs.

one pair form the bidirectional interface between the Main System and the GPU (for commands), and the other pair goes unidirectionally from the Video Memory to the output DAC (for pixel data).

Yuri wrote:
Just reading from memory twice though is a simple matter of just effectively hard wiring a single bit shift from the counters. No fancy extra multiplexing, no complicated logic to add, just simply, start from the two's place.

I'm not talking about reading the same memory location twice though. that would just draw the same pixels twice in a row.
but the goal is to draw a whole scanline twice without having to read from memory twice.
so that's why i would use 2 FIFOs towards the output DAC.

let me try to explain it in more detail:

the video circuit has a simple linear counter that steps through memory and fills both FIFOs with the same pixel data at the same time.
then on the next scanline the output side of the circuit reads from the first FIFO and on the scanline after that it reads from the second FIFO. (meanwhile in the background the counter loads the next scanline of data into both FIFOs)
the result is 2 identical scanlines being drawn to the screen while the video circuit only had to read 1 scanline worth of pixels from memory.

and just for completion, the reason to draw 2 identical scanlines in a row is to effectively half the vertical resolution but still have the image fill out the whole screen.
halfing the horizontal resolution is much simplier as you just have to shift out pixels at half the speed, causing each pixel to get drawn twice in a row.
that's how you get 320x240 from 640x480, 400x300 from 800x600, etc.

hope that cleared some things up?


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 6:45 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1117
Location: Albuquerque NM USA
If you use 720x FIFO, it has a “retransmit” pin that allows you to transmit the same data from the beginning of the FIFO queue.
Bill


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 7:04 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
oh yea i forgot that bit exists, though i don't know how useful it would be.
if i read the datasheet correctly it just sets the read pointer to 0, which is not guaranteed to point to the start of the data for the current scanline.
as the data for the next scanline is being loaded while the current one is being drawn, so it just continues to add onto the existing data, so you can't simply reset the FIFO to align the data with the start of the internal buffer as that would delete the rest of the current scanline and the data of the next one.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 8:25 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1117
Location: Albuquerque NM USA
My feeling (just a feeling, not tested) is you can manipulate reset, retransmit, read strobe and write strobe signals to retransmit old data in the FIFO queue, but yet able to fill the FIFO with new data starting with 0 while the previous set of data is being retransmitted. I don’t know, it may not work out, but saving a FIFO is worth trying.
Bill


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 9:03 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
the only way i can see it working is if you reset the FIFO between the duplicated and new scanline, which means you have to load the next row of pixels within the horizontal blanking period, giving you almost no time at all.

so the FIFO starts empty, during horizontal blank it's filled with enough data for a row of pixels. then on the first scanline the output side reads from the FIFO like normal.
afterwards the retransmit input is asserted and the read pointer is moved back to 0 (start of the data), then the second scanline is drawn using the same data.
after that the FIFO is reset to move the read and write pointer back to 0, then finally the input side of the FIFO is allowed to fetch new data for the next row of pixels.

for example, 320x200 @ 8bpp means 320 bytes per row, the blanking period is only 160 cycles long, so the write speed of the FIFO would make this impossible.

alternatively you could write dummy data to the FIFO to wrap the write pointer around to 0 and use the Full Flag to write new bytes to the FIFO just as the old ones were re-transmitted.
that would technically give you more time to load the next row of pixels, as you're only bottlenecked by how fast the circuit writes the pixels to the DAC.
you then also need to trigger the re-transmit input after every scanline, since it would otherwise continue drawing using the dummy bytes.
also writing that many dummy bytes takes a long time as well... i only have the 1024x9 variants, so for each row of pixels (320 Bytes) i would need to write an additional 704 Bytes.

man if only the retransmit feature worked a bit differently. have a second input that when asserted saves the current read pointer to some temporary register, then when retransmit is asserted it sets it to the value in the temporary register, instead of setting it to 0.
that would completely solve these issues as you would no longer need to reset the FIFO or write a lot of dummy bytes...


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 9:16 pm 
Offline
User avatar

Joined: Tue Feb 28, 2023 11:39 pm
Posts: 257
Location: Texas
Proxy wrote:
Yuri wrote:
Honestly, this is probably the most confusing part of your design to me. Assuming the FIFO's are holding pixel data and not commands, then maybe?


that may have been a bit confusing, but my concept would have 2 sets of FIFOs.

one pair form the bidirectional interface between the Main System and the GPU (for commands), and the other pair goes unidirectionally from the Video Memory to the output DAC (for pixel data).

Yuri wrote:
Just reading from memory twice though is a simple matter of just effectively hard wiring a single bit shift from the counters. No fancy extra multiplexing, no complicated logic to add, just simply, start from the two's place.

I'm not talking about reading the same memory location twice though. that would just draw the same pixels twice in a row.
but the goal is to draw a whole scanline twice without having to read from memory twice.
so that's why i would use 2 FIFOs towards the output DAC.

let me try to explain it in more detail:

the video circuit has a simple linear counter that steps through memory and fills both FIFOs with the same pixel data at the same time.
then on the next scanline the output side of the circuit reads from the first FIFO and on the scanline after that it reads from the second FIFO. (meanwhile in the background the counter loads the next scanline of data into both FIFOs)
the result is 2 identical scanlines being drawn to the screen while the video circuit only had to read 1 scanline worth of pixels from memory.

and just for completion, the reason to draw 2 identical scanlines in a row is to effectively half the vertical resolution but still have the image fill out the whole screen.
halfing the horizontal resolution is much simplier as you just have to shift out pixels at half the speed, causing each pixel to get drawn twice in a row.
that's how you get 320x240 from 640x480, 400x300 from 800x600, etc.

hope that cleared some things up?


Kinda?

I figured you were doing this to half the resolution, as you say 320x240 from 640x480 (which is what I hope to get), what I'm not sure is what benefit having the two FIFOs is over just scanning the framebuffer twice. Is the goal to try and leave more free time for access to the framebuffer for other GPU compute tasks?


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 17, 2023 9:38 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
yea the idea was to reduce the amount of time the video circuit occupies the bus so that the GPU, DMAC, whatever else have more time to do work.
cycle stealing would be an alternative, it saves you 1 FIFO IC (as you can read the same row of pixels twice without downsides) at the cost of having to run the GPU bus slow enough to avoid wait cycles when accessing FIFOs or a ROM.
you would have to do some math to see if running the GPU at 8-10-ish MHz but therefore always be active, is better/gives you more processing time than running the GPU at 11-20-ish MHz but then be stopped every second scanline for the video circuit to load new data.

hmm, could you use clock stretching to allow cycle stealing while also having wait states for slower ICs? technically they shouldn't interfere with eachother... if the video circuit wants to write to the FIFO it extends the first half of the clock cycle, if the GPU wants to access a FIFO or a ROM it extends the second half.
but since the video circuit is pretty much always writing to the FIFO, almost every clock cyle would have it's first half extended, pretty much nullifying the benifits of a faster clock for the GPU...


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 16 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: