Managing bus contention with 16K VRAM window at 320 x 200 @

AXY · Post by **AXY** » Tue Mar 17, 2015 5:42 am

Hi all,

I'm working on the video portion of my homebrew computer project. I've already implemented a CGA "video card" in FPGA which interfaces directly with a 16K SRAM window and a 8K font ROM. My problem: how to share this 16K VRAM with the CPU. If you're not familiar with CGA, it basically boils down to the common modes:

- 16K of 2-bit packed pixels, selectable palette, 320 x 200
- 8 x 1K pages of ASCII-ish character data
- 8 x 1K pages of 4-bit/4-bit packed foreground and background color that match with the corresponding character

The FPGA is actually running at 50 Mhz with a 25MHz VGA pixel clock but only buffers bytes as necessary for drawing 320 x 200 to prevent glitches on pixel transitions. There are HSYNC and VSYNC periods where the FPGA mostly doesn't need to access memory, except right before going back to (0, y), where it needs to buffer the first byte of the next row before churning out pixels.

A couple of approaches I've considered:

- Have the 65C02 and the FPGA swap two independent SRAM chips using a boatload of bus transceivers (2 for data, 2 for A0..A7, 2 for A8..A13) during VSYNC, achieving a kind of double-buffering
- Actually use double buffering
- Only draw during HSYNC/VSYNC
- Pause 65C02 with RDY
- Go through a peripheral chip
- Do Apple II-style bus sharing (seems difficult, especially with the FPGA operating at 50 MHz and 25 MHz pixel clock)
- Dual-ported SRAM (never used this before - part recommendations, anyone?)

Am I missing anything? Pros/cons of each?

Rob Finch · Post by **Rob Finch** » Tue Mar 17, 2015 8:59 am

I'd go with AppleII bus sharing if possible. How fast can the 65C02 run ? Could you interleave the video access with the 65c02's access by providing a clock from the video subsystem ?
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.

You could run the processor in strips during the horizontal draw. Buffer the video data in the FPGA using a fifo for the line in the first part of line draw. It's only 80 bytes for the entire line. The video data would fill the buffer at a 25MHz rate. Then be displayed from the fifo at a 6.25 MHz rate. That would leave about 3/4 of a scanline free for the 65c02 to access RAM.

The 65C02's clock could be stopped while the video buffer is loading.

AXY · Post by **AXY** » Thu Mar 19, 2015 4:29 am

Rob Finch wrote:

I'd go with AppleII bus sharing if possible. How fast can the 65C02 run ? Could you interleave the video access with the 65c02's access by providing a clock from the video subsystem ?
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.

I'm hoping to get at least 10 MHz out of the 65C02. This is a really elegant hack but I think I'm going to save this one as a last resort because it seems like a delicate dance. If I were doing even lower resolution composite video out I would totally give this a try.

Rob Finch wrote:

You could run the processor in strips during the horizontal draw. Buffer the video data in the FPGA using a fifo for the line in the first part of line draw. It's only 80 bytes for the entire line. The video data would fill the buffer at a 25MHz rate. Then be displayed from the fifo at a 6.25 MHz rate. That would leave about 3/4 of a scanline free for the 65c02 to access RAM.

This is an interesting suggestion ... I wonder if I could get the whole row during HSYNC. Right now, I'm only buffering one byte ahead and just in time, too. I'll definitely consider it.

Rob Finch wrote:

The 65C02's clock could be stopped while the video buffer is loading.

This one seems pretty appealing to me from an implementation standpoint, although the performance hit is unfortunate.

There are some interesting dual-port through-hole SRAM chips from IDT at Mouser, but they are something like $15 / 32Kbit, which is just a little too steep for me to care. I think I'm going to try this approach first:

One single separate SRAM chip for VRAM. The FPGA gets first dibs but will release its hold whenever it doesn't need it - all of the writing will happen during HSYNC and VSYNC.
(optional). Make the FPGA buffer the entire line in one burst as fast as it can. This would create larger contiguous access windows.

For Phase 1, I imagine it would be something like the following:

3x octal bus transceivers between the CPU bus and VRAM
Hold RDY low and transceiver /OE high if the CPU's address decodes to the VRAM window and the FPGA isn't in HSYNC or VSYNC. This signal should be really easy to emit from the FPGA and I've still got plenty of I/O there.

It's pretty amazing how difficult things get when you want to do video. Thanks for the suggestions, Rob!

cr1901 · Post by **cr1901** » Thu Mar 19, 2015 11:44 am

Is shared VRAM a hard requirement?

If I were doing this project, I'd probably wouldn't share VRAM between CPU and the videpo controller. I'd do something similar to the Sega Genesis VDP, where the 68k talks to the VDP using a single I/O port and transfers the picture all at once during Vblank or Forced Blank using DMA (might not be as simple without '816's block-move, or until someone creates a DMA controller that respects 65xx timing).

I have a dual port SRAM on hand- Cypress CY7C130-30PC. It's nowhere NEAR enough RAM for gfx, but this should give you some hints nonetheless.

AXY · Post by **AXY** » Fri Mar 20, 2015 3:39 am

cr1901 wrote:

Is shared VRAM a hard requirement?

Not at all. It seemed like a pretty simple hardware-based approach with minimal Verilog changes, though.

cr1901 wrote:

If I were doing this project, I'd probably wouldn't share VRAM between CPU and the videpo controller. I'd do something similar to the Sega Genesis VDP, where the 68k talks to the VDP using a single I/O port and transfers the picture all at once during Vblank or Forced Blank using DMA (might not be as simple without '816's block-move, or until someone creates a DMA controller that respects 65xx timing).

I still might try this approach, but at some point, even the FPGA is going to be competing with itself in writing to the VRAM. Right now it assumes read-only. Do you think I can transfer 16K in 0.0014299900695133761 s (VSYNC window)? I don't know if I can at 10 MHz or 12MHz. If I don't do it all in one go, I need to be able to squeeze writes in throughout the frame or have some other second buffer somewhere.

cr1901 wrote:

I have a dual port SRAM on hand- Cypress CY7C130-30PC. It's nowhere NEAR enough RAM for gfx, but this should give you some hints nonetheless.

Hey, 8K can do a lot! It can do 320 x 200 1-bit bitmap and two pages of 80x25 text with a color attribute byte.

I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.

nyef · Post by **nyef** » Fri Mar 20, 2015 4:08 am

AXY wrote:

I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.

ISTR reading something (by Michael Abrash, maybe?) about a VGA card that added a one-byte FIFO to allow a single non-blocking write to RAM. If it was empty, the write would enter the FIFO and return immediately, otherwise there were wait-states involved. And that FIFO was enough to secure a market lead for quite a while. Presumably reads always triggered a wait-state.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Mar 20, 2015 5:02 am

AXY wrote:

I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.

Looking at the 8563/8568 VDC circuitry in the Commodore 128(D) might be interesting. The VDC had 16KB (C-128) or 64KB (C-128D) of video RAM that was not part of the MPU's address space. The VDC showed up in the MPU's address space as two I/O ports, one being a control (write)/status (read) register and the other a read/write data register (the addresses were $D600 and $D601, respectively).

Reading or writing the video RAM was accomplished by setting an address in a pair of registers, using the control register to tell the VDC which registers were to be set, next telling the VDC that the data register is to be selected, again by writing to the control port, and then reading from or writing to the data port. The VDC would inform the program that was doing the read or write when it was ready for each access by setting bit 7 in the status register. The 8568 could also generate an IRQ when that happened.

It was somewhat convoluted and effectively limited the maximum frame rate to about 10-12 FPS, but did get around the problem of trying to keep two different devices from trying to access the same RAM.

AXY · Post by **AXY** » Fri Mar 20, 2015 5:10 am

BigDumbDinosaur wrote:

Reading or writing the video RAM was accomplished by setting an address in a pair of registers, using the control register to tell the VDC which registers were to be set, next telling the VDC that the data register is to be selected, again by writing to the control port, and then reading from or writing to the data port. The VDC would inform the program that was doing the read or write when it was ready for each access by setting bit 7 in the status register. The 8568 could also generate an IRQ when that happened.

This is exactly how I imagined I would do it. I'm not really going for insane performance, mostly just for the most naive solution that will give acceptable performance with minimal debugging.

You know, I thought of something else. CGA gave priority to the CPU when writing to VRAM, with the obvious side effects of making the image look nasty if writing was happening during scan. Wouldn't the simplest solution be to give all priority to the CPU, knocking out the FPGA's address lines, but let the CPU opt into cooperation with an HSYNC and VSYNC interrupt?

It's funny how these little questions in your head can start making you wonder ... maybe I should just talk to a VT100 emulator over serial and call it a day?

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Mar 20, 2015 5:38 am

AXY wrote:

... maybe I should just talk to a VT100 emulator over serial and call it a day?

That's what I'm effectively doing with my POC unit. I have a serial terminal running in WYSE 60 emulation mode connected to it.

cr1901 · Post by **cr1901** » Fri Mar 20, 2015 11:17 am

AXY wrote:

I still might try this approach, but at some point, even the FPGA is going to be competing with itself in writing to the VRAM. Right now it assumes read-only. Do you think I can transfer 16K in 0.0014299900695133761 s (VSYNC window)? I don't know if I can at 10 MHz or 12MHz. If I don't do it all in one go, I need to be able to squeeze writes in throughout the frame or have some other second buffer somewhere.

Theoretical bandwidth of the 65xx bus should be equal to your clock speed, if a DMA controller that respected 65xx timings existed. 16.0/0.0014299900695133761 is about 11.2MB/sec. So it's within the realm of possibility. Idk how fast the 6502 itself is when xferring data.

AXY wrote:

Hey, 8K can do a lot! It can do 320 x 200 1-bit bitmap and two pages of 80x25 text with a color attribute byte.

Fair enough. Why did I think that particular chip was 2kB only?

AXY wrote:

I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.

CGA provided it's framebuffer in regular memory- and it wasn't strictly linear, for reasons described in the links below.

I have linked the manual, including schematics, of IBM's original CGA card. It provides CGA and NTSC output using only discrete components. Please don't ask me how the NTSC circuitry works, as I have not analyzed that portion of the card yet- and it is a bit tougher than the other portions.

I should be able to answer other questions re: this card, however. The only special purpose chip is the 6845, which IME can be seen as a special digital counter. One section of bits repeats a sequence of values (0, 1, 2,... n - 1) until the second section of bits reaches a programmed threshold, after which the first section of bits takes on a new sequence of values to repeat (n, n + 1, 2n - 1) , and the second section counts up to the programmed threshold again. Concatenate the two sections of bits to get the address of the next picture element in VRAM to display.

AXY · Post by **AXY** » Sat Mar 21, 2015 8:10 pm

Thanks cr1901 for the CGA documentation! This will be interesting to compare against my pseudo-CGA implementation in Verilog.

I looked at that PDF a little bit and it looks a lot how my address generator and simple DAC works. The 6845 is the real mystery to me I think, so I started looking at that. It definitely is acting as gatekeeper and uses internal registers and I/O ports.

So now I'm starting to consider using an 8-bit port with a few register locations in the FPGA. Thinking out loud:

Register I/O Only

Data
Address Low
Address High
Mode Control Register (I need this anyway - right now it's hacked to dedicated pins)
Text Line Scroll
Text Page
etc.

A write to R0 will trigger a queue to the FIFO, possibly asserting RDY on the MPU if it's full (or just let the CPU hog the bus)
A read to R0 will read at [Address High][Address Low], possibly asserting RDY on the MPU if VRAM access is happening. (or just let the CPU hog the bus)
Mode control or other registers will latch immediately.

Doing it with standard parallel I/O is much easier to think about. I wonder how much it will affect R/W speeds to graphics memory, though. I'm going to need three separate memory accesses to affect VRAM, not to mention the blocking I/O.

Hybrid I/O
Directly latch A0..A13 and D0..D7 from the main bus.
n: Mode Control Register (I need this anyway - right now it's hacked to dedicated pins)
n+1: Text Line Scroll
n+2: Text Page
etc.

If I have spare pins on the FPGA, I might be able to latch the data and the 14-bit address at the same time. I've already got two independent address and data buses coming into the FPGA for VRAM reads and font ROM reads in parallel (already using up every pixel clock for synchronous logic buffering in data from the memory chips - this was necessary to prevent glitches during byte transitions in the image). Why not have three independent buses!

AXY · Post by **AXY** » Sun Apr 05, 2015 9:14 am

Thanks all for the input. I decided to go with the following, I think it's an improvement and a good compromise.

Text Mode (8x8 font at 640 x 200 operating in 640 x 480 @ 60Hz timings)
Previously, the FPGA would naively prefetch the next character byte, the next color attribute byte, and the 8x8 font row for the next character column during the visible area. Now, the FPGA buffers the entire next row of character data during horizontal blanking on (vga_y & 0b111 == 0), and the entire next row of color attribute data during HBlank on (vga_y & 0b111 == 1). The font ROM is on an independent bus and the FPGA owns it, so I still fetch that only right before I need it.

Previous naive VRAM bus usage: (80 chars/row * 25 rows * 25.4 us/row) = 50.8 milliseconds / frame. Yikes.
New VRAM bus usage: (HBLANK * 2 * 25 rows) = 318 microseconds / frame. Much better!

Bitmap mode (320 x 200, 2-bit color, 4 px/byte, operating in 640 x 480 @ 60Hz timings)
This is similar to the savings in text mode, except that I need to prefetch all 80 bytes of the next line at every HBlank, so it's 1 HBlank / VGA line. Still, it's a massive improvement to buffer in bytes during HBlank compared to hogging the VRAM bus for the entire visible scanline.

Basically, this means that I can just use a simple wait on the RDY line if the CPU tries to hop onto the VRAM bus and not feel too terribly bad about it.

lordbubsy · Post by **lordbubsy** » Sun Apr 05, 2015 12:31 pm

Take a look at the F18A, it’s a pin compatible TMS9918A replacement with a true standard VGA 640 x 480 @ 60Hz. output. It has some very nice features over the original VDP, is easy to connect and easy to program.

• Horizontal and vertical scroll registers
• Removed the original VDP “speed limit”
• 80-column mode, you can choose a charset yourself
• 64 programmable 12-bit color registers, for creating your own color scheme
• Two 32-bit 100MHz Linear Feedback Shift Register (LFSR) random number generators

http://codehackcreate.com/archives/335
http://codehackcreate.com/archives/30

Managing bus contention with 16K VRAM window at 320 x 200 @

Managing bus contention with 16K VRAM window at 320 x 200 @

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20

Re: Managing bus contention with 16K VRAM window at 320 x 20