Hi all,
I'm working on the video portion of my homebrew computer project. I've already implemented a CGA "video card" in FPGA which interfaces directly with a 16K SRAM window and a 8K font ROM. My problem: how to share this 16K VRAM with the CPU. If you're not familiar with CGA, it basically boils down to the common modes:
- 16K of 2-bit packed pixels, selectable palette, 320 x 200
- 8 x 1K pages of ASCII-ish character data
- 8 x 1K pages of 4-bit/4-bit packed foreground and background color that match with the corresponding character
The FPGA is actually running at 50 Mhz with a 25MHz VGA pixel clock but only buffers bytes as necessary for drawing 320 x 200 to prevent glitches on pixel transitions. There are HSYNC and VSYNC periods where the FPGA mostly doesn't need to access memory, except right before going back to (0, y), where it needs to buffer the first byte of the next row before churning out pixels.
A couple of approaches I've considered:
- Have the 65C02 and the FPGA swap two independent SRAM chips using a boatload of bus transceivers (2 for data, 2 for A0..A7, 2 for A8..A13) during VSYNC, achieving a kind of double-buffering
- Actually use double buffering
- Only draw during HSYNC/VSYNC
- Pause 65C02 with RDY
- Go through a peripheral chip
- Do Apple II-style bus sharing (seems difficult, especially with the FPGA operating at 50 MHz and 25 MHz pixel clock)
- Dual-ported SRAM (never used this before - part recommendations, anyone?)
Am I missing anything? Pros/cons of each?
Managing bus contention with 16K VRAM window at 320 x 200 @
Re: Managing bus contention with 16K VRAM window at 320 x 20
I'd go with AppleII bus sharing if possible. How fast can the 65C02 run ? Could you interleave the video access with the 65c02's access by providing a clock from the video subsystem ?
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.
You could run the processor in strips during the horizontal draw. Buffer the video data in the FPGA using a fifo for the line in the first part of line draw. It's only 80 bytes for the entire line. The video data would fill the buffer at a 25MHz rate. Then be displayed from the fifo at a 6.25 MHz rate. That would leave about 3/4 of a scanline free for the 65c02 to access RAM.
The 65C02's clock could be stopped while the video buffer is loading.
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.
You could run the processor in strips during the horizontal draw. Buffer the video data in the FPGA using a fifo for the line in the first part of line draw. It's only 80 bytes for the entire line. The video data would fill the buffer at a 25MHz rate. Then be displayed from the fifo at a 6.25 MHz rate. That would leave about 3/4 of a scanline free for the 65c02 to access RAM.
The 65C02's clock could be stopped while the video buffer is loading.
Re: Managing bus contention with 16K VRAM window at 320 x 20
Rob Finch wrote:
I'd go with AppleII bus sharing if possible. How fast can the 65C02 run ? Could you interleave the video access with the 65c02's access by providing a clock from the video subsystem ?
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.
For instance, the video byte data is required at 1/4 the pixel clock rate or about 6MHz. If the C02 could run at 6MHz you might be able to interleave the accesses AppleII style.
Rob Finch wrote:
You could run the processor in strips during the horizontal draw. Buffer the video data in the FPGA using a fifo for the line in the first part of line draw. It's only 80 bytes for the entire line. The video data would fill the buffer at a 25MHz rate. Then be displayed from the fifo at a 6.25 MHz rate. That would leave about 3/4 of a scanline free for the 65c02 to access RAM.
Rob Finch wrote:
The 65C02's clock could be stopped while the video buffer is loading.
There are some interesting dual-port through-hole SRAM chips from IDT at Mouser, but they are something like $15 / 32Kbit, which is just a little too steep for me to care. I think I'm going to try this approach first:
- One single separate SRAM chip for VRAM. The FPGA gets first dibs but will release its hold whenever it doesn't need it - all of the writing will happen during HSYNC and VSYNC.
- (optional). Make the FPGA buffer the entire line in one burst as fast as it can. This would create larger contiguous access windows.
- 3x octal bus transceivers between the CPU bus and VRAM
- Hold RDY low and transceiver /OE high if the CPU's address decodes to the VRAM window and the FPGA isn't in HSYNC or VSYNC. This signal should be really easy to emit from the FPGA and I've still got plenty of I/O there.
Re: Managing bus contention with 16K VRAM window at 320 x 20
Is shared VRAM a hard requirement?
If I were doing this project, I'd probably wouldn't share VRAM between CPU and the videpo controller. I'd do something similar to the Sega Genesis VDP, where the 68k talks to the VDP using a single I/O port and transfers the picture all at once during Vblank or Forced Blank using DMA (might not be as simple without '816's block-move, or until someone creates a DMA controller that respects 65xx timing).
I have a dual port SRAM on hand- Cypress CY7C130-30PC. It's nowhere NEAR enough RAM for gfx, but this should give you some hints nonetheless.
If I were doing this project, I'd probably wouldn't share VRAM between CPU and the videpo controller. I'd do something similar to the Sega Genesis VDP, where the 68k talks to the VDP using a single I/O port and transfers the picture all at once during Vblank or Forced Blank using DMA (might not be as simple without '816's block-move, or until someone creates a DMA controller that respects 65xx timing).
I have a dual port SRAM on hand- Cypress CY7C130-30PC. It's nowhere NEAR enough RAM for gfx, but this should give you some hints nonetheless.
Re: Managing bus contention with 16K VRAM window at 320 x 20
cr1901 wrote:
Is shared VRAM a hard requirement?
cr1901 wrote:
If I were doing this project, I'd probably wouldn't share VRAM between CPU and the videpo controller. I'd do something similar to the Sega Genesis VDP, where the 68k talks to the VDP using a single I/O port and transfers the picture all at once during Vblank or Forced Blank using DMA (might not be as simple without '816's block-move, or until someone creates a DMA controller that respects 65xx timing).
cr1901 wrote:
I have a dual port SRAM on hand- Cypress CY7C130-30PC. It's nowhere NEAR enough RAM for gfx, but this should give you some hints nonetheless.
I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.
Re: Managing bus contention with 16K VRAM window at 320 x 20
AXY wrote:
I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Managing bus contention with 16K VRAM window at 320 x 20
AXY wrote:
I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.
Reading or writing the video RAM was accomplished by setting an address in a pair of registers, using the control register to tell the VDC which registers were to be set, next telling the VDC that the data register is to be selected, again by writing to the control port, and then reading from or writing to the data port. The VDC would inform the program that was doing the read or write when it was ready for each access by setting bit 7 in the status register. The 8568 could also generate an IRQ when that happened.
It was somewhat convoluted and effectively limited the maximum frame rate to about 10-12 FPS, but did get around the problem of trying to keep two different devices from trying to access the same RAM.
Last edited by BigDumbDinosaur on Fri Mar 20, 2015 5:36 am, edited 1 time in total.
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: Managing bus contention with 16K VRAM window at 320 x 20
BigDumbDinosaur wrote:
Reading or writing the video RAM was accomplished by setting an address in a pair of registers, using the control register to tell the VDC which registers were to be set, next telling the VDC that the data register is to be selected, again by writing to the control port, and then reading from or writing to the data port. The VDC would inform the program that was doing the read or write when it was ready for each access by setting bit 7 in the status register. The 8568 could also generate an IRQ when that happened.
You know, I thought of something else. CGA gave priority to the CPU when writing to VRAM, with the obvious side effects of making the image look nasty if writing was happening during scan. Wouldn't the simplest solution be to give all priority to the CPU, knocking out the FPGA's address lines, but let the CPU opt into cooperation with an HSYNC and VSYNC interrupt?
It's funny how these little questions in your head can start making you wonder ... maybe I should just talk to a VT100 emulator over serial and call it a day?
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Managing bus contention with 16K VRAM window at 320 x 20
AXY wrote:
... maybe I should just talk to a VT100 emulator over serial and call it a day? 
x86? We ain't got no x86. We don't NEED no stinking x86!
Re: Managing bus contention with 16K VRAM window at 320 x 20
AXY wrote:
I still might try this approach, but at some point, even the FPGA is going to be competing with itself in writing to the VRAM. Right now it assumes read-only. Do you think I can transfer 16K in 0.0014299900695133761 s (VSYNC window)? I don't know if I can at 10 MHz or 12MHz. If I don't do it all in one go, I need to be able to squeeze writes in throughout the frame or have some other second buffer somewhere.
AXY wrote:
Hey, 8K can do a lot! It can do 320 x 200 1-bit bitmap and two pages of 80x25 text with a color attribute byte.
AXY wrote:
I kind of want to take a look at some CGA/VGA schematics now. Cards like that had dedicated, isolated graphics memory, so I assume there was an I/O port to transfer bytes into graphics memory but I'm almost positive that it was blocking I/O and not all of it was transferred at once. I'm sure it's trading complexity on the CPU side for complexity on the graphics side.
I have linked the manual, including schematics, of IBM's original CGA card. It provides CGA and NTSC output using only discrete components. Please don't ask me how the NTSC circuitry works, as I have not analyzed that portion of the card yet- and it is a bit tougher than the other portions.
I should be able to answer other questions re: this card, however. The only special purpose chip is the 6845, which IME can be seen as a special digital counter. One section of bits repeats a sequence of values (0, 1, 2,... n - 1) until the second section of bits reaches a programmed threshold, after which the first section of bits takes on a new sequence of values to repeat (n, n + 1, 2n - 1) , and the second section counts up to the programmed threshold again. Concatenate the two sections of bits to get the address of the next picture element in VRAM to display.
Last edited by cr1901 on Fri Apr 03, 2015 5:58 pm, edited 1 time in total.
Re: Managing bus contention with 16K VRAM window at 320 x 20
Thanks cr1901 for the CGA documentation! This will be interesting to compare against my pseudo-CGA implementation in Verilog.
I looked at that PDF a little bit and it looks a lot how my address generator and simple DAC works. The 6845 is the real mystery to me I think, so I started looking at that. It definitely is acting as gatekeeper and uses internal registers and I/O ports.
So now I'm starting to consider using an 8-bit port with a few register locations in the FPGA. Thinking out loud:
Register I/O Only
Hybrid I/O
Directly latch A0..A13 and D0..D7 from the main bus.
n: Mode Control Register (I need this anyway - right now it's hacked to dedicated pins)
n+1: Text Line Scroll
n+2: Text Page
etc.
If I have spare pins on the FPGA, I might be able to latch the data and the 14-bit address at the same time. I've already got two independent address and data buses coming into the FPGA for VRAM reads and font ROM reads in parallel (already using up every pixel clock for synchronous logic buffering in data from the memory chips - this was necessary to prevent glitches during byte transitions in the image). Why not have three independent buses!
I looked at that PDF a little bit and it looks a lot how my address generator and simple DAC works. The 6845 is the real mystery to me I think, so I started looking at that. It definitely is acting as gatekeeper and uses internal registers and I/O ports.
So now I'm starting to consider using an 8-bit port with a few register locations in the FPGA. Thinking out loud:
Register I/O Only
- Data
- Address Low
- Address High
- Mode Control Register (I need this anyway - right now it's hacked to dedicated pins)
- Text Line Scroll
- Text Page
- etc.
- A write to R0 will trigger a queue to the FIFO, possibly asserting RDY on the MPU if it's full (or just let the CPU hog the bus)
- A read to R0 will read at [Address High][Address Low], possibly asserting RDY on the MPU if VRAM access is happening. (or just let the CPU hog the bus)
- Mode control or other registers will latch immediately.
Hybrid I/O
Directly latch A0..A13 and D0..D7 from the main bus.
n: Mode Control Register (I need this anyway - right now it's hacked to dedicated pins)
n+1: Text Line Scroll
n+2: Text Page
etc.
If I have spare pins on the FPGA, I might be able to latch the data and the 14-bit address at the same time. I've already got two independent address and data buses coming into the FPGA for VRAM reads and font ROM reads in parallel (already using up every pixel clock for synchronous logic buffering in data from the memory chips - this was necessary to prevent glitches during byte transitions in the image). Why not have three independent buses!
Re: Managing bus contention with 16K VRAM window at 320 x 20
Thanks all for the input. I decided to go with the following, I think it's an improvement and a good compromise.
Text Mode (8x8 font at 640 x 200 operating in 640 x 480 @ 60Hz timings)
Previously, the FPGA would naively prefetch the next character byte, the next color attribute byte, and the 8x8 font row for the next character column during the visible area. Now, the FPGA buffers the entire next row of character data during horizontal blanking on (vga_y & 0b111 == 0), and the entire next row of color attribute data during HBlank on (vga_y & 0b111 == 1). The font ROM is on an independent bus and the FPGA owns it, so I still fetch that only right before I need it.
Previous naive VRAM bus usage: (80 chars/row * 25 rows * 25.4 us/row) = 50.8 milliseconds / frame. Yikes.
New VRAM bus usage: (HBLANK * 2 * 25 rows) = 318 microseconds / frame. Much better!
Bitmap mode (320 x 200, 2-bit color, 4 px/byte, operating in 640 x 480 @ 60Hz timings)
This is similar to the savings in text mode, except that I need to prefetch all 80 bytes of the next line at every HBlank, so it's 1 HBlank / VGA line. Still, it's a massive improvement to buffer in bytes during HBlank compared to hogging the VRAM bus for the entire visible scanline.
Basically, this means that I can just use a simple wait on the RDY line if the CPU tries to hop onto the VRAM bus and not feel too terribly bad about it.
Text Mode (8x8 font at 640 x 200 operating in 640 x 480 @ 60Hz timings)
Previously, the FPGA would naively prefetch the next character byte, the next color attribute byte, and the 8x8 font row for the next character column during the visible area. Now, the FPGA buffers the entire next row of character data during horizontal blanking on (vga_y & 0b111 == 0), and the entire next row of color attribute data during HBlank on (vga_y & 0b111 == 1). The font ROM is on an independent bus and the FPGA owns it, so I still fetch that only right before I need it.
Previous naive VRAM bus usage: (80 chars/row * 25 rows * 25.4 us/row) = 50.8 milliseconds / frame. Yikes.
New VRAM bus usage: (HBLANK * 2 * 25 rows) = 318 microseconds / frame. Much better!
Bitmap mode (320 x 200, 2-bit color, 4 px/byte, operating in 640 x 480 @ 60Hz timings)
This is similar to the savings in text mode, except that I need to prefetch all 80 bytes of the next line at every HBlank, so it's 1 HBlank / VGA line. Still, it's a massive improvement to buffer in bytes during HBlank compared to hogging the VRAM bus for the entire visible scanline.
Basically, this means that I can just use a simple wait on the RDY line if the CPU tries to hop onto the VRAM bus and not feel too terribly bad about it.
Re: Managing bus contention with 16K VRAM window at 320 x 20
Take a look at the F18A, it’s a pin compatible TMS9918A replacement with a true standard VGA 640 x 480 @ 60Hz. output. It has some very nice features over the original VDP, is easy to connect and easy to program.
• Horizontal and vertical scroll registers
• Removed the original VDP “speed limit”
• 80-column mode, you can choose a charset yourself
• 64 programmable 12-bit color registers, for creating your own color scheme
• Two 32-bit 100MHz Linear Feedback Shift Register (LFSR) random number generators
http://codehackcreate.com/archives/335
http://codehackcreate.com/archives/30
• Horizontal and vertical scroll registers
• Removed the original VDP “speed limit”
• 80-column mode, you can choose a charset yourself
• 64 programmable 12-bit color registers, for creating your own color scheme
• Two 32-bit 100MHz Linear Feedback Shift Register (LFSR) random number generators
http://codehackcreate.com/archives/335
http://codehackcreate.com/archives/30
Marco