6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Sep 21, 2024 12:31 pm

All times are UTC




Post new topic Reply to topic  [ 13 posts ] 
Author Message
PostPosted: Sun Dec 01, 2019 3:43 pm 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
Wanted to share a design I've been working on. For my game console project I need a way to quickly transfer bytes into the frame buffer.

The scheme I'm working on uses a bunch of counter chips and some adder chips to march the target address ranges on the rising edge of the clock,
while strobing the write enable on the low phase. The controller operates on the dual-ported video RAM "V.RAM" and a normal SRAM dedicated as a staging area for
graphics "G.RAM". When a copy operation is happening the V.RAM and G.RAM are cut off from the rest of the system bus by FET switches. The controller copies continuous
ranges up to 128 bytes wide/pixels, and can repeat the copy on the next row up to a height of 128 rows. The width and height are controlled by a pair of 40103 down-counters,
one of which resets the X counter and increments the Y counter, while the other resets the flip flop that enables counting. This last reset signal can also optionally trigger an IRQ.

Additionally a comparator (wanted to use an 8-input NOR/OR but those don't seem to be available) checks whether the current byte loaded from G.RAM is zero. If it is, then the
write strobe is disabled. (If the transparency function is enabled, that is.)

Originally I was working on another version of the design that could perform copies in either direction, scale graphics by scaling the counter clocks, and tile the copy if the destination
rectangle was larger than the source. These features might eventually make it into a larger retrocomputer build, or an FPGA-based project. I'm aiming to make my console a little
smaller than the Nintendo Entertainment System, so I cut these features to reduce chip count.

Also at the top of the schematic the _WE2 and _OE2 are simply clock-qualified versions of RWB.
_OE2 = !(Phi2 && RWB)
_WE2 = !(Phi2 && !RWB)

The dual-ported RAM symbol looks like the single-ported one because in my Eagle library I'm representing the ports like two gates on the same chip.


Attachments:
sVDMA.png
sVDMA.png [ 643.13 KiB | Viewed 1810 times ]
Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 02, 2019 12:11 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Interesting project! And you've done a lovely job with the schematic -- it's laid out in a way that quite clearly implies what's going on. I'm not 100% sure I understand your comment about a dual-port RAM, however. And allow me to suggest that schematics for posting should be rendered in monochrome, and also not in such high resolution -- I reduced my copy by 50%.

You don't mention the speed you hope to achieve, but be warned that several of the chips you've chosen are rather slow. For example I would consider upgrading the 4040 and 40103 counters to 74HC4040 and 'HC40103. Also, I'm not sure why you've chosen to use several 4066 FET switches to connect the RAMs to the system data bus, but a single 74CBT3245 FET switch would be easier and probably better. Or, instead of a FET switch, you could consider just an ordinary transceiver such as 'HC(T) or 'AHC(T) 245. Better yet, if you use a '640 (which is the inverting version of a '245) then that multi-input NOR/OR you had trouble finding can be replaced with a multi-input NAND/AND instead -- the 13-input '133 NAND, for example.

On a more fundamental level, I can mention some significant advantages which would result if the RAM addresses were to come directly from loadable counters such as '161 or '163 rather than from '183 summers. But I needn't elaborate if you're at a point where such a reorganization would involve too much backtracking.

Thanks for sharing your project, Have fun, and keep us posted!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 02, 2019 11:11 am 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
Thanks so much for the feedback, Jeff! Fortunately I'm already used to backtracking, as this is maybe my third redraw of the design.
I'd absolutely be interested to hear about the advantages in using the loadable counters directly. I had them in the last sketch but switched to the adders so I could share the same counters between source and destination.

As for the speed, I've got a bunch of 14MHz and 28MHz crystal oscillators from my video experiments. Optimistically I'd like to run at the 14MHz rate allowing for 14Mb/s transfers, or up to 873 full screen transfers per second. Even half or a quarter of this would allow plenty of time for drawing complex sprite stacks.

I probably should have added a note or updated the annotations before posting, but the 4040 and 40103 are definitely going to be the 74HC versions. My thinking for the 4066 switches was that I wanted something through-hole and easy to reason about. The 4066 has four switches per IC, so I'd only need two ICs to switch the data lines. If I use the 74H245, would it be enough to wire DIR to RWB?

Using the 640 and a multi-NAND is an interesting idea... I'd have to introduce another inverter on the signal generation side of the video board, but I'm already using a 573 to steady the pixel value between clocks so I could switch that to the 563.

The dual-ported RAM is how the V.RAM interfaces to my composite video generator, which lives on another sheet. To keep things tidy I chose to represent each "side" of the V.RAM as a separate schematic symbol. I'm using the IDT7007, which has a total of 32kb that I divide into two screen buffers. Updates to the video generator will go on the other thread since the two circuits run independently of one another.


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 02, 2019 5:16 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Quote:
the 4040 and 40103 are definitely going to be the 74HC versions
OK, good; the HC versions are a lot faster. But re actual clock rates and prop delays I'll leave the calculations to you. What sort of computer are you attaching to, BTW?

Quote:
Using the 640 [...] I'd have to introduce another inverter on the signal generation side of the video board, but I'm already using a 573 to steady the pixel value between clocks so I could switch that to the 563.
Oh! Right. I overlooked the fact the data also gets read by the other port on the RAM, as you explained.

As for loadable counters, they would improve the timing margins a lot, by eliminating the prop delay of the '283 adders. They'd also reduce the chip count somewhat, as follows.

For the X section the CPU would write to a '573, and from there the value would get re-loaded into a pair of 163's as each X sweep occurs. But we only sweep through the Y values once, isn't that right? So, for Y the CPU can write straight to a pair of 163's -- there's no need to have a 573.


:idea: Come to think of it, maybe it'd be better if the system were only capable of doing one X sweep at a time. This further reduces the amount of logic by offloading some of the work to the CPU... and the net result would still be massively faster than using software only. :!:

For each sweep, the CPU would write directly to X counters (no need to have a 573). For Y it would simply write to a 573 (no need for counters!).

For a setup like this I wouldn't use IRQ's to signal completion of the sweeps. After starting a sweep (by writing to X), the software would increment the Y value (held in a CPU register) then immediately attempt to start the following sweep by writing to Y and X again. But until the present scan is completed, writes to the logic would cause the CPU's RDY pin to get pulled low. (With an NMOS CPU, RDY doesn't work for writes; you'd need to stretch the clock instead.)

Just throwing the idea out there. I admit this "one X sweep at a time" is not so appealing if there's other useful work the CPU could be doing while the copies are in progress.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 02, 2019 10:18 pm 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
Quote:
What sort of computer are you attaching to, BTW?

I'm building a TV game system based on the 65c02. In addition to the graphics hardware there will be a 4-channel (2 square waves+noise+wavetable) audio generator, two gamepads read through memory-mapped shift registers, 8K of general-purpose SRAM, and a cartridge slot for ROM as well as eventual enhancement modules.

Quote:
As for loadable counters, they would improve the timing margins a lot, by eliminating the prop delay of the '283 adders.


Ah. I just ran the calculation on prop delays and it looks like the difference is indeed pretty big. Not just because of the 283 but also the ripple propagation on the 4040. Even the 74HC version takes 64ns to go from CP to Q7... If I were to run at 14MHz I should have 34ns for the clocked address to settle in before the write phase begins, plus another 7 while the clock passes through the NAND on its way to the write enable. Including the 15ns for the access time on the source RAM, I'm looking at a total of 113 ns from DMACLK to source data valid. By then the write enable has already been strobed. :shock:

Meanwhile the '163 only has 17ns from clock to output, making for a total of 43 ns from clock to data out. With 9 ns to spare before the write strobe propagates. Looks like I'm doing the 163 version if I want a chance at 14MHz!

Quote:
But until the present scan is completed, writes to the logic would cause the CPU's RDY pin to get pulled low.


An interesting idea! I'd considered the single sweep option, along with the possibility of pausing the CPU for anywhere-to-anywhere DMA copies instead of just the video banks. There are a few other things I can imagine I'd like the system to be doing during the writes, though. Such as game logic or updating the audio registers. The IRQ handler would likely just be a lightweight queue-checking routine that either kicks off another copy or just clears the flag. (Though that last sentence makes me want to try implementing the queue in hardware, I think a software solution will suffice)


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 02, 2019 11:24 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Agumander wrote:
I'm building a TV game system based on the 65c02.
Nice! I think I need to take take a closer look at your other thread. At first I didn't realize both were from the same person.

Agumander wrote:
There are a few other things I can imagine I'd like the system to be doing during the writes, though. Such as game logic or updating the audio registers.
Well, that's a valid concern.

Honestly, I wonder if you might be better off with a single RAM running at double the CPU clock rate -- or higher. This would let the CPU run flat out at all times, and still leave lots of bandwidth for DMA. Which includes the display function, BTW. I'm not convinced you need a dual-port device for that. Seems to me a single, modern 10ns RAM could serve all the RAM functions. And the block-move function would be able to do "anywhere-to-anywhere" copies, as you suggested a moment ago.

Lots of designs run RAM at double the CPU clock; it's a well-established strategy. Quadruple speed isn't out of the question, but it'd be non-trivial, even with a 10 ns memory.

Another main issue is the multiplexer chips that select the various address sources which need to connect to the RAM. 4-to-1 multiplexers (eg: '253) are only 2 bits wide, which means you'd need 8 of 'em. :|

Triple speed might be the sweet spot, both for timing and multiplexers. The multiplexers could maybe be 2-to-1 devices such as '257, which are 4 bits wide -- you'd only need four of 'em. The third address source would be the CPU itself, whose address bus can tri-state and thus can attach directly to the 257 outputs (which also can tri-state). But, hmm. That maybe gets awkward when the CPU needs to talk to I/O. Need to give that some thought.

There's probably a way to design the DMA logic so it presents the block-move addresses and video-display addresses sequentially on a single bus. That would simplify the multiplexer (ie; fewer inputs required). Just thinking out loud, here. Seems to me you might end up with more capability and yet have fewer chips! Gonna need some AC(T) or AHC(T) logic, though, and a nice, snappy RAM. Does that "break the rules," in your opinion? :) (Re the retro aspect, I mean.)

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 12:07 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Dr Jefyll wrote:
There's probably a way to design the DMA logic so it presents the block-move addresses and video-display addresses sequentially on a single bus.
Attachment:
register ring.png
register ring.png [ 3.03 KiB | Viewed 1694 times ]
Like this, for instance. '377 is an 8-bit register, so you'd need to pair 'em up to hold 16 bit addresses. But the idea is you'd have three values continually circulating through the ring. They'd be...
  • the display address
  • the block-move Source, and
  • the block-move Destination.

After each address has appeared at the output, it gets incremented before being fed back to the start of the ring. :) '377 is good for something like this because it can be instructed to ignore the clock signal. That'll be handy when initial values are being loaded into the ring.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 1:20 am 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
Quote:
There's probably a way to design the DMA logic so it presents the block-move addresses and video-display addresses sequentially on a single bus.


By display address do you mean the one that the signal circuit looks at? I like the Rube Goldberg-ness of that circle of registers scheme, though one issue is that the display address needs to read every row twice since my pixels are two scanlines tall. It's also scaled to an odd fraction of the overall clock so that the image pixels all fit on screen without getting lost to overscan.

The 7007 being forty bucks a pop makes finding a way to use a normal SRAM tempting, though. Maybe the G.RAM can switch between the system bus and the DMA bus, and run at some multiple of the pixel clock rate during copy operations.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 2:03 am 
Offline
User avatar

Joined: Sat Dec 01, 2018 1:53 pm
Posts: 727
Location: Tokyo, Japan
Agumander wrote:
The 7007 being forty bucks a pop makes finding a way to use a normal SRAM tempting, though.

This listing on AliExpress claims to be IDT7007S35PF or similar chips for a bit over a buck apiece, with shipping <$5 up to qty. 10. Yeah, pretty dodgy, but given that you can order a dozen for well under half the price of a new chip, it might be worth doing an order as a sort of lottery ticket.

_________________
Curt J. Sampson - github.com/0cjs


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 3:00 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Agumander wrote:
By display address do you mean the one that the signal circuit looks at?
I mean the address from which data is read, to be displayed as video.

Quote:
I like the Rube Goldberg-ness of that circle of registers scheme, though one issue is that the display address needs to read every row twice since my pixels are two scanlines tall.
A 14 MHz CPU is probably fast enough to intervene during the horizontal retrace period, tweaking the value in the ring so the display address gets reset to produce the duplication you want. I admit I'm half guessing, but I do have experience with coding for the H-retrace period using a 1 MHz 6502.

That said, the ring's sequential-only access does make values tricky to reload. Instead, maybe the values should be stored and retrieved from a 4-word register file made of 'HC or 'F670's. Ring or 670, the idea is to replace a large number of counters with a single adder. And, to have a single bus on which the DMA addresses present in sequence (reducing the complexity of the memory-address multiplexer).

Quote:
It's also scaled to an odd fraction of the overall clock so that the image pixels all fit on screen without getting lost to overscan.
Just kicking ideas around. It may be feasible to scale the CPU clock to the pixel clock, rather than vice versa. How much variability do you need? Will your horizontal frequency always be NTSC's 15.75 kHz? Can the pixel clock be fixed to one frequency?

Quote:
The 7007 being forty bucks a pop makes finding a way to use a normal SRAM tempting, though.
I haven't looked at the prices for FIFO memories, but I have a feeling it'd be cheaper to let one of those act as an elastic buffer for you, managing the clock disparity if in fact a disparity must be managed. But it's also possible to cycle-steal from the CPU on an irregular basis. You don't need a tidy, integer ratio between the number of cycles stolen vs the total number of CPU cycles. You only need to know whether or not there's a need to steal a cycle "right now."

Sorry if I'm stirring the pot too much. :oops: And thanks for the "Rube Goldberg" compliment, BTW! :P

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 03, 2019 9:53 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
Sorry if this is already mentioned upthread, but a possible tactic is to build a wider memory. It can service 8 bit accesses from the CPU side and (time-sliced) 16 or 32 bit accesses from the video side. The video side needs a shift register to serialise but it makes fewer accesses.

And then a memory subsystem with 2 or 4 banks has even more flexibility, especially if you have a blitter in the picture.

So, there are more routes to improved memory bandwidth than raw access rate.


Top
 Profile  
Reply with quote  
PostPosted: Wed Dec 04, 2019 12:44 am 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
Wow! Somehow I missed the fact that these FIFO devices existed... Even if I don't use them to buffer video data I can imagine all kinds of fun projects. Like a command buffer for the audio hardware to get cycle-perfect melody timing regardless of CPU activity!

Quote:
It can service 8 bit accesses from the CPU side and (time-sliced) 16 or 32 bit accesses from the video side.


Certainly something to think about! The video side already has as much overall bandwidth as it needs, though making the needed accesses more infrequent to give the CPU/DMA a turn would be an interesting way to use a normal SRAM.


Top
 Profile  
Reply with quote  
PostPosted: Mon Dec 16, 2019 7:51 am 
Offline
User avatar

Joined: Tue Jul 17, 2018 9:58 am
Posts: 106
Location: Long Island, NY
A little update on my DMA controller. I've switched from using the adder chips to using 74HC163s, and at least as far as the total delays are concerned I might yet be able to achieve the original 14MHz copy clock I'd hoped for.
The DMA and the composite video circuits together ended up being too many chips to fit within the current tier of EAGLE I'm subscribed to, so I split them into two boards that'll connect with a header.

The boards both have headers matching a backplane-style motherboard I ordered for prototyping. Although in theory the video card could connect directly to the system bus, it's mainly designed to connect to the top of the DMA card which then slots into the motherboard.


Attachments:
File comment: Composite video generator board
vidboard0.png
vidboard0.png [ 236.58 KiB | Viewed 1546 times ]
File comment: DMA controller board for VRAM
vdma0.png
vdma0.png [ 189.47 KiB | Viewed 1546 times ]
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 28 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: