6502.org

Posted: **Fri Aug 15, 2025 9:27 am**

Hi there! first time posting here

I've been (slowly) working on my own 65c816 based computer, I got the CPU running in a prototype breadboard a little while ago, and have tested some other things (keyboard adapter, SD card, UART, and basic VGA output) with a 65c02 based machine, but now I've had an idea for sprite hardware for a potential graphics stack for the actual machine I'd like to build. I'm not sure anyone else has taken this approach before but it seems like it would let me render potentially 256 independent sprites without needing to duplicate a bunch of hardware (my biggest gripe with many of the simple ones I've seen)

The data flow would be as follows:

: The overall data flow of the hardware

The sprite reader reads data from the sprite attribute memory, and does a vertical check with the height and size information (present in the first fetched word), if this sprite would be visible on the *next* scanline (it works one scanline ahead of time) more data is read, processed, and passed to the sprite renderer (the X position, width, horizontal flip information, palette nybble, and the starting location in the sprite graphics memory) which renders out the sprite into the back side of the single-line double buffer

I've made a simulation of the overall approach, it doesn't currently fit actual real-life chips or their timing requirements, naturally, but it does work well enough for me to have some faith in the approach, here's a debug view (blue channel is renderer active, green channel is reader active, red channel is reader "stalled" (for reading further sprite information / waiting for the renderer), and intensity is used as an alpha mask for sprites in the output) and actual output:

: debug view

: rendered view

If anyone has any thoughts on this approach, or examples of it being done before, that would be greatly appreciated

Posted: **Fri Aug 15, 2025 10:07 am**

(Welcome! And thanks for sharing your ideas.)

Posted: **Fri Aug 15, 2025 7:07 pm**

catelyn wrote:

Hi there! first time posting here

Welcome!

Quote:

I've been (slowly) working on my own 65c816 based computer,

Excellent...we need more 816 enthusiasts around here.

Quote:

...now I've had an idea for sprite hardware for a potential graphics stack for the actual machine I'd like to build.

In looking at your block diagram, I am curious about how you plan to manage concurrent access of the attribute and graphics memory by the MPU and the video hardware. Are both the MPU and video hardware going to be in the same clock domain? If not, will there be some sort of semaphore arrangement to let each entity know what the other is up to?

How fast do you plan to run the 816 and how much RAM will it have?

Posted: **Fri Aug 15, 2025 7:37 pm**

BigDumbDinosaur wrote:

Excellent...we need more 816 enthusiasts around here.

It's a weird but pleasant little chip!

BigDumbDinosaur wrote:

In looking at your block diagram, I am curious about how you plan to manage concurrent access of the attribute and graphics memory by the MPU and the video hardware. Are both the MPU and video hardware going to be in the same clock domain? If not, will there be some sort of semaphore arrangement to let each entity know what the other is up to?

I'm not entirely sure yet, but potentially I'm thinking of having the CPU effectively only be able to access it during blanking (vertical blank and "forced blanking", which both disable the reading for sprite hardware), this is still very early stages of design as you may be able to tell

BigDumbDinosaur wrote:

How fast do you plan to run the 816 and how much RAM will it have?

Ideally "as fast as I can", so 14 MHz, but I would probably settle for somewhere between 4 to 8 MHz, and while the final plan is "it'd be cool to have it fully kitted out with ~14.5 MiB of (work) RAM", I've currently got ~2.5 MiB of RAM available as a plan, with 512 KiB of flash ROM for boot operating storage

I've got some of the hardware planned out, but this might possibly be an eternal prototype where I try things that seem cool to me

Posted: **Fri Aug 15, 2025 7:38 pm**

Hello and welcome!

I've thought a bit about sprites before, and mostly stopped working on it seriously due to the same considerations - requiring quite a lot of counters etc per sprite. Your approach of doing everything per scanline is sound though I think, I'd thought about that too with a slightly different design - I was thinking of also eliminating all the vertical sprite processing hardware, and simply having the CPU store per scanline the necessary data to render any sprites on that scanline (e.g. for each of maybe up to four sprites, the distance to its leftmost pixel, the address of the row of pixels to draw there, and the number of pixels to draw). Then the hardware reads that during the blanking interval and sets up the appropriate counters. This way the only hardware needed is for horizontal timing - essentially the sprites are all 1D - but the trade-off is that the CPU needs to update more data per sprite in order to move them around. I didn't go very far with prototyping that concept though, I got distracted by other things.

Your idea requires less activity from the CPU, if I understand it correctly, but more calculation hardware. It also requires the hardware to have time to essentially loop over all the possible sprites, for every scanline - I guess it means that the total number of sprites on the screen is going to be limited by how much time it takes you to read the data for one sprite and decide if it needs to be rendered in the following scanline. Did you calculate any figures for that?

Regarding prior art - I imagine the sprite systems used in the C64 and Amiga worked a lot like this. It's also perhaps a kind of cut-down version of what the early Atari 8-bit computers did - they had a fascinating arrangement of coprocessors, with one running per scanline to output commands for the other one to execute during the next scanline. It's sort of a similar thing but taken to the next level.

Your limitations are going to be the number of sprites you can display within a scanline (limited by how many copies of the actual sprite hardware you still have - horizontal counters, pixel fetch, blending, etc), the total number of sprites you can support in the whole frame (limited by your ability to scan over all the sprite attribute data within a single scanline to work out what to draw in the next one), and whether you can deal with them overlapping (limited by your ability to perform multiple data reads per pixel and blend them). On that last point in case it's not clear - if three sprites are overlapping for example, then for pixels covered by all of them, you need to read at least the transparency data for all three sprites to work out which one to actually use to colour the pixel - so you need very fast sprite memory, or to store different sprites in different ICs so you can fetch them simultaneously, etc... it's still quite a bit of hardware just to get that bit to work.

I'll be interested to see how this turns out though! Are you using logisim or something similar to prototype it?

Posted: **Fri Aug 15, 2025 7:55 pm**

gfoot wrote:

Your idea requires less activity from the CPU, if I understand it correctly, but more calculation hardware. It also requires the hardware to have time to essentially loop over all the possible sprites, for every scanline - I guess it means that the total number of sprites on the screen is going to be limited by how much time it takes you to read the data for one sprite and decide if it needs to be rendered in the following scanline. Did you calculate any figures for that?

That is where I got the optimistic figure of 256 sprites (20 32x32 per scanline), assuming I can pump through the information needed for vertical sprite data in a single cycle per sprite (big assumption)

gfoot wrote:

Your limitations are going to be the number of sprites you can display within a scanline (limited by how many copies of the actual sprite hardware you still have - horizontal counters, pixel fetch, blending, etc), the total number of sprites you can support in the whole frame (limited by your ability to scan over all the sprite attribute data within a single scanline to work out what to draw in the next one), and whether you can deal with them overlapping (limited by your ability to perform multiple data reads per pixel and blend them). On that last point in case it's not clear - if three sprites are overlapping for example, then for pixels covered by all of them, you need to read at least the transparency data for all three sprites to work out which one to actually use to colour the pixel - so you need very fast sprite memory, or to store different sprites in different ICs so you can fetch them simultaneously, etc... it's still quite a bit of hardware just to get that bit to work.

So how this currently works is that the rendering hardware actually writes directly to the required pixels, only writing if the alpha mask (lower nybble != 0) says it should, if this can be fast enough (and modern SRAM is pretty fast) it should work out fine, at least in theory?

gfoot wrote:

I'll be interested to see how this turns out though! Are you using logisim or something similar to prototype it?

Using Digital right now because of its emulated VGA display node, the simulation doesn't fit real chips yet, but the concept does appear to be at least theoretically functional

I'll try to make a proof of concept board when I'm a bit further with the machine I'd want to hook it up to, I have a non-zero amount of faith in the idea so far, which is a decent start

Posted: **Fri Aug 15, 2025 9:20 pm**

I am now working on my own "computer", where is the "retrocomputer (CPU HD6309)" part and the "Graphic card" Atmega2560 part. They communicate via "shared memory chip", which can be used by CPU via bus, or connected to Atmega2560 directly. The GLUE (CLPD ATF1504) is used to communicate, which part can it access when.

You may find some ideas here usefull and use it (or it may be just waste of time, who knows...). It allows for moving big data between CPU and Graphic Card, each working on different clocks and styles.

The idea is, that CPU writes a request(s) to the RAM chip, then Atmega read/copy it to its own memory and show it on VGA "somehow".

When CPU "owns" the chip, it is connected by 74HC245 (Octal bus transceiver; 3-state gates) to system bus and is mapped like "normal memory", so CPU can simply read/write there as it wish.

After CPU did everything it wanted, it release the chip and Atmega will "own" it.

Atmega will use some GPIO pins to direct access the chip and read it and decide, how to manage those requests. Maybe also write some return code, or values there. Then release the chip and CPU can have its way with it again. (Atmega have all GPIO pins 3state, so it can easily "not interfere" with the chip, if it does not own it.)

As first step I plan to have some simple protocol like "Move cursor to: 2,5" and "write this string: 'Hello world!!!'", then add something like VDU calls, maybe also add "copy to rectangle: 12,32-28,48 this bytes: len:256, bytes (256 bytes)" and so on.

Similar arrangement will go with "file unit", where SD card with FAT be used and text and files will be transferred over the memory chip.

It adds latency to the communication, but the throughput may be big (switch some signals only to "move" 128kB RAM ownership) and each party may use it on its own pace and styte, and all instructions may be interpretted on background, while CPU is inventing new requests

I have first construction of the graphic card here KiCad files and some text as part of https://github.com/githubgilhad/MegaHomeFORTH repository (which is the graphic card part). It is still under construction and the rest of the computer will came later (I hope). But the graphic card may be also used as SBC of its own (which I do now for testing it).

: schema of the idea of shared memory chip

I do not want to hijack this thread. I may write more about my project after it will be more complete, even if it is not 6502/65c816, it could be simply changed to it too. I just wanted to present the sharing RAM idea.

Posted: **Fri Aug 15, 2025 10:46 pm**

catelyn wrote:

So how this currently works is that the rendering hardware actually writes directly to the required pixels, only writing if the alpha mask (lower nybble != 0) says it should, if this can be fast enough (and modern SRAM is pretty fast) it should work out fine, at least in theory?

Hmm so while the display system is scanning along odd scanline 123 for example, your hardware is going to:

copy row 124 from the background framebuffer into the even scanline buffer
scan sprite attribute memory to identify which sprites appear on scanline 124
for each such sprite in turn, copy its non-transparent pixels into the even scanline buffer, on top of the background data

It feels like a lot but I guess it does achieve your goal as you now only need one physical "sprite drawer", just it's going to run multiple times during a scanline. In a high resolution I'm not sure it'd be a win, except maybe reducing the challenge of finding fast enough sprite memory, but in a double-scanned resolution like 320x240 it would probably have a lot of benefits as on top of everything else it halves the number of reads the display hardware has to do from the framebuffer memory, by caching a scanline in a form that can then easily be displayed twice.

Just doing some calculations on that to check - supposing your memory in the scanline buffer and framebuffer is just fast enough to manage 640x480 resolution - reading 640 pixels from the framebuffer and writing them into the scanline buffer, at the rate of the VGA pixel clock - then you have 160 pixels' worth of time per scanline remaining to go back and draw the sprites on top. Sticking with one pixel per VGA clock, you'll be able to write another 160 pixels of sprite data - if it's up to 32 pixels per sprite then that's 5 sprites at the most.

But at 320x240 you now have 1600 VGA pixels' worth of time per double-scanline, rather than just 800 - and you only have 320 pixels to copy from the framebuffer - so now you have the time to copy up to 1280 pixels from sprites, which is a world of difference - now you can have 40 sprites all visible on one scanline. Even if your memory system is only half the speed of the VGA clock, you still get 20 sprites - and I guess this is where you got that number from?

I actually think there is far more interesting potential here than just sprite rendering though, and it's well worth looking up how the Atari 8 bit systems worked - ANTIC, POKEY, and GTIA. For example you don't need to actually read the same framebuffer data each frame, you can scroll it around, and maybe even allow for horizontal stretching or other transformation of the data during the copy, it's a powerful technique that could allow a lot of creative effects.

Quote:

I have a non-zero amount of faith in the idea so far, which is a decent start

Better than my faith in most of my projects!

Posted: **Sat Aug 16, 2025 2:11 am**

catelyn wrote:

BigDumbDinosaur wrote:

Excellent...we need more 816 enthusiasts around here.

It's a weird but pleasant little chip!

It has never seemed all that weird to me once I understood the architecture. Having the stack capabilities that it has makes the effort to learn its quirks effort well spent.

Quote:

How fast do you plan to run the 816 and how much RAM will it have?

Ideally "as fast as I can", so 14 MHz, but I would probably settle for somewhere between 4 to 8 MHz, and while the final plan is "it'd be cool to have it fully kitted out with ~14.5 MiB of (work) RAM", I've currently got ~2.5 MiB of RAM available as a plan, with 512 KiB of flash ROM for boot operating storage.

The current production versions of both the C02 and 816 can run well beyond 14 MHz. Production testing is done at 20 MHz, which is the speed at which my POC V1.2 unit can run. The FMAX vs. VCC curve, when extrapolated, suggests an 816 running on 5 volts is stable at ~25 MHz. Bill Chen (user plasmo here) has gotten a C02 to run in the mid-30 MHz range. So there is headroom above the official 14 MHz rating. Your glue logic and RAM speed, of course, will dictate the clock ceiling.

BTW, during testing with the logic analyzer on both my POC V1.2 and 1.3 units, the 816s that I sampled all made the address bus valid right around 12 nanoseconds after the fall of Ø2. In theory, that timing value would suggest that the 816 might be capable of running at 30+ MHz. The problem at that speed will be in latching the bank bits before the rise of the clock.

Quote:

I've got some of the hardware planned out, but this might possibly be an eternal prototype where I try things that seem cool to me.

It’s the journey that is fun to me...but may not be fun to you.

Posted: **Sat Aug 16, 2025 11:22 am**

gfoot wrote:

copy row 124 from the background framebuffer into the even scanline buffer

Ah, no, understandable misunderstanding, the current idea is to have the linebuffer be exclusive to the sprite hardware, it gets cleared out during reading (which reads 2 pixels and then clears 2 pixels, alternatingly) while the back buffer gets filled with just the sprite data. I'm aware it's a bit of an unusual approach, but it should be fine for handling just sprites, with the rest of the mixing being done further down the graphics pipeline.

gfoot wrote:

I actually think there is far more interesting potential here than just sprite rendering though, and it's well worth looking up how the Atari 8 bit systems worked - ANTIC, POKEY, and GTIA. For example you don't need to actually read the same framebuffer data each frame, you can scroll it around, and maybe even allow for horizontal stretching or other transformation of the data during the copy, it's a powerful technique that could allow a lot of creative effects.

I'll be sure to check that out sooner rather than later, my current idea for framebuffer/tilemap layers would be to do something similar to what James Sharman does for his tilemap in the JAM-1, but display lists (I vaguely remember that's an Atari 8-bit concept from some repair videos?) do also seem like an interesting approach to rendering graphics!

Posted: **Sat Aug 16, 2025 11:30 am**

BigDumbDinosaur wrote:

It has never seemed all that weird to me once I understood the architecture. Having the stack capabilities that it has makes the effort to learn its quirks effort well spent.

The way register sizing is handled makes it a bit more complicated than other architectures, in my opinion, and the additional hardware needed to deal with the data/bank multiplexing does make it a bit more complicated to get working than a 6502 (which is why I started with that for my first prototyping), but it's a fun chip!

BigDumbDinosaur wrote:

It’s the journey that is fun to me...but may not be fun to you.

I do love the journey, and it's a lot of fun learning more things about electronics, even if some of it is learnt the hard way through trial and error.

Posted: **Sat Aug 16, 2025 12:43 pm**

catelyn wrote:

Ah, no, understandable misunderstanding, the current idea is to have the linebuffer be exclusive to the sprite hardware, it gets cleared out during reading (which reads 2 pixels and then clears 2 pixels, alternatingly) while the back buffer gets filled with just the sprite data. I'm aware it's a bit of an unusual approach, but it should be fine for handling just sprites, with the rest of the mixing being done further down the graphics pipeline.

Ah I see, that's very interesting, sort of between the two - you avoid copying the framebuffer data into the scanline by still doing that blend later on. Nice idea, it makes even more sense now!

catelyn wrote:

I'll be sure to check that out sooner rather than later, my current idea for framebuffer/tilemap layers would be to do something similar to what James Sharman does for his tilemap in the JAM-1, but display lists (I vaguely remember that's an Atari 8-bit concept from some repair videos?) do also seem like an interesting approach to rendering graphics!

It's not something I actually know a lot about, just something I've scratched the surface on and thought was an interesting idea and ahead of its time, and it just keeps popping back into my head when topics like this come up.

Posted: **Sat Aug 16, 2025 3:54 pm**

Many 2d home game console video chips of the late 8-bit CPU generation (TG16/PCE etc) and beyond did this sort of thing, with line buffers and painting tiles/sprites onto them, with one hardware renderer instead of duplicated parallel hardware blocks per sprite, wrangling for who should show in real-time. So yeah, it's perfectly cromulent!

I believe they all scanned through all potential sprites per scanline, to see which is active and in view, and that can take a lot of cycles depending on your maximum number of sprites and clock speed, but it's the most consistent way to handle things. If you had an enable/disable sprite command list that ran per scanline to try to lessen work, then there'd still be a maximum number of events you could handle per scanline, and if you think about big walls of sprites all starting on the same line, a hardwired full scan-through would likely be a lot more efficient.

Posted: **Sat Aug 16, 2025 5:58 pm**

catelyn wrote:

The way register sizing is handled makes it a bit more complicated than other architectures, in my opinion, and the additional hardware needed to deal with the data/bank multiplexing does make it a bit more complicated to get working than a 6502 (which is why I started with that for my first prototyping), but it's a fun chip!

Yep, having to twiddle the m and x bits in SR (status register) can make for some interesting programming. In retrospect, it would have been more efficient if Bill Mensch had used the WDM “escape” opcode followed by the appropriate opcodes to tell the 816 to do a 16-bit load or store, e.g., $42 $AD $34 $12 (WDM LDA $1234) to indicate the load from $1234 is to be a word instead of a byte. I’m not at all certain as to why he adopted the idea of using SR bits and the REP/SEP instructions. Oh well!

In most of my 816 programs, I clear m and x at startup and leave them that way to the end. If a function needs a different combination, it pushes SR and changes m and/or x as required. Many functions set x, giving me a 16-bit accumulator and 8-bit indexes. Higher-level functions usually completely preserve MPU state, which is defensive programming so the mainline code doesn’t have to “worry” about register sizes and content. Lower-level functions, especially primitives, usually preserve only DB (data bank), DP (direct page pointer) and SR—.A, .X and .Y only get preserved if they are to be used to return values to the caller.

A lot of my functions point DP to the stack, which is why DP has to be preserved. A function that uses a stack frame for parameter-passing preserves DB because it is affected by the MVP instruction that is part of the stack housekeeping code that is executed when the function returns.

As for the bank latching, that is a nuisance—timing can be critical. It also complicates aspects of the glue logic that determine where things such as ROM and I/O show up in the memory map. Although it can be done with discrete logic, better to use programmable logic and ease the timing problems.

Quote:

BigDumbDinosaur wrote:

It’s the journey that is fun to me...but may not be fun to you.

I do love the journey, and it's a lot of fun learning more things about electronics, even if some of it is learnt the hard way through trial and error.

Trial and error can be good...as long as the error part doesn’t cause vile-smelling smoke to sneak out.

Posted: **Sun Aug 17, 2025 12:01 pm**

White Flame wrote:

Many 2d home game console video chips of the late 8-bit CPU generation (TG16/PCE etc) and beyond did this sort of thing, with line buffers and painting tiles/sprites onto them, with one hardware renderer instead of duplicated parallel hardware blocks per sprite, wrangling for who should show in real-time. So yeah, it's perfectly cromulent!

That's great to hear!

I do have a lot of practical problems I'd still need to find a solution for, but I think I can eventually make it work.

White Flame wrote:

I believe they all scanned through all potential sprites per scanline, to see which is active and in view, and that can take a lot of cycles depending on your maximum number of sprites and clock speed, but it's the most consistent way to handle things. If you had an enable/disable sprite command list that ran per scanline to try to lessen work, then there'd still be a maximum number of events you could handle per scanline, and if you think about big walls of sprites all starting on the same line, a hardwired full scan-through would likely be a lot more efficient.

Yeah, that's what I was thinking too. There's a lot of "clever" ways to do this, but doing it stupid faster might just be better.

6502.org

A concept for sprite hardware

A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware

Re: A concept for sprite hardware