Many of us want our 6502/65816 systems to be relevant - or sufficient to use as our primary computer. Many of us want our systems to be fast and maybe a little flashy in a world where highly visual work gains dis-proportionate attention. And many of us would like like to achieve that with a blitter. Even BigDumbDinosaur (who is best known for multiple alternate pursuits involving the robust fabrication of metal, the development of robust filing systems, serious business applications and a compact no-nonsense approach to operating systems)
wants faster block copy:
BigDumbDinosaur on Thu 9 Jun 2022 wrote:
What is really needed is a 65xx bus-compatible “blitter” to take care of mass memory copies and DMA-type operations.
Various schemes have used FGPA for video. ElEctric_EyE had great success before going dormant.
Agumander's GameTank has an
audacious 2D discrete implementation and the details are publicly available. Less is known about Radical Brad's design but it is more ambitious and includes 2D blitting with run length compression and a scrolling playfield on 30 breadboards. (I love you guys but I wouldn't want to make five of those.)
2D blitting is a special case which is most useful for bitmap graphics. Beyond that, I'm sure what anyone would like in a 65xx compatible blitter. I presume that copying from arbitrary offsets is highly desirable, although, perhaps page alignment is sufficient. I presume that 1D block copies larger than one page are highly desirable. I also presume that alternate cycles should be read and write. I also assume that an interrupt scheme similar to the GameTank is high desirable. Specifically, an interrupt occurs when the blitter is finished.
There is the problem of blitter priority. If a blitter steals all bus cycles then the processor is halted and doesn't respond to interrupts. That is invariably a problem in any system involving, high throughput storage, UART, Ethernet or video. However, if the blitter is bottom priority then it only uses dead cycles. If you are persistently good at avoiding some addressing modes then the guaranteed bandwidth available to a blitter is zero and this easier to achieve when 65816 is in native mode. The halfway hedge of alternate processor and blitter cycles is also undesirable because it increases interrupt jitter. Apart from that limitation, equal priority solves many problems very economically. For example, it is possible to make a discrete or FPGA, one RAM chip, four phase system where:
- Phase 0: Read-only video cycle.
- Phase 1: Read/write processor cycle.
- Phase 2: Read-only audio cycle.
- Phase 3: Read/write blitter cycle.
In particular, phase 2 or phase 3 can be forfeited. Don't use sound? Have double video resolution. Don't use blitter? Have double processor bandwidth. Using one contemporary
50 cent SRAM, I believe this arrangement approaches or matches the bus bandwidth of a Commodore Amiga 500. Unfortunately, unlike the GameTank which uses dual port RAM to decouple different parts of the system, all of this would be tied to one clock - and that would probably be 25MHz or so for 640*480 VGA or 28MHz or so for NTSC. That means audio may only play at one awkward rate, which varies according to display preference. Also, like the Amiga, it is very good at outputting video and audio but absolutely terrible for inputting video and audio. The main advantage of this arrangement is that it always avoids two sequential writes. Overall, it is a quick hack to get 1990s performance from one SRAM.
Beyond block copying, what should a 65xx blitter do? The Amiga blitter could perform logical operations and this extended to decode of the track-at-once floppy disk recording format. The Amiga explicitly allowed exploitation of the blitter pipeline delay and this allowed arbitrary barrel shifting (left or right) up to one word while only ascending through RAM. A 65xx blitter which performs logical operations could implement UART decode, Ethernet (Manchester) encode/decode, SP/DIF (MLC) encode/decode, floppy (MFM) encode/decode,
cell network encode/decode or similar to data
in situ. USB3.x or PCIx may be feasible or desirable. If arithmetic operations are included then de-correlation of multi-media is possible. That includes JPEG encode/decode, MPEG audio encode/decode and surround sound schemes similar to MLP [Meridian Lossless Packing]. This requires nothing more than unconditional execution of addition, subtraction and multiplication. 8 bit adder with carry can perform 16 bit, 32 bit or 64 bit addition with selective reset of the carry flag every 2, 4 or 8 cycles. With unrestricted addition and barrel shifting, it is possible to implement quirky things like
1024 bit rational FPU.
This doesn't apply to discrete implementation due to the quantity of components required but an ALU blitter may have a special mode which feeds a polygon blitter. If the ALU blitter permits two or more sources then point data and color/texture data may be sourced from separate but equal size buffers. This may allow graphics to exceed Sega's Virtua Racing. Specifically, it may allow more than 6000 polygons to be sourced more than 30 times per second. A suitably rigged system may allow static scenery to be merged with dynamic data. This only requires 6502 to animate and depth sort vehicles or Quake style monsters. The remainder is fed from ALU blitter to polygon blitter with only rotate and translate.
With one RAM chip for processor ($0.50), four RAM chips for blitter ($2.00) and a few tri-state buffers ($4.00), it is possible to exceed the bus bandwidth and processing power of Amiga 1200. For those with an extravagant budget and far less tolerance for breadboards, wire wrapping or soldering then an FGPA development board with DDR sockets may obtain bus bandwidth exceeding 500MB/s. Of course, there is a catch with this new-fangled four phase DRAM. Specifically, the sustained throughput above 100MHz only applies to sequential, pipelined access. Thankfully, that would only be to our sequential, pipelined co-processor. Meanwhile, those really cheap SRAM chips are brilliant for unpipelined, random access at speeds below 30MHz.
At this point, the blitter is more like a vector co-processor where two (or more) inputs are run through a separate ALU, on a separate bus, running at a separate speed. Technically, this is a heterogeneous architecture and these can be hugely awkward. Historically, we've had DTACK Grounded's MC68000 in Commodore PET where I/O sprouted in every direction. Keyboard, text display, audio and floppy disk were connected to 6502 but with 64KB RAM or more, the MC68000 was more suited for bitmap display, harddisk and print spooling. Acorn was a little more organized with all I/O connected to the low latency 6502 but the other processor was optional, arbitrary and connected via a byzantine protocol. Atari and Sega had bizarre designs with seven processors. The quirky Amiga is missed. Nowadays, we have various incompatible x86_64 variants connected to various incompatible GPU variants.
One of the many problems of heterogeneous computing is that resource limits may be hit prematurely. For example, exhausting processor RAM or graphics RAM when the other type is barely used. Likewise for processing power. However, with a vector co-processor on a separate bus, this is a lesser concern. We may have 65816 with maybe 12MB 8 bit SRAM, the co-processor with maybe 16GB 32 bit DRAM (and the ability to sequentially scan it within one minute) and REU style paging between SRAM and DRAM (although, possibly, at the rate of seven cycles per byte or slower). Overall, we have a distinction between unconditionally processed bulk data held in low quality DRAM (Ethernet packets, PCM audio, JPEG tiles) and conditionally processed meta data held in high quality SRAM (program, memory allocation, network addresses, music sequences, file handles, RMS error).
This leads to an arrangement where undifferentiated ports are double buffers or triple buffers to an undifferentiated pool of pipelined DRAM. (See
AndrewP's discrete simulation of double buffered video for coarse example.) ALU blitter inputs/output, polygon blitter output and 2KB or so window available to 6502 are the most differentiated interfaces to bulk DRAM. The remainder may be interchangeable ports for video, audio, network or storage. (Alternatively, ports may be set to a mode where they can be used for PS/2, I2C, SPI or
SNES peripherals.) Each port may operate at its own frequency. For example, audio may operate at 24.000MHz, network may operate at 25.000MHz and video may operate at 25.175MHz or more. The limitation is that aggregate memory bandwidth must never be exceeded and processor interrupt frequency must never be exceeded. When the DRAM is four times wider and four times faster than the ports, this can be trivially achieved by keeping the total number of ports to 16 or less.
As demonstrated by Agumander and Radical Brad, the best blitter on a 6502 bus is no blitter on a 6502 bus. By evicting all of the bulk data and blitting from the 8 bit processor bus, we get the killer feature rarely found elsewhere: guaranteed 1000ns interrupt response time (except when doing a Turing complete atomic operation) from a language which can be written by mortals without an "optimizing" compiler. This means you can count the cycles in the next 1000ns and the one after that. When interrupts have fairly deterministic duration, a minimal core may have a magnifying effect. For example, 8 bit 25MHz processor may instigate checksums and routing for 100MB network data per second or instigate 6000 polygon draw operations per video frame.
Bulk memory doesn't have to be accessed with byte alignment - although this is useful for many video and audio processing tasks. Instead, 1GB RAM = 2^30 bytes = 2^20 1KB buffers. These buffers may be used with no regard for fragmentation because everything is the same size. As an example, 48 interrupts per second allow 8 bit 48kHz audio to be played continuously. 96 interrupts per second allow 16 bit 48kHz audio to be played. Many systems require more processing power to send or receive 9600 baud UART. Nominally, a vector co-processor in its own memory-space may work in kilobytes rather than bytes. Unfortunately, if bulk DRAM exceeds SRAM by more than a factor of 1000 then there is insufficient SRAM to track all buffers independently. The solution is to allocate memory in much larger quantities.
I have a plan which scales from breadboard with no audio or video to PCB with minimal audio and video to a stupendous system with 16GB RAM, 8K video, surround sound and networking. However, that step requires a large FPGA with plenty of FRAM. That is a large jump - and quite expensive. There is one part which is lacking from here to there. I have yet to outline anything beyond a polygon blitter using a cheap FPGA. That's where I'm stuck.