6502.org

Posted: **Wed Jun 01, 2022 4:18 pm**

An increasing number of people get 6502/65816 systems to output composite video or VGA. I've suggested schemes to extend such functionality using a minimum number of components. The first of these is to have one set of counter chips and two or more sets of character ROMs. The intention is that the number of symbols can be vastly increased while the hardware is less than doubled. I've also suggested that dual displays, triple displays and suchlike may use the same set of counter chips. However, I've missed a really obvious extension which is likely to be more popular: sound. Specifically, PCM audio.

The requirements for sound are vague but often aligned with the requirements for video. We want to output the maximum bit depth at the maximum rate possible - without overwhelming the processor. To push this further, many systems (Zoran's Vaddis IV chip, Atari Jaguar, GameTank) allocate one processor to video and one processor to sound. However, in the trivial case, we want to:

Output one or more analog signals.
Using a resistor network DAC.
To obtain 1 volt peak-to-peak signals.

In the case of video, we want a horizontal scan-line frequency of 15.6kHz or more. In the case of audio, we're not fussy but anything above 8kHz would be preferable. I strongly recommend against tying the frequency of one system to another. Indeed, many of the difficulties of accelerating Commodore, Apple and Atari systems come from deeply inter-locked processor and video timing.

Ignoring that, if you want a quick win, 640*480 pixel VGA (with 9 bit counter) dove-tails really nicely with the default PATA/SATA/CF/MicroSD block size of 512 bytes. Furthermore, 31kHz horizontal sync is only moderately less than Compact Disc sample rate and would be an ideal default for 8 bit PCM audio. As a quick hack, a double buffered, vertical blank interrupt video system may play sound at the rate of one disk block per frame. At 0.5KB per block and 60 frames per second, this is exactly 30KB of audio per second. Double or triple this rate for 16 bit or 24 bit audio and multiply that by the total number of channels. How far can we push it? BigDumbDinosaur's POC designs have exceeded 600KB/s. Therefore, 16 bit, 7.1 surround sound is possible on 6502/65816.

I take AndersNielsen's VGA implementation as a concrete example. It reads 64 bytes per horizontal line. One additional RAM chip using the same counters could send 64 bytes of audio to one or more audio DACs. This could be eights sets of audio samples to eight DACs or four sets of samples to 16 DACs. Or a lesser quantity. I'll assume that hardware mixing of four samples is preferable because this only divides the maximum volume by a factor of four. In this arrangement, addressing of the audio RAM is such that the bottom three or four bits correspond with DAC number and the next two or three bits correspond with the audio "voice" which is being mixed.

However, it may be slow and inconvenient to populate to audio RAM in the order that it is rastered to DACs. It is vastly preferable if 512 bytes of one channel can be populated from storage (or network). This is particularly true if the audio buffers are bank switched and therefore not addressable at the same time. This is the conceptual trick. If samples of audio are played at horizontal sync frequency then it is convenient to imagine audio buffered as 64 "vertical stripes" within the audio RAM. Audio is played by rastering the audio RAM left-to-right, top-to-bottom in the same manner as the video RAM. However, unlike the video RAM, windows into the audio RAM are columns rather than rows.

It is obvious that it is possible to stripe audio by byte and therefore it is possible to make a hardware/software interface which is upward and downward compatible with 8/16/24 bit audio. Indeed, it works in a very similar manner to the optional accents with character generation - including use of the same bank latches. I call this voice mixing and 24 bit audio compatibility MAGPIE.

This is it. This is the solution. Audio and video may use the same bank latches. Audio and video RAM may be populated in the same manner. However, for audio, the order of the address lines when writing data is different to the order of the address lines when reading data. In my preferred implementation, the least significant 6 bits - when playing samples - become the most significant 6 bits when bank selecting audio RAM. Audio is very much like video - with the major exception that address lines are shuffled somewhere between writing and rastering.

We have to handle the mundane issue of volume control. I considered doing this in software with a caching multiplication algorithm. It is preferable to use a voltage multiplier with software and hardware volume control inputs. This is especially true if audio will be used in conjunction with storage, video or network. Audio output quality can be maximized by putting everything through one DAC and one volume multiplier. It is then possible to direct analog audio (on its own power rail) via an analog switch to sample-and-hold circuitry where sampling occurs in the *middle* of the cycle.

For an encore, I extend the rather fluid concept of row (or column) to networking. Here, the requirements are more vague than audio. The basic requirements for networking are:

Nodes should be able to send data to each other.
Don't design any stupid feature which hinders faster networking.

This is sufficiently vague to describe RS-485 but assume that we want something more like Manchester encoded, full-duplex, twisted-pair Ethernet as part of a combined Video/Audio/Network card. In this case, one or more out-bound network channels can be multiplexed like audio voices. This is fairly straight-forward and I was inclined to shadow everything under ROM addresses. Indeed, the clock stretching required to read from slow ROM is compatible with sending video, audio or network output to a peripheral card. Then I considered reading from network. ("How's that gonna work?") We also have the problem of decoding wire formats, such as Manchester encoding, as recommended by drogon. Encoding is easy. One byte becomes two bytes and this may be performed with two 256 byte tables.

For decode, I adapt one trick from the 6502 Forum's Programmable Logic section. Specifically, Windfall's Yet another (unnamed) 65C02 core stores even and odd bytes separately so that any 16 bit value can be retrieved with 8 bit granularity. If two or three 8 bit RAM chips are combined with a barrel shifter made from 74x157 chips, it is possible for any network input buffer to be de-skewed in hardware before the wire format is decoded in software. The occasional bit slip between hosts merely requires incrementing or decrementing through the available range of network buffer aliases.

So far, one bank latch value may represent:

One write-only, unique line of video display.
One write-only, unique "column" of audio samples.
One write-only, unique network output buffer.
One read-only, barrel shifted network input buffer.

With the addition of an audio input, it is possible to implement 6502 audio conferencing over LAN with four remote users. While there are many examples in film, I was specifically reminded of this possibility via the cheap and cheerful rendering of Winx Club: Magical Adventure (freely available in 2D on YouTube) where, co-incidentally, five characters talk about an absent friend. This is a demanding example of audio which could be implemented with homebrew 6502 systems.

Four channels at telephone quality is approximately the same as one channel at 31kHz and this may be simplified if the hardware mixes the audio streams. Admittedly, allowing users to join or leave a party-line is difficult - and a variable number of nodes will compound the pacing which is required to compensate for the mis-match in audio sample speed. As a trivial example, consider audio conferencing with two nodes. If the crystal oscillators are at different temperatures then one node will run out of samples to play while the other gets back-logged. After you think of a solution to this problem, extend it to cover three-way calling. Then keep going.

My preferred rate for initial network negotiation is 1/3 of double clocked 32.768kHz crystal. Starting from 65536kHz, three ticks may be used for 3x over-sampling. 65536kHz can be approximated by dividing 25.000MHz by 381 or dividing 25.175MHz by 384. I'd prefer to make the network compatible with 25.000MHz crystals commonly used in 400MHz USB2 or 800MHz USB3. However, 25.000MHz VGA is 0.5% too slow and may be incompatible with some monitors. I suppose we could obtain some portability by dividing 25.000MHz by 1000 and dividing 25.175MHz by 1007. However, GCD [Greatest Common Divisor] is 25kHz which is low for audio conferencing. In the medium term, audio and network will probably require clock domain crossing. In that case, VAN [Video, Audio, Network] will all be running at different speed. That may appear to eliminate the reason for grouping them together. However, I'm working toward a generalized peripheral FPGA where all ports may be MicroSD, SNES, SP/DIF, LAN or video. In this general case, it is helpful to group the functions due to their lack of common frequency.

Posted: **Sat Jul 09, 2022 3:55 pm**

Many of us want our 6502/65816 systems to be relevant - or sufficient to use as our primary computer. Many of us want our systems to be fast and maybe a little flashy in a world where highly visual work gains dis-proportionate attention. And many of us would like like to achieve that with a blitter. Even BigDumbDinosaur (who is best known for multiple alternate pursuits involving the robust fabrication of metal, the development of robust filing systems, serious business applications and a compact no-nonsense approach to operating systems) wants faster block copy:

BigDumbDinosaur on Thu 9 Jun 2022 wrote:

What is really needed is a 65xx bus-compatible “blitter” to take care of mass memory copies and DMA-type operations.

Various schemes have used FGPA for video. ElEctric_EyE had great success before going dormant. Agumander's GameTank has an audacious 2D discrete implementation and the details are publicly available. Less is known about Radical Brad's design but it is more ambitious and includes 2D blitting with run length compression and a scrolling playfield on 30 breadboards. (I love you guys but I wouldn't want to make five of those.)

2D blitting is a special case which is most useful for bitmap graphics. Beyond that, I'm sure what anyone would like in a 65xx compatible blitter. I presume that copying from arbitrary offsets is highly desirable, although, perhaps page alignment is sufficient. I presume that 1D block copies larger than one page are highly desirable. I also presume that alternate cycles should be read and write. I also assume that an interrupt scheme similar to the GameTank is high desirable. Specifically, an interrupt occurs when the blitter is finished.

There is the problem of blitter priority. If a blitter steals all bus cycles then the processor is halted and doesn't respond to interrupts. That is invariably a problem in any system involving, high throughput storage, UART, Ethernet or video. However, if the blitter is bottom priority then it only uses dead cycles. If you are persistently good at avoiding some addressing modes then the guaranteed bandwidth available to a blitter is zero and this easier to achieve when 65816 is in native mode. The halfway hedge of alternate processor and blitter cycles is also undesirable because it increases interrupt jitter. Apart from that limitation, equal priority solves many problems very economically. For example, it is possible to make a discrete or FPGA, one RAM chip, four phase system where:

Phase 0: Read-only video cycle.
Phase 1: Read/write processor cycle.
Phase 2: Read-only audio cycle.
Phase 3: Read/write blitter cycle.

In particular, phase 2 or phase 3 can be forfeited. Don't use sound? Have double video resolution. Don't use blitter? Have double processor bandwidth. Using one contemporary 50 cent SRAM, I believe this arrangement approaches or matches the bus bandwidth of a Commodore Amiga 500. Unfortunately, unlike the GameTank which uses dual port RAM to decouple different parts of the system, all of this would be tied to one clock - and that would probably be 25MHz or so for 640*480 VGA or 28MHz or so for NTSC. That means audio may only play at one awkward rate, which varies according to display preference. Also, like the Amiga, it is very good at outputting video and audio but absolutely terrible for inputting video and audio. The main advantage of this arrangement is that it always avoids two sequential writes. Overall, it is a quick hack to get 1990s performance from one SRAM.

Beyond block copying, what should a 65xx blitter do? The Amiga blitter could perform logical operations and this extended to decode of the track-at-once floppy disk recording format. The Amiga explicitly allowed exploitation of the blitter pipeline delay and this allowed arbitrary barrel shifting (left or right) up to one word while only ascending through RAM. A 65xx blitter which performs logical operations could implement UART decode, Ethernet (Manchester) encode/decode, SP/DIF (MLC) encode/decode, floppy (MFM) encode/decode, cell network encode/decode or similar to data in situ. USB3.x or PCIx may be feasible or desirable. If arithmetic operations are included then de-correlation of multi-media is possible. That includes JPEG encode/decode, MPEG audio encode/decode and surround sound schemes similar to MLP [Meridian Lossless Packing]. This requires nothing more than unconditional execution of addition, subtraction and multiplication. 8 bit adder with carry can perform 16 bit, 32 bit or 64 bit addition with selective reset of the carry flag every 2, 4 or 8 cycles. With unrestricted addition and barrel shifting, it is possible to implement quirky things like 1024 bit rational FPU.

This doesn't apply to discrete implementation due to the quantity of components required but an ALU blitter may have a special mode which feeds a polygon blitter. If the ALU blitter permits two or more sources then point data and color/texture data may be sourced from separate but equal size buffers. This may allow graphics to exceed Sega's Virtua Racing. Specifically, it may allow more than 6000 polygons to be sourced more than 30 times per second. A suitably rigged system may allow static scenery to be merged with dynamic data. This only requires 6502 to animate and depth sort vehicles or Quake style monsters. The remainder is fed from ALU blitter to polygon blitter with only rotate and translate.

With one RAM chip for processor ($0.50), four RAM chips for blitter ($2.00) and a few tri-state buffers ($4.00), it is possible to exceed the bus bandwidth and processing power of Amiga 1200. For those with an extravagant budget and far less tolerance for breadboards, wire wrapping or soldering then an FGPA development board with DDR sockets may obtain bus bandwidth exceeding 500MB/s. Of course, there is a catch with this new-fangled four phase DRAM. Specifically, the sustained throughput above 100MHz only applies to sequential, pipelined access. Thankfully, that would only be to our sequential, pipelined co-processor. Meanwhile, those really cheap SRAM chips are brilliant for unpipelined, random access at speeds below 30MHz.

At this point, the blitter is more like a vector co-processor where two (or more) inputs are run through a separate ALU, on a separate bus, running at a separate speed. Technically, this is a heterogeneous architecture and these can be hugely awkward. Historically, we've had DTACK Grounded's MC68000 in Commodore PET where I/O sprouted in every direction. Keyboard, text display, audio and floppy disk were connected to 6502 but with 64KB RAM or more, the MC68000 was more suited for bitmap display, harddisk and print spooling. Acorn was a little more organized with all I/O connected to the low latency 6502 but the other processor was optional, arbitrary and connected via a byzantine protocol. Atari and Sega had bizarre designs with seven processors. The quirky Amiga is missed. Nowadays, we have various incompatible x86_64 variants connected to various incompatible GPU variants.

One of the many problems of heterogeneous computing is that resource limits may be hit prematurely. For example, exhausting processor RAM or graphics RAM when the other type is barely used. Likewise for processing power. However, with a vector co-processor on a separate bus, this is a lesser concern. We may have 65816 with maybe 12MB 8 bit SRAM, the co-processor with maybe 16GB 32 bit DRAM (and the ability to sequentially scan it within one minute) and REU style paging between SRAM and DRAM (although, possibly, at the rate of seven cycles per byte or slower). Overall, we have a distinction between unconditionally processed bulk data held in low quality DRAM (Ethernet packets, PCM audio, JPEG tiles) and conditionally processed meta data held in high quality SRAM (program, memory allocation, network addresses, music sequences, file handles, RMS error).

This leads to an arrangement where undifferentiated ports are double buffers or triple buffers to an undifferentiated pool of pipelined DRAM. (See AndrewP's discrete simulation of double buffered video for coarse example.) ALU blitter inputs/output, polygon blitter output and 2KB or so window available to 6502 are the most differentiated interfaces to bulk DRAM. The remainder may be interchangeable ports for video, audio, network or storage. (Alternatively, ports may be set to a mode where they can be used for PS/2, I2C, SPI or SNES peripherals.) Each port may operate at its own frequency. For example, audio may operate at 24.000MHz, network may operate at 25.000MHz and video may operate at 25.175MHz or more. The limitation is that aggregate memory bandwidth must never be exceeded and processor interrupt frequency must never be exceeded. When the DRAM is four times wider and four times faster than the ports, this can be trivially achieved by keeping the total number of ports to 16 or less.

As demonstrated by Agumander and Radical Brad, the best blitter on a 6502 bus is no blitter on a 6502 bus. By evicting all of the bulk data and blitting from the 8 bit processor bus, we get the killer feature rarely found elsewhere: guaranteed 1000ns interrupt response time (except when doing a Turing complete atomic operation) from a language which can be written by mortals without an "optimizing" compiler. This means you can count the cycles in the next 1000ns and the one after that. When interrupts have fairly deterministic duration, a minimal core may have a magnifying effect. For example, 8 bit 25MHz processor may instigate checksums and routing for 100MB network data per second or instigate 6000 polygon draw operations per video frame.

Bulk memory doesn't have to be accessed with byte alignment - although this is useful for many video and audio processing tasks. Instead, 1GB RAM = 2^30 bytes = 2^20 1KB buffers. These buffers may be used with no regard for fragmentation because everything is the same size. As an example, 48 interrupts per second allow 8 bit 48kHz audio to be played continuously. 96 interrupts per second allow 16 bit 48kHz audio to be played. Many systems require more processing power to send or receive 9600 baud UART. Nominally, a vector co-processor in its own memory-space may work in kilobytes rather than bytes. Unfortunately, if bulk DRAM exceeds SRAM by more than a factor of 1000 then there is insufficient SRAM to track all buffers independently. The solution is to allocate memory in much larger quantities.

I have a plan which scales from breadboard with no audio or video to PCB with minimal audio and video to a stupendous system with 16GB RAM, 8K video, surround sound and networking. However, that step requires a large FPGA with plenty of FRAM. That is a large jump - and quite expensive. There is one part which is lacking from here to there. I have yet to outline anything beyond a polygon blitter using a cheap FPGA. That's where I'm stuck.

6502.org

VAN [Video, Audio, Network]

VAN [Video, Audio, Network]

Re: VAN [Video, Audio, Network]