6502.org

Posted: **Mon Dec 21, 2020 1:40 pm**

Making video circuitry is currently outside of my ability. Regardless, there is a particular allure to doodling designs for video circuitry. I'd like a video display which displays 1920*1080p or significantly better and this would be particularly good if it only used cheap commodity components.

The first problem I identified is how to simultaneously display and write. This problem can be sidestepped but it typically reduces the throughput of screen updates by a factor of five. That is a significant problem when attempting to update two million pixels. It is preferable to do this while not being hindered by such a large multiple. Anyhow, the standard solutions are dual port RAM, two sets of RAM or fast RAM for the purpose of time slicing. Two sets of RAM is the cheapest option but it may require two sets of writes to keep display consistent. So, again, we may be hindered by a factor of two.

The second problem I identified is ripple carry from binary counters which leave an increasingly prominent skew on 2^P pixel transitions. How do people solve this problem? Latches on the counters to ensure consistent output? Perhaps it is preferable for an arrangement to not use latches. Instead, techniques to shorten counters may be preferable. Techniques include Chinese Remainder Theorem which uses co-prime modulos and space filling curves, such as Grey coding of address lines. I had much hope for Grey coding because it would be exceptionally useful to dump the contents of a RAM chip while only changing one address line per unit of output. In addition to serializing the contents of any horizontal scan line, this technique may also facilitate hardware scrolling. Unfortunately, the circuitry is unworkable. It is possible to get Grey coding working in "linear time" where G bits of Grey coding require logic gates with G-1 inputs. That's completely useless and it is preferable to use a binary counter. Regardless, this consideration of Grey coding found application elsewhere.

I continued with Chinese Remainder Theorem and found that the modulo counters could be incorporated into horizontal sync generation. (CRT for your CRT!) I also made an attempt to find a good palette. In particular, after finding:

Quote:

kc5tja on Sat 4 Jan 2003:

In 640 pixel mode, I have the following layout: RRmGGGBB. The RRm field forms a 3-bit red field. GGG forms a 3-bit green field. BBm forms a 3-bit blue field. Although three 3-bit fields normally creates a 512 color display, the 'm' bit (which stands for magenta) is shared between the red and blue channels, thus halving the number of colors actually viewable to 256. I've done some testing of this mechanism on my own computer using software simulation, and the initial results are quite nice going to do more testing course. But it gives a clean, 8-shades of grey, and discoloration of cyans and yellows isn't visible to my eye. I think I made a good compromise solution.

my intuition differed. I tried the following on a Unix system with GIMP 2.x:

Code: Select all

$ perl -e 'print "GIMP Palette\nName: SharedRedBlue\n#\n";for($m=0;$m<2;$m++){for($r=$m;$r<8;$r+=2){for($g=0;$g<8;$g+=1){for($b=$m;$b<8;$b+=2){printf("%3i %3i %3i\n",(73*$r)>>1,(73*$g)>>1,(73*$b)>>1)}}}}' > SharedRedBlue.gpl
$ perl -e 'print "GIMP Palette\nName: SharedRedGreen\n#\n";for($m=0;$m<2;$m++){for($r=$m;$r<8;$r+=2){for($g=$m;$g<8;$g+=2){for($b=0;$b<8;$b+=1){printf("%3i %3i %3i\n",(73*$r)>>1,(73*$g)>>1,(73*$b)>>1)}}}}' > SharedRedGreen.gpl
$ perl -e 'print "GIMP Palette\nName: SharedGreenBlue\n#\n";for($m=0;$m<2;$m++){for($r=0;$r<8;$r+=1){for($g=$m;$g<8;$g+=2){for($b=$m;$b<8;$b+=2){printf("%3i %3i %3i\n",(73*$r)>>1,(73*$g)>>1,(73*$b)>>1)}}}}' > SharedGreenBlue.gpl
$ perl -e 'print "GIMP Palette\nName: 4Cubes\n#\n";for($m=0;$m<4;$m++){for($r=0;$r<4;$r++){for($g=0;$g<4;$g++){for($b=0;$b<4;$b++){printf("%3i %3i %3i\n",$m*17+$r*68,$m*17+$g*68,$m*17+$b*68)}}}}' > 4Cubes.gpl
sudo cp *.gpl /usr/share/gimp/2.0/palettes

and then tested it against an extensive archive of skin-tones. I thought that SharedGreenBlue would produce the best results. I was wrong. SharedRedBlue produces the best result for light skin but 4Cubes produces the best result for dark skin and the most shades of grey. Anyhow, the best result is to allocate bits symmetrically and then share the remainder symmetrically.

I thought that I had a workable system until I found a comment from White Flame regarding a similar framebuffer:

Quote:

White Flame on Tue 24 Jul 2012:

No, don't make a dumb framebuffer, especially not a 256-color one. The machine won't have enough bandwidth to push it around, and it'll be a programming and memory management nightmare. The Amiga in particular suffered for this in comparison to the C64 vs its peers, ending up being much less colorful and less animated than even underpowered machines like the SNES. You only want a framebuffer if you have a 3d accelerator, or maybe a very fast 2d blitter/shader pushing generic bytes around much faster than the CPU (ie, would allow you to redraw an entire game screen, object by object, at 60fps).

Multiple layers of tile-based graphics (with selectable palettes per tile) is the key to colorful, animated graphics being dynamically pushed around fast, as well as are easier to create & manage within the realm of this class of hardware. Regarding sprites, that's another place where Commodore went wrong, trying to include a smaller number of larger sprites. The more graphically successful arcade & home game consoles created their sprites from a larger number of smaller (usually 8x8 pixel) sprites, which ends up being far less constraining.

I have seen similar sentiment in the influential DTACK Grounded newsletter, issue 40:

Quote:

FNE on Sun 10 Mar 1985:

One way to mix graphics and text is simply to use an exclusive-or gate to mix conventional 25-line, 80-column text with graphics. This requires the pixel rate of the text and graphics to be precisely synchronized, and does NOT permit text to be aligned on bit rather than character boundaries. It also leaves the text with a fixed size. We don't think most folks would consider such a system to be a bit-mapped system.

We believe that the most cost-effective solution to the text-cum-graphics problem at the moment is to use a dedicated text video circuit with nice, conventional 25 lines and 80 columns as the system output. This circuit should be memory-mapped like the Apple and Pet/CBM text screens, NOT at the other end of an RS-232 cable. Graphics then become optional, and require a second CRT.

Ahh, hot dang! They are completely correct. A little processor cannot push around 120MB/s unaided. Even my planned VGA style ModeX hack to write through eight banks of RAM falls short. (Although, it simplifies most timings by a factor of eight.) I fear that we are "generals preparing for the last war" and that we are devising the prefect word processing system when the baseline specification is video conferencing.

It is possible to solve any computing problem with another level of indirection. For display, the standard techniques are a blitter or tiling (and sometimes both). Radical Brad and ElEctric_EyE have success with the former. 8BIT and Drass have success with the latter. I'd like to take a different path although much of what I propose is applicable elsewhere. Some of my over-confidence with processor design and circuitry comes from moderate success with codecs and streaming. In this case, I think that I know enough to specify video hardware and a codec which are tied to each other. I am also able to estimate quality of the output prior to implementation.

I'd like to make a self-hosting computer. I wish to use a broad definition in which a person unskilled in the art can learn the techniques of the computer's construction. So, my broad definition is the self-replicating memeplex of trustworthy computer. This could easily devolve into unscientific woo-woo if it did not include the scientific criteria of replication. For my purposes, it should be possible for the computer to play tutorial videos. However, this does not necessarily include the ability to encode video or play arbitrary codecs. Therefore, it is possible to discard all support for JPEG DCTs and motion deltas. So, it doesn't have to support H.261 as used in JPEG, MJPEG, MPEG or elsewhere.

After reading and implementing the annotated JPEG specification, I know more than any sane person should about a hierarchical, progressive, perceptual quantized, Huffman compressed, YIV color-space, 4:2:2 subsampled integer Discrete Cosine Transform with restart markers. I also know a little about JPEG2000 wavelet spirals and EXIF. And, indeed, after working with EXIF, my sanity is questionable. (For reference, it is against specification for an unescaped JPEG thumbnail to be embedded inside a JPEG. Good luck with enforcing that.)

For video, I initially considered a bi-level approximation of the most significant wave of an 8*8 DCT followed by RLE. This has advantages and disadvantages. Compression is poor but it requires zero bit shifts. This is acceptable for a computer which plays tutorial videos. Given that it only requires 64 tiles, this arrangement has the distinct advantage that it can be played in a window on a Commodore 64, at 60Hz while simultaneously displaying 7 bit ASCII, window decoration and other tiles. The quality is inferior to MJPEG although the exact amount is explicitly undefined. I assume that bi-level approximation is sqrt(2) worse than DCT. I also assume that a Commodore 64 has worse color-space than perceptual YIV. Perhaps there is some trick to improve color reproduction but I assume that it is incompatible with decompressing video.

In JPEG, color-space is often implicit. In MJPEG, perceptual quantize tables are often implicit. This makes quality comparison ambiguous. However, the most remarkable part is a comparison of multiple waves. The Joint Photographic Expert Group crowd-sourced the JPEG compression algorithm and image format. The committee's efforts include perceptual testing, a documented standard (sales of which are undermined by the chairperson's cheaper annotated version) and a reference implementation which is almost exclusively used in all implementations except Adobe's software. Indeed, the annotated version makes a pointed dig at Adobe for using up to 12 tiles in a macro-block when the standard specifies a maximum of 10. Oh, Adobe, even when you participate in defining a file format, you are unable to follow it.

Anyhow, JPEG perceptual quantize tables are only defined for one viewing distance, one viewing angle, one viewing brightness, one viewing contrast and - most significantly - individual waves. All other behavior is undefined. What is the quality of two waves in one tile? Undefined. This is the most incredulous part. For a standard which is concerned exclusively with the perceptual splitting, compressing and re-combining waves, the perceptual effect of such work is explicitly undefined. So, how much worse is the bi-level approximation of the most prominent wave? Undefined but, in practice, I expect it to be at least 40% worse.

So far, we have nothing above a demoscene effect. This is not the basis of a video tutorial format. So, how much further can we push this technique and how much retains compatibility with VIC-II hardware or similar? Actually, we can push it much further. I am also able to estimate how and where quality will suffer. Compared to JPEG, we are not using perceptual quantize tables, not using Huffman nor Arithmetic compression, not using YIV, not subsampling and only using an approximation of JPEG's 8*8 DCT. What happens if we throw that out too? Actually, we get increased throughput.

Anyone who understands Discrete Fourier Transform or the related Discrete Cosine Transform will understand that samples are converted to waves and waves are converted to samples. And, indeed, the number of samples and waves is fairly unchanged. For JPEG DCT, 64 samples become 64 waves and the most prominent ones are RLE/Huffman compressed or compressed with a technique which was slightly less frivolous than IBM's XOR cursor patent. Anyhow, we are only using 6 bits in a byte because we are representing 2^6 pixels. The obvious solution is to use 8 bits to represent 2^8 pixels. Therefore, the basic unit should be 16*16 pixels rather than the historical norm of 8*8 pixels. Historically, this hasn't been used because it requires four times as much processing power to encode and would skew the effectiveness of JPEG Huffman encoding. It also helps if video resolution is higher. And there is one further consideration. Think of those poor photographic experts, who were handed a working algorithm. They'd have to do four times more perceptual testing. Or possibly 16 if the result was practical. The horror!!! Think of the experts!!!!!

Not thinking in pixels is the crux of the idea. Rather than attempt to push 2MB of raw 8 bit pixels per frame at 60 frames per second (120MB/s), work exclusively in 16*16 waves. If we have, for example, 1024 tiles, we can display an arbitrary frame of the restricted video format concurrently with upscaled PETSCII and other tiles. By using multiple tiles, it is possible to display 16*32 characters for Roman, Greek and Cyrillic. It is also possible to display 32*32 CJK [Chinese, Japanese, Korean] and emoji. Unfortunately, there may not be enough tiles for arbitrary display. 1024 tiles is sufficient to concurrently display video, Roman script and window decoration but it may not be sufficient to display an arbitrary screen of Chinese. The upside is the ability to implement 1920*1080p using significantly less than one bit per pixel. Specifically, I suggest 16 bits or less for tile number, 16 bits or less for foreground color and 16 bits or less for background color. Excluding bi-level tile bitmaps, this is a maximum of 48 bits per 16*16 pixel tile. Tile bitmaps require exactly one bit per pixel. However, tile placements may exceed tiles by a factor of four or more.

In the most optimistic case, full screen 1920*1080p video in reduced to task of RLE decompressing 8100 bytes wave data, 8100 bytes of foreground data (with write through to least significant bits) and 8100 bytes of background data (also with write through). With this scheme, 2K video remains impossible for an unaided 2MHz 6502 but it is tantalizingly close. 640*360p is also a possibility. When decompressing 16 bit color, quality may exceed MJPEG but it is definitely inferior to the obsolete H.264 by perceptual quality and compression efficiency. Regardless, via the use of 8 bit LUTs [Look-Up Tables], it remains possible to render an inferior version on a Commodore 64. This is because 16*16 tiles can be mapped to 8*8 tiles.

I have previously attempted to transcode 3840*2160p video to my own format. I used a trailer for Elysium as test data. My conclusion is that everything looks awesome at 4K. Even when error metrics for 8*8 tiles were completely fubar, the result is excellent because there is a mix of 480*270 flat tiles and other encodings. This is the Mrs. Weiler's Law: "Anything is edible if chopped finely enough." Well, 120 columns of PETSCII with 8/16 bit color is definitely chopped finely enough.

A quick win to improve image quality and reduce bandwidth is hierarchical encoding. Specifically, I suggest concurrent use of 16*16, 32*32 and 64*64 tiles. I also suggest XOR of tiles at each scale prior to display. This allows selective addition or subtraction from a base color without incurring ripple carry. It also minimizes sections of dropout when a video is not played on target hardware. Hardware with support for two or more tile sizes allows small video to be upscaled by a factor of two or more while incurring no additional processor load. It also allows thumbnails to be played at lower quality. In particular, the top tier of a hierarchical video can be played on a Commodore 64 if it does not exceed 40 columns.

The hierarchical encoding process is similar to JPEG. Blocks of the broadest scale are processed first and residual data is processed in subsequent passes. This process is compatible with SIMD. If luma/chroma is conventionally summed, then thumbnails on any proposed hardware may incur dark patches in high detail areas. This would be particularly prominent on the lowest specification equipment. Although XOR incurs redundant impulse and worsens compression, it also provides the most graceful degradation. Where bandwidth and storage are not the priority, this atypical use is preferable. If this design choice is in error, it possible to fix in hardware by selectively incorporating XOR into a full adder. It is also possible to fix in software.

The astute may notice that large tile sizes are not compatible with the accepted list of high vertical resolutions. The result may be a strip of half or quarter tiles. Likewise, a tile hierarchy may place constraints upon user interface. This may include CJK on even columns only and windows which snap to a four column and four row grid. However, all of this is ancillary to the primary usage. The main purpose is mutually influenced hardware and software with the purpose of training people to make better hardware and software.

What is the quality of hierarchical, bi-level texture compression? You are probably using it already because most PCI and PCI Express graphic cards manufactured since 2005 support it. Indeed, fundamental patents regarding texture compression have subsequently expired. My technique to use the order of operands to share commutative opcodes is taken from a common texture compression format where the order of palette entries determines the compression technique. I would not be confident sharing my use of this technique with an active patent.

I have outlined a technique to obtain 1920*1080p video at one bit per pixel using the program counter of 65C02, 65816 or W65C265. This technique is compatible with hierarchical tiling and video decompression. These techniques are also compatible with processor stacking, blitter and various methods of DMA. I also note that my suggestion is compatible with work by White Flame, Radical Brad, ElEtric_EyE, 8BIT, Dras and others. I hope that ideas can be incorporated into discrete circuitry or FPGA. Indeed, I have been greatly inspired by an attempt to make a binary compatible extension to VIC-II; modestly and tentatively called VIC2.5. However, I believe that it is more beneficial to break binary compatibility in a manner which was typical at Commodore. I'm not the only one with such sentiment:

Quote:

The 8Bit Guy on Wed 11 Apr 2018:

Just like the C64 was not compatible with the VIC-20 or the PET or the Plus/4, total compatibility is not required. It just needs the feel. Also, it needs to use PETSCII characters.

While The 8Bit Guy wants 640*480, the DTACK Grounded newsletter, issue 40 provides a long worked example of why this is not feasible on 8MHz MC68000 with 16 bit bus (or permutation thereof) while chiding the basic numeracy of people who should know better. The worked example would also explain why Apple Macintosh computers remained monochrome for an extended period. Timings for The 8Bit Guy's suggested 8MHz 6502 or 65816 may be equally stringent.

Regardless, my suggestion for a 16 bit tile number, 16 bit foreground color and 16 bit background color may be a pleasing compliment to the work on 65Org16 and, in particular, ElEtric_EyE's work with 65Org16 and video. While my preferred embodiment is much closer to the work of 8BIT and Drass, some of my diagrams have been shockingly similar to ElEtric_EyE's diagrams.

My outlined design may not be suitable for gaming due to excessive fringing. This could be corrected with multiple layer sets and alpha mask techniques. For example, it is relatively easy to specify 256 chords across a square tile. I believe this would also be compatible with ElEtric_EyE's work.

Posted: **Mon Dec 28, 2020 12:28 pm**

I have thought further about encoding tile alpha masks in one byte or so. This may be of most immediate interest to ElEctric_EyE.

A non-obvious arrangement is a square with nodes along all four edges at the 1/8, 3/8, 5/8 and 7/8 mark. This leads to 192 interesting permutations where it is possible to draw a chord diagonally across the tile. Furthermore, there are two encodings for each pair or points. This is sufficient to indicate which side of the chord should be solid or transparent. The remaining 64 prmutations may indicate chords drawn at the 1/4 and 3/4 mark. The remaining 16 permutations may indicate chords in a diamond along the 1/2 mark of each edge. Of the remaining four permutations, one may indicate fully solid, one may indicate fully transparent and two encodings are unused. These unused encodings may not be useful for a basic hardware implementation but they are sufficient to indicate an escape code for future functionality.

A more obvious method to chord a square tile includes all four corners and nodes along all edges and 1/4, 1/2 and 3/4 mark. This provides 176 interesting permutations. The remaining 80 permutations may be allocated to curved alpha masks. These may be allocated such that there are zero, one or two concave or convex between nodes. Given the lack of available encodings, all of the shortest curves can be excluded from consideration. This is due to the shortest curves providing the least variation. The intention is to provide the best fringe - ideally within four pixels - for pre-calculated sprite data where the alpha channel is known and remains fixed. Therefore, it is not necessary to compute an alpha channel mask by any fast or obvious method during display.

Posted: **Mon Dec 28, 2020 12:37 pm**

(Just an idea: if you can attach documents in pdf format, it's easier for everyone to read.)

Posted: **Tue Mar 29, 2022 2:52 pm**

I like driving games. I prefer the simplicity of two dimensional sprite games, such as Sega's OutRun. However, three dimensional games, such as Virtua Racing, Daytona USA, Scud Race and Ridge Racer, also have appeal. For general appeal, 3D, perspective correct texture mapping is probably a baseline specification.

The standard technique of MIP mapped texturing will rapidly exhaust all bus bandwidth. The patents on early texture compression algorithms have expired. However, others are ongoing. There are other techniques to generate textures. For example, one of the earliest uses of fractals in film was generating metal textures in StarWars. In another example, I'd like the skeuomorphic brushed metal textures of Apple GUIs to be generalized to animal textures, such as cow print.

Ah, yeah, this'll work. A while back, some guy theorized that animal foetus development involved a phase where unknown molecules imprinted an animal pattern, such as leopard spots, zebra stripes or raccoon ring-tails. That guy was Alan Turing and his biology paper on the topic was surprisingly rigorous. I hope you enjoy vector field and curl. Furthermore, the molecules have been subsequently identified as protein and an unstable solution of the differential equation has been observed in the changing patterns of zebra fish.

Anyhow, I'm considering an FPGA blitter which can only write. It can only draw triangles with 256 animal textures. It can only output bi-level foreground/background textures in RRGGBBII. And to make things more interesting, it will only work with 1970s/1980s style tile video systems. It is intended to be a hypothetical blitter extension for a Commodore 64 or similar. Indeed, it would be possible to make a simplified version as a Commodore 64 cartridge.

Working with 16*16 tiles, it is possible to approximate one arbitrary tile of video with one byte for foreground color, one byte for background color and one byte for texture. This is not new. I was unable to find details but it is possible that Avid's old proxy video format was 32 bits per tile where foreground/background was 4 bit per channel RGB. I suggest similar but with even tempered 8 bit palette to make a more compact 24 bit. This may not be original, but I found that it is possible to approximate horizontal and vertical texture independently by using Walsh functions. Specifically, 2^N bi-level values may be approximated with N bits. In the case of 16*16 pixel tile, it may be approximated as two nybbles.

Mapping arbitrary video down to foreground/background palette and Walsh function nybbles is awkward but within the ability of 65816 in a batch mode. When the same format is the known output of a blitter pipeline, it is possible to work backwards and define the minimum resources for rendering. The task is simplified because the foreground and background colors are given. Most importantly, there is no need to calculate all 256 pixel values for the tile. Instead, it is possible to compute one horizontal stripe and one vertical stripe. Any stripe will do. However, to minimize error, the middle stripe is preferable. Specifically, (x,8) and (8,y) pixels where x and y are integers from zero to 15. This is 31 pixel values because (8,8) is common to both sets. The stripes may be computed in parallel or computed in iterative steps, for example, over 16 clock cycles.

Transformed co-ordinates are incremented with a delta and a delta of the delta. These values are the input of the texture function. If we work in a floating point format, we may choose a de-normalized format or overflow format which eliminates barrel shifting and the corresponding multiplexer latency. Or we may take a devil-may-care attitude and accept minor rendering error. This requires the least resources but does not preclude a more accurate implementation.

Inputs may require division operation. This may also occur over 16 clock cycles. Indeed, it may be desirable to supply parameters in a shuffled order. This provides the most opportunity to re-use one division unit. For example, horizontal texture offset, co-ord X0, horizontal texture scale, co-ord X1, foreground color, co-ord X2, vertical texture offset, co-ord Y0, vertical texture scale, co-ord Y1, background color, co-ord Y2.

Use of a blitter means that tiles do not have to be memory mapped. As an example, it is possible to have 7680*4320 video (480*270 tiles, 129600 cells) and processor with 16 bit address-space. FIFO to the blitter handles clock domain crossing problems from processor to video. And hardware double-buffering allows maximum bandwidth for blitter writes while also allowing maximum bandwidth for video display. In particular, both may be 24 bit or wider. For 16*16 tile video, for one row of characters, rastering will take 16 passes through tile references. Meanwhile, polygons may be overdrawn at least 16 times without loss of frame rate. This satisfies Radical Brad's recommendation to have at least 7x blitting bandwidth.

You may think that the jaggies will be horrendous. However, as resolution increases, it is more like working with textured pixels. Indeed, with no further refinement, 480*270 PETSCII is suitable to render 80 column text.

It should be possible to implement this blitter using Lattice iCE40 FPGA. The version with 3520 LUT may be sufficient. If this is implemented on Commodore 64, 8*8 tiles only require 64 Walsh textures and some of the iterative steps only require eight cycles.

Posted: **Tue Mar 29, 2022 5:02 pm**

Sheep64 wrote:

Making video circuitry is currently outside of my ability.

Then G-d help the rest of us.

Posted: **Tue Mar 29, 2022 8:57 pm**

Jmstein7 wrote:

Sheep64 wrote:

Making video circuitry is currently outside of my ability.

Then G-d help the rest of us.

Give a man a VGA and he’ll play retro games for a day.
Give a man an FPGA and he’ll play games for life.

Oh wow this topic is so fast and far above my head I can’t even find an appropriate quote..
After reading through the first post I’m 100% content with 51x30 x 8x8 characters.. mono!… for my SBC XD

Posted: **Tue Mar 29, 2022 9:18 pm**

AndersNielsen wrote:

Give a man a VGA and he’ll play retro games for a day.
Give a man an FPGA and he’ll play games for life.

Oh wow this topic is so fast and far above my head I can’t even find an appropriate quote..
After reading through the first post I’m 100% content with 51x30 x 8x8 characters.. mono!… for my SBC XD

Having just today programmed my first FPGA which has LCD and HDMI video output... I think I'll say that give someone an FPGA and they'll be frustrated for life and never get another thing done!!!

-Gordon

Posted: **Tue Mar 29, 2022 11:29 pm**

drogon wrote:

Having just today programmed my first FPGA which has LCD and HDMI video output... I think I'll say that give someone an FPGA and they'll be frustrated for life and never get another thing done!!!

-Gordon

Congrats! And, agreed

Posted: **Fri Apr 01, 2022 11:34 am**

With a triangle blitter, stereoscopic output is possible with no extra bandwidth. Set the desired stereo seperation and then blit to two buffers in parallel. You may think that the result will be awful. I argue that it will be better than noise stereograms which were initially developed to disprove the edge detect correlation theory of stereoscopic vision. I've written stereogram software on Amiga and it was an awful piece of 68000 assembly which read depths from planar data and then selectively rotated a buffer by 17-32 bits. The buffer started as alternate bits. However, it was common for bits to become clumped. This reduced the effective resolution of the stereogram. Two stereo viewports would be superior to stereograms. Mis-aligned jaggies and mis-aligned textures in one axis will be more than sufficient to obtain a stereo effect.

Sheep64 on Mon 21 Dec 2020 wrote:

FNE on Sun 10 Mar 1985 wrote:

One way to mix graphics and text is simply to use an exclusive-or gate to mix conventional 25-line, 80-column text with graphics.

I've considered XOR of text and graphics. I've considered XOR of text at different scales. I'm now considering OR of character ROMs.

I've looked at symbol encoding, such as Latin1 and Unicode. It doesn't seem to be implemented efficiently. Many of the symbols have diacritical marks, such as upper case E with acute, grave, circumflex or umlaut. These are often stored and rendered in a highly duplicated manner. I believe this could be represented very concisely using two bitmap ROMs and a set of OR gates. Specifically, one ROM would contain 7 bit ASCII, Greek and Cyrillic. The other ROM would contain diacritical marks. Indeed, this arrangement allows the correct rendering of Spinal Tap with the umlaut on the n.

If less than 32 diacritical marks are required, three bits may be allocated to underline, strikethrough and overbar. The remaining five bits are allocated to diacritical marks, such as acute, breve, tilde, double grave and cedilla; possibly at two heights to allow for upper case and lower case. Even this is a scheme with redundant allocation. However, it allows a broad, symmetric range of symbols to be displayed. In a more general scheme, the output of eight or more ROMs may be ORed together:

Walsh function ROM.
SHEEPSCII, Roman monospace and Roman pseudo-proportional ROM.
SHEEPSCII, Greek monospace and Roman diacritical ROM.
SHEEPSCII, Cyrillic monospace and Greek diacritical ROM.
Chinese, Japanese, Korean ROM 0.
Chinese, Japanese, Korean ROM 1.
Chinese, Japanese, Korean ROM 2.
Chinese, Japanese, Korean ROM 3.

None of these ROMs requires more than 256 entries. In the trivial case, only one ROM is fitted and one path of tile indirection is implemented. This is upwardly compatible with more general implementation. In particular, two or more reads may be multiplexed through the same data path. If this is implemented in FPGA, savings may be obtained from sections of ROM which do not use the edge. In the case of diacritical marks, the mid section is rarely used. Walsh functions are defined independently for horizontal and vertical nybbles. These may be XORed to obtain 16*16 pixel textures. The only exception case is zero which is empty. Indeed, in all ROMs, case zero should be empty to ensure that useful bitmaps may be selected from individual ROMs.

I don't read Chinese, Japanese or Korean but I work on the basis that Chinese has more than 2000 symbols but many of them appear to be repetitive in the same manner that acute and circumflex is repeated. I assume that 4*64 is sufficient to represent half tile bitmaps and that 4*192 is sufficient for full tile bitmaps and miscellaneous decoration. Actually, a default bitmap size of 8*32 or 16*64 may be preferable. Odd and even sets may or may not be duplicates. In this design, two (or more) tiles are required to make the majority of symbols and a lookup system is used to select the appropriate values. In the case of CJK, one 24 bit word provides half tile bitmap references and one 32 bit word provides full tile bitmap references. This top tier of indirection is stored outside of the rastering hardware and may be compressed. In the case of Latin1, up to four bytes may specify portions of a pseudo-proportional font face. In this case, most punctuation and lower case I would be 8*32 pixels. Most lower case letters would be 16*32 pixels. Most upper case letters would be 24*32 pixels. In this arrangement, w is v with a mid-section. Likewise for m and n. Sections of c, d, o and b may be shared, although this should not be over-fitted to one font. To maintain video codec speed, it should be possible to write 32*32 tile references in one cycle.

One of the joys of PETSCII is the ability to define pixel perfect scroll bars; horizontal and vertical. Unfortunately, the arrangement of the symbols is extremely haphazard and there was no facility for upscaling scroll bars beyond 8*8 pixels. 32 or more sequential values would be highly desirable. Dithering could be used to specify 1/4 pixel increments or suchlike. It is possible to support full PETSCII without mapping the symbols. However, this is highly inadvisable because PETSCII is a mix of input stream keycodes, output stream cursor/color control and bitmaps. The bitmaps are defined 32-127 and 160-191. Presumably, this is mapped to a contiguous range of 0-127 and the top bit defines reverse video. Unfortunately, there are two sets of symbols. The default set displays upper case ASCII but 92-127 is not standard ASCII. The alternate set reverses upper case and lower case. It is a mess. It is preferable to implement monospace Roman, Greek and Cyrillic with the least effort and then map PETSCII to upscaled symbols wherever it is most convenient.

Some symbols will be defined redundantly across two or more ROMs. For example, seven permutations of underline, strikethrough and overbar in one ROM may be applied to all symbols in another ROM. However, this arrangement requires reciprocation. In the case of ancient Greek diacritical marks, the full range must be placed outside of the Greek ROM. More than 200 symbols are available for SHEEPSCII and many of them are composable. Many emoji benefit from being spread across multiple ROMs. For example, catface smiling may be composed independently from a choice of head shape, eyes and mouth. The result intersects with Unicode but provides a more regular set of symbols which are outside of Unicode.

Symbols not supported include:

Most of the IBM PC text mode symbols. Specifically, double line boxes. PETSCII's single line boxes with optional round corners are preferable.
Most of the symbols in Unicode, such as the ridiculous $26C4 or the NSFW $1F3E9.

Posted: **Fri Apr 15, 2022 2:26 pm**

AndersNielsen on Tue 29 Mar 2022 wrote:

51x30 x 8x8 characters.

Oooh! Is your display is widescreen? Either way, you impress me at every opportunity.

Jmstein7 on Tue 29 Mar 2022 wrote:

Sheep64 on Mon 21 Dec 2020 wrote:

Making video circuitry is currently outside of my ability.

Then G-d help the rest of us.

I'm mostly theory. In practice, AndersNielsen and many others are ahead of me.

AndersNielsen on Tue 29 Mar 2022 wrote:

Oh wow this topic is so fast and far above my head I can’t even find an appropriate quote.

I'm deliberating exploring options outside of my ability because many of the failings of technology come from designs where people never looked more than one step ahead.

On Fri 15 Jan 2021, I said that I was working on 8K video and 3D sound. Unfortunately, 8K video is not possible at current clock speeds. 640*400 video in a window, with drop-shadow, from MicroSD, is entirely feasible. The trick is to use a video encoding where 256 pixels are approximated with three writes from 6502. This discards a huge amount of good theory and the quality is awful. However, better techniques obtain diminishing returns while using significantly more clock cycles.

I specifically wanted to know how far text mode displays could be pushed and when it is prudent to switch to hybrid text/bitmap or bitmap only. Surprisingly, the answer is to never switch away from character display. Even in the case of stereoscopic, 7680*4320 pixel, perspective correct, texture mapped, triangle blitting, it remains worthwhile to use character video. Furthermore, it is possible to add features in a manner which is upward and downward compatible.

Starting from AndersNielsen's favorable position of monochrome display where one RAM chip indirects to one character ROM, it is possible to add a second RAM chip to provide foreground color. It is possible to add a third RAM chip to provide background color. It is possible to have 8 bit to 24 bit palette mapping in a manner which is upward/downward compatible.

It is possible to provide another RAM and ROM to implement underline, accented characters and similar. Alternatively, this technique may be vastly extended. It may be desirable to have four, eight or more separate RAM/ROM chips operating in parallel possibly generating more than 24000 distinct tiles. If timing allows, 48 bit video bus or significantly wider may be compacted with multiplexing. This also allows symbols to be retrieved from one or two ROMs.

Tiles larger than 8*8 are highly desirable. 16*16 is the most desirable because it works best with a suitably rigged video codec. A hierarchy of 32*32 and larger increases video quality. Separately, as resolution increases, hierarchical tiles allow up-scaling by a factor of two or more. 8*32 tall, narrow tiles are also desirable. Specifically, it is possible to have text with half tile or quarter tile spacing, and do so without affecting the throughput of video or texture blitting. (This possibly uses 4x stacked RAM chips.)

A subset of the RAM chips may be directly controlled by an optional triangle blitter. This subset may optionally be dual port RAM. This allows the blitter to operate at any frequency independent of the display. If maximum quality and throughput is reduced, the triangle blitter may have similar complexity to 6502 or VIC-II and may be implemented as ASIC at visible wavelength or 10 dollar FPGA. I mentioned similar limits to jfoucher on Fri 15 Jan 2021.

If you are particularly careful, it is possible to make your hardware upwardly compatible and your firmware downwardly compatible. Specifically, this means minimal systems may ignore accents, may be monochrome, may have restricted palette, may not play video or may be reduced to software blitting. However, increased resolution, increased palette or increased functionality should not require software modification. While the basic implementation requires one buffer of 1KB or so, increased resolution and all of the optional extensions require multiple buffers; many of which individually exceed 16 bit address-space. I would like to implement all of this in a manner which does not exclude multiple displays.

The hardest step in this process is monochrome text display or better. In approximate chronological order, I commend Dr Jefyll, jfoucher, Agumander, plasmo, fredericsegard, rehsd, sburrow, AndersNielsen and many others for achieving this feat.

The next hardest step in this process is FPGA. As drogon has recently discovered, compiling and deploying any Hardware Description Language is a particularly patient exercise.

Over the last two weeks or so, I've made small progress with texture mapping.

I read about half of The Chemical Basis of Morphogenesis by A. M. Turing, Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences, Vol. 237, No. 641. (Aug. 14, 1952), pp. 37-72. The 3D cases and unstable cases should be avoided because that involves some of the fun of three part chemical reactions and three body approximation. Tapering 2D cases are also awkward. 2D repetitive cases reduce the differential equation to cosines. And the first reference to Fourier means that liberties can be taken. I didn't need to read any further. According to Chemistry World, "rosette spots of a jaguar can be reproduced by two coupled activator-inhibitor processes". Well, I found that pairs of sine waves at right angles can be summed and used as a threshold. Apparently, sum and threshold is sufficient morphogen interaction to make textures. It works a little like two perfboards making Moire patterns:

Code: Select all

# perl -e 'for($y=0;$y<32;$y++){for($x=0;$x<64;$x++){if(abs(sin($x*0.2)*2)+abs(sin($y*0.2)*2)+abs(sin($x*0.25+$y*0.33)*2)+abs(sin($x*0.33-$y*0.25)*2)>5.5){print"@"}else{print"."}}print"\n"}'
.....@@.........................................................
....@@@@..................................@@.........@@@........
....@@@@.................@@@.............@@@@.......@@@@@.......
.....@@@................@@@@@.......@@@@..@@........@@@@@.......
......@.@@@@............@@@@@......@@@@@@...........@@@@@.......
@@.....@@@@@@.......@@@@@@@@@......@@@@@@.......@@@..@@@@@@@....
@@.....@@@@@@......@@@@@@@@@@......@@@@@@.....@@@@@@...@@@@@@...
......@@@@@@@.....@@@@@@@@@@.......@@@@@@.....@@@@@@...@@@@@@@..
....@@@@@@@@@.....@@@@@@@....@@@@@..@@@@@@@@..@@@@@@...@@@@@@@..
...@@@@@@@@@@.....@@@@@@@...@@@@@@@...@@@@@@@....@..@@@@@@@@@...
..@@@@@@@@@@.......@@@@@@....@@@@@.....@@@@@@......@@@@@@@@@@...
..@@@@@@....@@@.....@@@@@@@..@@........@@@@@@......@@@@@.@@@....
..@@@@@@....@@@........@@@@@...........@@@@@@......@@@@@........
...@@@@......@@........@@@@@............@@@@.......@@@@@........
.......................@@@@@........@@..............@@@.........
........................@@@.........@@..........................
........@@...............@..........@@..........................
.......@@@@.........................@@..................@@@.....
......@@@@@........@@@..............@@.................@@@@@....
......@@@@@.......@@@@@.....@@.........@@@............@@@@@@....
..@@@@@@@@@.......@@@@@....@@@@.......@@@@@...........@@@@@@....
.@@@@@@.@@.@@@....@@@@@....@@@@@.....@@@@@@@......@@@@@@@@@@....
.@@@@@@...@@@@@....@@@@@@@@@@@@......@@@@@@@.....@@@@@@@@@@.....
.@@@@@@...@@@@@@.....@@@@@@@@@.......@@@@@@@.....@@@@@@@...@@@@.
..@@@@...@@@@@@@.....@@@@@@@......@@@@@@@@@@.....@@@@@@...@@@@@@
...@@@@@@@@@@@@......@@@@@@@.....@@@@@@@@@@@@....@@@@@@...@@@@@@
.....@@@@@@@@@.......@@@@@@......@@@@@@...@@@@....@@@@@@@@.@@@@.
.....@@@@@@..........@@@@@@......@@@@@....@@@@@......@@@@@@.....
.....@@@@@@.......@@@..@@@........@@@@....@@@@@......@@@@@@.....
.....@@@@@........@@.......@@..............@@@.......@@@@@@.....
......@@@..................@@..........@..............@@@@......
......................................@@@..............@@.......

I found that spots/stripes can be set by omitting the first sine wave. I also found that organic/crystalline textures can chosen by selectively using an int or floor function. (In the example, change all instances of abs to int and reduce the threshold.) It is also possible to set threshold above zero.

So, for an 8 bit texture number, the top bit sets organic or synthetic. Next bit sets spotty or stripy. Bottom two bits set threshold. Remaining four bits choose textures and there is sufficient range to select tiger, jaguar, leopard, zebra, ladybug and other textures. In this arrangement, 01000010 would set ladybug pattern:

Code: Select all

# perl -e 'for($y=0;$y<32;$y++){for($x=0;$x<64;$x++){if(abs(sin($x*0.2+$y*0.2)*2)+abs(sin($x*0.2-$y*0.2)*2)>3.5){print"@"}else{print"."}}print"\n"}'

Texture rotate and translate would be useful, especially if the textures are to be applied consistently over multiple triangles. It should be possible to get the blitter to perform camera rotate and translate, perspective transform, crude polygon clipping, back-face culling, texture perspective deltas and texture map something like an icosahedron with minimal processor load. Unfortunately, it may take more than 100 clock cycles before blitting starts. This is due to floating point operations being performed with the minimum number of transistors. Regardless, I believe that it is feasible to implement only texture perspective deltas and texture blitter using a transistor budget which is similar to the VIC-II's sprite system.

The floating point format would be curious. It wouldn't necessarily be IEEE-754. It may be preferable to use something more similar to Microsoft BASIC's 8 bit exponent and 24 bit mantissa. Likewise, the scale of the cosine waves may not be standard. Initially, I thought that a rigged scale could be approximated with 1-x*x and then approximate the multiplication at 12 bit precision or less over 12 clock cycles or suchlike. For repeating textures, the quadrant of the cosine could be determined by two bits which are not necessarily the most significant bits. However, I believe that I can apply era appropriate audio techniques to video. (This is not completely crazy because Namco's Ridge Racer uses eight DSPs to perform perspective correct texture mapping at the required throughput.)

I believe SID determines quadrant for sine/square/triangle/sawtooth waves at 11 or 12 bit precision. And I believe one of the Yamaha sound chips works in digital exponent and mantissa format prior to analog output. Given that I only require sum and threshold, it is possible to retain quadrant but reduce 1-x*x to a linear value and determine threshold for a value which is now logarithmic. This reduces the bulk of the circuitry to adders and boolean tests. There is a tricky part where the Walsh approximator requires foreground/background values to be flipped. This applies independently for horizontal and vertical nybbles and therefore XOR precedes swap. It may remain useful to multiplex adders over eight or more iterations. However, when multiplication is eliminated, it definitely doesn't require 192 clock cycles or similar per patch of texture. Indeed, as transistor budget increases, it becomes quite feasible to saturate a bus with writes; possibly for two viewports.

How far can we push this? Sega's Virtua Racing has 496*384 pixel display and draws 180000 flat polygons per second (6000 polygons per frame at 30Hz). I presume Sega's Daytona USA is the same but with texture mapping. 6502 at 25MHz would be 138 clock cycles per polygon. Even with 65816, that is insufficient to block copy single precision vertex data. Even then, average polygon draw must be kept under 138 cycles. With vertex buffer or greatly reduced polygon count, 50*30 tiles (800*480 pixels) is feasible. Effective resolution and choice of textures is lower than Daytona USA. So, something midway between Commodore 64 text mode and Daytona USA is definitely possible. Jaggies will be bad, resolution at the horizon will be bad, polygon count might be low, but scenery "pop in" will be no worse than a 1990s Sega arcade game. There is also the possibility of 60Hz, stereoscopic gaming.

Separately, I am considering extensions to PETSCII and I have unconventional ideas for displaying any symbol. This is loosely inspired by a spoof proposal for character encoding, a graphical language where up to seven symbols could be super-imposed to convey nouns and verbs and the HalfBakery's Two thousand glyph script. This is best described graphically where, for example, S super-imposed with a vertical line makes a dollar symbol. This can be achieved by logical OR of two or more character ROMs. However, it becomes quite tricky to place the symbols so that the maximum number of useful permutations can be selected. This is due to the combinatorial explosion of possible symbols. It may be worthwhile to redundantly place symbols into multiple ROMs while also assuming that only one ROM is fitted by default. In the case of one ROM, underline/strikethrough/overbar is not available. In the case of two ROMs, either may provide underline for symbols in the other ROM. However, this arrangement prevents symbols in two ROMs being combined. Therefore, a third or fourth ROM may also provide underline functionality. This allows two or more symbols to be combined while allowing relatively unrestricted decoration.

Ignoring 3D, I'm considering video writes with maximum throughput. I'm considering a scheme where bulk data is written to a region which shadows RAM (like plasmo's VGA6448 and others) or a region which shadows ROM (like sburrow and others) and available graphics functionality can be queried via SPI. 8KB RAM or ROM can be split into four or more pieces where one segment is video peripheral registers, one segment is for bank switching registers and two segments are for stereoscopic viewports. (Write to the left viewport is automatically copied to the right viewport. Differing data should be written to left then right.) Write to peripheral register sets display number, tile hierarchy or pseudo-proportional configuration. By default, this is not implemented or activated. Using one of Bill Mensch's tricks from RIOT, it is possible to use address and data lines to convey 18 or more bank bits in one write cycle. This allows, for example, 4 bit character/color plane and 10 bit display row to be specified in one cycle. (Unfortunately, this bank latching scheme might be a virtualization nightmare.) Provisionally, character/color planes are in two groups where:

Four planes are for video, Roman, Greek and Cyrillic.
Four planes are for Chinese, Japanese and Korean symbols.
Three planes are for 8 bit per channel RGB foreground.
One plane is for RRGGBBII foreground.
Three planes are for 8 bit per channel RGB background.
One plane is for RRGGBBII background.
A minimum of eight further planes are undefined.

It might be useful to configure the eight palette planes for 10 bit per channel color or possibly allow color emoji with reduced palette range. The color planes should be written in a specific order for downward compatibility. Actually, all planes should be written in a specific order for downward compatibility. Specifically, the Roman, Greek or Cyrillic plane should be written last on the assumption that a minimal, monochrome system only has one Roman, Greek or Cyrillic ROM fitted. In the minimal system, all writes are directed to one plane. In small systems, writes to most planes are ignored. For example, the following writes may be ignored or directed to the only plane: high quality background, low quality background, high quality foreground, low quality foreground, underline, accents and finally ASCII value. The last write will be the most significant data; possibly with the chroma and decoration discarded.

In the laziest SPI configuration, a loopback configuration indicates that 32*24 monochrome text is present. Considerably more resolution and functionality may be available therefore the sequence of writes should maximize compatibility. A knock protocol may drop the SPI loopback and provide a more advanced interface. (A 6502 video system with a knock protocol can be found in a Commodore 128.) The SPI configuration may be a legacy, downward compatible mode, although only major step revisions may be supported. Actually, over SPI, it might be possible to auto-configure MicroSD, MAX3100 UART or blitter over any number of ports. (This might extend to other devices, such as MAX3421 SPI to USB bridge or VERA.)

Anyhow, that's the plan. Make a simple video system. Make the video bus *much* wider. Then add an unconventional triangle blitter. Then add the full pipeline before the triangle blitter. Maybe make it 3D. I am very likely to start with monochrome text using one or two character ROMs. However, it is very re-assuring to know that a text system can be massively scaled out. Likewise, it is highly re-assuring to know that it could be fitted in a RAM or ROM socket; such as the ROM socket of POC2.0 or XR2602. It is most re-assuring to know that the Commander X16's VERA is available as an alternative display system.

Posted: **Sun Jul 16, 2023 5:45 pm**

I have been absent from the 6502 Forum for multiple reasons and I'd like to thank the one person concerned about my wellbeing. From Jul 2022 to Sep 2022, I was primarily avoiding a stalker on the Internet. Given the complete lack of boundaries (and the light touch regulation from administrators), I would be highly unsurprised if my stalker is also the first person to be banned from the 6502 Forum after constructing an 8 bit system. Either way, I am unimpressed that the ban was temporary. Aptitude with hardware or software is not a license for toxic behavior and - unlike the situation in the 1990s - the 6502 Forum is not the only venue for 6502 systems. From Sep 2022 to Dec 2022, I worked off-line with Sheep20 developing a tile video codec for 6502 systems. 8MHz 6502 with MicroSD connected via 6522 has enough bandwidth to block copy uncompressed 40*25 tile video and uncompressed 3D surround sound. Development slowed partly because my health declined. Over winter, I felt particularly cold and tired. However, from Jan 2023 to Jul 2023, I've endured an undiagnosed illness which has repeatedly caused me to scream in pain. This is in addition to all of my previous health problems. Understandably, this has prevented me from programming or writing. Given my 2000+ word ramblings, some might say that my absence was a bonus.

Ultimately, I have searched for a 64-256 character successor to PETSCII. The result has several limitations but the minimal configuration (monochrome and silent) should work on Commodore64, Commander X16, AndersNielsen's ABN6502 and other systems. The preferred configuration has been subject to scope creep and is slightly outside the base specification of the Commander X16. Sheep20 notes that any specification with "... and a PCM audio FIFO" is a failure to run even on unmodified niche hardware. Regardless, implementation has been mocked with a patch to ffmpeg and it is suitable to watch cartoons and action films. Indeed, it is particularly good with flames and debris. However, the tile video system is completely incapable of representing two or more pieces of detail within one tile. This is most notable with failed attempts to render text or faces with any fidelity. A further quirk is that animation and live action from the 1990s is most suitable for conversion.

Old cartoons, such as Popeye, have ropey color but many cartoons from the 1970s and 1980s (Sailor Moon, Ulysses 31, Centurions) may use 20% or more of the luma for over-bright effects. With 3 bit per channel color, the loss of the top 2-3 values is hugely significant. 8*8*8 (512) colors are reduced to 6*6*6 (216) or 5*5*5 (125) colors. That reduces the effective color-space by 1-2 bits. Pastel cartoons from the late 1990s and early 2000s (Totally Spies, Powerpuff Girls, Winx Club seasons 1-3) use the full luma but more recent cartoons - drawn on computer - have an increasing amount of fine detail which a crude codec fails to render in any meaningful manner.

The situation is equally bad for live action but for different reasons. Legacy 525i NTSC transcoded to 360p MPEG2 (or similar) often has washed-out chroma. Examples include Kenneth Branagh and Emma Thompson in a scene from William Shakespeare's Much Ado About Nothing, Rupert Sheldrake, The Homing Pigeon Enigma, 1992, Utah Saints' Something Good Is Gonna Happen and ABBA In Japan: If It Wasn't For The Nights. When converting this to a low palette suitable for block copying, the result is one base color with random speckling. Whereas, late 1990s video was often shot in HD with the expectation that it would also be shown at SD. As examples, MTV Europe's rotation from the late 1990s (Britney Spears' Baby, One More Time, Justin Timberlake's Rock Your Body, Lenny Kravitz's Are You Gonna Go My Way?, Jamiroquai's Virtual Insanity, Prodigy's Breathe) are all excellent after conversion. (Chemical Brothers' Setting Sun is dodgy but it is intended to be unsettling.) Unfortunately, 4K HDR video (such as an action sequence from 2011 Transformers: Dark Side Of The Moon or Bumblebee) has reverted to the washed-out chroma of old video.

Since at least Mon 21 Dec 2020, I have idly described a tile video codec suitable for 8 bit computers and 6502 in particular. On Fri 12 Aug 2022, my idleness advanced to taking screen-shots of videos, down-sampling them in GIMP and reducing them to my preferred 8 bit palette of RRGGBBII. Totem video player has the annoying "feature" that left arrow skips forward 15 seconds while right arrow only skips back 5 seconds. Regardless, I was able to move to the end of a video, skip backwards in 5 second offsets and manually build 40*25 pixel GIF animation with the correct ordering within GIMP. I repeated this process almost 20 times on short videos. This was for the purpose of viewing the lower bound of video quality. From previous failed experiments to convert 3840*2160 pixel video into 16*16 blocks, I found that the very worst case was flat texture with an effective resolution of 240*135 pixels. While many would strive to implement 240 JPEG tiles or something more seamless, I assure you that everything looks awesome in 240 columns of PETSCII. My new target of 40*25 pixels is 1/6 of the horizontal resolution, approximately 1/5 of the vertical resolution and 1/3 of the color-space. However, with a loose superset of PETSCII, some of this loss can be recovered and the result is quite watchable.

Sheep20 was wholly unimpressed with my animated examples and was particularly unimpressed that I hadn't written any software to perform the conversion. After a further four weeks of idle talk, Sheep20 effectively said "put up or shut up." Therefore, on Thu 8 Sep 2022, I began work on the Klax1 video codec. The name is a reference to The Clacks communication network in Terry Prachett's Discworld and I hope that 6502 tile video may be transferred over network in the near future. Klax is also a reference to the "clackers" of the Difference Engine where one of the many consequences of a successful Analytical Engine would be a cottage industry of people developing compression algorithms. Furthermore, public lectures would likely use a semi-standard presentation system; much like PowerPoint and a projector is regarded as standard infrastructure. However, if we consider a hypothetical Victorian PowerPoint using cogs, gears and lensed candles, it is highly likely to use an addressable array of tiles. A vector scheme with deflection mirrors is desirable but infeasible. I have also just discovered that Klax is a block game on Atari 2600.

I have previously worked on audio and video codecs with mixed success. One of the most prominent problems is a quest for efficiency. Specifically, an audio or video codec which can encode or decode in real-time has considerable practical benefit. Unfortunately, barriers to entry have increased while expectations have also increased. In particular, for video codecs, increased horizontal resolution is typically accompanied with increased vertical resolution. With the exception of a transition to widescreen, doubling the width also doubles the height and this quadruples the total number of pixels. For video encode and decode, the large volume of data is typically handled with parallel processing. This is commonly described as "embarrassingly parallel" given that video can be partitioned and assigned to separate processing cores with almost no loss of efficiency. However, a paradigm of embarrassingly parallel processing works against the plucky amateur who wishes to implement video on a single core, 8 bit system. Bandwidth is already constrained by whatever can be soldered at home or programmed on affordable FPGA. Options to scale home-brew hardware linearly are limited or absent. A further impediment is sourcing example material. When an increasing proportion of video is 4K HDR or better, getting it into an 8 bit system is also a minor challenge.

Understandably, my previous work with video codecs suffered from premature optimization. Specifically, I was too focused on byte streams and marshalling with the expectation that I could make it run at 30Hz - if only I could solve the immediate problems. Unfortunately, I could not get different features of my video codec to interact gracefully despite writing an elaborate assertion system. With hindsight, the assertion system was more focus on detail and offered little help with structure or semantics. This time around, I have been careful to avoid previous traps. Helpfully, tile video eliminates many of the fancy features typically found in video codecs, such as motion deltas. Unhelpfully, multiplication and compression are also excluded. Such minimalism has led to constant consideration about upward compatibility and this leads to the next most obvious traps of portability and over-abstraction. I could have written the absolute minimum amount of software to encode and decode one video frame. Instead, I was slowed by making my work open ended. My standard workflow for writing a video codec is to boot-strap using ffmpeg, libgd, libpng and zlib. ffmpeg has features to split and join frames in PNG and many other still image formats. libgd allows manipulation of bitmap images and this includes rectangle drawing. libgd calling libpng calling zlib performs read and write of GZip compressed PNG images. Therefore, a small amount of C, an unremarkable Makefile, some standard libraries and the external invocation of ffmpeg is sufficient to start writing a video codec. However, I didn't choose this route. Instead, I started work on a laptop with a live-boot, no compiler, no make system and without ffmpeg. In an attempt to make a video codec which was portable rather than fast, I spent two evenings attempting to decode PNG files using Unix shell scripts and Perl. I made shockingly good progress. Specifically, I decompressed chunks of PNG's RIFF variant and obtained the raw RGB byte stream of the first chunk.

This tedious approach was abandoned when, by chance, I discovered that my live-boot system had an ImageMagick binary. ImageMagick was hugely significant in the 1990s because it allowed batch processing of pictures not restricted to cropping, resampling, titling and - most significantly - image format conversion. This was a very intensive task when servers typically ran at 50MHz and had 16MB RAM. It did not occur to me that ImageMagick would be the type of fossil which survives in the default Debian Linux distribution or derivatives. Regardless, I quickly formed a cunning plan. I could make a Perl program which has ffmpeg and ImageMagick as dependencies and use Perl's shell pipeline spawning functionality to read and write all PNGs as uncompressed TGA. Specifically, to decompress PNG:

Code: Select all

open(IN,sprintf("cat 'foo/%08d.png' | convert - tga:- |",$frame)) || return 0;

This provides an input stream with a fixed length header and raw RGB data. The header describes the size of the image and the bit depth. Dependency on ImageMagick's convert binary is not ideal. However, it is not particularly onerous when ffmpeg is also required. Unfortunately, I encountered a multiple problems across different systems. The root problem is that versions of ffmpeg and ImageMagick weren't consistent and functionality differs across versions. ffmpeg has a deprecated set of parameter names for splitting and joining images while different versions of ImageMagick "helpfully" reduce RGB input to bi-level or monochrome. The initial work-around was to use PNG for input and JPEG for output. This work-flow reduced error but it was insufficient across all versions of ImageMagick. The subsequent work-around was to add SMTPE style chroma boxes (black, red, yellow, green, blue, purple, cyan, white) to inputs or outputs. I also added the black and white boxes for the bottom 10 bits of the frame number. (This was greatly appreciated by Sheep20 who has previously edited 16mm film and dubbed the audio separately.) The forced color prevented, for example, black frames being output as a bi-level PNG or monochrome JPEG. Many videos start with a black frame but the subsequent switch to color encoding caused ffmpeg to fail. It took a while for me to identify the root cause and this was not helped by encoding and decoding a broad selection of test videos on multiple computers. A further problem was truncation of frames caused by the processing delay of ImageMagick. I was reluctant to add a one second sleep in a video codec - and especially so when this only makes the problem less likely rather than solve it. A null stream is trivial to identify and redo. However, there is a more insidious case where the header is complete but the payload is short.

A codec requires an encoder and a decoder - and these components require data format. Following the example of TGA, Klax1.0 video format has fixed length header and stores tile data in 7 byte tuples where 1 texture byte is followed by 8 bit per channel RGB foreground and background. I hoped to store textures as two nybbles representing a XOR of two Walsh basis functions (horizontal and vertical) to approximate an area of 16*16 pixels. For downward compatibility, I hoped to store these nybbles in a non-obvious, non-contiguous format. Nybble encodings 8-15 represent the highest frequency encodings. The order of the bits is unimportant because they are typically used to select one pattern from a character ROM. The top bits of each nybble can be grouped together and this allows the highest frequency encodings to be mapped down. On 8*8 pixel systems, such as Commodore64 or Commander X16, NV bits can be tested or cleared with the least number of cycles. This allows 64 encodings to be sought in the most rapid fashion on systems with smaller tiles. This is not original thinking. I merely follow the examples of IEEE-754 floating point format and 65xx peripheral chip registers. The original part is that the scheme can be nested. Any encoding can be trimmed down to an encoding where each Walsh function has less bits. Regarding texture and color together, 16*16 pixels at 3 bytes per pixel (768 bytes uncompressed video) are approximated as 7 bytes per tile. This fixed length data reduction allows linear seek and read/write operation within an unindexed file. The overall scheme may have similarity to Avid's old proxy video editing format but I do not have sufficient information to confirm or refute it. Uncompressed data is reduced to less than 1% of its original size and this is a sufficient reduction to display video on 6502. It can be squashed further but the 7 byte tuple format is more flexible for experimentation. By retaining maximum RGB depth at the intensive encode step, we minimize the need to repeat encode. Whereas, decode has very little to perform. Experiments with palette mapping can be performed rapidly and repeatedly on 7 byte tuples. Although the tuples could be smaller, the gain is a false economy.

By Mon 12 Sep 2022 - three days after I started - I encoded and decoded my first video frame. After that, I encoded and decoded my second video frame. I then cursed for two hours. Speed was a concern. Klax1.0 encoding took 40 minutes per frame. It was a brute force technique where each 16*16 pixel tile was partitioned using a XOR of two Walsh basis functions (horizontal and vertical) and the minimum RMS error found. This is standard codec theory to minimize encoding error.

My greatest concern was that trivial examples failed horribly. I assembled a small library of video stills which were challenging but representative of anticipated inputs. This included a selection of frames from the standard Bad Apple animation, Winx Club Believix Transformation, AKB48's distinctly NSFW [Not Safe For Work] Baby! Baby! Baby! music video and others. Unfortunately, one frame from Bad Apple revealed that more than 95% of the tiles were either solid, horizontal half tiles, vertical half tiles or the PETSCII Battenberg symbol. The frame could be more efficiently encoded using two bits per tile with minimal loss of quality. It looked moderately better than the output of an Atari 2600. Encoding 40*25 tiles, I had an effective resolution of 80*50 pixels - with the exception that one corner of a tile could not be set independently. AKB48 was blurry and also highly skewed.

I was in a very bad mood. I planned to re-use this tile encoding in a large number of other contexts including textured polygon blitting, video sampling and Virtual Reality. I spent many hours drawing diagrams and it was all undermined by one empirical result. I was reluctant to continue because more than one year of doodling had been dashed. Unfortunately, I was accustomed to this type of failure in the field of video codecs. To salvage the situation, I needed to increase effective resolution; the illusion of independent choice. I had already described this problem to AndersNielsen:

Sheep64 on Sat 9 Jul 2022 wrote:

While tiles $00-$0F could be allocated to 2*2 blocks and allow 102*60 pixel bitmap display, 3*3 blocks can be assigned such that the middle square is the aggregate of the other squares. This would allow 153*90 pixel display with minor restriction. Other allocations are also possible.

My problem was analogous to a well known problem in JPEG/H.261. Specifically, there are a very large number of valid encodings which represent high frequency noise. Even worse, one tile of high frequency noise looks much like any other. JPEG has quantize thresholds which discourage these large, inter-changable encodings. With an encoding which only represents the most prominent wave, I observe a huge skew towards low frequency encodings. The most immediate improvement was that I could reduce the brute force search-space of Walsh functions. Reducing the maximum frequency of the horizontal and vertical Walsh functions by a factor of two reduced encoding time by a factor of four with minimal loss of encoding quality. This reduced frame encoding time from 40 minutes to 10 minutes. Down-sampling 1280*720 to 640*360 (or, more accurately, 640*352) further reduced encoding time to 2.5 minutes per frame. The down-sampling step is performed by passing appropriate parameters to ImageMagick.

Given the long encode time of each test image, I often tweaked the codec while I waited. On multiple occasions and often within the span of 10 minutes, the encoder became incompatible with the decoder. Furthermore, I used test data with incompatible aspect ratio or frame rate. Under such circumstances, I burned through a few step revisions of Klax encoding before arriving at a stable Klax1.3 encoding. This defines a 36 byte header which includes width and height in tiles, frame rate and tile set. Such is my laziness, each binary byte of parameter is preceded with ASCII. As an example, width is defined as 'X' followed by high byte of the width followed by 'x' followed by the low byte of the width. This greatly aids debug when reading hex dumps and should be imitated more widely.

I sought a better encoding. My initial plan was to have 48 types of Walsh tile where either the horizontal or vertical frequency may be large. This leaves 16 miscellaneous encodings which may define the most obvious omissions. This subsequently became a residual pool of 4-16 Walsh encodings, 48-60 other encodings and - potentially - 192 upwardly compatible encodings. This leaves the mere problem of defining 48-60 tile types. I thought that I was original to define diagonal chords. However, I already published a variant of this idea when suggesting an alpha channel of tiles to ElEctric_EyE:

Sheep64 on Mon 28 Dec 2020 wrote:

A non-obvious arrangement is a square with nodes along all four edges at the 1/8, 3/8, 5/8 and 7/8 mark. This leads to 192 interesting permutations where it is possible to draw a chord diagonally across the tile. Furthermore, there are two encodings for each pair or points. This is sufficient to indicate which side of the chord should be solid or transparent. The remaining 64 permutations may indicate chords drawn at the 1/4 and 3/4 mark. The remaining 16 permutations may indicate chords in a diamond along the 1/2 mark of each edge. Of the remaining four permutations, one may indicate fully solid, one may indicate fully transparent and two encodings are unused. These unused encodings may not be useful for a basic hardware implementation but they are sufficient to indicate an escape code for future functionality.

A more obvious method to chord a square tile includes all four corners and nodes along all edges and 1/4, 1/2 and 3/4 mark. This provides 176 interesting permutations. The remaining 80 permutations may be allocated to curved alpha masks.

64 encodings seems generous but they are rapidly consumed because the typical encoding requires four-way rotational symmetry in addition to a mirrored form. Therefore, symbols are typically defined in groups of eight. Only special cases require four, two or one encoding. A blank encoding is not specifically needed because it can be defined as any tile with the same foreground and background. However, the video codec is intended to work as an overlay for ASCII text where both planes share the same palette values. This multi-plane scheme is also intended to be upward compatible with Greek, Cyrillic, CJK [Chinese, Japanese, Korean] and a variety of dingbat symbols which would be primarily useful for a video player and circuit CAD. So, a blank encoding in the video plane is not strictly required but it allows other planes to be viewed without obstruction. It is desirable for the blank symbol to use encoding 32 because this would be consistent with ASCII space. It is desirable for a corresponding reserved-for-future-expansion symbol to use encoding zero because this can be rapidly tested on 6502. It is also desirable to group symbols into powers of two because this allows, for example, FPGA character generation to use the least logic. It is also desirable to group symbols such that the most likely transitions are adjacent encodings. I have suggested a palette encoding where INC/DEC operations simulate MPEG lighten/darken operations. INC/DEC on suitably arranged textures may simulate motion deltas. Sheep20 recommends that horizontal motion should take priority on the basis that horizontal panning of video is more common than vertical.

There are a few more constraints which should be enumerated. A smattering of high frequency encodings is desirable but not at the expense of rendering diagonal impulse - as occurred with Walsh basis functions. The union of PETSCII and Walsh functions would be desirable for display but the intersection is essential. The best precision for scroll-bars is desirable. In particular, 1/2 tile precision is grossly inadequate. The foremost application for scroll-bars is to convey position and duration of a video. This can be achieved using encodings 0-31 in the ASCII plane but this excludes other uses for that range. Most importantly, all encodings should be chosen with equal probability. Obviously, a blank frame will use one encoding. However, good tile encodings must be chosen frequently while also minimizing error. An encoding which is never used is obviously useless but it is less obvious that two or more marginal choices will maximize encoding error rather than minimize it.

After getting encode time down to 2.5 minutes and then improving image quality, it was time for a second round of improvements. The codec had become an elaborate Perl script around ffmpeg and ImageMagick's convert binary and it offered multiple modes of operation. It was apparent that the work-flow of ripping examples from YouTube and then MP4 -> PNG -> Klax1.3 -> JPEG -> MJPEG (for viewing) required approximately 1GB storage per minute of video. Splitting frames was a huge overhead and it failed to work with file system read-ahead. One mode of operation for the Perl script was to rasterize PNG frames into one compressed stream of 768 byte blocks. This saved a moderate amount of storage because compression continued from one frame to the next while also saving the unused partial sector at the end of each file. More importantly, the raw video could be decompressed faster than it could be retrieved from storage with read-ahead. Unfortunately, when working from FAT32 storage, I now encountered FAT32's 4GB file limit. Regardless, using GZip compression, I was able to encode 5GB of raw video data from SSX Cheer's Sharkbite routine. Sheep20 was irked that the text on the logo was only readable when the camera panned and has described the inter-dependent choices of pixel encoding like looking through patterned glass. Specifically, the ribbed type which makes a square lens effect. Sheep20 was more impressed that the tile video codec handled fades between cameras and, in other examples, transparent smoke and fire. Sheep20 was completely unimpressed with SSX Cheer's tick-tock move (not to be confused with Intel's tick tock or ByteDance's TikTok) which I've performed successfully in competition conditions. Actually, I've used a very peculiar set of skills to improve the arts and sciences. In 2007, I daisy-chained 4017 counters to make a clean-room clone of a motion capture system exhibited at SIGGRAPH 2006. This was used to make the jump in the Dark Knight trailer. After joining a cheerleading team in 2013, I used a re-implementation of the motion capture system to improve my gymnastics. From 2022 to present, I now use my knowledge of ballet, jazz, cheer and urban to rapidly find errors in video encoding. This is why I've been using video such as Taylor Swift's Shake It Off as test data. The encoded video looks impressive, sounds impressive and it is a convenient length to encode and demonstrate. However, I can coach many of the moves which I am unable to perform myself. Errors may be fleeting. Sometimes, the subject doesn't flow gracefully. Even with a blocky codec, that remains apparent. With suitable choice of test data, I can maximize the overlap of skill and write an encoder which minimizes subjective error.

Sheep20 was surprised that I would write more than 600 lines of Perl and 400 lines of C before invoking an interpreter or compiler. Sheep20 was more surprised that I could write more than 1000 lines with matched round brackets, square brackets, curly brackets and no dropped semi-colons. Admittedly, the C had missing array indexes and a few stray sigils. However, indexes present were matched. I work in this manner because it allows me to focus on algorithm rather than grammar. Too many people develop "works for me" brittle software in REPL - or force software through a compiler in ignorance. An example of grammatically correct, semantically deadly software - specific to C - is the break statement which caused network cascade failure in the Martin Luther King Day Telephone Crash. 70 million telephone calls were dropped. 60000 people had no access to emergency services. Some people probably died. Admittedly, I'm only tinkering with pattern matching. However, it is a dramatic example of how focus on bytes, cycles and grammar is premature optimization. For me, an incomplete program, especially in the wrong language, is premature.

On Wed 21 Sep 2022, I started to split the brute force tile encoder into a separate program. Initially, the sub-program was written in the same language. This paste job minimized disruption. It was then trivial to convert the "engine" into compiled C. I expected a small speed boost because the single threaded application would now be split across two cores and one would be free to perform blocking I/O while the other used 100% CPU. Furthermore, the communication channel (Unix pipe) is typically implemented in an efficient manner. For example, on Linux, I believe that a circular buffer of four memory pages are transferred between two application-spaces. In my case, with 768 byte blocks, this will be greatly aided by every third page being a block boundary. However, speed improvement was far more impressive. I did not expect performance to increase by more than 100%. The cache locality of the Perl bytecode interpreter and the 768 integer data-set provided a huge boost. This left conversion to C with a less stellar advantage. Regardless, encode time was under 30 seconds per frame.

For additional speed, it was trivial to make Perl spawn multiple copies of the engine and round-robin 768 byte blocks. Rather than reconcile the 7 byte tuples into one stream, they were written independently as striped data. Unfortunately, this parallelism has limits. I was unaware that ffmpeg on x86 already had assembly optimization for GZip PNG encode but I assumed that it was the bottleneck. Excluding ffmpeg's PNG encode, Klax1 encode time was down to 4.5 seconds per frame. Sheep20 wondered how I had striped the data. Sheep20 was under the false assumption that tile encoding used adjacent pixels in the encode process. I had to explain that it worked like JPEG and MJPEG. Every tile is encoded separately. However, the encoding scheme is such that an edge in one tile usually lines up with an edge in an adjacent tile. Likewise, fades, smoke and fire work because each encoded tile represents the minimum error of a given impulse. The encoder doesn't detect two layers and encode them separately. The encoding scheme has one or more variables which can be minimized. This may be part of a hierarchical encoding. It may be followed by a residual pass which reduces error to zero. Somehow, a scheme should rigged to approximate input.

On Fri 21 Oct 2022, I began writing small programs to manipulate the 7 byte tuples, such as palette rotation, mirror and resize. Most of these programs are less than 500 bytes of Perl due to sneaky use of regular expressions. The 36 byte header can be pattern matched while extracting pertinent fields, such as width and height:

Code: Select all

binmode(STDIN);
binmode(STDOUT);
read(STDIN,$head,36);
if($head!~/^\x7F\xFF\t\n\r\r\n[\010\011]KLAX1\.[34] o.F.f.J.j.X(.)x(.)Y(.)y(.)\0\0/s) {
  exit(18);
} else {
  $mx=ord($1)<<8+ord($2)<<0;
  $my=ord($3)<<8+ord($4)<<0;

and the tuples can be split with:

Code: Select all

read(STDIN,$tupl,7);
$tupl=~s/^(.)(.)(.)(.)(.)(.)(.)//os;

From here, it is possible to output negative colors, rotate color-space, reduce bits per channel, set foreground or background to zero, swap foreground and background or perform statistics on the data, such as frequency of textures and distribution of each color channel. Spatial and temporal processing is also possible. For example, it is possible to crop a cuboid of tuples within (x0,y0,t0)-(x1,y1,t1). Horizontal flip can be implemented by reading one row of tuples and outputting them backwards. Vertical flip works on larger quantities because entire rows are written in reverse.

A strange effect occurs if video is flipped but texture is unmodified. It is more accurate to describe the video as divided into textured strips and re-assembled as strips in the reverse order but with the texture otherwise unchanged. It kinda works but it is noticeably odd. The most informative program by far was resize. Blocky resize of bitmaps is bad. Blocky resize of textures is surely going to be worse? Not necessarily. Increasing the size of a video causes duplicate rows and columns to be inserted. Likewise, decreasing the size of a video omits rows and columns. However, this is tolerable over surprising extremes. Indeed, from my experiments, 40*25 tile video can be viewed comfortably at 12*7 or smaller, especially if accompanying audio is dominant. Obviously, the reduction is permanent loss of data but the total loss may be minimal when audio is also considered. Furthermore, 12*7 is a very convenient size and ratio. 84 bytes per channel for texture, foreground and background makes 252 bytes per frame for uncompressed video thumbnails. The aspect ratio is also very convenient. I strongly considered 8*8 SHEEPSCII icons for the forthcoming SheepOS™ (and prototyped a few designs) but I have definitely chosen 12*7 instead. The cool kids regard 40*25 as 16:10. 12*7 very closely approximates 16:9. It is the ideal size and aspect ratio for widescreen video icons. This is especially true when they are arranged in a grid. On 40 column display, 12 columns for Icon0, 1 column gap, another 12 columns for Icon1, 1 column gap, another 12 columns for Icon2 and 1 column gap leaves 1 column for vertical scroll bar. The remainder is exactly the same for 25 row display. 7 rows for Icon0, 1 column gap, another 7 rows for Icon3, 1 column gap, another 7 rows for Icon6 and 1 row gap leaves 1 column for horizontal scroll bar. Astounding!

With a 3*3 grid of 12*7 tile video thumbnails, it is possible to play them all concurrently (but silently) from storage or network and zoom to full screen when a video is selected. For streaming media, full screen video continues from the thumbnailed frame. For pre-recorded media, play starts from the beginning. For an example of this in practice, see the Glam Metal Detectives which used 3*3 and 4*3 video thumbnails for comedy sketches in the form of retro-futuristic, dystopian 500 channel television feed.

At the other extreme, AndersNielsen's ABN6502 with 51*30 tile display also approximates 16:9 and 16:10 display. Furthermore, the 1530 byte display is highly suited to 512 byte block storage, although payloads may have to be trimmed by one row or column to work over Ethernet. Actually, the loss of rows or columns may not be detrimental because 51*29 approximates 16:9 while 48*30 is 16:10 exactly. Although AndersNielsen maintains fastidious compatibility with 2MHz NMOS 6502, moderate extension to ABN6502 allows 8MHz 65C02 to multiplex texture at $0800-$0FFF, foreground at $1000-17FF and background at $1800-$1FFF using one 74HC161 and two 74HC574 or similar. A far less modest extension with dual port RAM provides PCM audio output. With no modification to hardware, ABN6502 can implement monochrome Klax1.3 in character ROM. Specifically, the 64 symbols would be characters $80-$BF and the inverted versions would be $C0-$FF. Patches have been sent to AndersNielsen with the suggestion to demo Bad Apple from open reel tape or demo Justin Timberlake's Rock Your Body from MicroSD. However, AndersNielsen has wisely spent time on an even more impressive 6507/6532 RIOT system which is no larger than an Arduino Uno. On Wed 2 Feb 2022, I said that AnderNielsen had great potential after getting WS2812B Digital LEDs working with 6502. The subsequent 17 months have been impressive.

Posted: **Sun Jul 23, 2023 12:43 pm**

I've been working on a tile video codec for 6502. Early results were slow and exceptionally bad quality. Unfortunately, that was unrelated to 6502. An experimental encoder required 40 minutes per frame. I reduced this by a factor of four by switching from 8 bit texture encoding to 6 bit. I reduced this by a another factor of four by down-sampling input. I reduced this by a further factor of five by moving the intensive loops to a separate process and re-writing that part in C. With very selective measurement, an additional factor of six advantage could be obtained with paralellism. Technically, I had a video codec which could encode one frame every 4.5 seconds. However, it would be more accurate to describe it as three binaries in a trenchcoat. Specifically, Perl using ffmpeg, ImageMagick's convert and my own tile encode/decode binary. Many improvements remained.

From late Sep 2022 onwards, Sunday Codec Club served multiple purposes. We chose test data, set encoding strategies, viewed encoded video, ran encoding jobs and occasionally programmed. We viewed video in multiple configurations including laptop, desktop and 60 inch television. I provided a selection of live action and animated content with different palettes. I deliberately encoded similar examples with and without sound because sound has an immersive effect which raises subjective quality. Unknown to Sheep20, I also showed some clips without modification over multiple weeks to ensure consist feedback. However, I did not control other factors. Indeed, I should have controlled factors described in the Annotated JPEG Specification. Unprompted, Sheep20 said that some videos are better in a small window while full screen is preferable for others. Further investigation found that video with a face is better in small format. In this case, the reduced visual arc compensates for the poor quality of face features. When features are more "distant", reduced quality is acceptable.

Sheep20 was quite thankful for increased encoding speed because our weekly subjective viewing sessions had consisted of music videos, dance/gym/cheer, film and television title sequences, station idents, fractal zooms, demoscene productions, action film clips, animated transformation sequences, slapstick and, generally, a procession of loud, lurid, unsubtle clips. Sheep20 was very thankful that clips rarely exceeded three minutes and we had frequent opportunity to take a break. I explained to Sheep20 that it took one minute to encode one second of video. Therefore, each frame had to be notably different. Regardless, Sheep20 was visibly relieved when we reviewed an encoding of StarTrek New Voyages: Mind Sifter. (I believe that the 6502 Forum's kc5tja and family contributed to this production.) Sheep20 subsequently used the clip of Admiral Withrow talking with Spock as test data. Unfortunately, we found numerous problems with our encoding. Firstly, tile codecs are really bad with small spots. Unfortunately, StarTrek with no stars looks really weird. I had to add a specific case to the palette algorithm so that spots on black background are always boosted. We call it the StarTweak™. The astute will note this case is beneficial for Bad Apple but we included it because it improves a wide variety of science fiction. Secondly, the Original StarTrek look is quite easy to achieve with appropriate props and a selection of spot lights and gels. Indeed, I suspect that Admiral Withrow's office, Kirk's quarters, Spock's quarters, Scotty's quarters and other locations are the same three walls with different lighting. Unfortunately, that lighting is a non-linear gradient which is fairly much guaranteed to stymie any 8 bit palette scheme. Thirdly, Mind Sifter is about 60000 frames. After encountering FAT32's 4GB file limit and reverting to separate PNG files, I discovered the hard way that Linux typically won't write a FAT32 sub-directory larger than 2MB. That is 32765 files or so. Initially, I thought that it was a kernel compile option on Raspbian to save RAM. However, it appears to be a Linux default.

Sheep20 grew increasingly frustrated by attempts to improve output. I would invent a set of tiles, implement that set, encode example videos and cue them for our viewing sessions. The result was either a regression, indifferent or a minor improvement. Unfortunately, improvements became increasingly rare. By Nov 2022, gains had ceased and were only coming at the expense of other regressions. For example, the StarTweak™ improved Bad Apple and other star-fields. However, it increased unwanted noise in Prodigy's Firestarter and I'm surprised that it didn't worsen Laura Branigan's Self Control. Even an off-by-one error in palette reduction improved some videos while worsening others. Although we were not progressing, we were both certain that we had not found a general purpose tile encoding. After five candidate encodings for Klax1.4 with no clear advantage and a further two encodings where we built upon each other's previous suggestions, I was agreeable to Klax1.4, Tile Mode 7. Sheep20 wanted to write a program which would analyze a video stream and output the most frequently used tiles. Using that information, the whole video could be encoded in a manner which minimized RMS error. The result may be specific to each video. However, if we aggregated enough sources of video, we *might* find the general solution. As a further optional improvement, Sheep20 suggested weighting the algorithm such that tiles towards top middle should be given priority. This would, for example, reduce the significance of watermarking and increase the significance of text and faces. I implemented this as a horizontal linear ramp function multiplied by a vertical linear ramp function. Column 1 had weight 1. Column 2 had weight 2. However, Row 1 had weight 1, Row 2 had weight 3 and Row 3 had weight 5. This creates a "hot spot" 1/2 across and 1/3 down the screen where faces and text are most likely to be significant. However, I said that summation of identical tiles (with or without weights) would be slow because collation requires O(n log n) sorting, it wouldn't work with live video streaming and that the minimal hardware was unable to modify its character set. Regardless, I wanted to give Sheep20's suggestion every advantage and this included an experimental tri-level encoding to further reduce RMS error. (On minimal hardware, the mid level can be implemented with dithered pixels.)

Sheep20 is capable of writing C but is distinctly a COBOL programmer who switched to Java and won't write anything else. Reluctantly, the Klax1 video codec gained a third implementation language and second license. (This gets much messier when we get to ffmpeg and 6502 assembly.) On Sun 16 Oct 2022, I wrote the output class in Java and the input routine in C. Unfortunately, we had a false start because the boundary conditions for the tri-level textures were wrong. Eventually, after a small amount of shouting, we had a working interface from Java to C and this allowed Sheep20 to work on an advanced algorithm while I implemented a brute force tile counter. Unfortunately, Sheep20's algorithm was good at finding Admiral Withrow's hair but not much else. Contrary to my advice, Sheep20 switched to encoding Bad Apple but found the boundary between the bi-level areas was all flagged as important detail. And when everything is important, nothing is important.

If you wish to make a video codec which plays Bad Apple to a satisfactory standard and also handles color, I highly recommend using the Winx Club Believix Transformation as test data. The flame of the fire fairy, the splash of the water fairy, the leaves of the nature fairy and the checkerboard dance-floor of the music fairy have analogous sections within Bad Apple (see here, here, here and here). Use of the suggested test data avoids over-fitting to Bad Apple's monochrome silhouettes. Essentially, Bad Apple drops out as a corollary. More practically, if the output of Sailor Moon's Crystal Transformation is recognizable then the video codec is suitable to teach bread-boarding and wire-wrapping. This is the primary purpose of the codec. 6502 hardware can play video - and those videos can explain how to make 6502 hardware. Excess capability provides light relief.

Unfortunately, Sheep20's old laptop used for testing Java Swing applications wasn't accustomed to 100% CPU usage for extended periods. It stopped working and Sheep20 wasn't willing to continue a fairly fruitless search on other hardware. Also, because this escapade was my idea, I'm responsible for recovering the stranded data on the laptop's 2.5 inch PATA drive. Meanwhile, I was reminded that when you double sorting volume, you triple sorting time. 15 minutes of video took 84 hours to collate. However, results were impressive. A clip from YouTube with watermarks and suffixed menu was the first tile video clip to render text correctly. Likewise for the copyright notice at the end of a Taylor Swift music video. Sheep20 asked me if that was the source video or the codec output. I laughed because this was a step change in quality. Unfortunately, it remains impractical because it takes two months to encode a film. And it doesn't work on target hardware.

On Wed 5 Oct 2022, I wrote a tile visualizer because I had great difficulty explaining some of the concepts of tile video to Sheep20. Using principles of laziness and similarity, it converts our text tri-level format to the text XPM image format. It subsequently became very useful for checking our work. In particular, it showed us that the criteria for "best" tiles can be subjective or misleading. For example, a very common tile is one pixel in the corner. Less common is two or three pixels in the same corner. This is a "split vote" principle. There is only one variant of one dissenting pixel. However, there are two variants for two stray pixels and three variants for three differing pixels. Horizontal and vertical chords are similar. There is one variant with no deviation but many which switch row or column. It doesn't imply that the straight variant is best. Many of these aberrations were eliminated with a filter to remove tiles which are too similar. This filter is O(n^2) but, fortunately, only runs on the first 10000 or so results from the collated list of "optimal" encodings. Furthermore, it only has to run until a maximum of 256 distinct tiles are found. Sheep20 desperately wanted to extend beyond 8 bit texture. However, even if we choose a ludicrous encoding with 24 bit texture (more than 16 million tiles), we obtain marginal improvement because each additional bit of texture reduces error by approximately 8%. Hierarchical encoding is more beneficial. However, both are outside the scope of minimal implementation.

Steady versus moving shots greatly skew "optimal" encoding. One example is a held shot of a court-yard in the deceptively difficult to encode Winx Club, Way Of Sirenix. This leads to a large number of prominent, busy textures in the "optimal" encoding. Sheep20 suggested a stateful expiry for tiles which do not change over multiple frames. I considered a stateful, square root, temporal weighting. I then considered a flood fill algorithm to count contiguous regions. However, it was easier to iterate over the rows and columns of each candidate encoding and discard any with an excessive number of transitions. Unfortunately, the threshold is subjective and varies for each video.

Two principles can be applied to all candidate encodings. The Levenshtein distance between bit patterns should be sufficiently large to encourage relatively even coverage of the encoding-space. There should also be an upper bound on contiguous regions. Walsh functions have a perfectly balanced Levenshtein distance but they fail at the second hurdle because most of them have too many regions. After all of that work, perhaps diagonal chords aren't so bad. Although we didn't find a generic solution, Sheep20 found an exotic tri-level tile type which we didn't guess. Divide a tile into three equal stripes. Populate them as foreground, background, midground. Ensure that rotational variants exist and perhaps octagonal variants. In subjective tests, such tiles don't appear very often. However, they make a notable improvement to image quality when they appear, for example, in the cockpit in Pacific Rim.

To get through the file size limit, directory name limit, compression limit and speed limit, my most logical option was to re-implement Klax1 inside ffmpeg. This would be faster because H.264 decode and Klax1 encode would operate on one buffer in the same application-space. However, I was not looking forward to this task due to the size of ffmpeg. I wrongly assumed that ffmpeg was a giant pile of license violations and duplicate code. A cursory inspection found it to be very conformant with, for example, common entropy encoder implementations across all of the major codecs. However, a much closer inspection revealed that it is written for speed and there are thousands of buffer overflows with, for example, pointer arithmetic dependent upon unchecked binary input. More worrying is that I was quite obviously looking at YouTube's transcoding implementation and the solution to all of the security problems is to run all of it in a sandbox. "Every problem in computing can be solved with a layer of indirection - except too many layers of indirection."

Sheep20 was very surprised that within two days, I had a FourCC of KLX1 running inside a MOV container. Unfortunately, I have not replicated this in MP4 and I get different errors on different platforms. Given that one error is divide by zero, I presume that an aspect ratio or similar is not populated but it works in MOV because this meta-data is only important in other container formats. Whatever. By setting the embarrassingly parallel flags, it is possible to encode at 32 frames per second and decode at 160 frames per second using 12 year old hardware. Even a quad core Raspberry Pi can encode 3 frames per second. Sheep20 was impressed that I had found my way around 49MB of C and 12MB of assembly optimizations for 12 processor architectures and make a 30KB patch. I noted that the Internet was significantly larger and - much to our detriment - idiots have no difficulty finding their way around it. Anyhow, don't be intimidated by the size of a software project.

As a demonstration of inter-operability, I encoded a fat binary with H.264 and Klax1.3 within the same MOV container. Fat binaries were popular during the MacOS transition from 680x0 to PowerPC but the technique has fallen into dis-use as JIT compilation has advanced. Regardless, I've suggested making cross platform firmware and binaries where, for example, one common ROM can service interrupts on 6502, 6800 or 6809; distribute bytecode with an interpreter for Z80, 6502, 6800, 6809 and other processor architectures; detect NMOS/CMOS/65816; or auto-identify legacy 6502 host from a table of known vectors. Given that ffmpeg has functionality to include audio tracks and subtitles for multiple languages, I was definitely going to tinker with such encodings. My results are encouraging. If the widely known video codec is video channel zero, it plays successfully using Windows Media Player, QuickTime Player, VLC and Totem (commonly found on Linux). As expected, suitably patched ffmpeg remains able to convert Klax1.3 to MJPEG. Potentially, this Klax1.3 stream does not require conversion and can be played directly on 6502. Unfortunately, playing MOV on 6502 has several impediments which all increase bandwidth. Fetching blocks from MicroSD over SPI is trivial. However, fetching frames not aligned to block boundaries, from fragmented FAT32 storage, may double storage bandwidth requirements. A preferable arrangement for 6502 and 65816 is an extent based filing system and a media container format which aligns to block boundaries. This is also more suitable for video streaming over network. This led to a very long discussion with Sheep20 about RIFF parsing, 4GB file limits, language tracks, music/effects tracks, upwardly compatible surround sound, meta-data (subtitles, rumble tracks, pyrotechnics), TCP versus UDP streaming and general support for a RIFF48 container format which breaks the 4GB limit and contains a 40*25 tile stream - partially for 6502 compatibility - but also to provide an embedded video thumbnail and upwardly compatible hierarchical encoding for more advanced systems.

Regarding hierarchical encoding, if we begin with a basic building block of 640*360 video encoding, 9 optional instances allow 1920*1080 encoding of the residual data. With further doubling and tripling in each axis, it is possible to encode 2K, 4K, 6K, 8K, 12K and 16K video in a manner which is downward compatible with 6502. Audio may be encoded in a similar fashion. 24kHz, 8 bit, monophonic audio may be interleaved with 48kHz, 96kHz, 192kHz, 16 bit, 24 bit, stereo, 2D surround sound and 3D surround sound. All of this hierarchical data can be encoded such that the minimum number of blocks may be retrieved from storage or transferred over network. If the system is designed correctly, it would be possible to multi-cast and cache data such that clients with vastly different capabilities may inter-operate while incurring the least load on each network node. That means a home-brew 6502 system may browse the same audio and video streams as a gaming desktop or smartphone. Obviously, not at the same quality, but reduced capability does not mean total exclusion.

For minimal systems, it may be preferable to encode 16 bit stereo audio and extend from there. Clients may discard the least significant 8 bits but this encoding, especially over SPI, encourages 12 bit or better stereo output via suitable R2R DACs. I find vocals extremely distracting when programming but I would gain significant enjoyment from a self-hosting 6502 system which could play a wide selection of music with or without vocals. When the audio source is professionally studio mixed, it is possible to subtract the right channel from the left and delete centered vocals from the resulting monophonic karaoke mix. This may be performed at 8 bit quality or better, according to available processing power. In normal operation, 16 bit left channel is copied to left DAC and 16 bit right channel is copied to right DAC. In karaoke mode, 16 bit read, followed by 16 bit subtract is followed by duplicate writes to left and right output.

By Dec 2022, Sheep20 took the Klax1 video codec very seriously. It was no longer a toy. It encodes films, inter-operates with other systems and has ridiculous upward/downward compatibility. Heck, if we submit a patch, it might be possible to upload Klax1 video to YouTube. Given that encode frame rate exceeds 30 frames per second, it is now practical to use the Klax1 video codec in film production. Specifically, with a single camera set-up, rushes can be encoded faster than they are shot. Klax1, as currently implemented, works independently of 6502 systems. However, I separately believe that it is practical to shoot, edit and publish video on 6502. HDMI to Klax1 conversion may require FPGA. However, this magic step can also be fed from a polygon blitter; possibly reviving Virtual Reality on 6502. Video over network is also very practical. If we take BigDumbDinosaur's 600KB/s 65816 SCSI system and apply RFC2143 [Encapsulating IP with the Small Computer System Interface], it is possible to make a video jukebox which can serve six concurrent video streams with low quality sound. And my interest in AndrewP's quad core 65816 system should be obvious. More cores means more bandwidth which means more video streams.

Interactive applications are also possible. I've previously suggested 6502 audio conferencing. I was unaware that video conferencing is also possible. It is also possible to make single player and multi-player games with a ludicrous number of pre-rendered video frames. In the most trivial case, it is difficult to purchase MicroSD smaller than 16GB. Assuming 1KB for texture, 1KB for foreground color, 1KB for background color and 1KB for occlusion meta-data, it is possible to distribute a driving game with no less than 4 million frames of pre-rendered video. These frames may be any combination of camera recording or pre-rendered sprites or vectors. Assuming a triple screen driving game where tracks are rendered with 12 positions from left to right, it remains possible to render more than 100000 offsets around an unlimited number of race tracks. Under these conditions, my half joking suggestion to make Dance Dance Revolution: PETSCII Chip Tune Edition is trivial.

6502.org

16*16 Pixel Tile Video Display And Video Codec

16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec

Re: 16*16 Pixel Tile Video Display And Video Codec