In another thread, Garth asked what made GPUs and the like so special. Well, I'm about to describe a basic, 2-D blitter, the most primitive kind of GPU you can imagine. This is 1985 technology here, so what you'll find in modern video cards is perhaps 1000x or more as powerful as what I'm about to describe.
The Problem
Manipulating graphics is inherently a matrix manipulation operation. You're more or less repeating the same algorithm across multiple pixels in a rectangular grid. Boring stuff.
The Solution
Let's assume a 3-source blitter, as found in the Commodore-Amiga hardware. Because you have 3 sources, you can combine these sources in up to 256 different ways. This is because you can generate eight different "minterms":
Code:
(0) Y=/A*/B*/C
(1) Y=/A*/B*C
(2) Y=/A*B*/C
(3) Y=/A*B*C
(4) Y=A*/B*/C
(5) Y=A*/B*C
(6) Y=A*B*/C
(7) Y=A*B*C
Obviously, we're talking about boolean math here -- * implies logical AND, and / implies logical negation. Each minterm corresponds to a single bit in a control register. Setting multiple bits yields an AND/OR relationship. In the above example, using a control value of $0A will produce the logical result Y=/A*B*C+/A*/B*C. Reducing the equation yields the effective result Y=/A*C.
For bitmaps where the number of bits per pixel is less than the smallest addressible memory location (a word in the case of the Amiga's blitter), there exists the chance that the left edge of a bitmap will appear on a non-word boundary. For this reason, both the A and B channels come equipped with a barrel shifter, each performing a multi-word logical shift in either the left or right directions. The data flow is as follows:
Code:
(from DMA)
|
V
+-----------------+-----------------+
| Data from DMA | Previous Word |
+-----------------+-----------------+
------> | |
(shift +-------------------+
amount) |
|
V
final data word
Note that the C channel, the so-called "Pattern" channel, is always assumed to be word-aligned (as all hatches, stipples, and halftone patterns tend to be). For this reason, no silicon is invested for barrel-shifting on the C channel.
Each of the A, B, and C channels are driven from either DMA sources or are spoon-fed from the CPU, depending on whether the DMA channel is enabled for that source. For simplicity, I'm going to assume all have DMA enabled.
The D channel (which I'm going to relabel to the Y channel, to remain consistent with logic equations commonly found in TTL databooks) is the "destination" channel, which also has its own DMA channel. If the DMA channel is disabled, then the results of the blit are ignored, leaving only flags set. I will get into this in a bit below.
Our blitter, then, as implemented above, implements the following pseudocode. Note: I'm not even going to try to code this in Forth:
Code:
unsigned int *pointerA, *pointerB, *pointerC;
unsigned int shiftA, shiftB;
for(y = 0; y < HEIGHT; y++) {
channelA = 0;
channelB = 0;
for(x = 0; x < WIDTH; x++) {
previousA = channelA;
previousB = channelB;
if(dmaEnabledForA) channelA = *pointerA++;
if(dmaEnabledForB) channelB = *pointerB++;
if(dmaEnabledForC) channelC = *pointerC++;
wordA = (previousA >> 16-shiftA) | (channelA << shiftA);
wordB = (previousB >> 16-shiftB) | (channelB << shiftB);
notA = ~wordA;
notB = ~wordB;
notC = ~channelC;
y = 0;
if(ctrl & 0x01) y = y | (notA & notB & notC);
if(ctrl & 0x02) y = y | (notA & notB & channelC);
if(ctrl & 0x04) y = y | (notA & wordB & notC);
if(ctrl & 0x08) y = y | (notA & wordB & channelC);
if(ctrl & 0x10) y = y | (wordA & notB & notC);
if(ctrl & 0x20) y = y | (wordA & notB & channelC);
if(ctrl & 0x40) y = y | (wordA & wordB & notC);
if(ctrl & 0x80) y = y | (wordA & wordB & channelC);
if(dmaEnabledForY) *pointerY++ = y;
}
pointerA = pointerA + moduloA;
pointerB = pointerB + moduloB;
pointerC = pointerC + moduloC;
}
This is a
gross simplification over the simplest possible use-case for the Amiga's blitter.
The Amiga's blitter also can perform area fills and implements one of the first hardware Bresenham's line-drawing algorithm entirely in hardware too.
For collision detection between several software sprites, the blitter also supports a "zero flag". If you wanted to perform a collision detection without also producing an update to some bitmap, then you'd configure the blitter as normal, except that you'd disable the Y DMA channel. Upon completion of the phantom blit, the zero flag will be set or cleared, depending on whether or not a collision between two soft-sprites occurred.
As you can see, having hardware acceleration for implementing this kind of loop, plus for line drawing and area fills, can result in a substantial amount of work performed for only the cost of filling a set of I/O registers. The blitter is pipelined, which enables it to retire
one bitmap word per every fourth clock cycle (remember that four clock cycles are needed: three for the A, B, and C fetches, and one for the Y write-back, if enabled). If the chipset drew data from a cache, it'd be substantially faster, perhaps retiring a single bitmap word every clock cycle.
The block diagram of the business logic of the blitter is:
Code:
A B C
| | |
| | |
+--------+--------+ +--------+--------+ |
| DATA A | PREV A | | DATA B | PREV B | |
+--------+--------+ +--------+--------+ |
| shift | | shift | |
+-------+ +-------+ |
| | |
| | |
+--------+ +--------+ +--------+
| WORD A | | WORD B | | WORD C |
| Q /Q | | Q /Q | | Q /Q |
+--------+ +--------+ +--------+
| | | | | |
| | | | | |
+------------------------------------------------+
| MINTERM GENERATOR (AND/INVERT LOGIC) |
+------------------------------------------------+
| | | | | | | |
| +-- | --- | --- | --- | --- | --- | --- | -----o Y=ABC
| | | +-- | --- | --- | --- | --- | --- | -----o Y=AB/C
| | | | | +-- | --- | --- | --- | --- | -----o Y=A/BC
| | | | | | | +-- | --- | --- | --- | -----o Y=A/B/C
| | | | | | | | | +-- | --- | --- | -----o Y=/ABC
| | | | | | | | | | | +-- | --- | -----o Y=/AB/C
| | | | | | | | | | | | | +-- | -----o Y=/A/BC
| | | | | | | | | | | | | | | +----o Y=/A/B/C
| | | | | | | | | | | | | | | |
+------------------------------------------------+
| MINTERM GATES |
| (16-input, 16-bit AND gates) |
| |
| |
+------------------------------------------------+
| | | | | | | |
| | | | | | | |
+------------------------------------------------+
| 8-input x 16-bit OR gates |
+------------------------------------------------+
|
|
+-----------------+
| Zero-flag and |
| area fill logic |
+-----------------+
|
V
Y
As you can see, you are going to have quite a difficult time competing with this circuit in even the most optimized 6502 or 65816 assembly language.