Let me start by admitting I don't have any Commodore 8-bit hardware in my possession (no place to put them and no power to supply them), and if I did, I never had the SuperCPU. Everything I know about it, I learned from reading other sources.
With that being said . . .
For all I know about the SuperCPU, it appears to just plug onto the board-edge connector at the back of the computer
This is correct. The SuperCPU is not just a processor upgrade. It is, in fact, a completely headless, whole new computer consisting of a 65C816 at 20MHz, an FPGA, static asynchronous RAM (128K IIRC), and some SDRAM (I believe it's SDRAM, at least). It comes with a single expansion slot, which just so happens to be a perfect fit for the C64's expansion slot.
its output still has to go over the C128's 1MHz bus to get into the C64 or C128.
Correct. The SuperCPU performs a number of tricks to keep the CPU running at high-speed, but sometimes it simply has to inject wait states to re-synchronize against the VIC-II's timing.
One of the tricks the SuperCPU employs is a write-buffer. If I
write data to a location in bank 0, it's queued by the FPGA for convenient delivery. In essence, writes to bank 0 are
scheduled for the next phase-2 sub-cycle on the Commodore bus.
To handle the scheduling, IIRC, the SuperCPU has a one- or two-byte queue, which means you can only write one or two bytes to bank 0 every 20 cycles without incurring wait-states. This lets the SuperCPU compute something else while the data being written is synchronizing against the VIC-II's bus.
Note: reads must always block, due to the 65816's unrelenting synchronicity in its bus cycle. Unlike 32-/64-bit RISC processors, the 65816 does not support split address/data bus transactions, and therefore, no means of prefetching exists. So if you read from bank 0, you're just gonna hafta wait for it to finish.
Another trick it employs involves a custom memory management unit, allowing you to map significant chunks of bank 0 into SDRAM (1 or 2 wait-states) or static RAM (0 wait-states). That is, zero-page, stack, ROM, and most of RAM can be mapped to S(D)RAM, thus preventing the CPU from wait-states where they're not needed (this includes for reading). Proper placement of frequently used resources in SRAM plus the hardware-managed write-buffers effectively means your program will not incur nearly as many wait-states unless it touches I/O registers, color RAM, or the VIC-II-related memory regions.
Note: The Turbo-CPU upgrade to the Apple IIgs treats its SRAM as cache RAM, and translates DRAM accesses in hardware. The SuperCPU relies on
software to place its resources appropriately; hence, you might say that the SuperCPU uses a
software-managed cache.
The '816 was never intended to compete performancwise with the 32- and 64-bit mulit-GHz processors; but add to the above some updated video hardware, and a surprising amount could still be done with the '816 in a desktop environment.
Remember that the world once thought EGA on an 8MHz 80286 was jaw-dropping graphics and break-neck processor performance. A 20MHz 65816 will compete rather favorably against a 20MHz 68000. Indeed, it's strange to think about, but a 65816 at 16MHz or faster, equipped with a MMU (actually making use of the ABORT# pin) would probably make a nice Unix workstation, giving an early Sun workstation a run for its money, and it'll undoubtedly draw less power.