I wondered where this topic would go, if anywhere. We've already considered
multiplication,
squaring,
more multiplication, Maclaurin, Taylor, Newton-Raphson and
CORDIC. I hoped that we might discuss matrix operations,
neural networks, a hypothetical Commodore FPU or a polygon blitter which uses the same floating point format as BASIC, Forth and C. Like many others, I was particularly impressed by
TobyLobster's exhaustive evaluation of more than 60 multiplication algorithms. I hoped there would be a definitive winner but "it depends" because the distribution of inputs affects the choice of fastest algorithm.
I've
previously noted that
ℤ ⊂
float ⊂
ℚ ⊂
ℝ. Floating point is a subset of fractions not a superset. Regardless, programmers struggle to understand that binary floating point mangles almost everything. This includes 1/3, 1/5 and everything derived from them. Most notably this includes 1/10, 1/100 and 1/1000. Our most noble, wise and symmetrical tri-quad-overlord,
cjs, suggests Base 12 or Base 60 to avoid rounding error. Indeed, given my understanding of
Chinese Remainder Theorem, my first thought was Base 360 encoding to hold 1/10 and 1/12 without error. I then downgraded that to Base 180 to fit into one byte and raised it to Base 240 to obtain better encoding density. So, yes, I strongly agree with cjs that Base 60 is a good encoding and - after multiple adjustments - humbly note that it is possible to hold two extra bits per byte, your most quadness. If you are in an obscure case where 1/7 is hugely important (in addition to 1/3 and 1/10), Base 210 should be considered. Otherwise, Base 240 is a good encoding which uses most of the available bit permutations.
Two's compliment is preferable for addition, one's compliment is preferable for multiplication, either is moot for Taylor approximation but
monotonic representation is preferable for
O(n log n) sort operations which may overwhelm all other considerations. Indeed, I strongly recommend that Base 240 representation follows binary sort order, irrespective of the representation having dis-contiguous holes. There is no strict requirement for Base 240 representation to use the first, middle or last values of each byte. However, the last values may be preferable for software and hardware implementation; especially hardware counters. De-normalized Base 240 representation is highly suited to polygon blitting where modulo 15*16 counters may increment faster than 8 bit counter. Likewise, multiples of 240 are entirely compatible with high resolutions. Specifically, video resolution is often specified as an integer multiple of 2K (1920*1080 pixels). This is 8*4.5 multiples of 240 pixels.
One of the over-looked advantages of BCD representation and larger representations is that very few exponent bits are required to cover a broad range. With Base 2 representation, every successive value of exponent is a doubling. With Base 10 representation, every successive value of exponent is an order of magnitude. Therefore, 6 bit exponent covers 64 leading decimal digits - and slightly more if a de-normalized form is allowed. Base 240 encoding with 6 bit exponent covers more than 152 leading decimal digits. If we follow the development of floating point from the
Apollo Guidance Computer to
MIL-STD-1750A to the laughable "Chinese wall" conflict of interest of
IEEE-754 development, we find that range and precision is always compatible with common aeronautical formula in the volume of space covering Earth and Moon. From this, we find there is never a reason to take the value 10^50 and cube it. Regardless, I find it ludicrous that it is possible to take a IEEE-754 single precision value, cast it to double precision, square it and obtain overflow. I also find 80 bit representation unwieldy. Therefore, I strongly recommend Base 240, 32 bit single precision with 6 bit exponent and 64 bit double precision with 14 bit exponent.
I have thought further about the interaction of a
caching multiplication algorithm and interrupts. Specifically, this corner case should be avoided at all costs. One of the many barriers to general purpose operating system adoption on 6502 is that implementations (typically written in C) assume that multiplication is a computationally cheap operation which can be used with abandon within interrupt. Even a slow microcontroller, such as AVR ATMEGA328P, can perform 16 bit integer multiplication within 200ns - and
operating system development typically targets hardware which can amortize 32 bit integer multiplication within 1ns by using register re-naming. 300 cycles or more at 40ns or more is considered perverse and 6502 exists in a niche where it is bad at multiplication and therefore doesn't get used for any purpose which requires intensive multiplication. Historically, this was false. Many early 6502 systems were purchased for the primary purpose of running spreadsheet software. Likewise, it does not remain entirely false. 6502 systems with SRAM offer considerable advantage with reliability, maintainability and energy consumption. A canonical example is GARTHWILSON's flight computer; a BCD 6502 system which reduced processor speed to save energy. It is possible that people using alternative systems have died in light aircraft crashes.
GARTHWILSON recommends against BCD due to the related problems of poor speed and poor encoding density. Indeed, Base 100 and Base 1000 offer advantages over BCD. I believe that Atari BASIC uses BCD for floating point and I presume that rationale is quite simple. Performance will be awful with any representation. However, decimal means trivial string conversion and no errors with decimal currency. We should remember that Nolan Bushnell, founder of Atari, said "Business is a good game - lots of competition and a minimum of rules. You keep score with money." While Atari failed to count pennies prior to bankruptcy, we should also remember that the world's former richest man didn't bother with decimal floats. Or accuracy.
Regarding the interaction of cached multiplication and matrix operations, it is apparent to me that the order of operations is significant. As an example, two dimensional matrix rotation typically uses three distinct values. However, poor evaluation order and unfortunate hashing may cause the same inputs to be evaluated repeatedly. This case can be entirely avoided by modifying the schedule of multiplication operations. In the general case, serpentine evaluation might be most efficient. For small matrix rotation operations, diagonal evaluation is recommended. Obviously, this requires the type of rigorous testing pioneered by TobyLobster. However, in the case of sparse matrix operations (magnify, project, rotate, shear), multiplication by zero is exceedingly common.
Regarding a Commodore FPU, a hypothetical 6587 required a patient and charitable understanding which didn't exist in 1980s computing. Although
NASA believed that it was fruitful for 6502 to manage up to 10 FPUs, a dedicated 65xx peripheral chip would have been underwhelming. However, don't let that stop you from making similar with FPGA. Intel's 8087 was a "show-boating", ambitious over-reach. While 8086 initially ran at 5MHz, any 8087 dragged it down to 3MHz. It was a hugely ornate chip with 80 bit precision and the majority of its operations are completely useless; most notably all of the hyperbolic and arc-hyperbolic functions. It required
two bits per cell to hold the micro-code. Yes, the micro-code was held in Base 4. Obviously, this is why manufacturing yield was poor and devices ran slowly. Commodore could have competed against this insanity with the minimal implementation. Specifically, addition, subtraction, multiplication and not much more. However, it would have been viewed less favorably than National Semiconductor's awful FPU.
A hypothetical 65xx FPU does not require many pins. It requires power, ground, 8 data lines, some address lines, clock, read/write and it is customary for 65xx peripheral chips to have multiple select lines with differing polarity. Alternatively, use of SYNC allows co-processor to share the instruction stream. At most, this would require 20 pin narrow DIP. However, this assumes a contemporary level of chip packaging for a large chip. It also ignores heat dissipation of NMOS. 6587 would have been 24 pin ceramic. With suitable pin arrangement, one or more memory mapped 6587 chips could have been placed next to SID. However, it is unlikely that 6587 would have been given more than 16 or 32 memory locations. This is inadequate for 64 bit matrix operations of any size. To reduce register pressure, 6587 may have used 40 bit representation. With a strategically place NC pin, this would have indicated future expansion to 80 bit representation. Given that Commodore had rights to Microsoft BASIC Version 2, it would be quite easy to make BASIC work with FPU. Indeed, it remains worthwhile to pursue this huge performance boost. It also reduces the size of the BASIC interpreter. This would reduce or eliminate the performance loss when running BASIC on Commodore 16 or similar. Either way, it remains worthwhile to maintain the addresses of historical entry points.
I've previously suggested to Rob Finch that
VIC-II could have been extended with wider video bus. In this scenario, 320*200 text/bitmap/sprite system is retained as an overlay. Meanwhile, 640*400 or more can be pursued with minimal compatibility concerns. (Within this arrangement, I've found that 40*25 tile video benefits from 640*400 pixel resolution.) The migration path is a variation of Atari's upwardly compatible video architecture. However, this path allows SuperCPU to retain compatibility with the notoriously tight Commodore video timing while also allowing floating point compatibility across BASIC, FPU and polygon blitter. As we've seen from the success of the MEGA65 project, Commodore's VIC-III matches throughput of
Amiga OCS graphics. A hypothetical VIC-IV could greatly exceed it, although it might be a 64 pin ceramic monster. When the legacy text/bitmap/sprite system is not required, surplus bandwidth on the 8 bit processor bus can be used to feed a polygon blitter. My doodles indicate that a minimal polygon blitter might take 192 clock cycles or so before the first byte is output. After that, each write may occur every 8, 16 or 32 cycles. However, such a design benefits from several iterations of Moore's law. It also benefits from wider bus. Therefore, it is trivial to double throughput every two years; possibly for a decade or more. At this point, all ROMs can be reduced to one chip, all peripherals can be reduced to one chip and the design is focused around the wider bus of the graphic system.
When the majority of transistors are allocated to a single core GPU, further doublings are possible by scaling GPU cores. In this arrangement, it isn't hugely important if the floating point format is 32 bit or 40 bit, Microsoft/Apple/Intel format, Base 240 or Base 256. Historically, Commodore followed Microsoft in preference to Intel. (See
Vintage Computer Festival: Jack and the Machine for rationale.) However, a fresh, consistent implementation should investigate Base 240 representation.