To Proxy, on Thu Nov 11 2021,
I suggested minimal FPU without division and mentioned Raphson and FDIV bug.
To John West, I've previously suggested hardware multiply and software division and modulo. Processors, such as TMS9900, typically supported multiply and divide (or not) as a pair. Since the 1990s RISC era, processors usually have multiply and typically support divide and modulo (or not) as a pair. I am quite impressed that Proxy correctly implemented divide and modulo as FPGA state machine, although Proxy is encountering the same rationale as ARMv1 for omitting divide and modulo.
I wrongly assumed that ARMv1 omitted division because it is an ugly operation. I assume understanding of
open sets,
groups,
commutative operations,
finite fields,
rings and
ℤ ⊂
float ⊂
ℚ ⊂
ℝ. An understanding of the Abelian grape joke is optional. Programmers who work with bytes will definitely understand integer overflow and will already be familiar with finite fields. Likewise, programmers should already be aware that 1/3 or 1/10 has rounding error when represented in binary.
If you don't care about integer overflow then finite binary addition, subtraction and multiplication are closed. However, division has the exception case of divide by zero and there is inevitable rounding error. No division instruction? No divide by zero exception. No problem. If there are conditions around a division instruction then it isn't a huge hardship to also have an iterative software loop. I wrongly thought that ARM gave a specious argument that division is infrequent. However, I have no immediate use for 6502 division outside one instance to implement floating point division for perspective correct texture mapping or division operator in BASIC, Forth, C or spreadsheet application. Analysis of Lisp, Smalltalk and Squeak finds similar infrequent use - although frequency in source code or bytecode does not equal frequency of execution. For example, scaling textures requires four divisions per triangle, ideally for a minimum of 180000 triangles per second.
With or without easy carry operations, addition, subtraction and multiplication are composable. Specifically, small 8*8 operations (or similar) can be combined into one large operation. This doesn't apply to divide or modulo - unless multiplication of the inverse is performed. Counter-intuitively, if you only have one multiplication operation and it is significantly larger than 8*8 then make both inputs signed. For example, 31*31 signed multiply hardware can perform 16*8 unsigned multiplication without bit shifting, overflow or branching. 8*8 is an awkward case because it is just at the threshold where look-up table is feasible. I believe that a common attitude is "Go large or go home." An exception is MC6809 which offers 8*8 unsigned multiply - but I strongly suspect they would have offered 16 bit multiply and divide (signed and unsigned) if the transistor budget allowed it.
Going back to ARMv1, I wrongly assumed the correctness of division, the lack of composable cases and the frequency of operations contributed to a decision to not allocate transistors for division or modulo in the first version (and all subsequent versions). Instead, it is preferable to make a smaller, cheaper device with higher yield. However, a finite state machine to implement division adversely affects interrupt response time. This was in the era when MC68000 ran at 12MHz and NS32032 was too slow for Acorn's preferred floppy disk controller. Intel's infamous FDIV bug occurred due to a failed attempt to perform division in base 4. I find this highly amusing because the public sector
VIPER documents from 1987 state that correctness proofs of float point were out of scope. Meanwhile, the INMOS T800 documents from 1987 show how to convert provably correct Occam to provably correct microcode. It took Intel a further eight years - and the largest processor recall ever - to discover the merit of verification. Cowboys.
Nowadays, I believe that fast processors perform 64 bit integer division in base 8 for the purpose of keeping out-of-order execution below 32 cycles. This avoids allocating more that 160 shadow registers. While some may appreciate a scheme where 64 bit integer division effectively takes less than 1ns, it is far less impressive to know this is a fancy FIFO scheme with highly restricted throughput.
The speed and density of decimal representation is poor. The major advantage was that 4 bit hardware could operate on arbitrarily long sequences of decimal digits without rounding error. I like the simplicity of GARTHWILSON's iterative decimal subtraction. I have considered optimized decimal algorithms, such as subtracting 3x. In the worst case, division where a digit is 8 incurs the sequence subtract 3x, subtract 3x, subtract 3x, underflow, restore previous value, subtract 1x, subtract 1x, subtract 1x, underflow, restore previous value. However, this technique (and powers of two within one decimal digit) is fiddly while providing minimal advantage compared to the trivial technique.
Given a free choice, I would implement integer addition, subtraction and multiplication in hardware and integer division and modulo in software. If I don't care about modulo, I might be inclined to implement division by casting to float, performing float division and then casting back to integer. If the float mantissa is the same or larger than integer representation then precision is never lost and only one implementation is required. Some versions of BASIC worked in a similar manner. If I do care about modulo, it may be very useful to preserve floating point remainder and cast that to integer. However, it is
very well known to the Forth programmers that one's compliment arithmetic or two's compliment arithmetic may affect the rounding of negative inputs.
Choice is influenced by available or anticipated hardware support for integer multiply or float divide.
I previously suggested minimal FPU with bitwise support for division, Raphson and similar techniques. Bit at a time division allows low precision approximations. In particular, it allows floating point mantissa to never exceed integer precision. It also allows low latency interrupts. I believe this has many of the properties that Proxy seeks.
There is a mis-match of common operations. All hyperbolic functions can be ignored. Inverse square root is highly desirable. In the
Rankin algorithms, square root is implemented as log, half, exp. Inverse square root can be implemented as log, half, negate, exp. Triangle and sawtooth functions are commonly implemented in GPU for texture mapping. I believe that a sawtooth function would also be useful for low precision sine and cosine operating in right angles rather than radians. In the trivial case,
cosine may be 1-(x*x) and I've found this or cruder approximations to be sufficient for many graphics applications.
On Thu 11 Nov 2021,
I expressed strong preference for squaring as a separate operation. This reduces ten 8*8 multiplications to two reads from 256*16 bit LUT and four 8*8 multiplications. If this is not possible or desirable then caching and ordering of operations will guarantee 100% cache hit on the redundant commutative multiplies; possibly across multiple invocations.
However,
stateful multiplication interacts horribly with interrupt latency - with or without memory mapped hardware. In the case of software only multiplication, the call graph of subroutines suspected of being outside interrupt (typically, everything) may update a multiplication cache via a semaphore which does not release a wanted interrupt mask. Obviously, this adversely affects interrupt response time. In the case of write-only hardware latches, subroutines outside of interrupt may require
duplicate writes to known memory locations or a semaphore to prevent latches being over-written. In the case of read/write octal latches (where the preferred implementation is
LE74HC574K245N), subroutines suspected of being inside interrupt (typically, everything except main) must preserve latch contents before use. Obviously,
any implementation of division which uses multiplication is similarly affected.
I like the idea of using address latches and ROM for 8*8 multiplication and other functions. In particular, this doesn't require vast quantities of address-space and retains 6502 compatibility. For 16 bit LUT, only one ROM is required because the bottom address line may be wired directly and one 8 bit ROM may perform double duty. I thought of this after studying
AndersNielsen's video circuit. I made
XR2801 16 bit ROM programmer with the primary purpose of programming 16*16 pixel fonts but also secondary purposes, such as Gigatron programming. More recently, I found that AndersNielsen's 8*8 pixel font system can be increased to 8*16 or 16*16 with no additional components. In particular, one 8 bit ROM can provide both halfs of 16*16 pixel font face. Similarly, one 8 bit ROM may provide both parts of 16 bit LUT. I believe this reduces your design to one read strobe and one 128KB ROM or larger, two write strobes and two latches for 8 bit inputs and one write strobe and one latch for additional functions. All strobes may be provided by one 74HC138 or similar which may be shared with other system functionality.
I don't trust the long-term storage of cheap flash ROM. I understand that the really cheap ones store up to five bits of data per cell and then use
hideous finite field arithmetic to correct bit errors. Understandably, they can only be programmed in blocks due to the checksum calculations. To overcome the inevitable failure of flash ROM, you might want to notch out unreliable rows or columns of LUT. Unfortunately, performance will degrade unsteadily as the ROM fails and it does not prevent arithmetic errors during operation. However, it allows a minimal system where flash ROM and latches are optional.