With @granati's implementation of
quadruple-precision arithmetic for the 65816, and my own slowly-evolving plans for a 65C02-based desk calculator with quadruple precision (and generally better features and ergonomics than your average hand calculator, otherwise what's the point), I thought it worth looking at what IEEE-754 floating point really means, and how best to implement it in practice.
Most pocket calculators perform all their arithmetic in BCD format. This is awkward at best to implement on a standard CPU, and has some pretty nasty properties regarding precision loss during computations. This is also why most calculators can only handle hexadecimal base, etc, with integer values. By using binary format as the "native" format, we can convert to any base we like for display, including for displaying floating-point values in "strange" bases like 12 and 60 (which were traditionally preferred for hand calculations by serious mathematicians).
Traditionally, 8-bit micros have used some hand-rolled and rather old-fashioned floating-point format which doesn't match the semantics of modern floating-point hardware. This includes particularly the 40-bit (5-byte) formats used in Apple BASIC and BBC BASIC. They lack any sort of sophisticated handling of rounding, overflow, and invalid results - probably in favour of simplifying the code and obtaining maximum performance - and for similar reasons may also omit support for denormalised (tiny) values. Granted, these implementations mostly predated the IEEE-754 standard's ratification.
Actually, @granati's code is also not quite IEEE-754 compliant, because it doesn't correctly implement rounding for the basic operations which the standard requires perfect rounding for. I hope to be able to illustrate these deficiencies and their correct solution in this thread.
The three most common IEEE-754 formats in modern use are 32-bit single-precision (sign + 8b exponent + 23b mantissa), 64-bit double-precision (sign + 11b exponent + 52b mantissa), and 80-bit double-extended precision (typically sign + 15b exponent + explicit leading bit + 63b mantissa). Of these, the last is most convenient for manipulation on an 8-bit micro, because the dividing point between exponent and mantissa is byte-aligned. Quadruple-precision, 128 bits wide, also satisfies this property since it has the same sign-exponent layout as double-extended. However, to illustrate this series I'll use two other formats in preference:
- 16-bit half-precision (sign + 5b exponent + 10b mantissa). This is a useful size for writing easily-readable illustrative examples, though it isn't very useful for real computations.
- 48-bit single-extended precision (sign + 15b exponent + 32b mantissa). This is essentially a truncated version of quadruple-precision format, occupying just 6 bytes instead of 16, which allows for relatively compact code examples while still offering enough range and precision for real-world usage (better in both respects than single-precision). It should be easy to extend code supporting this format to double-extended or quadruple precision, simply by extending the mantissa field, and I'll point out what needs to be done to achieve that.
(Technically, my 48-bit format violates the standard's definition of single-extended precision, because its exponent field is wider than that of double-precision.
To that I say "nuts". Since nobody actually
uses single-extended precision that I know of, there is no problem with compatibility.)
In this first post, I'll briefly describe the essential features of IEEE-style arithmetic, without code. Much of this is a paraphrase of material from
"What Every Computer Scientist Should Know About Floating-Point Arithmetic", and I do encourage reading that if you want the really fine detail - but it is a very technical document in places, so I'm pulling out the essential points and dropping the tedious mathematical proofs backing them up. In later posts I'll start introducing C and/or 65C02 code to demonstrate aspects of implementation.
A floating-point value essentially consists of a mantissa, which is essentially an integer, and an exponent which indicates where to insert the radix point in (or to either side of) the mantissa. The radix point is a generalisation of what we often call the decimal point, in that it applies to more than just decimal numbers. In IEEE format, negative numbers are handled by prepending a separate sign bit to the exponent-mantissa representation, and the leading bit of the mantissa is normally assumed to be set so is omitted from the representation in memory. The exponent itself is "biased" so that it can be represented as an unsigned integer. These properties distinguish IEEE format from some earlier formats, which used two's-complement signed representation for both fields.
Values in IEEE format fall into several categories, depending on the value of the exponent field, which allow representation of overflowed, invalid, tiny and zero values as well as values in the "normal" range:
- Maximum exponent, non-zero mantissa: NaN (Not A Number). NaNs compare unequal to and unordered from all values, including itself. Nearly all arithmetic operations involving a NaN also produce a NaN. The sign of a NaN is irrelevant. Arithmetic operations should produce a NaN as an error-indicating value instead of throwing an exception, eg for 0/0, Infinity/Infinity, Infinity-Infinity, and sqrt(-x).
- Maximum exponent, zero mantissa: Infinity. This is the representation used for overflowed values, and may be positive or negative. Dividing a non-zero, non-NaN value by zero produces an Infinity of the appropriate sign. Arithmetic involving Infinities produces well-defined results.
- Zero exponent, non-zero mantissa: Denormalised (tiny) value. This is the exception to the "implicit leading bit" normally prepended to the mantissa, and allows behaviour known as "gradual underflow" for very small values, with precision equal to the smallest exponent in the "normal" range. Normal arithmetic rules apply to these values.
- Zero exponent, zero mantissa: Zero. This can be either a positive or negative zero, which compare equal to each other and are usually indistinguishable, but which do behave differently in specific ways, known as "branch cuts".
- All other exponents, any mantissa: Normal value.
The value +1.0 is represented by a cleared sign bit, a zero mantissa, and an exponent with all bits set except the most significant bit, eg. (0 01111 0000000000) for half-precision. Larger exponents indicate successively higher powers of two, eg. (0 10000 0000000000) is +2.0, or (+1.0 * 2^1). Smaller exponents indicate fractional values, eg. (0 01101 0000000000) is +0.25, or (+1.0 * 2^-2). A non-zero mantissa with an exponent in the normal range controls bits to the right of the implicit leading bit, eg. (0 01111 1000000000) is +1.5; compare with the representation of +1.0.
The standard defines several different rounding modes which implementations must implement. The default is "round to nearest even", sometimes also known as "banker's rounding", and is generally the most useful since it tends to minimise cumulative and systematic errors. Also available are "round towards zero", also known as truncation, "round towards positive infinity", aka. ceiling, and "round towards negative infinity", aka. floor. All of the "basic" operations - add, subtract, multiply, divide, remainder, square root, convert to/from integer - are defined to round
correctly, that is, as if the infinitely-precise result were first calculated, and only then rounded. There are ways of doing that without actually having to calculate the full result (which in the case of division and square-root is often infinitely long), using just three extra guard bits in a particular way which we'll discuss later.
But those are the essentials. Floating-point isn't really as complicated or arcane as you might assume.