6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 11:01 pm

All times are UTC




Post new topic Reply to topic  [ 13 posts ] 
Author Message
PostPosted: Tue Oct 30, 2018 10:42 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
With @granati's implementation of quadruple-precision arithmetic for the 65816, and my own slowly-evolving plans for a 65C02-based desk calculator with quadruple precision (and generally better features and ergonomics than your average hand calculator, otherwise what's the point), I thought it worth looking at what IEEE-754 floating point really means, and how best to implement it in practice.

Most pocket calculators perform all their arithmetic in BCD format. This is awkward at best to implement on a standard CPU, and has some pretty nasty properties regarding precision loss during computations. This is also why most calculators can only handle hexadecimal base, etc, with integer values. By using binary format as the "native" format, we can convert to any base we like for display, including for displaying floating-point values in "strange" bases like 12 and 60 (which were traditionally preferred for hand calculations by serious mathematicians).

Traditionally, 8-bit micros have used some hand-rolled and rather old-fashioned floating-point format which doesn't match the semantics of modern floating-point hardware. This includes particularly the 40-bit (5-byte) formats used in Apple BASIC and BBC BASIC. They lack any sort of sophisticated handling of rounding, overflow, and invalid results - probably in favour of simplifying the code and obtaining maximum performance - and for similar reasons may also omit support for denormalised (tiny) values. Granted, these implementations mostly predated the IEEE-754 standard's ratification.

Actually, @granati's code is also not quite IEEE-754 compliant, because it doesn't correctly implement rounding for the basic operations which the standard requires perfect rounding for. I hope to be able to illustrate these deficiencies and their correct solution in this thread.

The three most common IEEE-754 formats in modern use are 32-bit single-precision (sign + 8b exponent + 23b mantissa), 64-bit double-precision (sign + 11b exponent + 52b mantissa), and 80-bit double-extended precision (typically sign + 15b exponent + explicit leading bit + 63b mantissa). Of these, the last is most convenient for manipulation on an 8-bit micro, because the dividing point between exponent and mantissa is byte-aligned. Quadruple-precision, 128 bits wide, also satisfies this property since it has the same sign-exponent layout as double-extended. However, to illustrate this series I'll use two other formats in preference:

- 16-bit half-precision (sign + 5b exponent + 10b mantissa). This is a useful size for writing easily-readable illustrative examples, though it isn't very useful for real computations.

- 48-bit single-extended precision (sign + 15b exponent + 32b mantissa). This is essentially a truncated version of quadruple-precision format, occupying just 6 bytes instead of 16, which allows for relatively compact code examples while still offering enough range and precision for real-world usage (better in both respects than single-precision). It should be easy to extend code supporting this format to double-extended or quadruple precision, simply by extending the mantissa field, and I'll point out what needs to be done to achieve that.

(Technically, my 48-bit format violates the standard's definition of single-extended precision, because its exponent field is wider than that of double-precision. To that I say "nuts". Since nobody actually uses single-extended precision that I know of, there is no problem with compatibility.)

In this first post, I'll briefly describe the essential features of IEEE-style arithmetic, without code. Much of this is a paraphrase of material from "What Every Computer Scientist Should Know About Floating-Point Arithmetic", and I do encourage reading that if you want the really fine detail - but it is a very technical document in places, so I'm pulling out the essential points and dropping the tedious mathematical proofs backing them up. In later posts I'll start introducing C and/or 65C02 code to demonstrate aspects of implementation.

A floating-point value essentially consists of a mantissa, which is essentially an integer, and an exponent which indicates where to insert the radix point in (or to either side of) the mantissa. The radix point is a generalisation of what we often call the decimal point, in that it applies to more than just decimal numbers. In IEEE format, negative numbers are handled by prepending a separate sign bit to the exponent-mantissa representation, and the leading bit of the mantissa is normally assumed to be set so is omitted from the representation in memory. The exponent itself is "biased" so that it can be represented as an unsigned integer. These properties distinguish IEEE format from some earlier formats, which used two's-complement signed representation for both fields.

Values in IEEE format fall into several categories, depending on the value of the exponent field, which allow representation of overflowed, invalid, tiny and zero values as well as values in the "normal" range:

- Maximum exponent, non-zero mantissa: NaN (Not A Number). NaNs compare unequal to and unordered from all values, including itself. Nearly all arithmetic operations involving a NaN also produce a NaN. The sign of a NaN is irrelevant. Arithmetic operations should produce a NaN as an error-indicating value instead of throwing an exception, eg for 0/0, Infinity/Infinity, Infinity-Infinity, and sqrt(-x).

- Maximum exponent, zero mantissa: Infinity. This is the representation used for overflowed values, and may be positive or negative. Dividing a non-zero, non-NaN value by zero produces an Infinity of the appropriate sign. Arithmetic involving Infinities produces well-defined results.

- Zero exponent, non-zero mantissa: Denormalised (tiny) value. This is the exception to the "implicit leading bit" normally prepended to the mantissa, and allows behaviour known as "gradual underflow" for very small values, with precision equal to the smallest exponent in the "normal" range. Normal arithmetic rules apply to these values.

- Zero exponent, zero mantissa: Zero. This can be either a positive or negative zero, which compare equal to each other and are usually indistinguishable, but which do behave differently in specific ways, known as "branch cuts".

- All other exponents, any mantissa: Normal value.

The value +1.0 is represented by a cleared sign bit, a zero mantissa, and an exponent with all bits set except the most significant bit, eg. (0 01111 0000000000) for half-precision. Larger exponents indicate successively higher powers of two, eg. (0 10000 0000000000) is +2.0, or (+1.0 * 2^1). Smaller exponents indicate fractional values, eg. (0 01101 0000000000) is +0.25, or (+1.0 * 2^-2). A non-zero mantissa with an exponent in the normal range controls bits to the right of the implicit leading bit, eg. (0 01111 1000000000) is +1.5; compare with the representation of +1.0.

The standard defines several different rounding modes which implementations must implement. The default is "round to nearest even", sometimes also known as "banker's rounding", and is generally the most useful since it tends to minimise cumulative and systematic errors. Also available are "round towards zero", also known as truncation, "round towards positive infinity", aka. ceiling, and "round towards negative infinity", aka. floor. All of the "basic" operations - add, subtract, multiply, divide, remainder, square root, convert to/from integer - are defined to round correctly, that is, as if the infinitely-precise result were first calculated, and only then rounded. There are ways of doing that without actually having to calculate the full result (which in the case of division and square-root is often infinitely long), using just three extra guard bits in a particular way which we'll discuss later.

But those are the essentials. Floating-point isn't really as complicated or arcane as you might assume.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 12:25 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
This should be good! (I might be naive but I'd worry more about, say, getting excellent results for the four basic operations and square root than I would worry about denorms. By excellent, I mean exact, or within half a unit of last place.)

Glad you linked Goldberg. There are (of course) also writings by William Kahan which are well worth reading. Try this search. Or this on mindlessness and futility.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 12:49 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
65C02 based desk calculator, now that's a nice idea.

The problem when implementing a floating point package is, that the devil hides in the detail.
For instance, in the rounding errors when calculating polynomials for having transcendental functions.

I'm no expert when it comes to floating point, but a search for:
W. J. Cody, Jr. and W. Waite. Software Manual for the Elementary Functions. Prentice Hall, 1980.
Brought me to:
Jean-Michel Muller. Elementary Functions: Algorithms and Implementation. Birkhäuser, 2006.

Found a copy of it here.
Don't know if this site has permission from the copyright holder to post it...


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 2:17 pm 
Offline

Joined: Mon Jun 24, 2013 8:18 am
Posts: 83
Location: Italy
ttlworks wrote:
65C02 based desk calculator, now that's a nice idea.
The problem when implementing a floating point package is, that the devil hides in the detail.
For instance, in the rounding errors when calculating polynomials for having transcendental functions.


Are used rational function's, in a strict argument interval for minimize the ripple enter fixed limits (is a process very close to the one used for design of elliptic's filter); the cumulative error evaluating successive power's is in fact cancelled thanks to rational function. Of course this computation is very sensitive respect the coefficient of rational function: for this reason coefficient and intermediate computation must be maked with some bits more of the requested max. error.
Computation of root square don't use this method cause the fast convergence of "newton method".

Marco

_________________
http://65xx.unet.bz/ - Hardware & Software 65XX family


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 2:28 pm 
Offline

Joined: Mon Jun 24, 2013 8:18 am
Posts: 83
Location: Italy
@Chromatix

Actually, for basic operations in my implementation i use a full 128 bits mantissa (15 guard bits), with usual rounding to the nearest. In division algorithm i use 2 more bits for last remainder (at some point the infinite cycle must be stopped), and i use the msb for round the 128 bits mantissa.
Rounding to the 113 bits mantissa of IEEE quadruple is done with standard method "tie to even"; Others rounding is done with standard functions (floor, ceil,...)

I maked a lot of test, comparing results with some on-line calculators that show the packed quadruple IEEE and not get discordance. Of course if you find issues and/or errors on my implementation, any suggestion is welcome.

_________________
http://65xx.unet.bz/ - Hardware & Software 65XX family


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 7:05 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
As it happens, the very first subject is how to handle conversion to and from an expanded internal representation, occupying two extra bytes and in fully normalised form. Because denormalised values are relatively rare (especially given the huge exponent range our 48-bit format supports), the overhead of renormalising and denormalising tiny values will rarely be incurred, so we can mostly ignore it during performance analysis - but the code is not very complex.

The API I use involves passing in pointers to the operands and to the output location; this allows either writing the result to a completely fresh location, or overwriting either of the input operands, or even using the same value for both operands; this is "three operand" semantics, familiar from RISC CPUs. A simple memcpy operation moves the operands and result between user memory and the internal workspace. This memcpy can also be used to byte-reverse the operands if desired. I prefer to work in big-endian, except where the CPU mandates otherwise; the 65C02 only requires addresses to be little-endian, so I'm arranging everything else big-endian in this project. This means that the sign bit is the most-significant bit of the first byte, and the exponent occupies the rest of the first two bytes.

After fetching each operand, I use one extra byte to extract the sign bit, leaving room to expand the 15-bit biased exponent into a 16-bit two's complement signed exponent (by subtracting 0x3FFF from the 15-bit biased exponent). This has sufficient range to convert a denormalised mantissa into normalised form, after which arithmetic operations can proceed on the assumption of all numbers (except the special cases of NaN, Infinity, Zero) being normalised, and thus having an implicit leading bit. Zeroes get the most-negative exponent, while NaNs and infinities get the most-positive one.

Code:
typedef struct {
    uint8_t sign;
    int16_t exponent;
    uint32_t mantissa;
    uint8_t guard;
} Workspace;

void fadd(F48* ap, F48* bp, F48* rp)
{
    Workspace a, b, r;
    F48_expand(ap, &a);
    F48_expand(bp, &b);

    // do the add

    F48_round_format(rp, &r, roundingMode);
}
The other extra byte is used to store the guard bits, and is initialised to zero on conversion from external format. IEEE-754 requires three guard bits, named "guard", "round", and "sticky", but it's convenient to provide eight, with the least-significant bit getting the "sticky" semantics; it indicates whether any bits are set in its position or further right. This subtle detail is what's missing from @granati's implementation. The deficiency is only observable in quite specific circumstances, but it should be fairly easy to test for once you know what you're testing for.

The guard byte is therefore updated when shifting right in two ways, depending on the magnitude of the shift. When shifting right by one bit, the guard byte is shifted right just like the rest of the mantissa, then the carry needs to be checked to see if a set bit was shifted out. If so, ORA #1 into the guard byte to preserve the sticky bit. When shifting right by a whole byte, the guard byte is to be replaced entirely by the least-significant byte of the mantissa; but first, test the old guard byte, and if it's non-zero, ORA #1 into the new guard byte. On rarer occasions, it is necessary to shift the mantissa left by one bit, and in that case you can just include the guard byte in the shift.

A wrinkle is that during denormalisation (converting a tiny value to external format), the rounding point moves to a different place than in normalised form. For that reason, I think it's best to defer rounding until any denormalisation has occurred and the value has been coerced into external format, thereby preserving the original guard bits as well as any shifted out by the denormalisation process. Rounding may then (rarely) result in a normalised number, which is very naturally handled by the layout of the IEEE format and requires no special cases (because, except at zero and the infinity limits, nextafter() is just an increment of the bit representation when positive, and a decrement when negative). So in essence, rounding becomes part of the export routine from internal to external format.

Code:
typedef enum {
    nearestEven = 0,
    towardsZero,
    positiveInf,
    negativeInf
} RoundingMode;

void F48_expand(const F48* mem, Workspace* intern)
{
    memcpy(&intern->exponent, mem, sizeof(F48));
    intern->sign = (intern->exponent < 0) ? 0x80 : 0;
    intern->exponent &= 0x7FFF;
    intern->guard = 0;

    if(intern->exponent == 0x7FFF) {
        // infinity or NaN; leave it all alone
    } else if(intern->exponent == 0) {
        if(intern->mantissa == 0) {
            // zero
            intern->exponent = -0x8000;
        } else {
            // denormal - renormalise
            intern->exponent = -0x3FFE;
            for(uint8_t i = clz(intern->mantissa) + 1; i; i--) {
                intern->exponent--;
                intern->mantissa <<= 1;    // this shift does not involve the guard byte, which is known to be zero
            }
        }
    } else {
        // normal range
        intern->exponent -= 0x3FFF;
    }
}

void F48_round_format(F48* mem, Workspace* intern, RoundingMode rnd)
{
    if(intern->exponent > 0x3FFF) {    // overflow or existing infinity or NaN
        if(intern->exponent != 0x7FFF)
            intern->mantissa = 0;    // flush overflow to infinity
        intern->exponent = 0x7FFF;
    } else if(intern->exponent > -0x3FFF) {    // normal range (before rounding)
        intern->exponent += 0x3FFF;
    } else if(intern->exponent < -(0x3FFF + 32)) {    // zero or value too tiny to represent - *** NOTE 32 is length of mantissa ***
        intern->exponent = intern->mantissa = 0;
        intern->guard = (intern->guard != 0) ? 1 : 0;    // preserve sticky bit only, to permit round-towards-infinity modes to work
    } else {
        // denormalised value
        bool first = 1;
        while(intern->exponent < -0x3FFE) {
            intern->guard = (intern->guard & 1) | (intern->guard >> 1) | ((intern->mantissa & 1) << 7);    // preserve sticky bit and include shift-in of mantissa bit
            intern->mantissa = (first << 31) | (intern->mantissa >> 1);
            first = 0;
        }
        intern->exponent = 0;
    }

    if(rnd != towardsZero && intern->exponent != 0x7FFF && intern->guard != 0) {    // rounding only concerns finite values
        if((sign && rnd == negativeInf) || (!sign && rnd == positiveInf)) {
            if(!(++intern->mantissa)) intern->exponent++;
        } else {    // round to nearest even
            if(guard > 0x80 || (guard == 0x80 && (intern->mantissa & 1) == 1))
                if(!(++intern->mantissa)) intern->exponent++;
        }
    }

    intern->exponent |= intern->sign << 8;
    memcpy(mem, &intern->exponent, sizeof(F48));
}
It should be pretty obvious how to translate the above into 65C02 assembly, so I won't clutter the post with a listing. The above code is untested, and probably won't even compile without further work. Use at your own risk.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 30, 2018 7:13 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
ttlworks wrote:
65C02 based desk calculator, now that's a nice idea.

The problem when implementing a floating point package is, that the devil hides in the detail.
For instance, in the rounding errors when calculating polynomials for having transcendental functions.

I'll worry about the transcendentals when I've got a good implementation of the basic operations and input/display routines. Those will be a foundation that everything else relies on.

What thought I've put in so far suggests I should base the log/antilog family functions around the base 2 logarithm, and use scaling factors to implement the common and natural logarithms. But I'll probably need to implement the natural exponential in order to calculate the correct scaling factors to sufficient precision.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 31, 2018 12:28 am 
Offline
User avatar

Joined: Mon May 12, 2014 6:18 pm
Posts: 365
Quote:
Most pocket calculators perform all their arithmetic in BCD format. This is awkward at best to implement on a standard CPU, and has some pretty nasty properties regarding precision loss during computations.
Out of curiosity (as opposed to argument), what makes you say this? I've been working on BCD floating point on another processor for a homebrew calculator and haven't found it awkward yet. What is nasty about it?


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 31, 2018 9:04 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Looking at the 6502 specifically, it has instructions to handle adding and subtracting BCD numbers easily, via the Decimal flag. But multiplication and division are more difficult, and you need significantly more space to achieve the same precision as a pure binary representation. There are ways to do it, but the binary arithmetic is easier and faster and smaller. Modern CPUs don't even have decimal arithmetic instructions any more.

The modern approach to decimal arithmetic is to actually use base-1000 instead of base-10, storing groups of three decimal digits in 10-bit bitfields. This is much more space-efficient (it takes 12 bits to store those same three digits in BCD), but the first thing you have to do is extract those bitfields into something you can actually operate on. The exponent field still refers to base-10 digits, so if it isn't a multiple of three, you have to extract the individual digits and repack them in order to align operands for addition and subtraction.

And for what? To ensure that decimal numbers add and subtract without rounding errors. That's it.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 31, 2018 1:35 pm 
Offline

Joined: Mon Jun 24, 2013 8:18 am
Posts: 83
Location: Italy
The rounding error while basic operations is indipendent from the base. The only justification for use decimal base is the rounding error due to the conversion from binary to decimal string. This conversion involve a scaling by powers of ten until the number fall into a wanted range, then is treated as an integer and converted in decimal.

For instance, the "cancellation" error caused by subtracting two very close numbers, affect in the same way both BCD and binary numbers, cause the fact that the most significative bits (or nibbles) of number are cleared by subtraction. The successive normalitation that left-shift by bits (or by nibbles) mantissa, cause the entering of bits '0' (or nibbles '0') in the right side of the mantissa, losing precison.

_________________
http://65xx.unet.bz/ - Hardware & Software 65XX family


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 02, 2018 8:07 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
In general, I'm building this system from the bottom levels of abstraction upwards. The first piece of code covered a preliminary pair of routines required for all of the arithmetic operations, but we don't have any actual calculations working - we can't yet add two floating-point numbers together. This next topic will get us a step closer to doing exactly that - but it's still another internal routine rather than a complete operation. (NB: this is software engineering in real time - I'm literally writing the code in the forum editor.)

The basic problem is that if you take two floating-point values and add them together, the operation actually performed depends on the signs of the two values involved; add a negative number to a positive one, and you actually need to do a subtraction operation, not an addition. So we need to pre-process both addition and subtraction to find out what really needs to be done, and hand over to some internal routines which actually handle addition and subtraction. This pre-processing should also identify which operand is larger in magnitude than the other, and hence determine what the sign of the result will be, and swap the operands if necessary to put them in magnitude order.

The internal routines can then assume that this has been done, simplifying their implementation. These assumptions are known as preconditions.

Our preconditions will be that the larger-magnitude operand comes first, that the operands are both finite, both non-zero, and not equal to each other, and that the correct operation (addition or subtraction) has been pre-identified. We also assume that the outer routine will fill in the result's sign field after control is returned to it, so the internal routines can completely ignore the signs of the input operands. These preconditioning routines will come later - right now we just want to get on with the business of addition.

To add floating-point numbers together, we need to align the mantissas with each other. We do this by taking the difference between the exponents, and shifting the smaller-magnitude mantissa right by that number of bits - updating the guard bits as we go, and not forgetting to insert the implicit leading bit. We then copy the guard byte to the result (because the guard byte of the larger operand will be zero), and add the aligned mantissas together. If there's a carry out from this addition, the result exponent is one higher than the larger operand's exponent, and the result mantissa must be shifted right one more bit (into the guard bits, again). If there is no carry out, the result exponent is that of the larger operand.

If the difference in exponents is large enough, the mantissa of the smaller value will disappear into the guard bits. We can speed up this case considerably for extreme exponent differences by special-casing it, but the basic routine will work correctly without this optimisation.

If the exponents happen to be equal, then the mantissas are already aligned with each other - but they both have hidden leading bits, and the result will always overflow to the next higher exponent. We handle this case specially, by performing a straight addition of the mantissas, then immediately shifting right one bit, leaving the carry of the addition in the most-significant bit of the result mantissa, and the least-significant bit of the result in the most-significant guard bit. This is a reasonably common case, so is worth optimising in this way, as well as being one of the cleaner ways to handle the case.

The case of the second exponent being larger than the first will not occur, because of the preconditions already mentioned.

Rounding and overflow are not handled here - we already took care of those details in the result formatting routine. All we need to do is set the guard byte appropriately.

Code:
// this code is not strictly C; shifts work with a Carry flag, just like a 6502.  Adds and subtracts update Carry but don't consume it unless explicitly specified.
void F48_internal_add(Workspace* R, Workspace* A, Workspace* B)
{
    if(A->exponent == B->exponent) {
        R->exponent = A->exponent + 1;
        R->mantissa = (A->exponent + B->exponent) >> 1;  // carry from addition becomes high bit
        R->guard = 0 >> 1;  // carry from mantissa low bit into guard high bit
    } else {
        uint16_t expDiff = A->exponent - B->exponent;
        if(expDiff > 32+8) {
            // mantissas don't overlap
            R->exponent = A->exponent;
            R->mantissa = A->mantissa;
            R->guard = 1;
        } else {
            Carry = 1;
            do {
                B->mantissa >>= 1;
                B->guard >>= 1;
                B->guard |= Carry;
                Carry = 0;
            } while(--expDiff);
            R->exponent = A->exponent;
            R->guard = B->guard;
            R->mantissa = A->mantissa + B->mantissa;
            if(Carry) {
                R->exponent++;
                Carry = 0;
                R->mantissa >>= 1;
                R->guard >>= 1;
                R->guard |= Carry;
            }
        }
    }
}


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 03, 2018 8:41 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Next up is the internal subtraction routine, which has the same preconditions as the internal addition routine above. The chief difference, in fact, is that the exponent never decreases with addition - so sometimes a shift right by one bit is required - but never increases with subtraction, and can actually decrease by quite a lot, requiring left shifts by potentially as much as the whole mantissa width. If this occurs to values which have previously lost precision in other operations, the result can be catastrophic cancellation.
Code:
// this code is not strictly C; shifts work with a Carry flag, just like a 6502.  Adds and subtracts update Carry but don't consume it unless explicitly specified.
void F48_internal_subtract(Workspace* R, Workspace* A, Workspace* B)
{
    if(A->exponent == B->exponent) {
        // leading implicit bits cancel each other, so a left shift *will* be needed afterwards
        R->guard = 0;
        R->exponent = A->exponent;
        R->mantissa = A->mantissa - B->mantissa;    // will be non-zero, because equal values were eliminated in precondition
        Carry = 0;
        do {
            R->exponent--;
            R->mantissa <<= 1;
        } while(!Carry);
    } else {
        uint16_t expDiff = A->exponent - B->exponent;
        if(expDiff > 32+8) {
            // mantissas don't overlap
            R->guard = -1;
            R->mantissa = A->mantissa - 1;
            R->exponent = A->exponent - !Carry;
        } else {
            Carry = 1;
            do {
                B->mantissa >>= 1;
                B->guard >>= 1;
                B->guard |= Carry;
                Carry = 0;
            } while(--expDiff);
            R->exponent = A->exponent;
            R->guard = A->guard - B->guard;
            R->mantissa = A->mantissa - B->mantissa - !Carry;
            while(!Carry) {
                R->exponent--;
                Carry = 0;
                R->guard <<= 1;
                R->mantissa <<= 1;
            }
        }
    }
}


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 04, 2018 12:47 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Now all we need is the preconditioning routines, and we'll actually be able to add and subtract our 48-bit floats with IEEE-754 semantics. There are a lot of special cases to take care of, which we deliberately ignored in the internal routines posted immediately above, as well as the need to put the operands in the right order, generate the result sign, and call the correct internal routine. We also take advantage of the fact that the standard specifies subtraction as being exactly equivalent to addition of the second operand after inverting its sign.

The first and most obvious special case is that of NaN; if either operand is one of those, we simply copy the first one noticed to the result, no questions asked. As noted above, a NaN has a maximum exponent and a non-zero mantissa.

Infinities can also be dealt with without reference to the internal routines. Adding two Infinities of the same sign (or subtracting two of opposite signs, which amounts to the same thing) simply yields the first Infinity again. Adding or subtracting any finite value to an Infinity also fails to change it. Subtracting Infinity from Infinity is undefined, and yields a NaN instead.

Adding or subtracting zero to any value also doesn't change that value, so we can just copy it to the output. If both operands are zero, we should take the sign of the zero result from the first operand. This satisfies the rule that the sign of an addition must differ from at most one of the original operands.

The final special cases involve both operands being finite, non-zero and with equal exponent and mantissa. If addition is called for (after comparing signs), we can simply double one of them into the result by incrementing the exponent - a speed optimisation. If subtraction, we explicitly construct a zero to avoid an infinite loop of renormalisation in the subtraction routine. The standard specifies that this must be a positive zero except when the rounding mode is "towards negative infinity", when it must be a negative zero. AFAIK, this is the only case in the basic operations where the rounding mode needs to be known outside the rounding routine itself.

Once again, these routines are provided in pseudo-C for readability. After this post, I'll take a break from expanding functionality (since we now have our first working user-visible routines), and start converting them to 65C02 code.
Code:
static RoundingMode roundingMode = nearestEven;

void fadd(F48* rp, F48* ap, F48* bp)
{
    Workspace a, b, r;
    F48_expand(ap, &a);
    F48_expand(bp, &b);

    F48_precondition_add(&r, &a, &b);

    F48_round_format(rp, &r, roundingMode);
}

void fsub(F48* rp, F48* ap, F48* bp)
{
    Workspace a, b, r;
    F48_expand(ap, &a);
    F48_expand(bp, &b);

    // subtraction is just adding a negated value.
    b.sign ^= 0x80;
    F48_precondition_add(&r, &a, &b);

    F48_round_format(rp, &r, roundingMode);
}

void F48_precondition_add(Workspace* R, Workspace* A, Workspace* B)
{
    // sort operands by magnitude; this also means we only need to test one of them for NaN.
    if(A->exponent < B->exponent || (A->exponent == B->exponent && A->mantissa < B->mantissa))
        swap(*A, *B);

    if(A->exponent == 0x7FFF) {
        if(A->mantissa != 0) {    // NaN
            *R = *A;
            return;
        }
        if(B->exponent != 0x7FFF) {    // Infinity ± Finite = Infinity
            *R = *A;
            return;
        }
        if(A->sign == B->sign) {    // Infinity + Infinity = Infinity
            *R = *A;
            return;
        }
        // Only remaining case is Infinity - Infinity = NaN
        R->sign = 0;
        R->guard = 0;
        R->exponent = 0x7FFF;
        R->mantissa = 0xDEADBEEF;  // any non-zero value will do; we could encode the cause of the NaN generation here.
    }

    if(B->exponent == -0x8000) {  // Anything ± Zero = Anything, including when Anything is itself a Zero
        *R = *A;
        return;
    }

    if(A->exponent == B->exponent && A->mantissa == B->mantissa) {    // Double or Nothing?
        if(A->sign == B->sign) {
            *R = *A;
            R->exponent++;
        } else {
            R->sign = (roundingMode == negativeInf ? 0x80 : 0);
            R->exponent = 0;
            R->mantissa = 0;
            R->guard = 0;
        }
        return;
    }

    if(A->sign == B->sign) {
        F48_internal_add(R, A, B);
    } else {
        F48_internal_subtract(R, A, B);
    }
    R->sign = A->sign;
}


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 13 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: