6502.org • View topic - I can't understand FLOATING POINT REPRESENTATION

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

I can't understand FLOATING POINT REPRESENTATION

Page 2 of 2

[ 18 posts ]

Go to page Previous 1, 2

Previous topic | Next topic

Author

Message

John West

Post subject: Re: http://www.6502.org/source/floats/wozfp1.txt

Posted: Tue Jul 13, 2004 9:01 am

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336

tedaz wrote:

Code:

      HIGH-ORDER MANTISSA BYTE
     01.XXXXXX  Positive mantissa.
     10.XXXXXX  Negative mantissa.
     00.XXXXXX  Unnormalized mantissa.
     11.XXXXXX  Exponent = -128.

I can't understand 00.XXXXXX Unnormalized mantissa;
Especially I CAN'T UNDERSTAND 11.XXXXXX Exponent = -128.
What does them mean?

Disclaimer: I never did any Apple programming, and I haven't read all of the documentation that people have been referring to.

My interpretation: 00 and 11 are both "Unnormalized mantissa. Exponent = -128". 00 is for positive denormals (IEEE 754 notation feels more comfortable, if you'll excuse me) and 11 for negative. Using 00 or 11 with an exponent that isn't -128 is presumably considered an error.

Normalized floating point numbers are of the form +/- 1.m * 2^e. The leading 1 is always there, so IEEE doesn't store it (getting an extra bit of precision for free). When working with very small numbers, there's an annoying gap between the smallest representable number and zero. Denormals fill that gap. In IEEE, if the exponent bits are all zero, then the leading 1 changes to a 0: +/- 0.m * 2^-128. That extends the range of small numbers, at the expense of some precision. It's useful, as it preserves the identity x = y <=> x-y = 0 (which would fail if the difference between x and y falls into the gap).

Top

kc5tja

Post subject: Re: I can't understand FLOATING POINT REPRESENTATION

Posted: Tue Jul 13, 2004 3:10 pm

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706

tedaz wrote:

I can't understand FLOATING POINT REPRESENTATION of APPLE II in"Apple II Reference Manual (Red Book), January 1978, pages 94-95."

The Red Book said:
HIGH-ORDER MANTISSA BYTE
01.XXXXXX Positive mantissa.
10.XXXXXX Negative mantissa.
00.XXXXXX Unnormalized mantissa.
11.XXXXXX Exponent = -128.
But,I can't understand when the high-order mantissa byte is 00 or 11.

I found the FLOATING POINT REPRESENTATION of APPLE II is different with IEEE. And I can't find reference of it.

Can anybody help me?
Thanks.

What 00xxxxxx and 11xxxxxx mean are, simply, unnormalized numbers, like explained numerous times above. BUT, what you want to know is, where do they come from, and how are they handled?

Frankly, they are NEVER stored in variables. EVER. Because they violate the rules of floating point numbers, just as 125.4x10^1 violates scientific notation. So they can therefore ONLY occur WHILE performing a calculation. In short, they are created only as intermediate results.

Let's use scientific notation for an example, because I'm too lazy to use binary. But the *exact* same rules apply, only with base-2 instead of base-10.

Let's look at what happens when we have to add two numbers:

Code:

1.23x10^2 + 4.56x10^3 = ?

As you learned in grade school, you can't just add numbers -- you have to first align the decimal points!

Code:

0.123x10^3 + 4.56x10^3 = ?

Note that 0.123x10^3 is unnormalized. We had to shift the number to the right and increase the exponent to compensate, so that we may arrive at a proper sum.

Code:

0.123x10^3 + 4.56x10^3 = 4.683x10^3

Generally speaking, addition and subtraction of numbers require de-normalization before actually working out the result. But multiplication and division require re-normalization after the result has been attained. Let's compute the fraction 100/2:

Code:

1.0x10^2 / 2.0x10^0 = 0.5x10^2

Notice that the resulting number is no longer in proper scientific notation -- that is, it is now denormalized. We must now shift the digits, and adjust the exponents accordingly, to produce a properly formed number:

Code:

1.0x10^2 / 2.0x10^0 = 5.0x10^1

Apply this basic principle to binary floating-point numbers (which is just scientific notation expressed in binary, with all the same basic rules), and you can see how a positive number might sometimes be denormalized (having a top two bits of 00), and how a negative number might be denormalized (with a top two bits of 11). Remember that an arithmetic shift right always preserves the sign bit!

Personally, I prefer how Commodore stored its floating point numbers over Apple's. The problem with Apple's is you lost 1 bit of precision when storing a positive value. It also keeps that silly '1' bit (which you always know is there) hanging around. Thus, in effect, you're losing 2 bits of precision, which can add up over many calculations, and is especially noticable when using numbers like 1/3, 1/9, and irrational numbers.

But I digress.

Top

dclxvi

Post subject: Re: I can't understand FLOATING POINT REPRESENTATION

Posted: Fri Jul 16, 2004 8:07 am

Joined: Thu Mar 11, 2004 7:42 am
Posts: 362

John West wrote:

My interpretation: 00 and 11 are both "Unnormalized mantissa. Exponent = -128". 00 is for positive denormals (IEEE 754 notation feels more comfortable, if you'll excuse me) and 11 for negative. Using 00 or 11 with an exponent that isn't -128 is presumably considered an error.

It isn't an error, there's just a potential loss of precision. However, the major routines (FADD, FDIV, FMUL, FSUB, etc.) only return an unnormalized number when the exponent is -128.

kc5tja wrote:

Frankly, they are NEVER stored in variables. EVER. Because they violate the rules of floating point numbers, just as 125.4x10^1 violates scientific notation. So they can therefore ONLY occur WHILE performing a calculation. In short, they are created only as intermediate results.

One advantage of the Wozniak & Rankin representation is that it's perfectly fine to store unnormalized numbers in variables, since FP numbers in that format are stored in the same amount of memory regardless of whether the number is in a floating point "accumulator" or whether it is stored elsewhere in memory. On the other hand, in the more common representation (used by EhBASIC), FP numbers are stored in unpacked format in a floating point "accumulator" and in packed format in a variable, so unnormalized numbers cannot be stored in a variable. More on this below.

kc5tja wrote:

Personally, I prefer how Commodore stored its floating point numbers over Apple's. The problem with Apple's is you lost 1 bit of precision when storing a positive value. It also keeps that silly '1' bit (which you always know is there) hanging around. Thus, in effect, you're losing 2 bits of precision, which can add up over many calculations, and is especially noticable when using numbers like 1/3, 1/9, and irrational numbers.

I'm going to call the two representations "signed mantissa" (Wozniak & Rankin routines) and "positive mantissa" (EhBASIC, Applesoft). Since well over 99% of the FP calculations done on Apples used the Applesoft FP routines (the routines in the wozfp3.txt file were rarely used), calling the "signed mantissa" representation the "Apple" representation may be confusing.

Anyway, for a given number of mantissa bits, the "signed mantissa" has only 1 less bit of precision than "positive mantissa", not 2 less. The wozniak & Rankin uses a 24-bit mantissa which can represent every integer from -8388608 to 8388607. EhBASIC also uses a 24-bit mantissa and can represent every integer from -16777215 to 1677215 except zero (which is represented by a special exponent value, not by a mantissa value), so that is only 1 less bit of precision for both positive and negative values. Either representation can be extended to as much precision as you wish, though.

Here is a more detailed comparison of the two representations:

24-bit "signed mantissa" representation: (Wozniak & Rankin routines)

Code:

   X1      M1+0     M1+1     M1+2
EEEEEEEE SNMMMMMM MMMMMMMM MMMMMMMM

24-bit "positive mantissa" representation: (EhBASIC)

Unpacked format: in FAC (the floating point accumulator)

Code:

 FAC1_e   FAC1_1   FAC1_2   FAC1_3   FAC1_s
EEEEEEEE NMMMMMMM MMMMMMMM MMMMMMMM SXXXXXXX

Packed format: in a variable (memory)

Code:

 addr+0   addr+1   addr+2   addr+3
EEEEEEEE SMMMMMMM MMMMMMMM MMMMMMMM

The bits are:

E = Exponent
M = Mantissa
N = Mantissa (also indicates whether the mantissa is normalized)
S = Sign
X = Don't care

In the "signed mantissa" representation, the mantissa is normalized when the "normalized" (N) bit is not equal to the sign bit (S). In the "positive mantissa" representation, the mantissa is normalized when the "normalized" bit is 1. When packing the number (i.e. converting from unpacked to packed format), the "normalized" bit is simply overwritten by the sign bit, which is why (a) an unnormalized number cannot be stored in a variable (b) there is one more bit of precision than "signed mantissa" representation.

I should point out that EhBASIC (and Applesoft, which uses 32-bit mantissa but is otherwise the same) also has a rounding byte, FAC1_r, which is used as an extra byte of precision during calculations, and is ultimately rounded into FAC1. However, the absence or presence of the rounding byte depends on the floating point implementation, not on the representation used for FP numbers. An implemenation of either representation could use a rounding byte, or could omit it.

The 24-bit "positive mantissa" representation:

unpacked format:

FAC1_e: an 8-bit unsigned integer = 0 to 255
U: a 24-bit unsigned integer = 0 to 16777215
FAC1_s: the sign of the mantissa
FAC1_1: the high byte of M
FAC1_2: the middle byte of M
FAC1_3: the low byte of M

the number represented in unpacked format:

when FAC1_e = 0: 0
when FAC1_e is non-zero and bit 7 of FAC1_s = 0: U * 2 ^ (FAC1_e - 152)
when FAC1_e is non-zero and bit 7 of FAC1_s = 1: -U * 2 ^ (FAC1_e - 152)

Range of negative values: -2 ^ 128 + 2 ^ 103 to -2 ^ -151
Range of positive values: 2 ^ -151 to 2 ^ 128 - 2 ^ 103

packed format:

FAC1_e: an 8-bit unsigned number = 0 to 255
P: a 23-bit unsigned integer = 0 to 8388607
FAC1_1: bits 6 to 0 are bits 22 to 16 of M
FAC1_2: bits 15 to 8 of M
FAC1_3: bits 7 to 0 of M

The number represented in packed format:

when FAC1_e = 0: 0
when FAC1_e is non-zero and bit 7 of FAC1_1 = 0: (2 ^ 23 + P) * 2 ^ (FAC1_e - 152)
when FAC1_e is non-zero and bit 7 of FAC1_1 = 1: -(2 ^ 23 + P) * 2 ^ (FAC1_e - 152)

Range of negative values: -2 ^ 128 + 2 ^ 103 to -2 ^ -128
Range of positive values: 2 ^ -128 to 2 ^ 128 - 2 ^ 103

Advantages of this representation:

1. There is 1 more bit of (mantissa) precision as compared to a 24-bit "signed mantissa" representation. This may sound minor, but round off errors and other such issues can rapidly surface. Proper use of the extra bit can help alleviate some of the difficulies.

2. It's more common, so it's more widely understood. Unfortunately, there are a lot of subtleties and caveats when it comes to floating point, many of which the average programmer is unaware of. There is far too little documentation of floating point as it is, and a representation that has more information available can be a great benefit.

The 24-bit "signed mantissa" representation:

X1: the 8-bit unsigned number at X1 = 0 to 255
M1: the 24-bit twos complement number at M1 = -8388608 to 8388607
M1+0: the low byte of M1
M1+1: the middle byte of M1
M1+2: the high byte of M1

The number represented by X1 and M1 is: M1 * 2 ^ (X1 - 150)

Range of negative values: -2 ^ 128 to -2 ^ -150
Range of positive values: 2 ^ -150 to 2 ^ 128 - 2 ^ 105

Advantages of this representation:

1. Unnormalized numbers can be stored in variables, so numbers closer to zero (such as 2 ^ -150) are not limited to intermediate results.

2. There isn't a packed or unpacked format, so routines to convert between the two aren't needed. In addition to requiring less space, there is also a slight (a very slight) speed increase when moving numbers from a variable to the floating point "accumulator" and vice versa.

3. Calculations can be performed faster than the "positive mantissa" representation. Neither the Wozniak & Rankin routines nor EhBASIC is really optimized for speed, so this may not be so easy to see by comparing those two implemenations. For example, both perform subtraction by negating one argument and performing an addition. It would be faster to write a special subtraction routine, of course.

Anyway, for addition and subtraction, both must shift one mantissa (to "align the decimal points"), and finish by normalizing the result. In between, "signed mantissa" needs only to added the mantissa, whereas "positive mantissa" must see if the two mantissa have the same sign, adding them if their signs are the same, or subtracting them if their signs are different. In the latter case, if the subtraction resulted in a negative mantissa, the mantissa must be negated so that it positive. In BASIC, addition is more common than it may seem, because the NEXT in a FOR NEXT loop performs a floating point addition to update the loop variable. So in that case it makes sense to optimize for addition.

It might seem like "signed mantissa" multiplication and division routines would be slower than their "positive mantissa" counterparts. In many implementations this will be true, but it does not have to be so. Probably well over 99% of signed multiplication routines perform the multiplcation using an unsigned multiplication routine to calculate product = abs(mantissa1) * abs(mantissa2), negating that product if the signs of the two mantissa are different. However, it is possible to write a signed multiplication routine that does not need to calculate either absolute value or perform a final negation of the product. When all is said and done, the speed of that type of a "signed mantissa" multiplication routine and a "positive mantissa" multiplication routine will be about the same. The above is also true for division.

Also, in "positive mantissa", zero is represented by a special exponent value, not by a mantissa value, so you must specifically check the exponent for zero which takes a few extra cycles when none of function arguments are zero, which is the most common case.

The bottom line is that the best representation to use isn't clear cut. But that's often the case.

Top

Page 2 of 2

[ 18 posts ]

Go to page Previous 1, 2

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 22 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum