Lab 4: Floating point

Lab sessions Mon Feb 02 to Thu Feb 05

Lab written by Julie Zelenski

Learning goals

In this lab, you will:

  1. experiment with float/double types and observe their operational behavior
  2. explore the rounding inherent in floating point representation/arithmetic
  3. dissect the internals of the floating point format
  4. get more practice using bitwise operators

Find an open computer and somebody new to sit with. Introduce yourself and share whether you are a CISC or RISC processor when using your favorite text editor.

Lab exercises

  1. Get started. Clone the starter repo using Mercurial

    hg clone /afs/ir/class/cs107/repos/lab4/shared lab4
    

    This creates the lab4 directory which contains source files and a Makefile. Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.

  2. Simple behavior of floating point types. First, compile and run the floats program which prints the size and range of various floating point types on this architecture. Use this program and either add code or experiment in gdb to answer the following questions:

    • What are the values of FLT_MAX? FLT_MIN? FLT_DIG? What are those values for type double?
    • Add code to the program to cause overflow and divide by zero. What is the behavior of these errors for floating point types?
    • What are the exceptional float values? What calculations can produce an exceptional result? What happens when you add/multiply/compare an ordinary float to an exceptional one?

  3. Float bit patterns. Recall that an IEEE 32-bit float is stored as:

    NEEE EEEE ESSS SSSS SSSS SSSS SSSS SSSS
    

    where N is the sign bit (1 if negative), E is the 8-bit exponent (with a bias of 127), and S is the 23-bit significand, with an implicit leading "1." for normalized values. The sign bit is the bit position of most significance, the final bit of the significand in the least. Here are some exercises to work through to visualize float bit patterns. Write code to test your answers and/or experiment in gdb. Here is a neat online viewer you can also use to examine float bit patterns.

    • What is the bit pattern for FLT_MAX? FLT_MIN (smallest normalized value)? the smallest (non-zero) denormalized value? the largest denorm? the value 1.0? the value 1.75? how does +0.0 differ from -0.0?
    • How does the bit pattern for a float change when negated? What is the bitwise expression to make that change?
    • How does a normalized float's bit pattern change when multiplied by 2? How does a denormalized float's bit pattern change when multiplied by 2?
    • Consider the normalized float values that represent exact powers of 2. What do all their bit patterns have in common?

  4. Limits of a finite representation (nearest representable floats). As a finite representation trying to model values drawn from an infinite range, floating point is inherently a compromise. Although some values are exactly representable, many others will have to approximated. Let's drill down a bit to understand what precision you can expect. In this particular exercise, we to refer to the 8-bit mini-float (1 sign bit 4 exponent bits 3 significand bits) described in lecture/textbook. Although there is no such C type, it can be helpful to first work through how the limits apply to the minifloat on paper before trying to generalize to a larger bitwidth such as the 32-bit float. We will see that it is the number of bits dedicated to the significand that dictates the precision of what can be stored/computed.

    • Rounding during assignment from a constant. Many constants cannot be assigned to a float without loss of precision. Floating point representation is based on binary decimal. What if the given constant does not terminate when expressed as a binary decimal? Consider the constant 0.4. This is 4/10, or, in binary, 100/1010. Apply division to that binary fraction and you'll get a repeating binary decimal 0.01100. There is no terminating sequence of powers-of-two that sums exactly to 0.4, so no hope of exactly representing this value! Let's follow what happens during these assignments:

      minifloat mf = 0.4;     // value stored in mf will be nearest representable minifloat 
      float f = 0.4;          // value stored in f will be nearest representable float
      

      Work on paper for the minifloat. The constant 0.4 is expressed in binary decimal (0.0110011001100...) and normalized to 1.100110011... x 2-2. We can only record 4 bits of that infinite significand (3 explicit bits + hidden leading 1). The excess bits are not just crudely truncated, the default approach is round-to-nearest which adds 1 to the least significant bit if the truncated bits would have contributed more than one-half. This significand is rounded to 1.101. The minifloat mf holds the bit pattern 00101101 This value is not 0.4 but the representable minifloat nearest to 0.4. What value is it? How close is it to what we wanted?

      Now consider assigning the constant 0.4 to a float. The 23-bit significant allows retaining more of the repeated expansion than the minifloat, but excess bits still must be rounded. Run the floats program under gdb and step through test_nearest function. After assigning f = 0.4, use x/1wt &f to look at the stored bit pattern. Identify the sign bit, the 8-bit exponent, and the 23-bit significand. Can you see the repeating binary decimal in the bits? Now step through the printf. What value is the nearest representable float to 0.4? How close is it to what we wanted? Although the larger bitwidth of the float versus minifloat (or double versus float) tightens the resolution, rounding is inevitable for any non-terminating binary decimal.

      Some constants are rounded because the binary decimal expansion (though terminating) uses more than the maximum number of significand bits available in the data type. Consider these assignments:

      minifloat mf = 9.25;     // value stored in mf will be nearest representable minifloat 
      float f = 9.25;          // value stored in f will be exact
      

      Trace the minifloat assignment on paper. 9.25 as a normalized binary decimal is 1.00101 x 23. The constant has a significand of 6 bits and a minifloat can only store 4 bits, so the excess 2 bits must be rounded. The minifloat mf holds the bit pattern 01010001. This value is not exactly 9.25 but the representable minifloat nearest to it. What value is it?

      The constant 9.25 can be exactly represented as a float, as the 6 bits fit with room to spare in a 23-bit significand. Step through this float assignment in gdb and examine the float bits to identify the sign, exponent, and significand bits. Execute the printf and see the exact value is printed. Other constants such as 24.25 or 1000.25 can also be represented with no loss of precision, but this is not true for every constant N.25. Find a value of N that is rounded when assigned to a float. What is the smallest N such that N.25 is rounded?

    • Rounding during addition (and subtraction). Even if we start with exactly representable values, floating point operations on those values may introduce rounding into the result. To add/subtract two floating point values, the decimal points must first be aligned, the operand with the smaller exponent is adjusted to match the larger. The significands are then combined and result is normalized if necessary.

      minifloat mfa = 9, mfb = 1.25;  // both constants exactly representable as minifloat
      minifloat sum = mfa + mfb;      // sum is rounded to nearest representable minifloat
      float fa = 10000000, fb = 1.25; // both constants exactly representable as float
      float sum = fa + fb;            // sum is rounded to nearest representable float
      

      Work through the minifloat addition on paper. The normalized binary decimal for 9 is 1.001 x 23;for 1.25 is 1.01 x 20. Both of these are exactly representable as minifloat. To add, we temporarily denormalize the smaller magnitude value to align its decimal point with the larger, raising the exponent and counteracting that by right-shifting the significand bits the same number of positions. 1.25 becomes 0.00101 x 23. This is a 6-bit significand and room for only 4, so the excess bits are rounded off. The larger operand is then added to this rounded operand. Trace the addition. The minifloat sum stores the bit pattern 01010010. What value is this? This is the nearest representable float. How far off is from the exact sum?

      Now, trace the float addition by stepping through in gdb and observe the rounded result. How far is it off from what we expected?

      Rounding during addition is responsible for the oddity of floating point values which appear to be their own successor. Working from your understanding of addition/rounding to find a positive float x for which x+1.0 == x. First, come up such a number by working with your partner on paper then write code or evaluate expressions in gdb to verify your answer. What is the smallest such positive float x for which x+1.0 == x?

    • Rounding during multiplication (and division). Floating point multiplication works by separately multiplying significands and adding exponents, and normalizing the result. If multiplying two N-bit significands results in a product of more than N bits, rounding will be needed.

      minifloat mfc = 7, mfd = 5;     // both constants exactly representable as minifloat
      minifloat product = mfc * mfd;  // product is rounded to nearest representable minifloat
      float fc = 8388607, fd = 5;     // both constants exactly representable as float
      float product = fc * fd;        // product is rounded to nearest representable float
      

      Work through the minifloat multiplication on paper. The normalized binary decimal for 7 is 1.11 x 22; 5 is 1.01 x 22. Both of these are exactly representable as minifloat. Multiplying the significands produces 10.0011, adding exponents is 24. Normalizing produces 1.00011 x 25. The resulting 6-bit significand has to be rounded to 4 bits. The minifloat sum has the bit pattern 01100001. What value is this? How far off is from the exact sum?

      Now trace the above float multiplication in gdb and observe the rounded result. The choice of 7 for minifloat and 8388607 for float to be multiplied by 5 is not random -- what is the pattern to how these values relate to the number of significand bits?

  5. Floating point equality. Execute our test_equality function to observe some surprising results about floating point equality. As you saw in the previous exercise, floating point values are subject to rounding on assignment and during arithmetic. Differences in when/where you start and the order/choice of operations can change how rounding affects the result. Floating point values computed in equivalent yet different steps are not guaranteed to get equal results. Naive code (e.g. the loop using a double counter in this function) can easily run afoul of this. As a rule, comparing floats using == is unreliable (heed the warnings you receive from the compiler!) and its use is rarely appropriate. Read the comments for the approx_equal function and follow its suggestions to implement a more appropriate comparison.

  6. Conversion between integer and float. Assigning from int to float or float to int performs a type conversion. The bits from one numeric representation are rearranged to convert the value into a different representation. The transformation between numeric types is implicit on assignment, no explicit typecast is required. Given a 4-byte type (such as int or float), it can hold one of ~4 billion (232) distinct bit patterns. This makes for 4-billion distinct representable ints and 4-billion representable floats, but note that these are not going to be the same set of 4 billion values. The set of representable integers includes every whole number from INT_MIN to INT_MAX. The set of representable floats spans a much larger magnitude from -FLT_MAX to +FLT_MAX but with gaps between neighboring values. When converting from integer to float or vice versa, only those values that are representable in both camps are value-preserving during conversion. As an example, the value 2 can be exactly represented both as an int and as a float, thus converting 2.0 => 2 or 2 => 2.0 is value-preserving (although the int and float bit patterns for 2/2.0 are not the same). Now let's look into the cases that are not value-preserving during conversion:

    • First consider converting from float => int. Any float with a fractional component is truncated to its integer value, discarding the fraction. You are familiar with to this conversion and it is unsurprising is not value-preserving for fractional numbers. However, there are whole-valued float values with no fractional component (e.g. F.0) that do not have a corresponding integer representation and thus do not retain their value across conversion. What are those float values? What happens when you try to assign one of those floats to an int?
    • Now consider converting int => float. You might assume that any integer x has an obvious float equivalent x.0 that is value-preserving. But given that we just showed some of the 4 billion floats don't have a match in the 4 billion ints, applying the pigeonhole principle implies that some of those ints are thus guaranteed not have a match in the floats. What are those ints that don't have a corresponding float representation? What happens when you try to assign one of those ints to a float?
    • Consider converting int=>double. A double is basically a bigger version of float that apportions 11 bits for the exponent and 52 for the significand. There are no integer values that are not value-preserving when converted to a double. Why not?
    • Conversion considered harmful. See this bug report of a recent discovery that certain values send Java and php into an infinite loop when trying to convert from long to double. This stuff is tricky even for the experts!

  7. Floating point arithmetic. An important lesson that comes from studying floats is learning why/how to arrange calculations to minimize rounding and loss of precision. Read over the code in the arith.c program. Execute it and observe the result of the sum_doubles function to see another surprising outcome. A sequence of doubles is summed forward and again backwards. The same values are included in the sum, but the order of operations is different. The result is not the same, and not just epsilon different either! Has the law of associativity been repealed? Explain what is going on. The relative error for floating point multiplication and division is small, but addition and subtraction have some gotchas to watch out for. For example, both addition and subtraction cannot maintain precision when applied to operands of very different magnitude, as shown by the example above where the lower significance digits are rounded off. Two operands very close in magnitude can also cause a problem for subtraction or addition of values with opposite sign. If two floats have the same exponent and agree in the leading 20 bits of the significand, then only few bits of precision will remain after subtracting the two, and leaving just those few digits that were least significant too--- double bummer! This event is called catastrophic cancellation (the catastrophe is so-called because you lose a large amount of significant digits all at once, unlike roundoff error which tends to occur more gradually). A subtraction of nearly equal values ideally should be reformulated to avoid this cancellation. Consider the task of calculating the roots of a quadratic equation ax2 + bx + c = 0. The classic formula to find roots is (-b ± sqrt(b2- 4ac))/2a. This computation works out for many values, but when b is large relative to a and c, the sqrt of the discriminant is nearly equal to b. For one of the roots, you subtract the two terms in the numerator, possibly losing much precision. In extreme cases, the result can go to zero, which is clearly wrong! Thankfully, some simple algebraic rearrangement can avoid this fate and turn the dreaded subtraction into an addition. Take a look at the code we give for the two find_roots functions. The first uses the classic formulation for both roots. The second function changes things up on the second root to avoid the subtraction. Read the comments and work through the algebra of the revised version. Run the program and play with the inputs to observe the results of the two versions. If a and c are both 1, roughly how big does b need to be before you will see a difference in results between the two? How big does b need to be before you get an erroneous zero root from the classic formula?

  8. Float bit manipulation. For additional practice with bitwise manipulation, implement the function quadruple which is given a float value and returns value multiplied by 4. The function should operate without using any multiplication, but instead directly modifying the exponent bits of the float. The C bitwise operators cannot be directly applied to a float variable; these operators are only legal for int-family types. In order to manipulate the raw float bits, you must first copy those bits into an unsigned int, on that type you can play bitwise games. You don't want to assign the float to an int as that will do a type conversion, you want the raw bits to be copied unchanged. Our code shows two different ways to accomplish this:

    bits = *(unsigned int *)&f;     // (1) trick compiler to copy bits using cast    OR
    memcpy(&bits, &f, sizeof(f));   // (2) use raw memcpy to copy bits
    

    Once you have the raw bits, use a bitmask to extract the exponent bits, increment them by 2, and merge in the modified exponent bits and convert back to float.

Fun and interesting further reading on floats:

Check off with TA

Before you leave, be sure to submit your checkoff sheet and have lab TA come by and approve it so you will be properly credited for lab. If you don't finish all the exercises in the lab period, we encourage you to work through any remainder on your own.

xkcd float comic