Friday, February 3, 2012

Avoid Floating Point Counters

Every once in a while I encounter code that uses a float to track time, keep count of events, or otherwise add small increments to a running sum. This can lead to some nasty bugs that don't show up during system test.

Consider the following C code:

#include
#define SKIP 0x00FFFFF8

int main(void)
{
  union  { float fv;  int iv; } count;
  long i;

  for (i = 0; i < SKIP; i++)
  { count.fv += 1;
  }

  for (i = 0; i < 16; i++)
  { count.fv += 1;
    printf(" +1 = %8.0f   0x%08X\n", count.fv, count.iv);
  }
 return;
}

At first blush, it seems this counter will be able to count up to a really high number, because even a 32-bit float can handle very large numeric values. But, the problem is that a float only has a 24-bit mantissa value. So if you have a total count that requires more than a 24-bit integer to represent, the increment will stop working due to roundoff error.  In other words, each incremental +1 gets rounded off to be equivalent to +0, and the counter stops counting.

Here's the output of that program compiled with gcc v3.4.4 (header line added):
       Count     Hex Representation
 +1 = 16777209   0x4B7FFFF9
 +1 = 16777210   0x4B7FFFFA
 +1 = 16777211   0x4B7FFFFB
 +1 = 16777212   0x4B7FFFFC
 +1 = 16777213   0x4B7FFFFD
 +1 = 16777214   0x4B7FFFFE
 +1 = 16777215   0x4B7FFFFF
 +1 = 16777216   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000
 +1 = 16777217   0x4B800000

As you can see, once the count hits a 24-bit value (16M), the counter stops counting.

This sort of thing can be hard to find in system test, because you may not run the test long enough to catch the problem. For an example of a similar sort of bug that contributed to a tragedy, read about the Patriot Missile bug at http://www.fas.org/spp/starwars/gao/im92026.htm

Here it is in a nutshell: Never Use A Float For A Running Sum
More generally, avoid floats in embedded systems if possible.
You can often get away with a double, but most times you are better off just using an int or long int.

Detailed notes:
IEEE single precision floating point has a 23 bit mantissa with an implicit leading 1 bit. So I say it is a 24 bit value even though only 23 of those bits show up in the actual binary representation.

The fact that the printed floating point value increments one more time but the hex value doesn't change is interesting. Likely it involves the compiled code temporarily storing the increment value in a double precision (or larger) floating point holding register and using that longer value for the printf. Note that the hex value and thus the 32-bit stored value does stop incrementing at the predicted point.

4 comments:

  1. Reading from count.iv after you have written to count.fv is undefined behavior in C99 (See ANSI C 6.2.6.1.6). Hence, your Hex Representation column could show anything.

    If you use an (unsigned) int instead of float, you have to check for an overflow. How is that different to check for an increment?

    Btw have a look at the nexttowardf function in math.h ;)

    ReplyDelete
  2. In this particular case, the hex representation is showing the equivalent data in the floating point number. This sort of trick is always hazardous for production code, and I used it here just for illustrative purposes.

    Yes, you'd have to worry about an overflow if you used an int, but it wouldn't happen until 4G counts if you used an unsigned 32-bit int. The real point here is that many programmers don't realize that a float will for practical purposes overflow after a count of only 16M. The sole purpose of this code is to illustrate the problem.

    ReplyDelete
  3. @qznc Yes, of course, you are right that an integer also has a "count limit"

    But the problem here is not the limit as such, but the increment: the programmer has assumed that the float can go up to really huge values, so he can just keep incrementing...

    ReplyDelete
  4. Skynet used floating point counters on T2.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...