Monday, February 27, 2012

Floating Point Comparison Problems

We all learned (or should have learned) that comparing floating point numbers for equality is a risky proposition. The code:
  if (a == b) { ... }
could easily evaluate as not equal rather than equal if a and b differ due to round-off error. There are ways to deal with that involving making sure that the difference is a small fraction of the value.  But as I said, you've probably seen this one.

Today I want to talk about a different comparison problem that is more subtle, and frankly one I never even thought about until I saw it as a failure in a real system that had (everyone thought) been thoroughly tested.

Consider code that looks like this:

#define SAFETYLIMIT 357.9
double MonitoredValue;
. . .
... compute MonitoredValue based on sensor values ...

if (MonitoredValue > SAFETYLIMIT)
{ ... slow down or shut down system ... }

The idea is to measure an actual value and check against a safety limit.  If the value is too high, command a lower actuation set-point or do a safety shutdown. Alternately, the code might check to make sure that the monitored value is close enough to a commanded value.  The details don't matter beyond there being a floating point value comparison of any type involved.

How, you ask, can this fail since it is not an exact equality test?  The answer is that the computation of MonitoredValue could result in a numeric exception and produce a NaN value ("Not a Number"). Division by zero is a classic way to get this problem, but there are other more subtle problems involving numeric underflow, overflow, or hitting a discontinuity in a trig function. Any comparison involving a NaN fails, depending upon your compilation flags, floating point library, and so on. So, you could have a really fast speed that results in a numerical exception and the speed limit won't work. Because this is a numerical exception problem, using a double instead of a float usually won't help.

There are at least three solutions to this problem. One is to always make sure that NaN values result in a "safe" outcome. I'd recommend against this -- it is just too easy to forget, get wrong, or get into a situation where you aren't even sure what the safe action is. (Consider that most code doesn't bother to check for null pointers. In most projects, checking for NaN just isn't going to happen.)

Another solution is to use integers or fixed point math instead of floating point math. Fixed point computations are a pain, but at least they don't have a NaN value.

The last one I can think of is to set up your compiler or run-time system to trap on NaN comparisons.  (For example,  -fsignaling-nans and other related options for GCC if they are supported.) What you do when you trap is not necessarily simple, but at least your system will know there is a problem instead of blinding going along ignoring safety limits.

None of theses solutions is perfect -- but having a plan to deal with this situation is a good first step.

(BTW, there is a LOT more to writing safety critical code than putting in a speed limit check in the main code, so please consider this just an example to motivate the NaN problem. If you have a safety-critical system that doesn't have a mechanical safety backup, you need to do a whole lot more than this before you can trust software to get it right.)

No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Static Analysis Ranked Defect List

  Crazy idea of the day: Static Analysis Ranked Defect List. Here is a software analysis tool feature request/product idea: So many times we...