Better Embedded System SW: February 2012

Monday, February 27, 2012

Floating Point Comparison Problems

We all learned (or should have learned) that comparing floating point numbers for equality is a risky proposition. The code:

  if (a == b) { ... }

could easily evaluate as not equal rather than equal if a and b differ due to round-off error. There are ways to deal with that involving making sure that the difference is a small fraction of the value. But as I said, you've probably seen this one.

Today I want to talk about a different comparison problem that is more subtle, and frankly one I never even thought about until I saw it as a failure in a real system that had (everyone thought) been thoroughly tested.

Consider code that looks like this:

#define SAFETYLIMIT 357.9
double MonitoredValue;
. . .
... compute MonitoredValue based on sensor values ...

if (MonitoredValue > SAFETYLIMIT)
{ ... slow down or shut down system ... }

The idea is to measure an actual value and check against a safety limit. If the value is too high, command a lower actuation set-point or do a safety shutdown. Alternately, the code might check to make sure that the monitored value is close enough to a commanded value. The details don't matter beyond there being a floating point value comparison of any type involved.

How, you ask, can this fail since it is not an exact equality test? The answer is that the computation of MonitoredValue could result in a numeric exception and produce a NaN value ("Not a Number"). Division by zero is a classic way to get this problem, but there are other more subtle problems involving numeric underflow, overflow, or hitting a discontinuity in a trig function. Any comparison involving a NaN fails, depending upon your compilation flags, floating point library, and so on. So, you could have a really fast speed that results in a numerical exception and the speed limit won't work. Because this is a numerical exception problem, using a double instead of a float usually won't help.

There are at least three solutions to this problem. One is to always make sure that NaN values result in a "safe" outcome. I'd recommend against this -- it is just too easy to forget, get wrong, or get into a situation where you aren't even sure what the safe action is. (Consider that most code doesn't bother to check for null pointers. In most projects, checking for NaN just isn't going to happen.)

Another solution is to use integers or fixed point math instead of floating point math. Fixed point computations are a pain, but at least they don't have a NaN value.

The last one I can think of is to set up your compiler or run-time system to trap on NaN comparisons. (For example, -fsignaling-nans and other related options for GCC if they are supported.) What you do when you trap is not necessarily simple, but at least your system will know there is a problem instead of blinding going along ignoring safety limits.

None of theses solutions is perfect -- but having a plan to deal with this situation is a good first step.

(BTW, there is a LOT more to writing safety critical code than putting in a speed limit check in the main code, so please consider this just an example to motivate the NaN problem. If you have a safety-critical system that doesn't have a mechanical safety backup, you need to do a whole lot more than this before you can trust software to get it right.)

Sunday, February 19, 2012

Controller Area Network (CAN) Protocol Vulnerabilities

Controller Area Network (CAN)is a very popular embedded network protocol. It's widely used in automotive applications, and the low cost that results in makes it popular in a wide variety of other applications.

CAN is a very nice embedded protocol, and it's very appropriate for a wide variety of applications. But, like most protocols it has some faults. And like many protocol faults, they aren't widely advertised. They are the sort of thing you have to worry about if you need to create an ultra-dependable system. For a robust system architecture for every day control functions, probably they are not a huge deal.

Here is a summary of CAN Protocol faults I've run into.

(1) Unprotected header. The length bits of a CAN message aren't protected by the CRC. That means that a single corrupted bit in the length field can cause the CRC checker to look in the wrong place for the CRC. For example, if a 3-byte payload is corrupted to look like a 1-byte payload, the last two bytes of the payload will be interpreted as a CRC. This could cause an undetected 1-bit error. There are some other framing considerations that might or might not cause the error to be discovered, but it is a vulnerability in CAN that is fixed in later protocols via use of a dedicated header CRC (e.g., as done in FlexRay). I've never seen this written up for CAN in particular, but this is discussed on page 20 of [Driscoll09]

(2) Stuff bit errors compromising the CRC. A pair of bit errors in the bit-stuffed format of a CAN message (i.e., one on the wire) can cause a cascading of bit errors in the message, leading to undetected two-bit errors that are undetected by the CRC. See [Tran99] and slides 15-16 of
http://www.ece.cmu.edu/~ece649/lectures/13_can.pdf
In general there are issues in any embedded network if the physical layer encoding can cause the propagation of one physical data bit fault into affecting multiple data bits.

(NOTE: added 10/16/2013. I just found about about the CAN-FD protocol. Among other things it includes the stuff bits in the CRC calculation to mitigate this problem. Whether or not there are any holes in the scheme I don't know, but it is clearly a step in the right direction.)

(3) CAN is intended to provide exactly-once delivery of messages via the use of a broadcast acknowledgement mechanism. However, there are faults with this mechanism. In one fault scenario, some nodes see an acknowledge but others don't, so some nodes will receive two copies of a message (thinking they are independent copies rather than a retransmission) while others receive just one copy. In another less likely (but plausible) scenario, the above happens but the sender fails before retransmitting, resulting in some recipients having a copy of the message and others never having received a copy. [Rufino98] Further info here: [Tindel20b]

(4) It is possible for a CAN node to lock up the network due to retries if a transmitter sees a different value on the bus than it transmitted. (This is a failure of the transmitter, but the point is one node can take down the network.) [Perez03]

(5) If a CAN nodes' transmitter gets stuck at a dominant bit value, it will jam up the entire network. The error counters inside each CAN controller are supposed to help mitigate this, but if you have a stuck-at-"on" transmitter that ignores "turn off" commands there isn't anything you can do with a standard CAN setup.

(6) CAN suffers from a priority inversion issue, which can cause delays of high priority messages due to queuing issues in nodes that locally use FIFO transmission order. [Tindell20] Note that strictly speaking this is a CAN driver and hardware issue rather than something inherent to the on-wire protocol. To quote Ken Tindell via a social media message: "Any software that does FIFO queuing of CAN frames in a critical application is broken."

There are various countermeasures you might wish to take to combat the above failure modes for highly dependable applications. For example, you might put an auxiliary CRC or checksum in the payload (although that doesn't solve the stuff bit vulnerability). You might shut down the network if the CAN controller error counter shows a high error rate. Or you might just switch to a more robust protocol such as FlexRay.

CAN is also not secure in that provides no message authentication. It was not designed to be secure, so this is more of limitation in scope than a design defect.

Have you heard of any other CAN failure modes? If so, let me know.

References:

[Driscoll09] Driscoll, K., Hall, B., Koopman, P., Ray, J., DeWalt, M., Data Network Evaluation Criteria Handbook, AR-09/24, FAA, 2009.
http://www.ece.cmu.edu/~koopman/pubs/faa09-24_data_network_evaluation_criteria_handbook.pdf

[Tran99] Eushiuan Tran Tsung, Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol, MS Thesis, Carnegie Mellon University ECE Dept, May 1999
http://www.ece.cmu.edu/~koopman/thesis/etran.pdf

[Rufino98] Rufino, J.; Verissimo, P.; Arroz, G.; Almeida, C.; Rodrigues, L., Fault-tolerant broadcasts in CAN, FTCS 1998, pp. 150-159.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=689464

[Perez03] Perez, J., Reorda, M., Veiolante, M.; Accurate dependability analysis of CAN-based networked systems, 2003 SBBCCI.
http://porto.polito.it/1418994/

[Tindell20] Tindell, K., CAN Priority Inversion, June 29, 2020.
https://kentindell.github.io/2020/06/29/can-priority-inversion/

[Tindell20b[ Tindell, K., "CAN, atomic broadcast and Buridan's Ass," https://kentindell.github.io/2020/07/11/can-atomic-multicast/

Friday, February 3, 2012

Avoid Floating Point Counters

Every once in a while I encounter code that uses a float to track time, keep count of events, or otherwise add small increments to a running sum. This can lead to some nasty bugs that don't show up during system test.

Consider the following C code:

#include
#define SKIP 0x00FFFFF8

int main(void)
{
union { float fv; int iv; } count;
long i;

for (i = 0; i < SKIP; i++)
{ count.fv += 1;
}

for (i = 0; i < 16; i++)
{ count.fv += 1;
    printf(" +1 = %8.0f   0x%08X\n", count.fv, count.iv);
}
return;
}

At first blush, it seems this counter will be able to count up to a really high number, because even a 32-bit float can handle very large numeric values. But, the problem is that a float only has a 24-bit mantissa value. So if you have a total count that requires more than a 24-bit integer to represent, the increment will stop working due to roundoff error. In other words, each incremental +1 gets rounded off to be equivalent to +0, and the counter stops counting.

Here's the output of that program compiled with gcc v3.4.4 (header line added):
       Count     Hex Representation
+1 = 16777209   0x4B7FFFF9
+1 = 16777210   0x4B7FFFFA
+1 = 16777211   0x4B7FFFFB
+1 = 16777212   0x4B7FFFFC
+1 = 16777213   0x4B7FFFFD
+1 = 16777214   0x4B7FFFFE
+1 = 16777215   0x4B7FFFFF
+1 = 16777216   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000

As you can see, once the count hits a 24-bit value (16M), the counter stops counting.

This sort of thing can be hard to find in system test, because you may not run the test long enough to catch the problem. For an example of a similar sort of bug that contributed to a tragedy, read about the Patriot Missile bug at http://www.fas.org/spp/starwars/gao/im92026.htm

Here it is in a nutshell: Never Use A Float For A Running Sum
More generally, avoid floats in embedded systems if possible.
You can often get away with a double, but most times you are better off just using an int or long int.

Detailed notes:
IEEE single precision floating point has a 23 bit mantissa with an implicit leading 1 bit. So I say it is a 24 bit value even though only 23 of those bits show up in the actual binary representation.

The fact that the printed floating point value increments one more time but the hex value doesn't change is interesting. Likely it involves the compiled code temporarily storing the increment value in a double precision (or larger) floating point holding register and using that longer value for the printf. Note that the hex value and thus the 32-bit stored value does stop incrementing at the predicted point.