Sunday, February 19, 2012

Controller Area Network (CAN) Protocol Vulnerabilities

Controller Area Network (CAN)is a very popular embedded network protocol. It's widely used in automotive applications, and the low cost that results in makes it popular in a wide variety of other applications.

CAN is a very nice embedded protocol, and it's very appropriate for a wide variety of applications. But, like most protocols it has some faults. And like many protocol faults, they aren't widely advertised. They are the sort of thing you have to worry about if you need to create an ultra-dependable system. For a robust system architecture for every day control functions, probably they are not a huge deal.

Here is a summary of CAN Protocol faults I've run into.

(1) Unprotected header. The length bits of a CAN message aren't protected by the CRC. That means that a single corrupted bit in the length field can cause the CRC checker to look in the wrong place for the CRC. For example, if a 3-byte payload is corrupted to look like a 1-byte payload, the last two bytes of the payload will be interpreted as a CRC. This could cause an undetected 1-bit error. There are some other framing considerations that might or might not cause the error to be discovered, but it is a vulnerability in CAN that is fixed in later protocols via use of a dedicated header CRC (e.g., as done in FlexRay). I've never seen this written up for CAN in particular, but this is discussed on page 20 of [Driscoll09]

(2) Stuff bit errors compromising the CRC. A pair of bit errors in the bit-stuffed format of a CAN message (i.e., one on the wire) can cause a cascading of bit errors in the message, leading to undetected two-bit errors that are undetected by the CRC. See [Tran99] and slides 15-16 of
http://www.ece.cmu.edu/~ece649/lectures/13_can.pdf
In general there are issues in any embedded network if the physical layer encoding can cause the propagation of one physical data bit fault into affecting multiple data bits.

(NOTE: added 10/16/2013.  I just found about about the CAN-FD protocol. Among other things it includes the stuff bits in the CRC calculation to mitigate this problem.  Whether or not there are any holes in the scheme I don't know, but it is clearly a step in the right direction.)

(3) CAN is intended to provide exactly-once delivery of messages via the use of a broadcast acknowledgement mechanism. However, there are faults with this mechanism. In one fault scenario, some nodes see an acknowledge but others don't, so some nodes will receive two copies of a message (thinking they are independent copies rather than a retransmission) while others receive just one copy. In another less likely (but plausible) scenario, the above happens but the sender fails before retransmitting, resulting in some recipients having a copy of the message and others never having received a copy. [Rufino98]  Further info here: [Tindel20b]

(4) It is possible for a CAN node to lock up the network due to retries if a transmitter sees a different value on the bus than it transmitted. (This is a failure of the transmitter, but the point is one node can take down the network.) [Perez03]

(5) If a CAN nodes' transmitter gets stuck at a dominant bit value, it will jam up the entire network. The error counters inside each CAN controller are supposed to help mitigate this, but if you have a stuck-at-"on" transmitter that ignores "turn off" commands there isn't anything you can do with a standard CAN setup.

(6) CAN suffers from a priority inversion issue, which can cause delays of high priority messages due to queuing issues in nodes that locally use FIFO transmission order. [Tindell20]  Note that strictly speaking this is a CAN driver and hardware issue rather than something inherent to the on-wire protocol.  To quote Ken Tindell via a social media message: "Any software that does FIFO queuing of CAN frames in a critical application is broken."

There are various countermeasures you might wish to take to combat the above failure modes for highly dependable applications. For example, you might put an auxiliary CRC or checksum in the payload (although that doesn't solve the stuff bit vulnerability). You might shut down the network if the CAN controller error counter shows a high error rate. Or you might just switch to a more robust protocol such as FlexRay.

CAN is also not secure in that provides no message authentication. It was not designed to be secure, so this is more of limitation in scope than a design defect.

Have you heard of any other CAN failure modes? If so, let me know.

References:

[Driscoll09] Driscoll, K., Hall, B., Koopman, P., Ray, J., DeWalt, M., Data Network Evaluation Criteria Handbook, AR-09/24, FAA, 2009.
http://www.ece.cmu.edu/~koopman/pubs/faa09-24_data_network_evaluation_criteria_handbook.pdf

[Tran99] Eushiuan Tran Tsung, Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol, MS Thesis, Carnegie Mellon University ECE Dept, May 1999
http://www.ece.cmu.edu/~koopman/thesis/etran.pdf

[Rufino98] Rufino, J.; Verissimo, P.; Arroz, G.; Almeida, C.; Rodrigues, L., Fault-tolerant broadcasts in CAN, FTCS 1998, pp. 150-159.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=689464

[Perez03] Perez, J., Reorda, M., Veiolante, M.; Accurate dependability analysis of CAN-based networked systems, 2003 SBBCCI.
http://porto.polito.it/1418994/

[Tindell20] Tindell, K., CAN Priority Inversion, June 29, 2020.
https://kentindell.github.io/2020/06/29/can-priority-inversion/

[Tindell20b[ Tindell, K., "CAN, atomic broadcast and Buridan's Ass,"  https://kentindell.github.io/2020/07/11/can-atomic-multicast/

5 comments:

  1. Hi Phil,

    are you aware of any evaluation of the effectiveness of different countermeasures in the literature? I understand that it is common to see rolling counters/additional CRC/timout/DLC in the payload in some automotive applications. It is difficult to know which errors types the countermeasures are effective against and how effective they are?

    Luke

    ReplyDelete
  2. Hi Luke,

    Unfortunately there isn't really anything on this that I have run into. It's on my "would be nice to dig into someday" list, but that's a pretty big list so I doubt I will get to it soon.

    I've also heard of designers modifying data to ensure no stuff bits are generated, but that doesn't mean you can't get an error that creates false stuff bits. (I haven't looked into that specifically, but I'd be skeptical until I saw some fault injection results.)

    The one piece of advice I can give is to not count on an additional CRC increasing the achieved Hamming Distance past 2. The best you can hope for is an additional 1/2**k attenuation of undetected errors. You might see all sorts of fancy mathematical arguments about why two relatively prime polynomials will give you a better HD, but the bit stuff vulnerability will undermine any error code in CAN that I know of to be at best HD=2.

    -- Phil

    ReplyDelete
  3. Interesting. Zigbee, and in fact all IEEE 802.15.4 radio networks also have a header/length field unprotected by CRC. Pointing out this vulnerability (which is FAR more severe for radio than for wired networks) always met with incredulous looks and reactions of "oh well it was done by clever people so why are you worried about it". Oh dear.

    ReplyDelete
    Replies
    1. You're right --- unprotected headers and analogous problems show up in many places. The first work I'm aware of in this vein is G. Funk, Message Error Detecting Properties of HDLC Properties, IEEE Trans. Comms, Jan 1982, pp. 252-257 which talked about a 1-bit framing vulnerability (not so different). Protocol reliability analysis is a rather specialized skill and for the most part there just aren't that many folks around who have the skill set to find these sorts of problems, much less the time and resources to devote to a thorough study and publication of the problem. So the bad practices often stay in place without anyone fixing them.

      Delete
  4. Well, the list of possible faults focus mainly on the protocol but lacks much of standard software mitigation mechanisms like in AUTOSAR and ignore the system architecture of the nodes in a vehicle.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Static Analysis Ranked Defect List

  Crazy idea of the day: Static Analysis Ranked Defect List. Here is a software analysis tool feature request/product idea: So many times we...