Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):
  1. Picking a CRC based on a popularity contest instead of analysis
    • This includes using “standard” polynomials such as IEEE 802.3
  2. Saying that a good checksum is as good as a bad CRC
    • Many “standard” polynomials have poor HD at long lengths
  3. Evaluating with randomly corrupted data instead of BER fault model
    • Any useful error code looks good on random error patterns vs. BER random bit flips
  4. Blindly using polynomial factorization to choose a CRC
    • It works for long dataword special cases, but not beyond that
    • Divisibility by (x+1) doubles undetected fraction on even # bit errors
  5. Failing to protect message length field
    • Results in pointing to data as FCS, giving HD=1
  6. Failing to pick an accurate fault model and apply it
    • “We added a checksum, so ‘all’ errors will be caught” (untrue!)
    • Assuming a particular standard BER without checking the actual system
  7. Ignoring interaction with bit encoding
    • E.g., bit stuffing compromises HD to give HD=2
    • E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial
(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog.  That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.


7 comments:

  1. Hello,

    Length field is a subtle one: if your protocol has a 'command number' at the start and different commands have different lengths, one's effectively the other. Glad I realized this yesterday instead of next week when it'd have been too late...

    Thanks

    James

    ReplyDelete
  2. James -- this is a great point! Glad you found it sooner rather than later.

    Cheers,
    -- Phil

    ReplyDelete
  3. Actually, I've been pondering this today and framing seems to be real trouble...

    For instance Modbus, a single flipped start bit after a packet would shift to half of CRC + FF, the 'idle 3.5 characters' having received an additional character.

    If instead one used a high bit (address etc.) to indicate the first byte of a packet, and the CRC were directly after, two flips would get you checking the CRC at the wrong location as well.

    Do you have any pointers/references/etc. for how to avoid this problem? (Or have I missed something that makes it less of an issue?)

    Thanks

    James

    ReplyDelete
  4. Right again -- this is a good example of how this topic gets pretty tricky if you want to plug all the gaps. (As I get time I'll post things I've run into... but time always seems to be scarce.)

    Quick version: Yes you have to think of any way in which a message can be misunderstood that can affect the length. That includes length implicit in message headers. One way to solve this is make all messages the same length so the problem can't happen, fragmenting if necessary.

    Another way is to put a CRC on the message header to detect header corruption (FlexRay does this; for a protocol that doesn't do this you could make the first byte of the payload a CRC on the header field).

    A third way is to have headers that have a high hamming distance between each other. One way to do this is, for example, to use the first few bits of the header as the actual header and the rest of the header bits as a CRC protecting those first few bits. The network protocol will just see this as a sparsely populated header space, but you'll know that it is hard for a single bit flip (or more perhaps) to accidentally change the header and therefore change the implicit length.

    The last way I've seen it done is with a high-hamming-distance set of synchronization bit patterns (I think this was Train Control Network), with the sync bit pattern determining which of a few message types are being used and thus implicitly the length.

    ReplyDelete
  5. Just one last question... The 'CRC on the message header' approach, I noticed on USB that the header CRC is HD 3 yet the data is HD 4. I've seen this elsewhere as well. Is there some logic behind why the header can be weaker? This makes the whole packet HD 3 effectively..

    Thanks

    James

    ReplyDelete
  6. It does make the whole packet HD=3, and that is probably why some protocols use the same HD for header protection as packet body protection.

    I don't know why USB did it that way. But the obvious argument would be that the header is a smaller target to hit with random bit errors than the packet body. You might, for example, argue that a particular random independent bit error rate HD=3 on the header is good enough because it is very unlikely to get 3 errors there, while HD=4 is necessary for the packet body because the body is so big that 3 errors will happen often enough there to be a problem. To make this argument for your system you'd need to decide that random independent bit errors is the right fault model for your system, and then do the probability math to decide if number of undetected header errors is low enough to be acceptable to you.


    ReplyDelete
  7. Guess what happens in IEEE 802.15.4 / Zigbee.... last I checked - no CRC or checksum protection of a leading length field. Nobody I mentioned this to seemed to find it a problem. It's a disaster waiting to happen.

    The most robust protocols get around these kind of problems by use of things like illegal sequences (coding or symbol violations) to insert start / end of frame markers, and then use things like bit stuffing to ensure such symbols can never appear inside a valid data frame (eg HDLC).

    The problem is always one of distinguishing control fields in the presence of bit errors. Ignoring the issue does not make it go away. Of course, redundant coding (and thus decent HD) can aid, but its not nice for bandwidth - you don't get something for nothing.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!