Better Embedded System SW: Seven Deadly Sins of CRCs and Checksums

Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):

Picking a CRC based on a popularity contest instead of analysis

This includes using “standard” polynomials such as IEEE 802.3

Saying that a good checksum is as good as a bad CRC

Many “standard” polynomials have poor HD at long lengths

Evaluating with randomly corrupted data instead of BER fault model

Any useful error code looks good on random error patterns vs. BER random bit flips

Blindly using polynomial factorization to choose a CRC

It works for long dataword special cases, but not beyond that
Divisibility by (x+1) doubles undetected fraction on even # bit errors

Failing to protect message length field

Results in pointing to data as FCS, giving HD=1

Failing to pick an accurate fault model and apply it

“We added a checksum, so ‘all’ errors will be caught” (untrue!)
Assuming a particular standard BER without checking the actual system

Ignoring interaction with bit encoding

E.g., bit stuffing compromises HD to give HD=2
E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial

(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog. That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.

11 comments:

jfbJuly 5, 2013 at 11:53 PM
Hello,

Length field is a subtle one: if your protocol has a 'command number' at the start and different commands have different lengths, one's effectively the other. Glad I realized this yesterday instead of next week when it'd have been too late...

Thanks

James
ReplyDelete
Replies
Phil KoopmanJuly 6, 2013 at 7:12 AM
James -- this is a great point! Glad you found it sooner rather than later.

Cheers,
-- Phil
ReplyDelete
Replies
jfbJuly 6, 2013 at 1:53 PM
Actually, I've been pondering this today and framing seems to be real trouble...

For instance Modbus, a single flipped start bit after a packet would shift to half of CRC + FF, the 'idle 3.5 characters' having received an additional character.

If instead one used a high bit (address etc.) to indicate the first byte of a packet, and the CRC were directly after, two flips would get you checking the CRC at the wrong location as well.

Do you have any pointers/references/etc. for how to avoid this problem? (Or have I missed something that makes it less of an issue?)

Thanks

James
ReplyDelete
Replies
Phil KoopmanJuly 8, 2013 at 12:47 AM
Right again -- this is a good example of how this topic gets pretty tricky if you want to plug all the gaps. (As I get time I'll post things I've run into... but time always seems to be scarce.)

Quick version: Yes you have to think of any way in which a message can be misunderstood that can affect the length. That includes length implicit in message headers. One way to solve this is make all messages the same length so the problem can't happen, fragmenting if necessary.

Another way is to put a CRC on the message header to detect header corruption (FlexRay does this; for a protocol that doesn't do this you could make the first byte of the payload a CRC on the header field).

A third way is to have headers that have a high hamming distance between each other. One way to do this is, for example, to use the first few bits of the header as the actual header and the rest of the header bits as a CRC protecting those first few bits. The network protocol will just see this as a sparsely populated header space, but you'll know that it is hard for a single bit flip (or more perhaps) to accidentally change the header and therefore change the implicit length.

The last way I've seen it done is with a high-hamming-distance set of synchronization bit patterns (I think this was Train Control Network), with the sync bit pattern determining which of a few message types are being used and thus implicitly the length.

ReplyDelete
Replies
jfbJuly 9, 2013 at 7:14 PM
Just one last question... The 'CRC on the message header' approach, I noticed on USB that the header CRC is HD 3 yet the data is HD 4. I've seen this elsewhere as well. Is there some logic behind why the header can be weaker? This makes the whole packet HD 3 effectively..

Thanks

James
ReplyDelete
Replies
Phil KoopmanJuly 10, 2013 at 8:20 AM
It does make the whole packet HD=3, and that is probably why some protocols use the same HD for header protection as packet body protection.

I don't know why USB did it that way. But the obvious argument would be that the header is a smaller target to hit with random bit errors than the packet body. You might, for example, argue that a particular random independent bit error rate HD=3 on the header is good enough because it is very unlikely to get 3 errors there, while HD=4 is necessary for the packet body because the body is so big that 3 errors will happen often enough there to be a problem. To make this argument for your system you'd need to decide that random independent bit errors is the right fault model for your system, and then do the probability math to decide if number of undetected header errors is low enough to be acceptable to you.

ReplyDelete
Replies
AnonymousMarch 6, 2014 at 1:42 AM
Guess what happens in IEEE 802.15.4 / Zigbee.... last I checked - no CRC or checksum protection of a leading length field. Nobody I mentioned this to seemed to find it a problem. It's a disaster waiting to happen.

The most robust protocols get around these kind of problems by use of things like illegal sequences (coding or symbol violations) to insert start / end of frame markers, and then use things like bit stuffing to ensure such symbols can never appear inside a valid data frame (eg HDLC).

The problem is always one of distinguishing control fields in the presence of bit errors. Ignoring the issue does not make it go away. Of course, redundant coding (and thus decent HD) can aid, but its not nice for bandwidth - you don't get something for nothing.
ReplyDelete
Replies
LuisFOctober 7, 2020 at 11:23 PM
Anecdote on length field importance and rule 6: Recently used CRC16-CCITT for an embedded design. The packet length is known after there is a silence in the line.
I Assumed CRC would magically detect errors, wrong lengths and the possibility of an error was anything but remote... Turns out line silence isn't always properly detected and therefore 0x00 ends up sometimes in the packet as the last byte (erroneously).
Now check this out: ALL and EVERY valid packet with CRC16-CCITT with any extra (bogus) 0x00 at the end has also a valid CRC and passes the CRC check!!!!!!!!
Example:
D: DATA Byte
X: CRC Byte 1
y: CRC Byte 2

DDDDXY <- XY is the valid CRC for DDDD
DDDDXY0 <- Y0 is a valid CRC but X (CRC Byte 1) will be taken as data!!
DDDDXY00 <- 00 is a valid CRC but XY will be taken as data!!
DDDDXY000<- 00 is a valid CRC but XY0 will be taken as data!!
and so on...

At first I thought I had found a 1 in a 65536 test case (at least) and not so bad. Fortunately I wondered how likely was it and created a montecarlo program to find out, ohh was I surprised that every time it was the case!

I guess I will have to change my algorithm

Thank you for the information you have provided to the community
ReplyDelete
Replies
Phil KoopmanOctober 8, 2020 at 9:20 AM
Thanks for the story. This sort of issue is why many protocols start with a length field in a known place.
ReplyDelete
Replies
AnonymousAugust 18, 2022 at 8:13 PM
What are the effects of corruptions in the CRC itself?

I'm assuming that a single bit corruption in the CRC makes all HD promises invalid.

If that's the case, how does this affect the error detection analysis?
ReplyDelete
Replies

Add comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Better Embedded System SW

Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

11 comments:

Static Analysis Ranked Defect List

Pages

Search This Blog