Showing posts with label error detection. Show all posts
Showing posts with label error detection. Show all posts

Monday, May 5, 2014

Mitigating Data Corruption

As previously discussed, data corruption from bit flips can happen from a variety of sources. The results from such faults can be catastrophic. An effective technique should be used to detect both software- and hardware-caused corruption of critical data. Effective techniques include error coding and data replication in various forms.

Consequences:
Unintentional modification of data values can cause arbitrarily bad system behavior, even if only one bit is modified. Therefore, safety critical systems should take measures to prevent or detect such corruptions, consistent with the estimated risk of hardware-induced data corruption and anticipated software quality. Note that even best practices may not prevent corruption due to software concurrency defects.

Accepted Practices:

  • When hardware data corruption may occur and automatic error correction is desired, use a hardware single error correction/multiple error detection circuitry (SECMED) form of Error Detection and Correction circuitry (EDAC), sometimes just called Error Correcting Code circuitry (ECC) for all bytes of RAM. This would protect against hardware memory corruption, including hardware corruption of operating system variables. However, it does not protect against software memory corruption.
  • Use a software approach such as a cyclic redundancy code CRC (preferred), or checksum to detect a corrupted program image, and test for corruption at least when the system is booted.
  • Use a software approach such as keeping redundant copies to detect software data corruption of RAM values.
  • Use fault injection to test data corruption detection and correction mechanisms.
  • Perform memory tests to ensure there are no hard faults.

Discussion:

Safety critical systems must protect against data corruption to avoid small changes in data which can render the system unsafe. Even a single bit in memory changing in the wrong way could cause a system to change from being safe to unsafe. To guard against this, various schemes for memory corruption detection and prevention are used.

Effect of a bit flip (Source: http://dependablesystem.blogspot.com/2012/05/flip-happens.html)

Hardware and Software Are Both Corruption Sources

Hardware memory corruption occurs when a radiation event, voltage fluctuation, source of electrical noise, or other cause makes one or more bits flip from one value to another. In non-volatile memory such as flash memory, wearout, low programming voltage, or electrical charge leakage over time can also cause bits in memory to have an incorrect value. Mitigation techniques for these types of memory errors include the use of hardware error detection/correction codes (sometimes called “EDAC”) for RAM, and typically the use of a “checksum” for flash memory to ensure that all the bytes, when “added up,” give the same total as they did when the program image was stored in flash.

If hardware memory error detection support is not available, RAM can also be protected with some sort of redundant storage. A common practice is to store two copies of a value in two different places in RAM, often with one copy inverted or otherwise manipulated. It's important to avoid storing the two copies next to each other to avoid problems of errors that corrupt adjacent bits in memory. Rather, there should be two entirely different sections of memory for mirrored variables, with each section having only one copy of each mirrored variable. That way, if a small chunk of memory is arbitrarily corrupted, it can at most affect one of the two copies of any mirrored variable. Error detection codes such as checksums can also be used, and provide a tradeoff of increased computation time when a change is made vs. requiring less storage space for error detection information as compared to simple replication of data.

Software memory corruption occurs when one part of a program mistakenly writes data to a place that should only be written to by another part of the program due to a software defect. This can happen as a result of a defect that produces an incorrect pointer into memory, due to a buffer overflow (e.g., trying to put 17 bytes of data into a 16 byte storage area), due to a stack overflow, or due to a concurrency defect, among other scenarios.

Hardware error detection does not help in detecting software memory corruption, because the hardware will ordinarily assume that software has permission to make any change it likes. (There may be exceptions if hardware has a way to “lock” portions of memory from modifications, which is not the case here.) Software error detection may help if the corruption is random, but may not help if the corruption is a result of defective software following authorized RAM modification procedures that just happen to point to the wrong place when modifications are made. While various approaches to reduce the chance of accidental data corruption can be envisioned, acceptable practice for safety critical systems in the absence of redundant computing hardware calls for, at a minimum, storing redundant copies of data. There must also be a recovery plan such as system reboot or restoration to defaults if a corruption is detected.

Data Mirroring

A common approach to providing data corruption protection is to use a data mirroring approach in which a second copy of a variable having a one’s complement value is stored in addition to the ordinary variable value. A one’s complement representation of a number is computed by inverting all the bits in a number. So this means one copy of the number is stored normally, and the second copy of that same number is stored with all the bits inverted (“complemented”). As an example, if the original number is binary “0000” the one’s complement mirror copy would be “1111.” When the number is read, both the “0000” and the “1111” are read and compared to make sure they are exact complements of each other. Among other things, this approach gives protection against a software defect or hardware corruption that sets a number of RAM locations to all be the same value. That sort of corruption can be detected regardless of the constant corruption value put into RAM, because two mirrored copies can’t have the same value unless at least one of the pair has been corrupted (e.g., if all zeros are written to RAM, both copies in a mirrored pair will have the value “0000,” indicating a data corruption has occurred).

Mirroring can also help detect hardware bit flip corruptions. A bit flip is when a binary value (a 0 or 1), is corrupted to have the opposite value (changing to a 1 or 0 respectively), which in turn corrupts the value of the number stored at the memory location suffering one or more bit flips. So long as only one of two mirror values suffers a bit flip, that corruption will be detectable because the two copies won’t match properly as exact complements of each other.

A good practice is to ensure that the mirrored values are not adjacent to each other so that an erroneous multi-byte variable update is less likely to modify both the original and mirrored copy. Such mirrored copies are vulnerable to a pair of independent bit flips that just happen to correspond to the same position within each of a pair of complemented stored values. Therefore, for highly critical systems a Cyclic Redundancy Check (CRC) or other more advanced error detection method is recommended.

It is important to realize that all memory values that can conceivably cause a system hazard need to be protected by mirroring, not just a portion of memory. For example a safety-critical Real Time Operating System will have values in memory that control task scheduling. Corruption of these variables can lead to task death or other problems if the RTOS doesn't protect data integrity, even if the application software does use mirroring. Note that there are multiple ways for an RTOS to protect its data integrity from software and hardware defects beyond this, such as via using hardware access protection. But, if the only mechanism being used in a system to prevent memory corruption is mirroring, the RTOS has to use it too or you have a vulnerability.

Selected Sources

Automotive electronics designers knew as early as 1982 that data corruption could be expected in automotive electronics. Seeger writes: “Due to the electrically hostile environment that awaits a microprocessor based system in an automobile, it is necessary to use extra care in the design of software for those systems to ensure that the system is fault tolerant. Common faults that can occur include program counter faults, altered RAM locations, or erratic sensor inputs.” (Seeger 1982, abstract, emphasis added). Automotive designers generally accepted the fact that RAM location disruptions would happen in automotive electronics (due to electromagnetic interference (EMI), radiation events, or other disturbances), and had to ensure that any such disruption would not result in an unsafe system.

Stepner, in a paper on real time operating systems that features a discussion of OSEK (the trade name of an automotive-specific real time operating system), states with regard to avoiding corruption of data: “One technique is the redundant storage of critical variables, and comparison prior to being used. Another is the grouping of critical variables together and keeping a CRC over each group.” (Stepner 1999, pg. 155).

Brown says “We’ve all heard stories of bit flips that were caused by cosmic rays or EMI” and goes on to describe a two-out-of-three voting scheme to recover from variable corruption. (Brown 1998 pp. 48-49). A variation of keeping only two copies permits detection but not correction of corruption. Brown also acknowledges that designers must account for software data corruption, saying “Another, and perhaps more common, cause of memory corruption is a rogue pointer, which can run wild through memory leaving a trail of corrupted variables in its wake. Regardless of the cause, the designer of safety-critical software must consider the threat that sometime, somewhere, a variable will be corrupted.” (id., p. 48).

Kleidermacher says: “When all of an application’s threads share the same memory space, any thread could—intentionally or unintentionally— corrupt the code, data, or stack of another thread. A misbehaved thread could even corrupt the kernel’s own code or internal data structures. It’s easy to see how a single errant pointer in one thread could easily bring down the entire system, or at least cause it to behave unexpectedly.” (Kleidermacher 2001, pg. 23). Kleidermacher advocates hardware memory protection, but in the absence of a hardware mechanism, software mechanisms are required to mitigate memory corruption faults.

Fault injection is a way to test systems to see how they respond to faults in memory or elsewhere (see also an upcoming post on that topic). Fault injection can be performed in hardware (e.g., by exposing a hardware circuit to a radiation source or by using hardware test features to modify bit values), or injected via software means (e.g., slightly modifying software to permit flipping bits in memory to simulate a hardware fault). In a research paper, Vinter used a hybrid hardware/software fault injection technique to corrupt bits in a computer running an automotive-style engine control application. The conclusions of this paper start by saying: “We have demonstrated that bit-flips inside a central processing unit executing an engine control program can cause critical failures, such as permanently locking the engine’s throttle at full speed.” (Vinter 2001). Fault injection remains a preferred technique for determining whether there are data corruption vulnerabilities that can result in unsafe system behavior.

References:
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.
  • Stepner, D., Nagarajan, R., & Hui, D., Embedded application design using a real-time OS, Design Automation Conference, 1999, pp. 151-156.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.


Monday, March 3, 2014

Random Hardware Faults


Computer hardware randomly generates incorrect answers once in a while, in some cases due to single event upsets from cosmic rays. Yes, really! Here's a summary of the problem and how often you can expect it to occur, with references for further reading.

---------------------------------------

(As will be the case with some of my posts over the next few months, this text is written more in an academic style, and emphasizes automotive safety applications since that is the area I perform much of my research in. But I hope it is still useful, if less entertaining in style than some of my other posts.)

Hardware corruption in the form of bit flips in memories has been known to occur on terrestrial computers for more than 30 years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 2011). Hardware data corruption includes changes during operation made to both RAM and other CPU hardware resources (for example bits held in CPU registers that are intermediate results in a computation).

"Soft errors" are those in which a transient fault causes a loss of data, but not damage to a circuit. (Mariani03, pg. 51) Terminology can be slightly confusing in that a “soft error” has nothing to do with software, but instead refers to a hardware data corruption that does not involve permanent chip damage. These “soft” errors can be caused by, among other things, neutrons created by cosmic rays that penetrate the earth’s atmosphere.

Soft errors can be caused by external events such as naturally occurring alpha radiation particle caused by impurities in the semiconductor materials, as well as neutrons from cosmic rays that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani03, pg. 51).

As computers evolve to become more capable, they also use smaller and smaller individual devices to perform computation. And, as devices get smaller, they become more vulnerable to data corruption from a variety of sources in which a subtle fleeting event causes a disruption in a data value. For example, a cosmic ray might create a particle that strikes a bit in memory, flipping that bit from a “1” to a “0,” resulting in the change of a data value or death of a task in a real time control system. (Wang 2008 is a brief tutorial.)

Chapter 1 of Ziegler & Puchner (2004) gives an historical perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) lists accepted techniques for protecting against hardware induced errors, including use of error correcting codes (id., p. 9-7) and memory scrubbing (id., p. 9-8). It is noted that techniques that protect memory “are not effective for computational or logic operations” (id., p. 9-8), meaning that only values stored in RAM can be protected by these methods, although special exotic methods can be used to protect the rest of the CPU (id., p. 1-13).

Causes for such failures vary in source and severity depending upon a variety of conditions and the exact type of technology used, but they are always present. The general idea of the failure mode is as follows. SRAM memory uses pairs of inverter logic gates arranged in a feedback pattern to “remember” a “1” or a “0” value. A neutron produced by a cosmic ray can strike atoms in an SRAM cell, inducing a current spike that over-rides the remembered value, flipping a 0 to a 1, or a 1 to a 0. Newer technology uses smaller hardware cells to achieve increased density. But having smaller components makes it easier for a cosmic ray or other disturbance to disrupt the cell’s value. While a cosmic ray strike may seem like an exotic way for computers to fail, this is a well documented effect that has been known for decades (e.g., May & Woods 1979 discussed radiation-induced errors) and occurs on a regular basis with a predictable average frequency, even though it is fundamentally a randomly occurring event.

Radiation strike causing transistor disruption (Gorini 2012).

A paper from Dr. Cristian Constantinescu, a top Intel dependability expert at that time, summarized the situation about a decade ago (Constantinescu 2003). He differentiates random faults into two cases. Intermittent faults, which might be due to a subtle manufacturing or wearout defect,  can cause bursts of memory errors (id., pp. 15-16). Transient faults, on the other hand, can be caused by naturally occurring neutron and alpha particles, electromagnetic interference, and electrostatic discharge (id., pg. 16), which he refers to as “soft errors” in general. To simplify terminology, we’ll consider both transient and intermittent faults as “random” errors, even though intermittent errors strictly speaking may come in occasional bursts rather than having purely random independent statistical distributions. His paper goes on to discuss errors observed in real computers. In his study, 193 corporate server computers were studied, and 69% of them observed one or more random errors in a 16-month period. 2.6 percent of the servers reported more than 1000 random errors in that 16-month period due to a combination of random error sources (id., pg. 16):

Random Error Data (Constantinescu 2003, pp. 15-16).


This data supports the well known notion that random errors can and do occur in all computer systems. It is to be expected due to minute variations in manufacturing, environments, and statistical distribution properties that some systems will have a very large number of such errors in a short period of time, while many of the same exact type of system will exhibit none even for extended periods of operation. The problem is only expected to get worse for newer generations of electronics (id., pg. 17).

Excerpt from Mariani 2003, pg. 50

Mariani (2003, pg. 50) makes it clear that soft errors “are officially considered one of the main causes of loss of reliability in modern digital components.” He even goes so far to say of “x-by-wire” automotive systems (which includes “throttle-by-wire” and other computer-controlled functions): “Of course such systems must be highly reliable, as failure will cause fatal accidents, and therefore soft errors must be taken into account.” Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear that “random hardware failures” must be controlled for safety critical automotive applications, with a required rigor to control those failures increasing as safety becomes more important. (MISRA Report 2 p. 18).

Semiconductor error rates are often given in terms of FIT (“Failures in Time”), where one FIT equals one failure per trillion operating hours.  Mariani gives representative numbers as 4000 FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates increasing significantly for subsequent years. Based on a review of various data sources considered it seems reasonable to use an error rate of 1000 FIT/Mbit (1 errors per 1e13 bit-hr) for SRAM. (e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) Note that not all failures occur in memory. Any transistor on a chip can fail due to radiation, and not all storage is in RAM. As an example, registers such as a computer’s program counter are exposed to Single Event Upsets (SEUs), and the effect of an SEU there can result in the computer having arbitrarily bad behavior unless some form of mitigation is in place. Automotive-specific publications state that hardware induced data corruption must be considered (Seeger 1982).

SEU rates vary considerably based on a number of factors such as altitude, with data presented for Boulder Colorado showing a factor of 3.5 to 4.8 increase in cosmic-ray induced errors over data collected in Essex Junction Vermont (near sea level) (O’Gorman 1994, abstract).

Given the above, it is clear that random hardware errors are to be expected when deploying a large fleet of embedded systems. This means that if your system is safety critical, you need to take such errors into account. In a future post I'll explore how bad the effects of such a fault can be.

References:

Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, July-Aug 2003, pp. 14-19

Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.

Karnik et al., Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable and Secure Computing, April-June 2004, pp. 128-143.

Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.

May & Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Trans. On Electronic Devices, vol. 26, 1979

MISRA, Report 2: Integrity, February 1995. (www.misra.org.uk)

O’Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, April 1994, pp. 553-557.

Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.

Tezzaron Semiconductor, Soft Errors in Electronic Memory -- A White Paper, 2004, www.tezzaron.com

Wang & Agrawal, Single event upset: an embedded tutorial, 21st Intl. Conf. on VLSI Design, 2008, pp. 429-434.

Ziegler & Puchner, SER – History, Trends, and Challenges: a guide for designing with memory ICs, Cypress Semiconductor, 2004.





Saturday, November 9, 2013

CRC and Checksum Tutorial Webinar

I've completed my FAA-sponsored look at CRC and Checksum performance for aviation systems. While some of the material is aircraft-specific, it is all relevant to safety critical systems and contains quite a bit of information about CRC vs. checksum performance.

I'm pleased to be able to share a two-hour Webinar recording describing all the work, as well as the slides and draft report. There is a guide to the material below that you may find useful if you are looking for a specific topic.

The general flow of the presentation is as follows, by slide number:
  • 6 - CRC and checksum background and terminology
  • 10 - Why picking the right CRC isn't as easy as you might think
  • 20 - Summary of project technical approach
  • 21 - Project results overview
  • 24 - Checksum results (Checksum, Fletcher Checksum, ATN-32)
  • 30 - CRC-32 compared to checksums
  • 31 - It's all about the Hamming Distance
  • 33 - Table of checksum compared to CRC-32
  • 35 - Good CRC choices (8 & 16 bits)
  • 36 - Good CRC choices (24 & 32 bits)
  • 37 - Aviation industry results (might be enlightening for everyone)
  • 43 - Multiple CRC & memory-specific CRC results
  • 46 - System level effects and cross-layer interactions (e.g., unprotected length field, bit encoding)
  • 48 - ARINC-825 / CAN issues
  • 52 - 8b10b encoding Gbit Ethernet issue
  • 53 - Mapping to criticality levels (how do you know your CRC/checksum is good enough?)
  • 64 - Determining Bit Error Ratio (BER)
  • 71 - Error detection mechanism failure and scrubbing
  • 74 - A simple recipe for how to determine how big and capable a CRC/checksum you need
  • 76 - CRC/Checksum seven deadline sins (bad ideas)
  • 79 - Review of the study project subtasks (most readers can skip this part)

I owe significant thanks to my Co-Investigators Kevin Driscoll and Brendan Hall at Honeywell labs, without whom this work could not have been done. And also a warm and sincere thanks to our FAA contacts, and especially to Chuck Kilgore for his good spirits, kind words, and unfailing support.

NOTES:

"The provided links will give you the option of downloading the video file or view it through your web browser.   Some people have reported problems viewing the video with their web browser due to the 2 hour length.  Others have been able to view the video in a browser without any problems.   If you encounter problems, we recommend that you download the file for offline viewing if possible to avoid any streaming issues."

"This work was supported by the Federal Aviation Administration, Aircraft Certification Service, and Assistant Administration for NextGen, William J. Hughes Technical Center, Aviation Research Division, Atlantic City International Airport, New Jersey 08405. The findings and conclusions in this presentation are those of the author(s) and do not necessarily represent the views of the funding agency.  This presentation does not constitute FAA policy."

The Webinar was scheduled Oct 1, 2013, but was delayed due to the government shutdown. The webinar was actually held on Oct 29, 2013.

Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):
  1. Picking a CRC based on a popularity contest instead of analysis
    • This includes using “standard” polynomials such as IEEE 802.3
  2. Saying that a good checksum is as good as a bad CRC
    • Many “standard” polynomials have poor HD at long lengths
  3. Evaluating with randomly corrupted data instead of BER fault model
    • Any useful error code looks good on random error patterns vs. BER random bit flips
  4. Blindly using polynomial factorization to choose a CRC
    • It works for long dataword special cases, but not beyond that
    • Divisibility by (x+1) doubles undetected fraction on even # bit errors
  5. Failing to protect message length field
    • Results in pointing to data as FCS, giving HD=1
  6. Failing to pick an accurate fault model and apply it
    • “We added a checksum, so ‘all’ errors will be caught” (untrue!)
    • Assuming a particular standard BER without checking the actual system
  7. Ignoring interaction with bit encoding
    • E.g., bit stuffing compromises HD to give HD=2
    • E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial
(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog.  That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.


Tuesday, May 15, 2012

CRC and Checksum Tutorial

I published a tutorial on CRCs and Checksums on the blog I maintain on that topic.  Since I get a lot of hits on those search terms here, I'm providing a pointer for those who are interested:
http://checksumcrc.blogspot.com/2012/05/crc-and-checksum-tutorial-slides.html


Sunday, February 19, 2012

CAN Protocol Vulnerabilities

Controller Area Network (CAN)is a very popular embedded network protocol. It's widely used in automotive applications, and the low cost that results in makes it popular in a wide variety of other applications.

CAN is a very nice embedded protocol, and it's very appropriate for a wide variety of applications. But, like most protocols it has some faults. And like many protocol faults, they aren't widely advertised. They are the sort of thing you have to worry about if you need to create an ultra-dependable system. For a robust system architecture for every day control functions, probably they are not a huge deal.

Here is a summary of CAN Protocol faults I've run into.

(1) Unprotected header. The length bits of a CAN message aren't protected by the CRC. That means that a single corrupted bit in the length field can cause the CRC checker to look in the wrong place for the CRC. For example, if a 3-byte payload is corrupted to look like a 1-byte payload, the last two bytes of the payload will be interpreted as a CRC. This could cause an undetected 1-bit error. There are some other framing considerations that might or might not cause the error to be discovered, but it is a vulnerability in CAN that is fixed in later protocols via use of a dedicated header CRC (e.g., as done in FlexRay). I've never seen this written up for CAN in particular, but this is discussed on page 20 of [Driscoll09]

(2) Stuff bit errors compromising the CRC. A pair of bit errors in the bit-stuffed format of a CAN message (i.e., one on the wire) can cause a cascading of bit errors in the message, leading to undetected two-bit errors that are undetected by the CRC. See [Tran99] and slides 15-16 of
http://www.ece.cmu.edu/~ece649/lectures/11_can.pdf
In general there are issues in any embedded network if the physical layer encoding can cause the propagation of one physical data bit fault into affecting multiple data bits.

(NOTE: added 10/16/2013.  I just found about about the CAN-FD protocol. Among other things it includes the stuff bits in the CRC calculation to mitigate this problem.  Whether or not there are any holes in the scheme I don't know, but it is clearly a step in the right direction.)

(3) CAN is intended to provide exactly-once delivery of messages via the use of a broadcast acknowledgement mechanism. However, there are faults with this mechanism. In one fault scenario, some nodes see an acknowledge but others don't, so some nodes will receive two copies of a message (thinking they are independent copies rather than a retransmission) while others receive just one copy. In another less likely (but plausible) scenario, the above happens but the sender fails before retransmitting, resulting in some recipients having a copy of the message and others never having received a copy. [Rufino98]

(4) It is possible for a CAN node to lock up the network due to retries if a transmitter sees a different value on the bus than it transmitted. (This is a failure of the transmitter, but the point is one node can take down the network.) [Perez03]

(5) If a CAN nodes' transmitter gets stuck at a dominant bit value, it will jam up the entire network. The error counters inside each CAN controller are supposed to help mitigate this, but if you have a stuck-at-"on" transmitter that ignores "turn off" commands there isn't anything you can do with a standard CAN setup.

There are various countermeasures you might wish to take to combat the above failure modes for highly dependable applications. For example, you might put an auxiliary CRC or checksum in the payload (although that doesn't solve the stuff bit vulnerability). You might shut down the network if the CAN controller error counter shows a high error rate. Or you might just switch to a more robust protocol such as FlexRay.

Have you heard of any other CAN failure modes? If so, let me know.

References:

[Driscoll09] Driscoll, K., Hall, B., Koopman, P., Ray, J., DeWalt, M., Data Network Evaluation Criteria Handbook, AR-09/24, FAA, 2009.
http://www.ece.cmu.edu/~koopman/pubs/faa09-24_data_network_evaluation_criteria_handbook.pdf

[Tran99] Eushiuan Tran Tsung, Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol, MS Thesis, Carnegie Mellon University ECE Dept, May 1999
http://www.ece.cmu.edu/~koopman/thesis/etran.pdf

[Rufino98] Rufino, J.; Verissimo, P.; Arroz, G.; Almeida, C.; Rodrigues, L., Fault-tolerant broadcasts in CAN, FTCS 1998, pp. 150-159.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=689464

[Perez03] Perez, J., Reorda, M., Veiolante, M.; Accurate dependability analysis of CAN-based networked systems, 2003 SBBCCI.
http://www.cad.polito.it/db/sbcci2003.pdf

Monday, January 2, 2012

New Blog on Checksums and CRCs

I've decided to start a separate blog for checksum and CRC postings. There are only a few posts at this point, but I expect it will accumulate quite a bit more material over time.

http://checksumcrc.blogspot.com/
Checksum and CRC Central: Cyclic Redundancy Codes, Checksums, and related issues for computer system data integrity.

The most recent post gives CRC polynomials for optimal protection of 8-bit, 16-bit, and 32-bit data words such as you might want for ensuring integrity of key variables. The blog you're reading now (Better Embedded System SW) will continue as before.

Here's wishing everyone a happy and prosperous new year!

Thursday, May 13, 2010

What's the best CRC polynomial to use?


(If you want to know more, see my Webinar on CRCs and checksums based on work sponsored by the FAA.)

If you are looking for a lightweight error detection code, a CRC is usually your best bet. There are plenty of tutorials on CRCs and a web search will turn them up. If you're looking at this post probably you've found them already.

The tricky part is the "polynomial" or "feedback" term that determines how the bits are mixed in the shift-and-XOR process. If you are following a standard of some sort then you're stuck with whatever feedback term the standard requires. But many times embedded system designers don't need to follow a standard -- they just need a "good" polynomial. For a long time folk wisdom was to use the same polynomial other people were using on the presumption that it must be good. Unfortunately, that presumption is often wrong!

Some polynomials in widespread use are OK, but many are mediocre, some are terrible if used the wrong way, and some are just plain wrong due factors such as a typographical error.

Fortunately, after spending a many CPU-years of computer time doing searches, a handful of researchers have come up with optimal CRC polynomials. You can find my results below. They've been cross-checked against other known results and published in a reviewed academic paper. (This doesn't guarantee they are perfect!  But they are probably right.)


Here is as thumbnail description of using the table. HD is the Hamming Distance, which is minimum number of bit errors undetected. For example, HD=4 means all 1, 2, and 3 bit errors are detected, but some 4-bit errors are undetected, as are some errors with more than 4 bits corrupted.

The CRC Size is how big the CRC result value is. For a 14-bit CRC, you add 14 bits of error detection to your message or data packet.

The bottom number in each box within the table is the CRC polynomial in implicit "+1" hex format, meaning the trailing "+1" is omitted from the polynomial number. For example, hex 0x583 = binary 101 1000 0011 = x^11 + x^9 + x^8 + x^2 + x + 1. (This is "Koopman" notation in the wikipedia page.  No, I didn't write the wikipedia entry, and I wasn't trying to be gratuitously different. A lot of the comparison stuff happened after I'd already done too much work to have any hope of changing my notation without introducing mistakes.) 

The top number in each box is the maximum data word length you can protect at that HD. For example, the polynomial 0x583 is an 11-bit CRC that can provide HD=4 protection for all data words up to 1012 bits in length.  (1012+11 gives a 1023 bit long combined data word + CRC value.)

You can find the long version in this paper:  Koopman, P. & Chakravarty, T., "Cyclic Redundancy Code (CRC) Polynomial Selection For Embedded Networks," DSN04, June 2004. Table 4 lists many common polynomials, their factorizations, and their relative performance. It covers up to 16-bit CRCs. Longer CRCs are a more difficult search and the results aren't quite published yet.

You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/

Monday, May 10, 2010

Which Error Detection Code Should You Use?

Any time you send a message or save some data that might be corrupted in storage, you should think about using some sort of error detection code so you can tell if the data has been corrupted. If you do a web search you will find a lot of information about error detection codes. Some of it is great stuff. But much of it is incorrect, or on a good day merely suboptimal. It turns out that the usual rule of thumb of "do what the other guys do and you can't go far wrong" works terribly for error detection. There is lots of folk wisdom that just isn't right.


So, here is a guide to simple error correction in embedded systems. There is a journal paper with all the details (see this link), but this is the short version.

If you want the fastest possible computation with basic error detection:
  • Parity is fine if you have only one bit to spend, but takes about as much work to compute as a checksum.
  • Stay away from XOR checksums (often called LRCs or Longitudinal Redundancy Checks). Use an additive checksum instead to get better error detection at the same cost.
  • Use an additive checksum if you want something basic and fast. If possible, use a one's complement additive checksum instead of normal addition. This involves adding up all the bytes or words of your data using one's complement addition and saving the final sum as the checksum. One's complement addition cuts vulnerability to undetected errors in the top bit of the checksum in half. In a pinch normal integer addition will work, but gives up some error detection capability.

If you want intermediate computation speed and intermediate error detection:
  • Use a Fletcher checksum. Make sure that you use one's complement addition in computing the parts of that checksum, not normal integer addition. Normal integer addition just kills error detection performance for this approach.
  • Don't use an Adler checksum. In most cases it isn't as good as a Fletcher checksum and it is a bit slower to compute. The Adler checksum seems like a cool idea but it doesn't really pay off compared to a Fletcher checksum of the same size.
If you can afford to spend a little more computation speed to get a lot better error detection:
  • Use a CRC (cyclic redundancy check)
  • If you are worried about speed there are a variety of table lookup methods that trade memory for speed. CRCs aren't really as slow as people think they will be. Probably you can use a CRC, and you should if you can.  Mike Barr has a posting on CRC implementations.
  • Use an optimal CRC polynomial if you don't have to conform to a standard. If you use a commonly used polynomial because other people use it, probably you are missing out on a lot of error detection capability. (More on this topic in a later post.)
You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/