Consequences:
Unintentional modification of data values can cause arbitrarily bad system behavior, even if only one bit is modified. Therefore, safety critical systems should take measures to prevent or detect such corruptions, consistent with the estimated risk of hardware-induced data corruption and anticipated software quality. Note that even best practices may not prevent corruption due to software concurrency defects.
Accepted Practices:
- When hardware data corruption may occur and automatic error correction is desired, use a hardware single error correction/multiple error detection circuitry (SECMED) form of Error Detection and Correction circuitry (EDAC), sometimes just called Error Correcting Code circuitry (ECC) for all bytes of RAM. This would protect against hardware memory corruption, including hardware corruption of operating system variables. However, it does not protect against software memory corruption.
- Use a software approach such as a cyclic redundancy code CRC (preferred), or checksum to detect a corrupted program image, and test for corruption at least when the system is booted.
- Use a software approach such as keeping redundant copies to detect software data corruption of RAM values.
- Use fault injection to test data corruption detection and correction mechanisms.
- Perform memory tests to ensure there are no hard faults.
Discussion:
Safety
critical systems must protect against data corruption to avoid small changes in
data which can render the system unsafe. Even a single bit in memory changing
in the wrong way could cause a system to change from being safe to unsafe. To
guard against this, various schemes for memory corruption detection and
prevention are used.
Effect of a bit flip (Source: http://dependablesystem.blogspot.com/2012/05/flip-happens.html)
Hardware and Software Are Both Corruption Sources
Hardware
memory corruption occurs when a radiation event, voltage fluctuation, source of
electrical noise, or other cause makes one or more bits flip from one value to
another. In non-volatile memory such as flash memory, wearout, low programming voltage, or electrical
charge leakage over time can also cause bits in memory to have an incorrect
value. Mitigation techniques for these types of memory errors include the use
of hardware error detection/correction codes (sometimes called “EDAC”) for RAM,
and typically the use of a “checksum” for flash memory to ensure that all the
bytes, when “added up,” give the same total as they did when the program image
was stored in flash.
If
hardware memory error detection support is not available, RAM can also be
protected with some sort of redundant storage. A common practice is to store
two copies of a value in two different places in RAM, often with one copy
inverted or otherwise manipulated. It's
important to avoid storing the two copies next to each other to avoid
problems of errors that corrupt adjacent bits in memory. Rather, there
should be two entirely different sections of memory for mirrored
variables, with each section having only one copy of each mirrored
variable. That way, if a small chunk
of memory is arbitrarily corrupted, it can at most affect one of the
two copies of any mirrored variable. Error detection codes such as checksums can also be used, and provide a
tradeoff of increased computation time when a change is made vs. requiring less
storage space for error detection information as compared to simple replication
of data.
Software
memory corruption occurs when one part of a program mistakenly writes data to a
place that should only be written to by another part of the program due to a
software defect. This can happen as a result of a defect that produces an
incorrect pointer into memory, due to a buffer overflow (e.g., trying to put 17
bytes of data into a 16 byte storage area), due to a stack overflow, or due to
a concurrency defect, among other
scenarios.
Hardware
error detection does not help in detecting software memory corruption, because
the hardware will ordinarily assume that software has permission to make any
change it likes. (There may be exceptions if hardware has a way to “lock”
portions of memory from modifications, which is not the case here.) Software
error detection may help if the corruption is random, but may not help if the
corruption is a result of defective software following authorized RAM modification
procedures that just happen to point to the wrong place when modifications are
made. While various approaches to reduce the chance of accidental data
corruption can be envisioned, acceptable practice for safety critical systems in
the absence of redundant computing hardware calls for, at a minimum, storing
redundant copies of data. There must also be a recovery plan such as system
reboot or restoration to defaults if a corruption is detected.
Data Mirroring
A common approach to providing data corruption protection is to use a data mirroring approach in which a second copy of a variable having a
one’s complement value is stored in addition to the ordinary variable value. A one’s complement representation of a number is
computed by inverting all the bits in a number. So this means one copy of the
number is stored normally, and the second copy of that same number is stored
with all the bits inverted (“complemented”). As an example, if the original
number is binary “0000” the one’s complement mirror copy would be “1111.” When
the number is read, both the “0000” and the “1111” are read and compared to
make sure they are exact complements of each other. Among other things, this
approach gives protection against a software defect or hardware corruption that
sets a number of RAM locations to all be the same value. That sort of corruption
can be detected regardless of the constant corruption value put into RAM,
because two mirrored copies can’t have the same value unless at least one of
the pair has been corrupted (e.g., if all zeros are written to RAM, both copies
in a mirrored pair will have the value “0000,” indicating a data corruption has
occurred).
Mirroring can also help detect hardware bit
flip corruptions. A bit flip is when a binary value (a 0 or 1), is corrupted to
have the opposite value (changing to a 1 or 0 respectively), which in turn
corrupts the value of the number stored at the memory location suffering one or
more bit flips. So long as only one of two mirror values suffers a bit flip,
that corruption will be detectable because the two copies won’t match properly as
exact complements of each other.
A
good practice is to ensure that the mirrored values are not adjacent to each
other so that an erroneous multi-byte variable update is less likely to modify
both the original and mirrored copy. Such mirrored copies are vulnerable to a
pair of independent bit flips that just happen to correspond to the same
position within each of a pair of complemented stored values. Therefore, for
highly critical systems a Cyclic Redundancy Check (CRC) or other more advanced
error detection method is recommended.
It is important to realize that all memory values that can conceivably cause a system hazard need to be protected by mirroring, not just a portion of memory. For example a safety-critical Real Time Operating System will have values in memory that control task scheduling. Corruption of these variables can lead to task death or other problems if the RTOS doesn't protect data integrity, even if the application software does use mirroring. Note that there are multiple ways for an RTOS to protect its data integrity from software and hardware defects beyond this, such as via using hardware access protection. But, if the only mechanism being used in a system to prevent memory corruption is mirroring, the RTOS has to use it too or you have a vulnerability.
Selected Sources
Automotive
electronics designers knew as early as 1982 that data corruption could be
expected in automotive electronics. Seeger writes: “Due to the electrically
hostile environment that awaits a microprocessor based system in an automobile,
it is necessary to use extra care in the design of software for those systems
to ensure that the system is fault tolerant. Common faults that can occur
include program counter faults, altered
RAM locations, or erratic sensor inputs.” (Seeger 1982, abstract, emphasis
added). Automotive designers generally accepted the fact that RAM location
disruptions would happen in automotive electronics (due to electromagnetic
interference (EMI), radiation events, or other disturbances), and had to ensure
that any such disruption would not result in an unsafe system.
Stepner,
in a paper on real time operating systems that features a discussion of OSEK
(the trade name of an automotive-specific real time operating system), states
with regard to avoiding corruption of data: “One technique is the redundant
storage of critical variables, and comparison prior to being used. Another is
the grouping of critical variables together and keeping a CRC over each group.”
(Stepner 1999, pg. 155).
Brown
says “We’ve all heard stories of bit flips that were caused by cosmic rays or
EMI” and goes on to describe a two-out-of-three voting scheme to recover from
variable corruption. (Brown 1998 pp. 48-49). A variation of keeping only two
copies permits detection but not correction of corruption. Brown also
acknowledges that designers must account for software data corruption, saying
“Another, and perhaps more common, cause of memory corruption is a rogue
pointer, which can run wild through memory leaving a trail of corrupted
variables in its wake. Regardless of the cause, the designer of safety-critical
software must consider the threat that sometime, somewhere, a variable will be
corrupted.” (id., p. 48).
Kleidermacher
says: “When all of an application’s threads share the same memory space, any
thread could—intentionally or unintentionally— corrupt the code, data, or stack
of another thread. A misbehaved thread could even corrupt the kernel’s own code
or internal data structures. It’s easy to see how a single errant pointer in
one thread could easily bring down the entire system, or at least cause it to
behave unexpectedly.” (Kleidermacher 2001, pg. 23). Kleidermacher advocates
hardware memory protection, but in the absence of a hardware mechanism,
software mechanisms are required to mitigate memory corruption faults.
Fault
injection is a way to test systems to see how they respond to faults in memory
or elsewhere (see also an upcoming post on that topic).
Fault injection can be performed in hardware (e.g., by exposing a hardware
circuit to a radiation source or by using hardware test features to modify bit
values), or injected via software means (e.g., slightly modifying software to
permit flipping bits in memory to simulate a hardware fault). In a research
paper, Vinter used a hybrid hardware/software fault injection technique to
corrupt bits in a computer running an automotive-style engine control
application. The conclusions of this paper start by saying: “We have
demonstrated that bit-flips inside a central processing unit executing an
engine control program can cause critical failures, such as permanently locking
the engine’s throttle at full speed.” (Vinter 2001). Fault injection remains a
preferred technique for determining whether there are data corruption
vulnerabilities that can result in unsafe system behavior.
References:
- Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
- Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
- Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.
- Stepner, D., Nagarajan, R., & Hui, D., Embedded application design using a real-time OS, Design Automation Conference, 1999, pp. 151-156.
- Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.
Dear Mr Koopman,
ReplyDeleteFirst of all, thanks a lot for your very interesting blog.
Regarding this post, I have a question for you regarding data mirroring and efficiency/coverage regarding corruptions due to software faults. Indeed, we can consider this "mechanism" as well appropriate to ensure tolerence to random hw failures in case for example mechanism such as ECC is not available. Regarding corruption due to software faults, what is your position regarding added value/coverage of such a mechanism? Regarding software fault, the strategy could be first of all based on fault avoidance/removal du ring the development phase through in particular code analysis and in particular with the support of tools allowing to detect rutime errors that could lead to memory corruption (pointers issues...). In a second time, in case of coexistence in the same sw with components with lower integrity level, to use the support of hw mechanisms such as MPU. It can be difficult to mirror all the safety related data of a sw but maybe it could be applied on the most safety critical (i.e that can lead directly to an unwanted safety related event/static variables and identified through safety analysis).
Thanks!
Best regards
Mirroring will help with at least some software faults and some hardware faults. How much it helps depends upon the structure of your software and how you implement mirroring (for example, does the subroutine that implement mirrored variable writing check the task ID to make sure it is being called from a critical task and not a non-critical task?). It also, as you suggest, depends on what fraction of memory values are mirrored.
ReplyDeleteI'd be very careful mirroring only some things and claiming that you've covered all the critical data -- it can be very difficult to know what indirect results will happen from corrupting data.
If you need to go down that path, I'd suggest running some fault injection campaigns to see what happens when variables get corrupted. Don't forget that data on the stack can be corrupted too.
A significant limitation of mirroring and error detection in general is that some faults happen in the logic and peripherals of the CPU -- not just data stored in RAM.