Paraphrasing a favorite saying of my colleague Roy Maxion: "You may forget about unlikely faults, but they won't forget about you."
Random faults can and do occur in hardware on an infrequent but inevitable basis. If you have a large number of embedded systems deployed you can count on this happening to your system. But, how severe can the consequences be from a random bit flip? As it turns out, this can be a source of system-killer problems that have hit more than one company, and can cause catastrophic failures of safety critical systems.
While random faults are usually transient events, even the most fleeting of faults will at least sometimes result in a persistent error that can be unsafe. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, pg. 116)
Not every random fault will cause a catastrophic system failure. But based on experimental evidence and published studies, it is virtually certain that at least some random faults will cause catastrophic system failures in a system that is not fully protected against all possible single-point faults and likely combinations of multi-point faults.
Addy
presented a case study of an embedded real time control system. The system
architecture made heavy use of global variables (which is a bad idea for other reasons (Koopman 2010)). Analysis of test results showed that a single memory bit
written to a global variable caused an unsafe action for the system studied
despite redundant checks in software. (Addy 1991, pg. 77). This means that a
single bit flip causing an unsafe system has to be considered plausible,
because it has actually been demonstrated in a safety critical control system.
Addy also reported a fault in which one process set a value, and then another process took action upon that value. Addy identified a failure mode in the first process of his system due to task death that rendered the operator unable to control the safety-critical process:
Excerpt from (Addy 1991, pg. 80)
In Addy’s case a display is frozen and control over
the safety critical process is lost. As a result of the experience, Addy
recommends a separate safety process. (Addy 1991 pg. 81).
Sun Microsystems learned the Single Event Upset (SEU) lesson the hard way in 2000, when it had to recall high-end servers because they had failed to use error detection/correction codes and were suffering from server crashes due to random faults in SRAM. (Forbes, 2000) Fortunately no deaths were involved since this is a server rather than a safety critical embedded system.
Excerpt from Forbes 2000.
Cisco also had a significant SEU problem
with their 12000 series router line cards. They went so far as to publish a
description and workaround notification, stating: “Cisco
12000 line cards may reset after single event upset (SEU) failures. This field
notice highlights some of those failures, why they occur, and what work arounds
are available.” (Cisco 2003) Again, these were not safety critical systems.
More
recently, Skarin & Karlsson presented a study of fault injection results on
an automotive brake-by-wire system to determine if random faults can cause an
unsafe situation. They found that 30% of random errors caused erroneous
outputs, and of those 15% resulted in critical braking failures. (Skarin &
Karlsson 2008) In the context of this
brake-by-wire study, critical failures were either loss of braking or a locked
wheel during braking. (id., p. 148) Thus, it is clear that these sorts of
errors can lead to critical operational faults in electronic automotive
controls. It is worth noting that random errors tend to increase dramatically
with altitude, making this a relevant environmental condition (Heijman 2011, p.
6)
Mauser
reports
on a Siemens Automotive study of electronic throttle control for
automobiles (Mauser 1999). The study specifically accounted for random
faults
(id. p. 732), as well as considering the probability of a “runaway”
incidents (id., p. 734). It found a possibility of single point
failures, and
in particular identified dual redundant throttle electrical signals
being read
by a single shared (multiplexed) analog to digital converter in the CPU
(id., p.
739) as a critical flaw.
For automotive safety applications, MISRA
Software Guidelines discuss the need to identify and correct corruption of
memory values (MISRA Software Guidelines p. 30).
Random Faults Can Be Impossible to Reproduce In General
Intermittent
faults from SEUs and other sources can and do occur in the field, causing real
problems that are not readily reproducible even if the cause is a hard failure.
In many cases components with faults are diagnosed as “Trouble Not Identified”
(TNI). In my work with embedded system companies I have learned that is common
to have TNI rates of 50% for complex electronic components across many embedded
industries. (This means that half of the components returned to the factory as
defective are found to be working properly when normal testing procedures are
used.) While it is tempting to blame the customer for incorrect operation of
equipment, the reality is that many times a defect in software design, hardware
design, or manufacturing is ultimately responsible for these elusive problems.
Thomas et al. give a case study of a TNI problem on Ford vehicles and conclude
that it is crucial to perform a safety critical electronic root cause analysis for
TNI problems rather than just assume it is the customer’s fault for not using
the equipment properly or falsely reporting an error. (Thomas 2002)
TNI reports must be
tracked down to root cause. (Thomas 2002, pg. 650)
The
point is not really the exact number, but rather that random failures are the
sort of thing that can be expected to happen on a daily basis for a large
deployed fleet of safety-critical systems, and therefore must be guarded
against rigorously. Moreover, lack of reproducibility does not mean the failure
didn’t happen. In fact, irreproducible failures and very-difficult-to-reproduce
failures are quite common in modern electronic components and systems.
Random Faults Are Even Harder To Reproduce In System-Level Testing
While
random faults happen all the time in large fleets, detecting and reproducing
them is not so easy. For example: “Studies have shown that transient faults
occur far more often than permanent ones and are also much harder to detect.”
(Yu 2003 pg. 11)
It is also well known that some software faults
appear random, even though they may be caused by a software defect that could
fixed if identified. Yu explains that this is a common situation:
Software faults can be very difficult to
track down (Yu 2003, p. 12)
Random faults cannot be assumed to be benign,
and cannot be assumed to always result in a system crash or system reset that
puts the system in to a safe state. “Hardware component faults may be
permanent, transient or intermittent, but design faults will always be
permanent. However, it should be remembered that faults of any of these classes
may result in errors that persist within the system.” (Storey 1996, p.
116) “Many designers believe that
computers fail safe, whereas NASA experience has shown that computers may
exhibit hazardous failure modes.” (NASA 2004, p. 21)
Moreover, random faults are just one type of
fault that is expected on a regular, if infrequent basis. “Random faults, like
end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that
such faults can be easily detected, but no one has ever made a CPU that cannot
fail.” (Douglass 1999, pg. 105, emphasis added)
MISRA states that “system design should consider
both random and systematic faults.” (MISRA Software Guidelines p. 6). It also
states “Any fault whose initiating conditions occur frequently will usually
have emerged during testing before production, so that the remaining obscure
faults may appear randomly, even though they are entirely systematic in origin.”
(MISRA Report 2 p. 7). In other words, software faults can appear random, and
are likely to be elusive and difficult to pin down in products – because the
easy ones are the ones that get caught during testing.
SEUs
and
other truly random faults are not reproducible in normal test
conditions
because they depend upon rare uncontrollable events (e.g., cosmic ray
strikes). “The result is an error in one bit, which, however, cannot be
duplicated
because of its random nature.” (Tang 2003, pg. 111).
While
intermittent faults might manifest repeatedly over a long series of tests with
a particular individual vehicle or other embedded system, doing so may require precise reproduction of
environmental conditions such as vibration, temperature, humidity, voltage
fluctuations, operating conditions, and other factors.
Random faults may be
difficult to track down, but can result in
errors that persist within the system (Storey 1996, pp. 115-116)
errors that persist within the system (Storey 1996, pp. 115-116)
It
is unreasonable to expect a fault caused by a random error to perform upon
demand in a test setup run for a few hours, or even a few days. It can easily
take weeks or months to reproduce random software defects, and some may never
be reproduced simply by performing system level testing in any reasonable
amount of time. If a fleet of hundreds
of thousands of vehicles, or other safety critical embedded system, only sees random faults a few times a day, testing a
single vehicle is unlikely to see that same random fault in a short test time –
or possibly it may never see a particular random fault over that single test vehicle’s
entire operating life.
Thus,
it is to be expected that realistic faults – faults that happen to all types of
computers on an everyday basis when large numbers of them are deployed – can’t necessarily
be reproduced upon demand in a single vehicle simply by performing
vehicle-level tests.
Software Faults Can Also Appear To Be Random and Irreproducible In System-Level Testing
Software data corruption includes changes during
operation made to RAM and other hardware resources (for example, registers
controlling I/O peripherals that can be altered by software). Unlike hardware
data corruption, software data corruption is caused by design defects in software
rather than disruption of hardware operating state. The defects may be quite
subtle and seemingly random, especially if they are caused by timing problems
(also known as race conditions). Because software data corruption is caused by
a CPU writing values to memory or registers under software control just as the
hardware would expect to happen with correctly working software, hardware data
corruption countermeasures such as error detecting codes do not mitigate this source of errors.
It is well known that software defects can cause
corruption of data or control flow information. Since such corruption is due to
software defects, one would expect the frequency at which it happens depends
heavily on the quality of the software. Generally developers try to find and
eliminate frequent sources of memory corruption, leaving the more elusive and
infrequent problems in production code. Horst et al. used modeling and
measurement to estimate a corruption would occur once per month with a
population of 10,000 processors (1993, abstract), although that software was
written to non-safety-critical levels of software quality. Sullivan and
Chillarege used defect reports to understand “overlay errors” (the IBM term for
memory corruption), and found that such errors had much higher impact on the
system than other types of defects (Sullivan 1991, p. 9).
MISRA recommends using redundant data or a
checksum (MISRA Software Guidelines p. 30; MISRA report 1 p. 21) NASA
recommends using multiple copies or a CRC to guard against variable corruption
(NASA 2004, p. 93). IEC 61508-3 recommends protecting data structures against
corruption (p. 72).
It
is also well known that some software faults appear random, even though they
may be caused be a software defect that could fixed if identified. Yu explains
that this is a common situation, as shown in a previous figure. (Yu 2003, p.
12) It is well known that software faults can appear to be irreproducible at the system
level, making it virtually impossible to reproduce them upon demand with system
level testing of an unmodified system.
It
reasonable to expect that any software – and especially software that has not
been developed in rigorous accordance with high-integrity software procedures – to
have residual bugs that can cause memory corruption, crashes or other
malfunctions. The figure below provides an example software malfunction from an
in-flight entertainment system. Such examples abound if you keep an eye out for them.
Aircraft entertainment system software failure. February 13, 2008 (photo by author)
Coming at this another way, saying that something can't happen because it can't be reproduced in the lab ignores how random hardware and software faults manifest in the real world. Arbitrarily bad random faults can and will happen eventually in any widely deployed embedded system. Designers have to make a choice between ignoring this reality, or facing it head-on to design systems that are safe despite such faults occurring.
- Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
- Cisco, Cisco 12000 Single Event Upset Failures Overview and Work Around Summary, August 15, 2003.
- Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
- Forbes, Sun Screen, November 11, 2000. Accessed at: http://www.forbes.com/global/2000/1113/0323026a_print.html
- Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.
- Horst et al., The risk of data corruption in microprocessor-based systems, FTCS, 1993, pp. 576-585.
- Koopman, Better Embedded System Software, 2010.
- Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
- MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
- MISRA, Report 2: Integrity, February 1995.
- NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
- Skarin & Karlsson, Software implemented detection and recovery of soft errors in a brake-by-wire system, EDCC 2008, pp. 145-154.
- Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
- Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.
- Tang & Rodbell, Single-event upsets in microelectronics: fundamental physics and issues, MRS Bulletin, Feb. 2003, pp. 111-116.
- Thomas et al., The “trouble not identified” phenomenon in automotive electronics, Microelectronics Reliability 42, 2002, pp. 641-651.
- Yu, Y. & Johnson, B., Fault Injection Techniques: a perspective on the state of research. In: Benso & Prinetto (eds.), Fault Injection Techniques and Tools for Embedded System Reliability Evaluation, Kluwer, 2003, pp. 7-39.
No comments:
Post a Comment
Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.
If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!