Monday, March 17, 2014

Problems Caused by Random Hardware Faults in Critical Embedded Systems


A previous posting described sources of random hardware faults. This posting shows that even single bit errors can lead to catastrophic system failures. Furthermore, these types of faults are difficult or impossible to identify and reproduce in system-level testing. But, just because you can't make such errors reproduce at the snap of your fingers doesn't mean they aren't affecting real systems every day.

Paraphrasing a favorite saying of my colleague Roy Maxion: "You may forget about unlikely faults, but they won't forget about you."

Random Faults Cause Catastrophic Errors

Random faults can and do occur in hardware on an infrequent but inevitable basis. If you have a large number of embedded systems deployed you can count on this happening to your system. But, how severe can the consequences be from a random bit flip? As it turns out, this can be a source of system-killer problems that have hit more than one company, and can cause catastrophic failures of safety critical systems.

While random faults are usually transient events, even the most fleeting of faults will at least sometimes result in a persistent error that can be unsafe. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, pg. 116)

Not every random fault will cause a catastrophic system failure. But based on experimental evidence and published studies, it is virtually certain that at least some random faults will cause catastrophic system failures in a system that is not fully protected against all possible single-point faults and likely combinations of multi-point faults.

Addy presented a case study of an embedded real time control system. The system architecture made heavy use of global variables (which is a bad idea for other reasons (Koopman 2010)). Analysis of test results showed that a single memory bit written to a global variable caused an unsafe action for the system studied despite redundant checks in software. (Addy 1991, pg. 77). This means that a single bit flip causing an unsafe system has to be considered plausible, because it has actually been demonstrated in a safety critical control system.


Addy also reported a fault in which one process set a value, and then another process took action upon that value. Addy identified a failure mode in the first process of his system due to task death that rendered the operator unable to control the safety-critical process:

Excerpt from (Addy 1991, pg. 80)

In Addy’s case a display is frozen and control over the safety critical process is lost.  As a result of the experience, Addy recommends a separate safety process. (Addy 1991 pg. 81).

Sun Microsystems learned the Single Event Upset (SEU) lesson the hard way in 2000, when it had to recall high-end servers because they had failed to use error detection/correction codes and were suffering from server crashes due to random faults in SRAM. (Forbes, 2000) Fortunately no deaths were involved since this is a server rather than a safety critical embedded system.

Excerpt from Forbes 2000.

Cisco also had a significant SEU problem with their 12000 series router line cards. They went so far as to publish a description and workaround notification, stating: “Cisco 12000 line cards may reset after single event upset (SEU) failures. This field notice highlights some of those failures, why they occur, and what work arounds are available.” (Cisco 2003) Again, these were not safety critical systems.

More recently, Skarin & Karlsson presented a study of fault injection results on an automotive brake-by-wire system to determine if random faults can cause an unsafe situation. They found that 30% of random errors caused erroneous outputs, and of those 15% resulted in critical braking failures. (Skarin & Karlsson 2008)  In the context of this brake-by-wire study, critical failures were either loss of braking or a locked wheel during braking. (id., p. 148) Thus, it is clear that these sorts of errors can lead to critical operational faults in electronic automotive controls. It is worth noting that random errors tend to increase dramatically with altitude, making this a relevant environmental condition (Heijman 2011, p. 6)

Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734). It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

For automotive safety applications, MISRA Software Guidelines discuss the need to identify and correct corruption of memory values (MISRA Software Guidelines p. 30).

Random Faults Can Be Impossible to Reproduce In General

Intermittent faults from SEUs and other sources can and do occur in the field, causing real problems that are not readily reproducible even if the cause is a hard failure. In many cases components with faults are diagnosed as “Trouble Not Identified” (TNI). In my work with embedded system companies I have learned that is common to have TNI rates of 50% for complex electronic components across many embedded industries. (This means that half of the components returned to the factory as defective are found to be working properly when normal testing procedures are used.) While it is tempting to blame the customer for incorrect operation of equipment, the reality is that many times a defect in software design, hardware design, or manufacturing is ultimately responsible for these elusive problems. Thomas et al. give a case study of a TNI problem on Ford vehicles and conclude that it is crucial to perform a safety critical electronic root cause analysis for TNI problems rather than just assume it is the customer’s fault for not using the equipment properly or falsely reporting an error. (Thomas 2002)
TNI reports must be tracked down to root cause. (Thomas 2002, pg. 650)

The point is not really the exact number, but rather that random failures are the sort of thing that can be expected to happen on a daily basis for a large deployed fleet of safety-critical systems, and therefore must be guarded against rigorously. Moreover, lack of reproducibility does not mean the failure didn’t happen. In fact, irreproducible failures and very-difficult-to-reproduce failures are quite common in modern electronic components and systems.

Random Faults Are Even Harder To Reproduce In System-Level Testing

While random faults happen all the time in large fleets, detecting and reproducing them is not so easy. For example: “Studies have shown that transient faults occur far more often than permanent ones and are also much harder to detect.” (Yu 2003 pg. 11)

It is also well known that some software faults appear random, even though they may be caused by a software defect that could fixed if identified. Yu explains that this is a common situation:
Software faults can be very difficult to track down (Yu 2003, p. 12)

Random faults cannot be assumed to be benign, and cannot be assumed to always result in a system crash or system reset that puts the system in to a safe state. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, p. 116)  “Many designers believe that computers fail safe, whereas NASA experience has shown that computers may exhibit hazardous failure modes.” (NASA 2004, p. 21)

Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

MISRA states that “system design should consider both random and systematic faults.” (MISRA Software Guidelines p. 6). It also states “Any fault whose initiating conditions occur frequently will usually have emerged during testing before production, so that the remaining obscure faults may appear randomly, even though they are entirely systematic in origin.” (MISRA Report 2 p. 7). In other words, software faults can appear random, and are likely to be elusive and difficult to pin down in products – because the easy ones are the ones that get caught during testing.

SEUs and other truly random faults are not reproducible in normal test conditions because they depend upon rare uncontrollable events (e.g., cosmic ray strikes). “The result is an error in one bit, which, however, cannot be duplicated because of its random nature.” (Tang 2003, pg. 111).

While intermittent faults might manifest repeatedly over a long series of tests with a particular individual vehicle or other embedded system, doing so may require precise reproduction of environmental conditions such as vibration, temperature, humidity, voltage fluctuations, operating conditions, and other factors.

Random faults may be difficult to track down, but can result in
errors that persist within the system (Storey 1996, pp. 115-116)

It is unreasonable to expect a fault caused by a random error to perform upon demand in a test setup run for a few hours, or even a few days. It can easily take weeks or months to reproduce random software defects, and some may never be reproduced simply by performing system level testing in any reasonable amount of time.  If a fleet of hundreds of thousands of vehicles, or other safety critical embedded system, only sees random faults a few times a day, testing a single vehicle is unlikely to see that same random fault in a short test time – or possibly it may never see a particular random fault over that single test vehicle’s entire operating life.

Thus, it is to be expected that realistic faults – faults that happen to all types of computers on an everyday basis when large numbers of them are deployed – can’t necessarily be reproduced upon demand in a single vehicle simply by performing vehicle-level tests.

Software Faults Can Also Appear To Be Random and Irreproducible In System-Level Testing

Software data corruption includes changes during operation made to RAM and other hardware resources (for example, registers controlling I/O peripherals that can be altered by software). Unlike hardware data corruption, software data corruption is caused by design defects in software rather than disruption of hardware operating state. The defects may be quite subtle and seemingly random, especially if they are caused by timing problems (also known as race conditions). Because software data corruption is caused by a CPU writing values to memory or registers under software control just as the hardware would expect to happen with correctly working software, hardware data corruption countermeasures such as error detecting codes do not mitigate this source of errors.

It is well known that software defects can cause corruption of data or control flow information. Since such corruption is due to software defects, one would expect the frequency at which it happens depends heavily on the quality of the software. Generally developers try to find and eliminate frequent sources of memory corruption, leaving the more elusive and infrequent problems in production code. Horst et al. used modeling and measurement to estimate a corruption would occur once per month with a population of 10,000 processors (1993, abstract), although that software was written to non-safety-critical levels of software quality. Sullivan and Chillarege used defect reports to understand “overlay errors” (the IBM term for memory corruption), and found that such errors had much higher impact on the system than other types of defects (Sullivan 1991, p. 9).

MISRA recommends using redundant data or a checksum (MISRA Software Guidelines p. 30; MISRA report 1 p. 21) NASA recommends using multiple copies or a CRC to guard against variable corruption (NASA 2004, p. 93). IEC 61508-3 recommends protecting data structures against corruption (p. 72).
It is also well known that some software faults appear random, even though they may be caused be a software defect that could fixed if identified. Yu explains that this is a common situation, as shown in a previous figure. (Yu 2003, p. 12) It is well known that software faults can appear to be irreproducible at the system level, making it virtually impossible to reproduce them upon demand with system level testing of an unmodified system.

It reasonable to expect that any software – and especially software that has not been developed in rigorous accordance with high-integrity software procedures – to have residual bugs that can cause memory corruption, crashes or other malfunctions. The figure below provides an example software malfunction from an in-flight entertainment system. Such examples abound if you keep an eye out for them.

Aircraft entertainment system software failure. February 13, 2008 (photo by author)

It should be no surprise if a system suffers random faults in a large-scale deployment that are difficult or impossible to reproduce in the lab via system-level testing, even if their manifestation in the field has been well documented. There simply isn't enough exposure to random fault sources in laboratory testing conditions to experience to the variety of faults that will happen in the real world to a large fleet.

Coming at this another way, saying that something can't happen because it can't be reproduced in the lab ignores how random hardware and software faults manifest in the real world. Arbitrarily bad random faults can and will happen eventually in any widely deployed embedded system. Designers have to make a choice between ignoring this reality, or facing it head-on to design systems that are safe despite such faults occurring.

References:

  • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
  • Cisco, Cisco 12000 Single Event Upset Failures Overview and Work Around Summary, August 15, 2003.
  • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
  • Forbes, Sun Screen, November 11, 2000. Accessed at: http://www.forbes.com/global/2000/1113/0323026a_print.html
  • Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.
  • Horst et al., The risk of data corruption in microprocessor-based systems, FTCS, 1993, pp. 576-585.
  • Koopman, Better Embedded System Software, 2010.
  • Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Skarin & Karlsson, Software implemented detection and recovery of soft errors in a brake-by-wire system, EDCC 2008, pp. 145-154.
  • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
  • Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.
  • Tang & Rodbell, Single-event upsets in microelectronics: fundamental physics and issues, MRS Bulletin, Feb. 2003, pp. 111-116.
  • Thomas et al., The “trouble not identified” phenomenon in automotive electronics, Microelectronics Reliability 42, 2002, pp. 641-651.
  • Yu, Y. & Johnson, B., Fault Injection Techniques: a perspective on the state of research. In: Benso & Prinetto (eds.), Fault Injection Techniques and Tools for Embedded System Reliability Evaluation, Kluwer, 2003, pp. 7-39.


No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...