Monday, March 24, 2014

Safety Requires No Single Points of Failure

If someone is likely to die due to a system failure, that means the system can't have any single points of failure. The implications for this are more subtle than they might seem at first glance, because a "single point" isn't a single line of code or a single logic gate, but rather a "fault containment region," which is often an entire chip.

Safety Critical Systems Can't Have Single Points of Failure

One of the basic tenets of safety critical system design is avoiding single points of failure. A single point of failure is a component that, if it is the only thing that fails, can make the system unsafe. In contrast, a double (or triple, etc.) point of failure situation is one in which two (or more) components must fail to result in an unsafe system.

As an example, a single-engine airplane has a single point of failure (the single engine – if it fails, the plane doesn’t have propulsion power any more). A commercial airliner with two engines does not have an engine as a single point of failure, because if one engine fails the second engine is sufficient to operate and land the aircraft in expected operational scenarios. Similarly, cars have braking actuators on more than one wheel in part so that if one fails there are other redundant actuators to stop the vehicle.

MISRA Report 2 states that avoiding such failures is the centerpiece of risk assessment.  (MISRA Report 2 p. 17)

For systems that are deployed in large numbers for many operating hours, it can reasonably be expected that any complex component such as a computer chip or the software within it will fail at some point. Thus, attaining reasonable safety requires ensuring that no such component is a single point of failure.

When redundancy is used to avoid having a single point of failure, it is important that the redundant components not have any “common mode” failures. In other words, there must not be any way in which both redundant components would fail from the same cause, because any such cause would in effect be a single point of failure. As an example, if two aircraft engines use the same fuel pump, then that fuel pump becomes a single point of failure if it stops pumping fuel to both engines at the same time. Or if two brake pads in a car are closed by pressure from the same hydraulic line, then a hydraulic line rupture could cause both brake pads to fail at the same time due to loss of hydraulic pressure.

Similarly, it is important that there be no sequences of failures that escape detection until a last failure results in a system failure. For example, if one engine on a two-engine aircraft fails, it is essential to be able to detect that failure immediately so that diversion to a nearby airport can be completed before the second engine has a chance to fail as well.

For functions implemented in software, designers must assume that software can fail due to software defects missed in testing or due to hardware malfunctions. Software failures that must be considered include: a task dying or “hanging,” a task missing its deadline, a task producing an unsafe value due to a computational error, and a task producing an unsafe value due to unexpected (potentially abnormal) input values. It is important to note that when a software task “dies” that does not mean the rest of the system stops operating. For example, a software task that periodically computes a servo angle might die and leave the servo angle commanded to the last computed position, depending upon whether other tasks within the system notice whether the servo angle computation task is alive or not.

Avoiding single point failures in a multitasking software system requires that a single task must not manage both normal behavior and failure mode backup behavior, because if a failure is caused by that task dying or misbehaving, the backup behavior is likely to be lost as well. Similarly, a single task should not both perform an action and also monitor faults in that action, because if the task gets the action wrong, one should reasonably expect that it will subsequently fail to perform the monitoring correctly as well.

It is well known in the research literature that avoiding single points of failure is essential in a safety critical system. Leveson says: “To prove the safety of a complex system in the presence of faults, it is necessary to show that no single fault can cause a hazardous effect and that hazards resulting from sequences of failures are sufficiently remote.”  (Leveson 1986, pg. 139)

Storey says, in the context of software safety: “it is important to note that as complete isolation can never be guaranteed, system designers should not rely on the correct operation of a single module to achieve safety, even if that module is well trusted. As in all aspects of safety, several independent means of assuring safety are preferred.” (Storey 1996, pg. 238) In other words, a fundamental tenet of safety critical system design is to first ensure that there are no single points of failure, and then ensure that there are no likely pairs or sequences of failures that can also cause a safety problem. A specific implication of this statement is that software within a single CPU cannot be trusted to operate properly at all times. A second independent CPU or other device must be used to duplicate or monitor computations to ensure safety. Storey gives a variety of redundant hardware patterns to avoid single point failures (id., pp. 131-143)

Douglass says:

 Douglass on single-point failures (Douglass 1999, pp. 105-107)

Douglass makes it clear that all single point failures must be addressed to attain safety, no matter how remote the probability of failure might be.

Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

Fault Containment Regions Count As "Single Points" of Failure

Additionally, it is important to recognize that any line of code on a CPU (or even many lines of code on a CPU) can be a single point fault that creates arbitrary effects in diverse places elsewhere on that same CPU. Addy makes it clear that: “all parts of the software must be investigated for errors that could have a safety impact. This is not a part of the software safety analysis that can be slighted without risk, and may require more emphasis than it currently receives. [Addy’s] case study suggests that, in particular, care should be taken to search for errors in noncritical software that have a hidden interface with the safety-critical software through data and system services.” (Addy 1991, pg 83). Thus, it is inadequate to look at only some of the code to evaluate safety – you have to look at every last line of code to be sure.

A Fault Containment Region (FCR) is a region of a system which is neither vulnerable to external faults, nor causes faults outside the FCR. (Lala 1994, p. 30) The importance of an FCR is that arbitrary faults that originate inside an FCR cannot spread to another FCR, providing the basis of fault containment within a system that has redundancy to tolerate faults. In other words, if an FCR can generate erroneous data, you need a separate, independent FCR to detect that fault as the basis for creating a fault tolerant system (id.)




Hammet presents architectures that are fail safe, such as a “self-checking pair with simplex fault down” that in my experience is also commonly used in rail signaling equipment.  (Hammet 2002, pp. 19-20). Rail signaling systems I have worked with also use redundancy to avoid single points of failure. So it's no mystery how to do this -- you just have to want to get it right.

Obermaisser gives a list of failure modes for a “fault containment region” (FCR) that includes “arbitrary failures,” (Obermaisser 2006 pp. 7-8, section 3.2) with the understanding that a safe system must have multiple redundant FCRs to avoid a single point of failure. Put another way, a fault anywhere in a single FCR renders the entire FCR unreliable. Obermaisser also says that “The independence of FCRs can be compromised by shared physical resources (e.g., power supply, timing source), external faults (e.g., Electromagnetic Interference (EMI), spatial proximity) and design.” (id., section 3.1 p. 6)

Kopetz says that an automotive drive-by-wire system design must “assure that replicated subsystems of the architecture for ultra-dependable systems fail independently of each other.” (Kopetz 2004, p. 32, emphasis per original) Kopetz also says that each FCR must be completely independent of each other, including: computer hardware, power supply, timing source, clock synchronization service, and physical space, noting that “all correlated failures of two subsystems residing on the same silicon die cannot be eliminated.” (id., p. 34).

Continuing on with automotive specific references, none of this was news to the authors of the MISRA Software Guidelines, who said the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, p. 17).   In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

 Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734). It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.
FCRs must be implemented in separate silicon die (Ademaj 2003, p. 5)

 But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their monitor from their Main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

In other words, if a safety critical system has a single point of failure – any single point of failure, even if mitigated with “failsafes” in the same fault containment region – then it is not only unsafe, but also flies in the face of well accepted practices for creating safe systems.

In a future post I'll get into the issue of how critical a system has to be for addressing all single point failures to be mandatory. But the short version is that is someone is likely to die as a result of system failure, you need to avoid all single point failures (every last one of them) on at least the level of FCRs.

References:
  • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
  • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
  • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
  • Hammet, Design by extrapolation: an evaluation of fault-tolerant avionics, IEEE Aerospace and Electronic Systems, 17(4), 2002, pp. 17-25.
  • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
  • Lala et al., Architectural principles for safety-critical real-time applications, Proceedings of the IEEE, 82(1), Jan. 1994, pp. 25-40.
  • Leveson, N., Software safety: why, what, how, Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163.
  • MISRA, Report 2: Integrity, February 1995.
  • Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.
  • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.

No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...