Monday, March 31, 2014

How Often Will Random Faults Kill Someone?


If you design a system that doesn't eliminated all single points of failure, eventually your safety-critical system will kill someone.  How often is that?  Depends on your exposure, but in real-world systems it could easily be on a daily basis.

"Realistic" Events Happen Every Million Hours

Understanding software safety requires reasoning about extremely low probability events, which can often lead to results that are not entirely intuitive. Let's attempt to put into perspective where to set the bar for what is “realistic” and why single point fault vulnerabilities are inherently dangerous for systems that can kill people if they fail.

If a human being lives to be 90 years old, that is 90 years * 365.25 days/yr * 24 hrs/day = 788,940 hours.  While the notion of what a "realistic" fault is can be subjective, perhaps an individual will think something is a realistic failure type if (s)he can expect to see it happen in one human lifetime. 

This definition has some problems since everyone’s experience varies. For example, some would say that total loss of dual redundant hydraulic service brakes in a car is unrealistic. But that has actually happened to me due to a common-mode cracking mechanical failure in the brake fluid reservoir of my vehicle, and I had to stop my vehicle using my parking brake. So I’d have to call loss of both service brake hydraulic systems a realistic fault from my point of view. But I have met automotive engineers who have told me it is “impossible” for this failure to happen. The bottom line: intuition as to what is "realistic" isn’t enough for these types of matters -- you have to crunch the numbers and take into account that if you haven't seen it yourself that doesn't mean it can't happen.

So perhaps it is also realistic if a friend has told you a story about such a fault. From a probability point of view let’s just say it's "realistic" if it is likely to happen every 1,000,000 hours (directly to you in your life, plus 25% extra to account for second-hand stories).  

Probability math in the safety critical system area is usually only concerned with rounded powers of ten, and 1 million is just a rounding up of a human lifespan. Put another way, if it happens once per million hours, let’s say it’s “realistic.” By way of contrast, the odds of winning the top jackpot in Powerball are about one in 175 million (http://www.powerball.com/powerball/pb_prizes.asp), and someone eventually wins without buying 175 tickets per hour for an entire million-hour human lifetime, so a case can made for much less frequent events to also be “realistic” -- at least for someone in the population.

I should note at this point that aircraft safety and train safety are at least 1000 times more rigorous than one catastrophic failure per million hours. Then again, trains and jumbo jets hold hundreds of people, who are all being exposed to risk together.  So the point of this essay is simply to make safety probability more human-accessible, and not to argue that one catastrophic fault per million hours is OK for a high criticality system -- it's not!

"Realistic" Catastrophic Failures Will Happen To A Deployed Fleet Daily

Obermaisser gives permanent hardware failure rates as about 100 FIT (Obermaisser p. 10 in the context of drive-by-wire automobiles. Note that 1 “FIT” is 1 failure per billion hours, so 100 FIT is one failure per 10 million operating hours). This means that you aren’t likely to see any particular car component failure in your lifetime. But cars have lots of components all failing at this rate, so there is a reasonable chance you’ll see some component failure and need to buy a replacement part. (These failure rates are for normal operating lifetimes, and do not count the fact that failures are more frequent as parts near end of life due to age and wearout.)

However, transient failures, such as those caused by cosmic rays, voltage surges, and so on, are much more common, reaching rates of 10,000-100,000 FIT (Obermaisser p. 10; 100,000 per billion hours = 100 per million hours). That means that any particular component will fail in a way that can be fixed by rebooting or the like about 10 to 100 times per million hours – well within the realm of realistic by individual human standards. Obermaisser also gives a frequency for arbitrary faults as 1/50th of all faults (id., p. 8). Thus, any particular component, can be expected to fail 10 to 100 times per million hours, and to do so in an “arbitrary” way (which is software safety lingo for “dangerous”) about 0.2 to 2 times per million hours.  This is right around the range of “realistic,” straddling the once per million hours frequency.

 In other words, if you own one single electronic device your whole life, it will suffer a potentially dangerous failure about once in your lifetime. If you continuously own vehicles for your entire life, and they each have 100 computer chips in them, then you can expect to see about one potentially dangerous failure per year since there are 100 of them to fail -- and whether any such failure actually kills you depends upon how well the car's fault tolerance architecture successfully deals with that failure and how lucky you are that day. (We'll get into the point that you don't drive your car 24x7 momentarily.)  That is why it is so important to have redundancy in safety-critical automotive systems. These are general examples to indicate the scale of the issues at hand, and not specifically intended to correspond to exact numbers in any particular vehicle, especially one in which failsafes are likely to somewhat reduce the mishap rate by providing partial redundancy.

 Now let’s see how the probabilities work when you take into account the large size of the fleet of deployed vehicles. Let's say a car company sells 1 million vehicles in a year. Let’s also say that an average vehicle is driven about 1 hour per day (FHA 2009 National Household Travel Survey, p. 31 says 56 minutes per day, but we're rounding off here). That’s 1 million hours per day.  If we multiply that by the range of dangerous transient faults per component (0.2 to 2 per million hours), that means that the fleet of vehicles can expect 0.2 to 2 dangerous transient faults per day for any particular safety critical control component in that fleet. And there are likely more than one safety critical components in each car. In other words, while a dangerous failure may seem unlikely in an individual basis, designers must expect dangerous failures to happen on a daily basis in a large deployed fleet.

While the exact numbers in this calculation are estimates, the important point is that a competent safety-critical designer must take into account arbitrary failures when designing a system for a large deployed fleet. 

 The above is an example calculation as to why redundancy is required to achieve safety. Arbitrary faults can be expected to be dangerous (we’ve already thrown away 98% of faults as benign in these calculations – we’re just keeping 2% of faults as dangerous faults). Multiple fault containment regions (FCRs) must be used to mitigate such faults.

 It is important to note that in the fault tolerant and safety critical computing fields “arbitrary” means just that – it is a completely unconstrained failure mode.  It is not simply a failure that seems “realistic” based on some set of preconceived notions or experiences. Rather, designers consider it to be the worst possible failure of a single chip in which it does the worst possible thing to make the system unsafe. For example, a pyrotechnic device must be expected to fire accidentally at some point unless there is true redundancy that mitigates every possible single point of failure, whether the exact failure mechanism in the chip can be imagined by an engineer or not. The only way to avoid single point failures is via some form of true redundancy that serves as an independent check and balance on safety critical functions.

As a second source for similar failure rate numbers, Kopetz, who specializes in automotive drive-by-wire fault tolerant computer systems, gives an acceptable mean-time-to-failure of safety-critical applications of one critical failure per 1 billion hours (more than 100,000 years for any particular vehicle), saying that the “dependability requirements for a drive-by-wire system are even more stringent than the dependability requirements for a fly-by-wire systems, since the number of exposed hours of humans is higher in the automotive domain.” (Kopetz 2004, p. 32). This cannot be achieved using any simplex hardware scheme, and thus requires redundancy to achieve that safety goal. He gives expected component failure rates as 100 FIT for permanent faults and 100,000 FIT for transient faults (id., p. 37).

 By the same token, any first fault that persists a long time during operation without being detected or mitigated sets the stage for a second fault to happen, resulting in a sequence of faults that is likewise unsafe. For example, consider if a vehicle has a manufacturing fault or other fault that is undetected but disables a redundant sensor input without that failure being detected or fixed. From the time that fault happens forward, the vehicle is placed at the same risk as if it only had a single point failure vulnerability in the first place. Redundancy doesn’t do any good if a redundant component fails, nobody knows about it, nobody fixes it, and the vehicle keeps operating as if nothing is wrong. For example, for this reason a passenger jet with two engines isn’t allowed to take off on an over-ocean flight with only one engine working at takeoff.

Software Faults Only Make Things Worse

 The above fault calculations “assume the absence of software faults,” (Obermaisser, p. 13), which is an unsupported assumption for many systems. Software faults can only be expected to make the arbitrary failure rate worse.Having software-implemented failsafes and partial redundancy may mitigate dangerous faults to more than the computed 2%. But it is impossible for a mitigation technique in the same fault containment region as a failure to be successful at mitigating all possible failure modes.

Software defects manifest as single points of failure in a way that may be counter-intuitive. Software defects are design defects rather than run-time hardware failures, and are therefore present all the time in every copy of a system that is created. (MISRA Report 2, p. 7) Moreover, in the absence of perfect memory and temporal isolation mechanisms, every line of software in a system has the potential to affect the execution of every other line in the system. This is especially problematic in systems which do not use memory protection, which do not have an adequate real time scheduling approach, which make extensive use of global variables, and have other similar problems that result in an elevated risk of the activation of one software defect causing system-wide problems, or causing a cascading activation of other software defects. Therefore, in the absence of compelling proof of complete isolation of software tasks from each other, the entire CPU must be considered a single FCR for software, making the entirety of software on a CPU a single point of failure for which any possible unsafe behavior must be fully and completely mitigated by mechanisms other than software on that same CPU.

Some examples of ways in which multiple apparent software defects might result within a single FCR but still constitute a single point of failure include: corruption of a block of memory (corrupting the values of many global variables), corruption of task scheduling information in and operating system (potentially killing or altering the execution patterns of several tasks), and timing problems that cause multiple tasks to miss their deadlines. It should be understood that these are merely examples – arbitrarily bad faults are possible from even a seemingly trivial software defect.

 MISRA states that FMEA and Fault Tree Analysis that examine specific sources and effects of faults are “not applicable to programmable and complex non-programmable systems because it is impossible to evaluate the very high number of possible failure modes and their resulting effects.” (MISRA Report 2 p. 17). Instead, for any complex electronics (whether or not it contains software), MISRA tells designers to “consider faults at the module level.” In other words, MISRA describes considering the entire integrated circuit for a microcontroller as a single FCR unless there is a very strong isolation argument (and considering the findings of Ademaj (2003), it is difficult to see how that can be done).

Deployed Systems Will See Dangerous Random Faults

The point of all this is to demonstrate that arbitrary dangerous single point faults can be expected to happen on a regular basis when hundreds of thousands of embedded systems such as cars are involved. True redundancy is required to avoid weekly mishaps on any full-scale production vehicle that involves hundreds of thousands of units in the field. No single point fault, no matter how obscure or seemingly unlikely, can be tolerated in a safety critical system. Moreover, multiple point faults cannot be tolerated if they are sufficiently likely to happen via accumulation of undetected faults over time. In particular, software-implemented countermeasures that run on a CPU aren't going to be 100% effective if that same CPU is the one that suffered a fault in the first place. True redundancy is required to achieve safety from catastrophic failures for large deployed fleets in the face of random faults.

It is NOT acceptable practice to start arguing whether any particular fault is “realistic” given a particular design of within a single point fault region such as an integrated circuit. This is in part because designer intuition is fallible for very low (but still important) probability events, and also in part because it is as a practical matter impossible to envision all arbitrary faults that might cause a safety problem. Rather, accepted practice is to simply assert that the worst case possible single-point failure will find a way to happen, and design to ensure that such an event does not render the system unsafe.

References:
  • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
  • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
  • MISRA, Report 2: Integrity, February 1995.
  • Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.

No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!