If you design a system that doesn't eliminated all single points of failure, eventually your safety-critical system will kill someone. How often is that? Depends on your exposure, but in real-world systems it could easily be on a daily basis.
"Realistic" Events Happen Every Million Hours
Understanding software safety requires reasoning
about extremely low probability events, which can often lead to results that
are not entirely intuitive. Let's attempt to put into perspective where
to set the bar for what is “realistic” and why single point fault
vulnerabilities are inherently dangerous for systems that can kill people if they fail.
If a human being lives to be 90 years old, that
is 90 years * 365.25 days/yr * 24 hrs/day = 788,940 hours. While the notion of what a "realistic" fault is can be subjective, perhaps
an individual will think something is a realistic failure type if (s)he can
expect to see it happen in one human lifetime.
This definition has some
problems since everyone’s experience varies. For example, some would say that
total loss of dual redundant hydraulic service brakes in a car is unrealistic.
But that has actually happened to me due to a common-mode cracking mechanical
failure in the brake fluid reservoir of my vehicle, and I had to stop my
vehicle using my parking brake. So I’d have to call loss of both service brake
hydraulic systems a realistic fault from my point of view. But I have met
automotive engineers who have told me it is “impossible” for this failure to
happen. The bottom line: intuition as to what is "realistic" isn’t enough for these types of matters -- you have to crunch the numbers and take into account that if you haven't seen it yourself that doesn't mean it can't happen.
So perhaps it is also realistic if a friend has told you a story about such a fault. From a probability point of view let’s just say it's "realistic" if it is likely to happen every 1,000,000 hours (directly to you in your life, plus 25% extra to account for second-hand stories).
Probability math
in the safety critical system area is usually only concerned with rounded powers
of ten, and 1 million is just a rounding up of a human lifespan. Put another
way, if it happens once per million hours, let’s say it’s “realistic.” By way
of contrast, the odds of winning the top jackpot in Powerball are about one in
175 million (http://www.powerball.com/powerball/pb_prizes.asp), and someone
eventually wins without buying 175 tickets per hour for an entire million-hour
human lifetime, so a case can made for much less frequent events to also be
“realistic” -- at least for someone in the population.
I should note at this point that aircraft safety and train safety are at least 1000 times more rigorous than one catastrophic failure per million hours. Then again, trains and jumbo jets hold hundreds of people, who are all being exposed to risk together. So the point of this essay is simply to make safety probability more human-accessible, and not to argue that one catastrophic fault per million hours is OK for a high criticality system -- it's not!
"Realistic" Catastrophic Failures Will Happen To A Deployed Fleet Daily
Obermaisser gives permanent hardware failure rates as about 100 FIT (Obermaisser p. 10 in the context of drive-by-wire automobiles. Note that 1 “FIT” is 1 failure per billion hours, so 100 FIT is one failure per 10 million operating hours). This means that you aren’t likely to see any particular car component failure in your lifetime. But cars have lots of components all failing at this rate, so there is a reasonable chance you’ll see some component failure and need to buy a replacement part. (These failure rates are for normal operating lifetimes, and do not count the fact that failures are more frequent as parts near end of life due to age and wearout.)
However, transient failures, such as those
caused by cosmic rays, voltage surges, and so on, are much more common,
reaching rates of 10,000-100,000 FIT (Obermaisser p. 10; 100,000 per billion
hours = 100 per million hours). That means that any particular component will
fail in a way that can be fixed by rebooting or the like about 10 to 100 times
per million hours – well within the realm of realistic by individual human
standards. Obermaisser also gives a frequency for arbitrary faults as 1/50th
of all faults (id., p. 8). Thus, any particular component, can be expected to fail 10 to 100 times per million hours, and to do so
in an “arbitrary” way (which is software safety lingo for “dangerous”) about
0.2 to 2 times per million hours. This
is right around the range of “realistic,” straddling the once per million hours
frequency.
In other words, if you own one single electronic
device your whole life, it will suffer a potentially dangerous failure about
once in your lifetime. If you continuously own vehicles for your entire life,
and they each have 100 computer chips in them, then you can expect to see about
one potentially dangerous failure per year since there are 100 of them to
fail -- and whether any such failure actually kills you depends upon how well the car's fault tolerance architecture successfully deals with that failure and how lucky you are that day. (We'll get into the point that you don't drive your car 24x7 momentarily.) That is why it is so important
to have redundancy in safety-critical automotive systems. These are general
examples to indicate the scale of the issues at hand, and not specifically
intended to correspond to exact numbers in any particular vehicle, especially one in which failsafes are likely to somewhat reduce the
mishap rate by providing partial redundancy.
Now let’s see how the probabilities work when
you take into account the large size of the fleet of deployed vehicles. Let's say a car company sells 1 million vehicles in a year. Let’s also say that an average
vehicle is driven about 1 hour per day (FHA 2009 National Household Travel
Survey, p. 31 says 56 minutes per day, but we're rounding off here). That’s 1 million hours
per day. If we multiply that by the
range of dangerous transient faults per component (0.2 to 2 per million hours),
that means that the fleet of vehicles can expect 0.2 to 2 dangerous transient faults per day for any particular safety critical control component in that fleet. And there are likely more than one safety critical components in each car. In other words, while a dangerous failure may seem
unlikely in an individual basis, designers must expect dangerous failures to happen on a daily basis in a large deployed fleet.
While the exact numbers in this calculation are
estimates, the important point is that a competent safety-critical designer
must take into account arbitrary failures when designing a system for a large
deployed fleet.
The above is an example calculation as to why redundancy is required to achieve
safety. Arbitrary faults can be expected to be
dangerous (we’ve already thrown away 98% of faults as benign in these
calculations – we’re just keeping 2% of faults as dangerous faults). Multiple fault containment regions (FCRs) must be used to mitigate such faults.
It is important to note that in the fault
tolerant and safety critical computing fields “arbitrary” means just that – it
is a completely unconstrained failure mode.
It is not simply a failure that seems “realistic” based on some set of
preconceived notions or experiences. Rather, designers consider it to be the
worst possible failure of a single chip in which it does the worst possible
thing to make the system unsafe. For example, a pyrotechnic device must be expected to fire accidentally at some point unless there is true redundancy that mitigates every possible single point of failure, whether the exact failure mechanism in the chip can be imagined by an engineer or not. The only way to avoid single point failures is
via some form of true redundancy that serves as an independent check and
balance on safety critical functions.
As a second source for similar failure rate
numbers, Kopetz, who specializes in automotive drive-by-wire fault tolerant
computer systems, gives an acceptable mean-time-to-failure of safety-critical
applications of one critical failure per 1 billion hours (more than 100,000
years for any particular vehicle), saying that the “dependability requirements
for a drive-by-wire system are even more stringent than the dependability
requirements for a fly-by-wire systems, since the number of exposed hours of
humans is higher in the automotive domain.” (Kopetz 2004, p. 32). This cannot
be achieved using any simplex hardware scheme, and thus requires redundancy to
achieve that safety goal. He gives expected component failure rates as 100 FIT
for permanent faults and 100,000 FIT for transient faults (id., p. 37).
By the same token, any first fault that persists
a long time during operation without being detected or mitigated sets the stage
for a second fault to happen, resulting in a sequence of faults that is
likewise unsafe. For example, consider if a vehicle has a manufacturing fault
or other fault that is undetected but disables a redundant sensor input without
that failure being detected or fixed. From the time that fault happens forward,
the vehicle is placed at the same risk as if it only had a single point failure
vulnerability in the first place. Redundancy doesn’t do any good if a redundant
component fails, nobody knows about it, nobody fixes it, and the vehicle keeps
operating as if nothing is wrong. For example, for this reason a passenger jet with two engines
isn’t allowed to take off on an over-ocean flight with only one engine working
at takeoff.
Software Faults Only Make Things Worse
The above fault calculations “assume the absence of software faults,” (Obermaisser, p. 13), which is an unsupported assumption for many systems. Software faults can only be expected to make the arbitrary failure rate worse.Having software-implemented failsafes and partial redundancy may mitigate dangerous faults to more than the computed 2%. But it is impossible for a mitigation technique in the same fault containment region as a failure to be successful at mitigating all possible failure modes.
Software defects manifest as single points of failure in a way that may be counter-intuitive. Software defects are design defects rather than run-time hardware failures, and are therefore present all the time in every copy of a system that is created. (MISRA Report 2, p. 7) Moreover, in the absence of perfect memory and temporal isolation mechanisms, every line of software in a system has the potential to affect the execution of every other line in the system. This is especially problematic in systems which do not use memory protection, which do not have an adequate real time scheduling approach, which make extensive use of global variables, and have other similar problems that result in an elevated risk of the activation of one software defect causing system-wide problems, or causing a cascading activation of other software defects. Therefore, in the absence of compelling proof of complete isolation of software tasks from each other, the entire CPU must be considered a single FCR for software, making the entirety of software on a CPU a single point of failure for which any possible unsafe behavior must be fully and completely mitigated by mechanisms other than software on that same CPU.
Some examples of ways in which multiple apparent
software defects might result within a single FCR but still constitute a single
point of failure include: corruption of a block of memory (corrupting the
values of many global variables), corruption of task scheduling information in
and operating system (potentially killing or altering the execution patterns of
several tasks), and timing problems that cause multiple tasks to miss their
deadlines. It should be understood that these are merely examples – arbitrarily
bad faults are possible from even a seemingly trivial software defect.
MISRA states that FMEA and Fault Tree Analysis
that examine specific sources and effects of faults are “not applicable to
programmable and complex non-programmable systems because it is impossible to
evaluate the very high number of possible failure modes and their resulting
effects.” (MISRA Report 2 p. 17). Instead, for any complex electronics (whether
or not it contains software), MISRA tells designers to “consider faults at the
module level.” In other words, MISRA describes considering the entire
integrated circuit for a microcontroller as a single FCR unless there is a very strong isolation argument (and considering the findings of Ademaj (2003), it is difficult to see how that can be done).
Deployed Systems Will See Dangerous Random Faults
The point of all this is to demonstrate that
arbitrary dangerous single point faults can be expected to happen on a regular
basis when hundreds of thousands of embedded systems such as cars are involved. True redundancy is
required to avoid weekly mishaps on any full-scale production vehicle that involves hundreds of
thousands of units in the field. No single point fault, no matter how obscure
or seemingly unlikely, can be tolerated in a safety critical system. Moreover,
multiple point faults cannot be tolerated if they are sufficiently likely to
happen via accumulation of undetected faults over time. In particular, software-implemented countermeasures that run on a CPU aren't going to be 100% effective if that same CPU is the one that suffered a fault in the first place. True redundancy is required to achieve safety from catastrophic failures for large deployed fleets in the face of random faults.
It is NOT acceptable
practice to start arguing whether any particular fault is “realistic” given a
particular design of within a single point fault region such as an integrated
circuit. This is in part because designer intuition is fallible for very low
(but still important) probability events, and also in part because it is as a
practical matter impossible to envision all arbitrary faults that might cause a
safety problem. Rather, accepted practice is to simply assert that the worst
case possible single-point failure will find a way to happen, and design to
ensure that such an event does not render the system unsafe.
References:
- Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
- Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
- MISRA, Report 2: Integrity, February 1995.
- Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.