Monday, April 21, 2014

Layered Defenses for Safety Critical Systems

Even if designers mitigate all single point faults in a design, there is always the possibility of some unexpected fault or combination of correlated faults that causes a system to fail in an unexpected way. Such failures should ideally never happen, but in practice no design analysis is perfectly comprehensive, especially if there are unconsidered correlations that make seemingly independent faults happen together, so they do happen. To mitigate such problems, system designers use layers of mitigation, which is a practice sometimes referred to as “defense in depth.”

Consequences: 
If a layered defensive strategy is defective, a failure can bypass intended mitigation strategies and result in a mishap.

Accepted Practices:
  • The accepted practice for layered systems is to ensure that no single point of failure, nor plausible combination of failures, exists which permits a mishap to occur. For layered defense purposes, a single point of failure includes even a redundant component subsystem (e.g., a 2oo2 redundant self-checking CPU pair might fail due to software defect present on both modules, so a layered defense provides an alternate way to recover from such a failure) 
  • The existence of multiple layers of protection is only effective if the net result gives complete, non-single-point-of-failure, coverage of all relevant faults.
  • The goal of layered defenses should be maximizing the fraction of problems that are caught at each layer of defense to reduce the residual probability of a mishap.
Discussion:

A layered defense system typically rests on an application of the principle of fault containment, in which a fault or its effects are contained and isolated so as to have the least effect on the system possible. The starting point for this is using fault containment regions such as 2oo2 systems or similar design patterns. But, a prudent designer admits that software faults or correlated hardware faults might occur, and therefore provides additional layers or protection.


Layered defenses attempt to prevent escalation of fault effects.

The figure above shows the general idea of layered defenses from a fault tolerant computing perspective. First, it is ideal to avoid both design and run-time faults. But, faults do crop up, and so mechanisms and architectural patterns should be in place to detect and contain those faults using fault containment regions. If a fault is not contained as intended, then the system experiences a hazard in that its primary fault tolerance approach has not worked and the system has become potentially unsafe. In other words, some fraction of faults might not be contained, and will result in hazards.

Once a hazard has manifested, a "fail-safe" mitigation strategy can help reduce the chance of a bigger problem occurring. A fail safe might, for example, be an independent safety system triggered by an electro-mechanical monitor (for example, a pressure relief valve on a water heater that releases pressure if steam forms inside the tank). In general, the system is already in an unsafe operating condition when the fail-safe has been activated. But, successful activation of a fail-safe may prevent a worse event much of the time. In other words, the hope is that most hazards will be mitigated by a fail-safe, but a few hazards may not be mitigated, and will result in incidents.

If the fail-safe is not activated then an incident occurs. An incident is a situation in which the system remains unsafe long enough for an accident to happen, but due to some combination of operator intervention or just plain luck, a loss event is avoided. In many systems it is common to catch a lucky break when the system fails, especially if well trained operators are able to find creative ways to recover the system, such as by shifting a car's transmission to neutral or turning off the ignition when a car's engine over-speeds. (It is important to note that recovering such a system doesn't mean that the system was safe; it just means that the driver had time and training to recover the situation and/or got lucky.) On the other hand, if the operator doesn't manage to recover the system, or the failure happens in a situation that is unrecoverable even by the best operator, a mishap will occur resulting in property damage, personal injury, death, or other safety loss event. (The general description of these points is based on Leveson 1986, pp. 149-150.)

A well known principle of creating safety critical systems is that hazardous behavior displayed by individual components is likely to result in an eventual accident. In other words, with a layered defense approach, components that act in a hazardous way might lead to no actual mishap most times, because a higher level safety mechanism takes over, or just because the system gets “lucky.” However, the occurrence of such hazards can be expected to eventually result in an actual mishap, when some circumstance results in which the safety net mechanism fails. 

For example, a fault containment might work 99.9% of the time, and fail-safes might also work 99.9% of the time. Thousands of tests might show that one or another of these safety layers saves the day. But, eventually, assuming the probability the safety layers being effective is random and independent, both will fail for some infrequent situation, causing a mishap. (Two layers at 99.9% give unmitigated faults of: 0.1% * 0.1% = 0.0001%, which is unlikely to be seen in testing, but still isn't zero.)  The safety concept of avoiding single point failures only works if each failure is infrequent enough that double failures are unlikely to ever happen in the entire lifetime of the operational fleet, which can be millions or even billions of hours of exposure for some systems. Doing this in practice for large deployed fleets requires identifying and correcting all situations detected in which single point failures are not immediately and completely mitigated. You need multiple layers to catch infrequent problems, but you should always design the system so that the layers are never exercised in situations that occur in practice.

Selected Sources:

Most NASA space systems employ failure tolerance (as opposed to fault tolerance) to achieve an acceptable degree of safety. Failure tolerance means not only are faults tolerated within a particular component or subsystem, but the failure of an entire subsystem is tolerated. (NASA 2004 pg. 114) These are the famous NASA backup systems. “This is primarily achieved via hardware, but software is also important, because improper software design can defeat the hardware failure tolerance and vice versa.” (NASA 2004 pg. 114, emphasis added)

Some of the layered defenses might be considered to be forms of graceful degradation (e.g., as described by Nace 2001 and Shelton 2002). For example, a system might revert to simple mechanical controls if a 2oo2 computer controller does a safety shut-down. A key challenge for graceful degradation approaches is ensuring that safety is maintained for each possible degraded configuration.

See also previous blog posting on: Safety Requires No Single Points of Failure

References:
  • Leveson, N., Software safety: why, what, how, Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163.
  • Nace, W. & Koopman, P., "A Graceful Degradation Framework for Distributed Embedded Systems," Workshop on Reliability in Embedded Systems (in conjunction with Symposium on Reliable Distributed Systems/SRDS-2001), October 2001.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Shelton, C., & Koopman, P., "Using Architectural Properties to Model and Measure System-Wide Graceful Degradation," Workshop on Architecting Dependable Systems (affiliated with ICSE 2002), May 25 2002.
  • Monday, April 14, 2014

    Redundant Input Processing for Safety

    Redundant analog and digital inputs to a safety critical system must be fed to independent chips to ensure no single point failure exists.  (This posting is a follow-on to a previous post about single points of failure.)

    Consequences: If fully replicated input processing and validation is not implemented with complete avoidance of single points of failure, it is possible for a single fault to result in erroneous input values causing unsafe system operation.

    Accepted Practices:
    • A safety critical system must not have any single point of failure that results in a significant unsafe condition if that failure can reasonably be expected to occur during the operational life of the deployed fleet of systems. Redundant input processing is an accepted practice that can help avoid single point failures.
    Discussion:

    A specific instance of avoiding single points of failure involves the processing of data inputs. It is imperative that safety critical input signals be duplicated and processed independently to avoid a single point of failure in input processing.

    Analog inputs must be converted to digital signals via an A/D converter, which is a relatively complex apparatus that takes up a significant amount of chip area. For this reason, it is common to use a single shared A/D converter with multiple shared (“muxed”) inputs to that single converter. If redundant external inputs are run through the same A/D, this creates a single point of failure in the form of the A/D converter itself and the associated control circuitry.

    Mauser gives an example of this problem applied to automotive throttle control, showing that only a "true 2-channel system" (e.g., a 2oo2 system with redundant inputs) provides safety.


    Figures from Mauser 1999, pp. 731, 738-739 showing an example throttle control system that causes runaway unless a truly redundant system (dual CPUs plus dual A/D conversion) is used.

    Similarly, digital inputs that are processed in the same chip have common circuitry affecting their operation, which in typical chips includes a direction register that determines whether a digital pin is an input or output.

    For both analog and digital inputs, an additional way of looking at single point failures is that if both redundant input signals are processed on the same chip, that one chip is subject to arbitrary faults, with arbitrary fault behavior including the possibility of corrupting both inputs in a way that is both faulty and undetectable by other components in the system. Unless some independent means of ensuring system safety is present, such a single point of failure impairs system safety. An arbitrary fault on a single chip that processes both copies of an analog input sensor might declare the sensor to be fully activated, but within normal operational limits, resulting in that value being processed by the rest of the system whether the input is really active or not. For example, an embedded CPU in a car might think that the brake pedal is depressed, accelerator pedal is depressed, or parking brake is engaged despite the potential presence of redundant sensors on those controls if both sensors pass through the same A/D converter or digital input port and there is a common-mode hardware fault in those input processing circuits.

    Beyond the need to independently computer results on two different chips, there is an additional requirement for safety that each of the two chips independently and fully compare the inputs to detect any faults. For example, if chips A and B both process inputs, but only chip B compares them for correctness, then there is a single point of failure if chip B has a bad input and incorrectly reports the comparison as passing (this only counts as one failure because chip B can fail in an arbitrary way in which it both mis-interprets input B and "lies" about the comparison with input A being OK). A safe way to do such a comparison is that chip A compares both inputs, chip B compares both inputs, and the system only continues operation if both chip A and chip B agree that each of their comparisons validated the inputs as being consistent. Moreover, cross-checks on the outputs based on those inputs must also be performed to detect faults that occur after input processing. That generally leads to a 2oo2 architecture like the one shown below, with each FCR usually being a CPU chip.



    Sometimes both inputs go to both FCRs, but then it must be ensured that there is hardware isolation in place so that a hardware fault on one FCR can't propagate to the inputs of the other FCR via the shared input lines. Another complication is that redundant sensors often do not produce identical output values, and the problem of determining distributed agreement turns out to be very difficult even if all you need is an approximate agreement result. In general, once you have replication, getting agreement across the replicated copies of inputs and computations requires some effort (Poledna 1995). But, these are the sorts of issues that engineers routinely work through when creating safety-critical systems.

    In the end, having a single shared A/D converter or other input circuit for a safety critical system is inadequate. You must have two separate input processing circuits on two separate chips to have two independent Fault Containment Regions (e.g., using a 2oo2 architectural approach with redundant inputs). This is required to achieve safety for a high-integrity application, and any high-integrity embedded system that uses a shared A/D converter on a single chip to process redundant inputs is unsafe.

    References:
    • Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
    • Poledna, S., "Fault tolerance in safety critical automotive applications: cost of agreement as a limiting factor ", Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on , 27-30 Jun 1995 Page(s): 73 -8


    Monday, April 7, 2014

    Self-Monitoring and Single Points of Failure


    A previous post discussed single points of failure in general. Creating a safety-critical embedded system requires avoiding single points of failure in both hardware in software. This post is the first part of a discussion about examples of single points of failure in safety critical embedded systems.

     Consequences: A consequence of having a single point of failure is that when a critical single point fails, the system becomes unsafe via taking an unsafe action or ceasing to perform critical functions. 

    Accepted Practices: The following are accepted practices for avoiding single point failures in safety critical systems:
    • A safety critical system must not have any single point of failure that results in a significant unsafe condition if that failure can reasonably be expected to occur during the operational life of the deployed fleet of systems. Because of their high production volume and usage hours, for automobiles, aircraft, and similar systems it must be expected that any single microcontroller hardware chip and software on any single chip will fail in an arbitrarily unsafe manner.
    • Properly implemented monitor-actuator pairs, redundant input processing, and a comprehensive defense-in-depth strategy are all accepted practices for mitigating single point faults (see future blog entries for postings on those topics).
    • Multiple points of failure that can fail at the same time due to the same cause, can accumulate without being detected and mitigated during system operation, or are otherwise likely to fail concurrently, must be treated as having the same severity as a single point of failure.
    Discussion:
    MISRA Report 2 states that the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, pg. 17). In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior, typically involving passenger deaths. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

    Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734) in which an open throttle fault could cause a mishap. It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

    Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.

    But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their network monitor from their main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

    A fallacy in designing safety critical systems is thinking that partial redundancy in the form of "fail-safe" hardware or software will catch all problems without taking into account the need for complete isolation of the potentially faulty component and the mitigation component. If both the mitigation and the fault are in the same Fault Containment Region (FCR), then the system can't be made entirely safe.

    To give a more concrete example, consider a single CPU with a self-monitoring feature that has hardware and/or software that detects faults within that same CPU. One could envision such a system signaling to an outside device a self-health report. Such a design pattern is sometimes called a "simplex system with disengagement monitor" and uses "Built-In Test" (BIT) to do the self-checking.  (Note that BIT is a generic term for self-checks, and does not necessarily mean a manufacturing gate-level test or other specific diagnostic.) If self-health checks are false, then the system fails over to a safe state via, for example, shutting down (if shutting down is safe). To be sure, doing this is better than doing nothing. But, it can never get complete coverage. What if the self-health check is compromised by the fault in the chip?

    A look at a research paper on aerospace fault tolerant architectures explains why a simplex (single-FCR) system with BIT is inadequate for high-integrity safety-critical systems. Hammett (2001) figure 5 shows a simplex computer with BIT disengagement features, and says that they “increase the likelihood the computer will fail passive rather than fail active. But it is important to realize that it is impossible to design BIT that can detect all types of computer failures and very difficult to accurately estimate BIT effectiveness.” (id., pg. 1.C.5-4, emphasis added) Such an architecture is said to “Fail Active” after some failures (id., Table 1, p. 1.C.5-7), where “A fail active condition is when the outputs to actuators are active, but uncontrolled. … A fail active condition is a system malfunction rather than a loss of function.” (id., pg. 1.C.5-2, emphasis per original) “For some systems, an annunciated loss of function is an acceptable fail-safe, but a malfunction could be catastrophic.” (id., p. 1.C.5-3, emphasis per original) In particular, with such an architecture depending on the fraction of failures caught (which is not 100%), some “failures will be undetected and the system may fail to a potentially hazardous fail active condition.” (id., p. 1.C.5-4, emphasis added).



    Table 1 from Hammett 2001, below, shows where Simplex with BIT stands in terms of fault tolerance capability. It will fail active (i.e., fail dangerously) for some single point failures, and that's a problem for safety critical systems. .



    Note that dual standby redundancy is also inadequate even though it has two copies of the same computer with the same software. This is because the primary has to self-diagnose that it has a problem before it switches to the backup computer (Hammett Fig. 6, below). If the primary doesn't properly self-diagnose, it never switches over, resulting in a fail-active (dangerous system).


    On the other hand, a self-checking pair (Hammett figure 7 above), sometimes known as a "2 out of 2" or 2oo2 system, can tolerate all single point faults the following way. Each of the computers in a 2oo2 pair runs the same software on identical hardware, usually operating in lockstep. If the outputs don't agree, then the system disables its outputs. Any single failure that affects the computation will, by definition, cause the outputs to disagree (because it can only affect one of the 2 computers, and if it doesn't change the output then it is not affecting the result of the computation). Most dual-point failures will also be detected, except for dual point failures that happen to affect both computers in exactly the same way. Because the two computers are separate FCRs, this is unlikely unless there is a correlated fault such as a software defect or hardware design defect. In practice, the inputs are also replicated to avoid a bad sensor being a single point of failure as well (Hammett's figure is non-specific about inputs, because the focus is on computing patterns). 2oo2 is not a free lunch in many regards, and I'll queue a discussion of the gory details for a future blog post if there is interest. Suffice it to say that you have to pay attention to many details to get this right. But it is definitely possible to build such a system.

    With a 2oo2 system, the second CPU does not improve availability, but in fact reduces it because there are twice as many computers to fail. To attain availability, a redundant failover set of 2oo2 computers can be used (Hammett Fig.9 -- dual self-checking pair). And in fact this is a commonly used architecture in railway switching equipment. Each 2oo2 pair self-checks, and if it detects an error it shuts down, swapping in the other 2oo2 pair.  So having a single 2oo2 pair is done for safety.  The second 2oo2 pair is there to prevent outages (see Hammett figure 9, below).


    From the above we can see that avoiding single points of failure requires at least two CPUs, with care taken to ensure that each CPU is a separate fault containment region. If you need a fail-operational system, then 4 CPUs arranged per figure 9 above will give you that, but at a cost of 4 CPUs.

    Note that we have not at any point attempted to identify some "realistic" way in which a computer can both produce a dangerous output and cause its BIT to fail. Such analysis is not required when building a safe system. Rather, the effects of failure modes in electronics are more subtle and complex than can be readily understood (and some would argue that many real but infrequent failure modes are too complex for anyone to understand). It is folly to try to guess all possible failures and somehow ensure that the BIT will never fail. But even if we tried to do this, the price for getting it wrong in terms of death and destruction with a safety critical system is simply too high to take that chance. Instead, we simply assert that Murphy will find a way to make a simplex system with BIT fail active, and take that as a given.

     By way of analogy, there is no point doing analysis down to single lines of  code or bolt tensile strengths in high-vibration environments within a jet engine to know that flying across the Pacific Ocean in a jet airliner with only one engine working at takeoff is a bad idea.  Even perfectly designed jet engines break, and any single copy of perfectly design jet engine software will eventually fail (due to a single event upset within the CPU it is running in, if for no other reason). The only way to achieve safety is to have true redundancy, with no single point failure whatsoever that can possibly keep the system from entering a safe state.

    In practice the "output if agreement" block shown in these figures can itself be a single point of failure. This is resolved in practical systems by, for example, having each of the computers in a 2oo2 pair control the reset/shutdown line on the other computer in the 2oo2 pair. If either computer detects a mismatch, it both shuts down the other CPU and commits suicide itself, taking down the pair. This system reset also causes the switch in a dual 2oo2 system to change over to the backup pair of computers. And yes, that switch can also be a single point of failure, which can be resolved by for example having redundant actuators that are de-energized when the owner 2oo2 pair shuts down. And, we have to make sure our software doesn't cause correlated faults between pairs by ensuring it is of sufficiently high integrity as well.

    As you can see, flushing out single points of failure is no small thing. But if you want to build a safety critical system, getting rid of single points of failure is the price of admission to the game. And that price includes truly redundant CPUs for performing safety critical computations.

    References:
    • Hammett, Design by extrapolation: an evaluation of fault-tolerant avionics, 20th Conference on Digital Avionics Systems, IEEE, 2001.
    • MISRA, Report 2: Integrity, February 1995.
    • Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.

    Monday, March 31, 2014

    How Often Will Random Faults Kill Someone?


    If you design a system that doesn't eliminated all single points of failure, eventually your safety-critical system will kill someone.  How often is that?  Depends on your exposure, but in real-world systems it could easily be on a daily basis.

    "Realistic" Events Happen Every Million Hours

    Understanding software safety requires reasoning about extremely low probability events, which can often lead to results that are not entirely intuitive. Let's attempt to put into perspective where to set the bar for what is “realistic” and why single point fault vulnerabilities are inherently dangerous for systems that can kill people if they fail.

    If a human being lives to be 90 years old, that is 90 years * 365.25 days/yr * 24 hrs/day = 788,940 hours.  While the notion of what a "realistic" fault is can be subjective, perhaps an individual will think something is a realistic failure type if (s)he can expect to see it happen in one human lifetime. 

    This definition has some problems since everyone’s experience varies. For example, some would say that total loss of dual redundant hydraulic service brakes in a car is unrealistic. But that has actually happened to me due to a common-mode cracking mechanical failure in the brake fluid reservoir of my vehicle, and I had to stop my vehicle using my parking brake. So I’d have to call loss of both service brake hydraulic systems a realistic fault from my point of view. But I have met automotive engineers who have told me it is “impossible” for this failure to happen. The bottom line: intuition as to what is "realistic" isn’t enough for these types of matters -- you have to crunch the numbers and take into account that if you haven't seen it yourself that doesn't mean it can't happen.

    So perhaps it is also realistic if a friend has told you a story about such a fault. From a probability point of view let’s just say it's "realistic" if it is likely to happen every 1,000,000 hours (directly to you in your life, plus 25% extra to account for second-hand stories).  

    Probability math in the safety critical system area is usually only concerned with rounded powers of ten, and 1 million is just a rounding up of a human lifespan. Put another way, if it happens once per million hours, let’s say it’s “realistic.” By way of contrast, the odds of winning the top jackpot in Powerball are about one in 175 million (http://www.powerball.com/powerball/pb_prizes.asp), and someone eventually wins without buying 175 tickets per hour for an entire million-hour human lifetime, so a case can made for much less frequent events to also be “realistic” -- at least for someone in the population.

    I should note at this point that aircraft safety and train safety are at least 1000 times more rigorous than one catastrophic failure per million hours. Then again, trains and jumbo jets hold hundreds of people, who are all being exposed to risk together.  So the point of this essay is simply to make safety probability more human-accessible, and not to argue that one catastrophic fault per million hours is OK for a high criticality system -- it's not!

    "Realistic" Catastrophic Failures Will Happen To A Deployed Fleet Daily

    Obermaisser gives permanent hardware failure rates as about 100 FIT (Obermaisser p. 10 in the context of drive-by-wire automobiles. Note that 1 “FIT” is 1 failure per billion hours, so 100 FIT is one failure per 10 million operating hours). This means that you aren’t likely to see any particular car component failure in your lifetime. But cars have lots of components all failing at this rate, so there is a reasonable chance you’ll see some component failure and need to buy a replacement part. (These failure rates are for normal operating lifetimes, and do not count the fact that failures are more frequent as parts near end of life due to age and wearout.)

    However, transient failures, such as those caused by cosmic rays, voltage surges, and so on, are much more common, reaching rates of 10,000-100,000 FIT (Obermaisser p. 10; 100,000 per billion hours = 100 per million hours). That means that any particular component will fail in a way that can be fixed by rebooting or the like about 10 to 100 times per million hours – well within the realm of realistic by individual human standards. Obermaisser also gives a frequency for arbitrary faults as 1/50th of all faults (id., p. 8). Thus, any particular component, can be expected to fail 10 to 100 times per million hours, and to do so in an “arbitrary” way (which is software safety lingo for “dangerous”) about 0.2 to 2 times per million hours.  This is right around the range of “realistic,” straddling the once per million hours frequency.

     In other words, if you own one single electronic device your whole life, it will suffer a potentially dangerous failure about once in your lifetime. If you continuously own vehicles for your entire life, and they each have 100 computer chips in them, then you can expect to see about one potentially dangerous failure per year since there are 100 of them to fail -- and whether any such failure actually kills you depends upon how well the car's fault tolerance architecture successfully deals with that failure and how lucky you are that day. (We'll get into the point that you don't drive your car 24x7 momentarily.)  That is why it is so important to have redundancy in safety-critical automotive systems. These are general examples to indicate the scale of the issues at hand, and not specifically intended to correspond to exact numbers in any particular vehicle, especially one in which failsafes are likely to somewhat reduce the mishap rate by providing partial redundancy.

     Now let’s see how the probabilities work when you take into account the large size of the fleet of deployed vehicles. Let's say a car company sells 1 million vehicles in a year. Let’s also say that an average vehicle is driven about 1 hour per day (FHA 2009 National Household Travel Survey, p. 31 says 56 minutes per day, but we're rounding off here). That’s 1 million hours per day.  If we multiply that by the range of dangerous transient faults per component (0.2 to 2 per million hours), that means that the fleet of vehicles can expect 0.2 to 2 dangerous transient faults per day for any particular safety critical control component in that fleet. And there are likely more than one safety critical components in each car. In other words, while a dangerous failure may seem unlikely in an individual basis, designers must expect dangerous failures to happen on a daily basis in a large deployed fleet.

    While the exact numbers in this calculation are estimates, the important point is that a competent safety-critical designer must take into account arbitrary failures when designing a system for a large deployed fleet. 

     The above is an example calculation as to why redundancy is required to achieve safety. Arbitrary faults can be expected to be dangerous (we’ve already thrown away 98% of faults as benign in these calculations – we’re just keeping 2% of faults as dangerous faults). Multiple fault containment regions (FCRs) must be used to mitigate such faults.

     It is important to note that in the fault tolerant and safety critical computing fields “arbitrary” means just that – it is a completely unconstrained failure mode.  It is not simply a failure that seems “realistic” based on some set of preconceived notions or experiences. Rather, designers consider it to be the worst possible failure of a single chip in which it does the worst possible thing to make the system unsafe. For example, a pyrotechnic device must be expected to fire accidentally at some point unless there is true redundancy that mitigates every possible single point of failure, whether the exact failure mechanism in the chip can be imagined by an engineer or not. The only way to avoid single point failures is via some form of true redundancy that serves as an independent check and balance on safety critical functions.

    As a second source for similar failure rate numbers, Kopetz, who specializes in automotive drive-by-wire fault tolerant computer systems, gives an acceptable mean-time-to-failure of safety-critical applications of one critical failure per 1 billion hours (more than 100,000 years for any particular vehicle), saying that the “dependability requirements for a drive-by-wire system are even more stringent than the dependability requirements for a fly-by-wire systems, since the number of exposed hours of humans is higher in the automotive domain.” (Kopetz 2004, p. 32). This cannot be achieved using any simplex hardware scheme, and thus requires redundancy to achieve that safety goal. He gives expected component failure rates as 100 FIT for permanent faults and 100,000 FIT for transient faults (id., p. 37).

     By the same token, any first fault that persists a long time during operation without being detected or mitigated sets the stage for a second fault to happen, resulting in a sequence of faults that is likewise unsafe. For example, consider if a vehicle has a manufacturing fault or other fault that is undetected but disables a redundant sensor input without that failure being detected or fixed. From the time that fault happens forward, the vehicle is placed at the same risk as if it only had a single point failure vulnerability in the first place. Redundancy doesn’t do any good if a redundant component fails, nobody knows about it, nobody fixes it, and the vehicle keeps operating as if nothing is wrong. For example, for this reason a passenger jet with two engines isn’t allowed to take off on an over-ocean flight with only one engine working at takeoff.

    Software Faults Only Make Things Worse

     The above fault calculations “assume the absence of software faults,” (Obermaisser, p. 13), which is an unsupported assumption for many systems. Software faults can only be expected to make the arbitrary failure rate worse.Having software-implemented failsafes and partial redundancy may mitigate dangerous faults to more than the computed 2%. But it is impossible for a mitigation technique in the same fault containment region as a failure to be successful at mitigating all possible failure modes.

    Software defects manifest as single points of failure in a way that may be counter-intuitive. Software defects are design defects rather than run-time hardware failures, and are therefore present all the time in every copy of a system that is created. (MISRA Report 2, p. 7) Moreover, in the absence of perfect memory and temporal isolation mechanisms, every line of software in a system has the potential to affect the execution of every other line in the system. This is especially problematic in systems which do not use memory protection, which do not have an adequate real time scheduling approach, which make extensive use of global variables, and have other similar problems that result in an elevated risk of the activation of one software defect causing system-wide problems, or causing a cascading activation of other software defects. Therefore, in the absence of compelling proof of complete isolation of software tasks from each other, the entire CPU must be considered a single FCR for software, making the entirety of software on a CPU a single point of failure for which any possible unsafe behavior must be fully and completely mitigated by mechanisms other than software on that same CPU.

    Some examples of ways in which multiple apparent software defects might result within a single FCR but still constitute a single point of failure include: corruption of a block of memory (corrupting the values of many global variables), corruption of task scheduling information in and operating system (potentially killing or altering the execution patterns of several tasks), and timing problems that cause multiple tasks to miss their deadlines. It should be understood that these are merely examples – arbitrarily bad faults are possible from even a seemingly trivial software defect.

     MISRA states that FMEA and Fault Tree Analysis that examine specific sources and effects of faults are “not applicable to programmable and complex non-programmable systems because it is impossible to evaluate the very high number of possible failure modes and their resulting effects.” (MISRA Report 2 p. 17). Instead, for any complex electronics (whether or not it contains software), MISRA tells designers to “consider faults at the module level.” In other words, MISRA describes considering the entire integrated circuit for a microcontroller as a single FCR.

    Deployed Systems Will See Dangerous Random Faults

    The point of all this is to demonstrate that arbitrary dangerous single point faults can be expected to happen on a regular basis when hundreds of thousands of embedded systems such as cars are involved. True redundancy is required to avoid weekly mishaps on any full-scale production vehicle that involves hundreds of thousands of units in the field. No single point fault, no matter how obscure or seemingly unlikely, can be tolerated in a safety critical system. Moreover, multiple point faults cannot be tolerated if they are sufficiently likely to happen via accumulation of undetected faults over time. In particular, software-implemented countermeasures that run on a CPU aren't going to be 100% effective if that same CPU is the one that suffered a fault in the first place. True redundancy is required to achieve safety from catastrophic failures for large deployed fleets in the face of random faults.

    It is NOT acceptable practice to start arguing whether any particular fault is “realistic” given a particular design of within a single point fault region such as an integrated circuit. This is in part because designer intuition is fallible for very low (but still important) probability events, and also in part because it is as a practical matter impossible to envision all arbitrary faults that might cause a safety problem. Rather, accepted practice is to simply assert that the worst case possible single-point failure will find a way to happen, and design to ensure that such an event does not render the system unsafe.

    References:
    • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
    • MISRA, Report 2: Integrity, February 1995.
    • Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.

    Monday, March 24, 2014

    Safety Requires No Single Points of Failure

    If someone is likely to die due to a system failure, that means the system can't have any single points of failure. The implications for this are more subtle than they might seem at first glance, because a "single point" isn't a single line of code or a single logic gate, but rather a "fault containment region," which is often an entire chip.

    Safety Critical Systems Can't Have Single Points of Failure

    One of the basic tenets of safety critical system design is avoiding single points of failure. A single point of failure is a component that, if it is the only thing that fails, can make the system unsafe. In contrast, a double (or triple, etc.) point of failure situation is one in which two (or more) components must fail to result in an unsafe system.

    As an example, a single-engine airplane has a single point of failure (the single engine – if it fails, the plane doesn’t have propulsion power any more). A commercial airliner with two engines does not have an engine as a single point of failure, because if one engine fails the second engine is sufficient to operate and land the aircraft in expected operational scenarios. Similarly, cars have braking actuators on more than one wheel in part so that if one fails there are other redundant actuators to stop the vehicle.

    MISRA Report 2 states that avoiding such failures is the centerpiece of risk assessment.  (MISRA Report 2 p. 17)

    For systems that are deployed in large numbers for many operating hours, it can reasonably be expected that any complex component such as a computer chip or the software within it will fail at some point. Thus, attaining reasonable safety requires ensuring that no such component is a single point of failure.

    When redundancy is used to avoid having a single point of failure, it is important that the redundant components not have any “common mode” failures. In other words, there must not be any way in which both redundant components would fail from the same cause, because any such cause would in effect be a single point of failure. As an example, if two aircraft engines use the same fuel pump, then that fuel pump becomes a single point of failure if it stops pumping fuel to both engines at the same time. Or if two brake pads in a car are closed by pressure from the same hydraulic line, then a hydraulic line rupture could cause both brake pads to fail at the same time due to loss of hydraulic pressure.

    Similarly, it is important that there be no sequences of failures that escape detection until a last failure results in a system failure. For example, if one engine on a two-engine aircraft fails, it is essential to be able to detect that failure immediately so that diversion to a nearby airport can be completed before the second engine has a chance to fail as well.

    For functions implemented in software, designers must assume that software can fail due to software defects missed in testing or due to hardware malfunctions. Software failures that must be considered include: a task dying or “hanging,” a task missing its deadline, a task producing an unsafe value due to a computational error, and a task producing an unsafe value due to unexpected (potentially abnormal) input values. It is important to note that when a software task “dies” that does not mean the rest of the system stops operating. For example, a software task that periodically computes a servo angle might die and leave the servo angle commanded to the last computed position, depending upon whether other tasks within the system notice whether the servo angle computation task is alive or not.

    Avoiding single point failures in a multitasking software system requires that a single task must not manage both normal behavior and failure mode backup behavior, because if a failure is caused by that task dying or misbehaving, the backup behavior is likely to be lost as well. Similarly, a single task should not both perform an action and also monitor faults in that action, because if the task gets the action wrong, one should reasonably expect that it will subsequently fail to perform the monitoring correctly as well.

    It is well known in the research literature that avoiding single points of failure is essential in a safety critical system. Leveson says: “To prove the safety of a complex system in the presence of faults, it is necessary to show that no single fault can cause a hazardous effect and that hazards resulting from sequences of failures are sufficiently remote.”  (Leveson 1986, pg. 139)

    Storey says, in the context of software safety: “it is important to note that as complete isolation can never be guaranteed, system designers should not rely on the correct operation of a single module to achieve safety, even if that module is well trusted. As in all aspects of safety, several independent means of assuring safety are preferred.” (Storey 1996, pg. 238) In other words, a fundamental tenet of safety critical system design is to first ensure that there are no single points of failure, and then ensure that there are no likely pairs or sequences of failures that can also cause a safety problem. A specific implication of this statement is that software within a single CPU cannot be trusted to operate properly at all times. A second independent CPU or other device must be used to duplicate or monitor computations to ensure safety. Storey gives a variety of redundant hardware patterns to avoid single point failures (id., pp. 131-143)

    Douglass says:

     Douglass on single-point failures (Douglass 1999, pp. 105-107)

    Douglass makes it clear that all single point failures must be addressed to attain safety, no matter how remote the probability of failure might be.

    Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

    Fault Containment Regions Count As "Single Points" of Failure

    Additionally, it is important to recognize that any line of code on a CPU (or even many lines of code on a CPU) can be a single point fault that creates arbitrary effects in diverse places elsewhere on that same CPU. Addy makes it clear that: “all parts of the software must be investigated for errors that could have a safety impact. This is not a part of the software safety analysis that can be slighted without risk, and may require more emphasis than it currently receives. [Addy’s] case study suggests that, in particular, care should be taken to search for errors in noncritical software that have a hidden interface with the safety-critical software through data and system services.” (Addy 1991, pg 83). Thus, it is inadequate to look at only some of the code to evaluate safety – you have to look at every last line of code to be sure.

    A Fault Containment Region (FCR) is a region of a system which is neither vulnerable to external faults, nor causes faults outside the FCR. (Lala 1994, p. 30) The importance of an FCR is that arbitrary faults that originate inside an FCR cannot spread to another FCR, providing the basis of fault containment within a system that has redundancy to tolerate faults. In other words, if an FCR can generate erroneous data, you need a separate, independent FCR to detect that fault as the basis for creating a fault tolerant system (id.)




    Hammet presents architectures that are fail safe, such as a “self-checking pair with simplex fault down” that in my experience is also commonly used in rail signaling equipment.  (Hammet 2002, pp. 19-20). Rail signaling systems I have worked with also use redundancy to avoid single points of failure. So it's no mystery how to do this -- you just have to want to get it right.

    Obermaisser gives a list of failure modes for a “fault containment region” (FCR) that includes “arbitrary failures,” (Obermaisser 2006 pp. 7-8, section 3.2) with the understanding that a safe system must have multiple redundant FCRs to avoid a single point of failure. Put another way, a fault anywhere in a single FCR renders the entire FCR unreliable. Obermaisser also says that “The independence of FCRs can be compromised by shared physical resources (e.g., power supply, timing source), external faults (e.g., Electromagnetic Interference (EMI), spatial proximity) and design.” (id., section 3.1 p. 6)

    Kopetz says that an automotive drive-by-wire system design must “assure that replicated subsystems of the architecture for ultra-dependable systems fail independently of each other.” (Kopetz 2004, p. 32, emphasis per original) Kopetz also says that each FCR must be completely independent of each other, including: computer hardware, power supply, timing source, clock synchronization service, and physical space, noting that “all correlated failures of two subsystems residing on the same silicon die cannot be eliminated.” (id., p. 34).

    Continuing on with automotive specific references, none of this was news to the authors of the MISRA Software Guidelines, who said the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, p. 17).   In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

     Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734). It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

    Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.
    FCRs must be implemented in separate silicon die (Ademaj 2003, p. 5)

     But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their monitor from their Main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

    In other words, if a safety critical system has a single point of failure – any single point of failure, even if mitigated with “failsafes” in the same fault containment region – then it is not only unsafe, but also flies in the face of well accepted practices for creating safe systems.

    In a future post I'll get into the issue of how critical a system has to be for addressing all single point failures to be mandatory. But the short version is that is someone is likely to die as a result of system failure, you need to avoid all single point failures (every last one of them) on at least the level of FCRs.

    References:
    • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
    • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
    • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
    • Hammet, Design by extrapolation: an evaluation of fault-tolerant avionics, IEEE Aerospace and Electronic Systems, 17(4), 2002, pp. 17-25.
    • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
    • Lala et al., Architectural principles for safety-critical real-time applications, Proceedings of the IEEE, 82(1), Jan. 1994, pp. 25-40.
    • Leveson, N., Software safety: why, what, how, Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163.
    • MISRA, Report 2: Integrity, February 1995.
    • Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.
    • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.

    Monday, March 17, 2014

    Problems Caused by Random Hardware Faults in Critical Embedded Systems


    A previous posting described sources of random hardware faults. This posting shows that even single bit errors can lead to catastrophic system failures. Furthermore, these types of faults are difficult or impossible to identify and reproduce in system-level testing. But, just because you can't make such errors reproduce at the snap of your fingers doesn't mean they aren't affecting real systems every day.

    Paraphrasing a favorite saying of my colleague Roy Maxion: "You may forget about unlikely faults, but they won't forget about you."

    Random Faults Cause Catastrophic Errors

    Random faults can and do occur in hardware on an infrequent but inevitable basis. If you have a large number of embedded systems deployed you can count on this happening to your system. But, how severe can the consequences be from a random bit flip? As it turns out, this can be a source of system-killer problems that have hit more than one company, and can cause catastrophic failures of safety critical systems.

    While random faults are usually transient events, even the most fleeting of faults will at least sometimes result in a persistent error that can be unsafe. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, pg. 116)

    Not every random fault will cause a catastrophic system failure. But based on experimental evidence and published studies, it is virtually certain that at least some random faults will cause catastrophic system failures in a system that is not fully protected against all possible single-point faults and likely combinations of multi-point faults.

    Addy presented a case study of an embedded real time control system. The system architecture made heavy use of global variables (which is a bad idea for other reasons (Koopman 2010)). Analysis of test results showed that a single memory bit written to a global variable caused an unsafe action for the system studied despite redundant checks in software. (Addy 1991, pg. 77). This means that a single bit flip causing an unsafe system has to be considered plausible, because it has actually been demonstrated in a safety critical control system.


    Addy also reported a fault in which one process set a value, and then another process took action upon that value. Addy identified a failure mode in the first process of his system due to task death that rendered the operator unable to control the safety-critical process:

    Excerpt from (Addy 1991, pg. 80)

    In Addy’s case a display is frozen and control over the safety critical process is lost.  As a result of the experience, Addy recommends a separate safety process. (Addy 1991 pg. 81).

    Sun Microsystems learned the Single Event Upset (SEU) lesson the hard way in 2000, when it had to recall high-end servers because they had failed to use error detection/correction codes and were suffering from server crashes due to random faults in SRAM. (Forbes, 2000) Fortunately no deaths were involved since this is a server rather than a safety critical embedded system.

    Excerpt from Forbes 2000.

    Cisco also had a significant SEU problem with their 12000 series router line cards. They went so far as to publish a description and workaround notification, stating: “Cisco 12000 line cards may reset after single event upset (SEU) failures. This field notice highlights some of those failures, why they occur, and what work arounds are available.” (Cisco 2003) Again, these were not safety critical systems.

    More recently, Skarin & Karlsson presented a study of fault injection results on an automotive brake-by-wire system to determine if random faults can cause an unsafe situation. They found that 30% of random errors caused erroneous outputs, and of those 15% resulted in critical braking failures. (Skarin & Karlsson 2008)  In the context of this brake-by-wire study, critical failures were either loss of braking or a locked wheel during braking. (id., p. 148) Thus, it is clear that these sorts of errors can lead to critical operational faults in electronic automotive controls. It is worth noting that random errors tend to increase dramatically with altitude, making this a relevant environmental condition (Heijman 2011, p. 6)

    Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734). It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

    For automotive safety applications, MISRA Software Guidelines discuss the need to identify and correct corruption of memory values (MISRA Software Guidelines p. 30).

    Random Faults Can Be Impossible to Reproduce In General

    Intermittent faults from SEUs and other sources can and do occur in the field, causing real problems that are not readily reproducible even if the cause is a hard failure. In many cases components with faults are diagnosed as “Trouble Not Identified” (TNI). In my work with embedded system companies I have learned that is common to have TNI rates of 50% for complex electronic components across many embedded industries. (This means that half of the components returned to the factory as defective are found to be working properly when normal testing procedures are used.) While it is tempting to blame the customer for incorrect operation of equipment, the reality is that many times a defect in software design, hardware design, or manufacturing is ultimately responsible for these elusive problems. Thomas et al. give a case study of a TNI problem on Ford vehicles and conclude that it is crucial to perform a safety critical electronic root cause analysis for TNI problems rather than just assume it is the customer’s fault for not using the equipment properly or falsely reporting an error. (Thomas 2002)
    TNI reports must be tracked down to root cause. (Thomas 2002, pg. 650)

    The point is not really the exact number, but rather that random failures are the sort of thing that can be expected to happen on a daily basis for a large deployed fleet of safety-critical systems, and therefore must be guarded against rigorously. Moreover, lack of reproducibility does not mean the failure didn’t happen. In fact, irreproducible failures and very-difficult-to-reproduce failures are quite common in modern electronic components and systems.

    Random Faults Are Even Harder To Reproduce In System-Level Testing

    While random faults happen all the time in large fleets, detecting and reproducing them is not so easy. For example: “Studies have shown that transient faults occur far more often than permanent ones and are also much harder to detect.” (Yu 2003 pg. 11)

    It is also well known that some software faults appear random, even though they may be caused by a software defect that could fixed if identified. Yu explains that this is a common situation:
    Software faults can be very difficult to track down (Yu 2003, p. 12)

    Random faults cannot be assumed to be benign, and cannot be assumed to always result in a system crash or system reset that puts the system in to a safe state. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, p. 116)  “Many designers believe that computers fail safe, whereas NASA experience has shown that computers may exhibit hazardous failure modes.” (NASA 2004, p. 21)

    Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

    MISRA states that “system design should consider both random and systematic faults.” (MISRA Software Guidelines p. 6). It also states “Any fault whose initiating conditions occur frequently will usually have emerged during testing before production, so that the remaining obscure faults may appear randomly, even though they are entirely systematic in origin.” (MISRA Report 2 p. 7). In other words, software faults can appear random, and are likely to be elusive and difficult to pin down in products – because the easy ones are the ones that get caught during testing.

    SEUs and other truly random faults are not reproducible in normal test conditions because they depend upon rare uncontrollable events (e.g., cosmic ray strikes). “The result is an error in one bit, which, however, cannot be duplicated because of its random nature.” (Tang 2003, pg. 111).

    While intermittent faults might manifest repeatedly over a long series of tests with a particular individual vehicle or other embedded system, doing so may require precise reproduction of environmental conditions such as vibration, temperature, humidity, voltage fluctuations, operating conditions, and other factors.

    Random faults may be difficult to track down, but can result in
    errors that persist within the system (Storey 1996, pp. 115-116)

    It is unreasonable to expect a fault caused by a random error to perform upon demand in a test setup run for a few hours, or even a few days. It can easily take weeks or months to reproduce random software defects, and some may never be reproduced simply by performing system level testing in any reasonable amount of time.  If a fleet of hundreds of thousands of vehicles, or other safety critical embedded system, only sees random faults a few times a day, testing a single vehicle is unlikely to see that same random fault in a short test time – or possibly it may never see a particular random fault over that single test vehicle’s entire operating life.

    Thus, it is to be expected that realistic faults – faults that happen to all types of computers on an everyday basis when large numbers of them are deployed – can’t necessarily be reproduced upon demand in a single vehicle simply by performing vehicle-level tests.

    Software Faults Can Also Appear To Be Random and Irreproducible In System-Level Testing

    Software data corruption includes changes during operation made to RAM and other hardware resources (for example, registers controlling I/O peripherals that can be altered by software). Unlike hardware data corruption, software data corruption is caused by design defects in software rather than disruption of hardware operating state. The defects may be quite subtle and seemingly random, especially if they are caused by timing problems (also known as race conditions). Because software data corruption is caused by a CPU writing values to memory or registers under software control just as the hardware would expect to happen with correctly working software, hardware data corruption countermeasures such as error detecting codes do not mitigate this source of errors.

    It is well known that software defects can cause corruption of data or control flow information. Since such corruption is due to software defects, one would expect the frequency at which it happens depends heavily on the quality of the software. Generally developers try to find and eliminate frequent sources of memory corruption, leaving the more elusive and infrequent problems in production code. Horst et al. used modeling and measurement to estimate a corruption would occur once per month with a population of 10,000 processors (1993, abstract), although that software was written to non-safety-critical levels of software quality. Sullivan and Chillarege used defect reports to understand “overlay errors” (the IBM term for memory corruption), and found that such errors had much higher impact on the system than other types of defects (Sullivan 1991, p. 9).

    MISRA recommends using redundant data or a checksum (MISRA Software Guidelines p. 30; MISRA report 1 p. 21) NASA recommends using multiple copies or a CRC to guard against variable corruption (NASA 2004, p. 93). IEC 61508-3 recommends protecting data structures against corruption (p. 72).
    It is also well known that some software faults appear random, even though they may be caused be a software defect that could fixed if identified. Yu explains that this is a common situation, as shown in a previous figure. (Yu 2003, p. 12) It is well known that software faults can appear to be irreproducible at the system level, making it virtually impossible to reproduce them upon demand with system level testing of an unmodified system.

    It reasonable to expect that any software – and especially software that has not been developed in rigorous accordance with high-integrity software procedures – to have residual bugs that can cause memory corruption, crashes or other malfunctions. The figure below provides an example software malfunction from an in-flight entertainment system. Such examples abound if you keep an eye out for them.

    Aircraft entertainment system software failure. February 13, 2008 (photo by author)

    It should be no surprise if a system suffers random faults in a large-scale deployment that are difficult or impossible to reproduce in the lab via system-level testing, even if their manifestation in the field has been well documented. There simply isn't enough exposure to random fault sources in laboratory testing conditions to experience to the variety of faults that will happen in the real world to a large fleet.

    Coming at this another way, saying that something can't happen because it can't be reproduced in the lab ignores how random hardware and software faults manifest in the real world. Arbitrarily bad random faults can and will happen eventually in any widely deployed embedded system. Designers have to make a choice between ignoring this reality, or facing it head-on to design systems that are safe despite such faults occurring.

    References:

    • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
    • Cisco, Cisco 12000 Single Event Upset Failures Overview and Work Around Summary, August 15, 2003.
    • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
    • Forbes, Sun Screen, November 11, 2000. Accessed at: http://www.forbes.com/global/2000/1113/0323026a_print.html
    • Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.
    • Horst et al., The risk of data corruption in microprocessor-based systems, FTCS, 1993, pp. 576-585.
    • Koopman, Better Embedded System Software, 2010.
    • Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
    • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
    • MISRA, Report 2: Integrity, February 1995.
    • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
    • Skarin & Karlsson, Software implemented detection and recovery of soft errors in a brake-by-wire system, EDCC 2008, pp. 145-154.
    • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
    • Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.
    • Tang & Rodbell, Single-event upsets in microelectronics: fundamental physics and issues, MRS Bulletin, Feb. 2003, pp. 111-116.
    • Thomas et al., The “trouble not identified” phenomenon in automotive electronics, Microelectronics Reliability 42, 2002, pp. 641-651.
    • Yu, Y. & Johnson, B., Fault Injection Techniques: a perspective on the state of research. In: Benso & Prinetto (eds.), Fault Injection Techniques and Tools for Embedded System Reliability Evaluation, Kluwer, 2003, pp. 7-39.


    Monday, March 3, 2014

    Random Hardware Faults


    Computer hardware randomly generates incorrect answers once in a while, in some cases due to single event upsets from cosmic rays. Yes, really! Here's a summary of the problem and how often you can expect it to occur, with references for further reading.

    ---------------------------------------

    (As will be the case with some of my posts over the next few months, this text is written more in an academic style, and emphasizes automotive safety applications since that is the area I perform much of my research in. But I hope it is still useful, if less entertaining in style than some of my other posts.)

    Hardware corruption in the form of bit flips in memories has been known to occur on terrestrial computers for more than 30 years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 2011). Hardware data corruption includes changes during operation made to both RAM and other CPU hardware resources (for example bits held in CPU registers that are intermediate results in a computation).

    "Soft errors" are those in which a transient fault causes a loss of data, but not damage to a circuit. (Mariani03, pg. 51) Terminology can be slightly confusing in that a “soft error” has nothing to do with software, but instead refers to a hardware data corruption that does not involve permanent chip damage. These “soft” errors can be caused by, among other things, neutrons created by cosmic rays that penetrate the earth’s atmosphere.

    Soft errors can be caused by external events such as naturally occurring alpha radiation particle caused by impurities in the semiconductor materials, as well as neutrons from cosmic rays that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani03, pg. 51).

    As computers evolve to become more capable, they also use smaller and smaller individual devices to perform computation. And, as devices get smaller, they become more vulnerable to data corruption from a variety of sources in which a subtle fleeting event causes a disruption in a data value. For example, a cosmic ray might create a particle that strikes a bit in memory, flipping that bit from a “1” to a “0,” resulting in the change of a data value or death of a task in a real time control system. (Wang 2008 is a brief tutorial.)

    Chapter 1 of Ziegler & Puchner (2004) gives an historical perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) lists accepted techniques for protecting against hardware induced errors, including use of error correcting codes (id., p. 9-7) and memory scrubbing (id., p. 9-8). It is noted that techniques that protect memory “are not effective for computational or logic operations” (id., p. 9-8), meaning that only values stored in RAM can be protected by these methods, although special exotic methods can be used to protect the rest of the CPU (id., p. 1-13).

    Causes for such failures vary in source and severity depending upon a variety of conditions and the exact type of technology used, but they are always present. The general idea of the failure mode is as follows. SRAM memory uses pairs of inverter logic gates arranged in a feedback pattern to “remember” a “1” or a “0” value. A neutron produced by a cosmic ray can strike atoms in an SRAM cell, inducing a current spike that over-rides the remembered value, flipping a 0 to a 1, or a 1 to a 0. Newer technology uses smaller hardware cells to achieve increased density. But having smaller components makes it easier for a cosmic ray or other disturbance to disrupt the cell’s value. While a cosmic ray strike may seem like an exotic way for computers to fail, this is a well documented effect that has been known for decades (e.g., May & Woods 1979 discussed radiation-induced errors) and occurs on a regular basis with a predictable average frequency, even though it is fundamentally a randomly occurring event.

    Radiation strike causing transistor disruption (Gorini 2012).

    A paper from Dr. Cristian Constantinescu, a top Intel dependability expert at that time, summarized the situation about a decade ago (Constantinescu 2003). He differentiates random faults into two cases. Intermittent faults, which might be due to a subtle manufacturing or wearout defect,  can cause bursts of memory errors (id., pp. 15-16). Transient faults, on the other hand, can be caused by naturally occurring neutron and alpha particles, electromagnetic interference, and electrostatic discharge (id., pg. 16), which he refers to as “soft errors” in general. To simplify terminology, we’ll consider both transient and intermittent faults as “random” errors, even though intermittent errors strictly speaking may come in occasional bursts rather than having purely random independent statistical distributions. His paper goes on to discuss errors observed in real computers. In his study, 193 corporate server computers were studied, and 69% of them observed one or more random errors in a 16-month period. 2.6 percent of the servers reported more than 1000 random errors in that 16-month period due to a combination of random error sources (id., pg. 16):

    Random Error Data (Constantinescu 2003, pp. 15-16).


    This data supports the well known notion that random errors can and do occur in all computer systems. It is to be expected due to minute variations in manufacturing, environments, and statistical distribution properties that some systems will have a very large number of such errors in a short period of time, while many of the same exact type of system will exhibit none even for extended periods of operation. The problem is only expected to get worse for newer generations of electronics (id., pg. 17).

    Excerpt from Mariani 2003, pg. 50

    Mariani (2003, pg. 50) makes it clear that soft errors “are officially considered one of the main causes of loss of reliability in modern digital components.” He even goes so far to say of “x-by-wire” automotive systems (which includes “throttle-by-wire” and other computer-controlled functions): “Of course such systems must be highly reliable, as failure will cause fatal accidents, and therefore soft errors must be taken into account.” Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear that “random hardware failures” must be controlled for safety critical automotive applications, with a required rigor to control those failures increasing as safety becomes more important. (MISRA Report 2 p. 18).

    Semiconductor error rates are often given in terms of FIT (“Failures in Time”), where one FIT equals one failure per trillion operating hours.  Mariani gives representative numbers as 4000 FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates increasing significantly for subsequent years. Based on a review of various data sources considered it seems reasonable to use an error rate of 1000 FIT/Mbit (1 errors per 1e13 bit-hr) for SRAM. (e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) Note that not all failures occur in memory. Any transistor on a chip can fail due to radiation, and not all storage is in RAM. As an example, registers such as a computer’s program counter are exposed to Single Event Upsets (SEUs), and the effect of an SEU there can result in the computer having arbitrarily bad behavior unless some form of mitigation is in place. Automotive-specific publications state that hardware induced data corruption must be considered (Seeger 1982).

    SEU rates vary considerably based on a number of factors such as altitude, with data presented for Boulder Colorado showing a factor of 3.5 to 4.8 increase in cosmic-ray induced errors over data collected in Essex Junction Vermont (near sea level) (O’Gorman 1994, abstract).

    Given the above, it is clear that random hardware errors are to be expected when deploying a large fleet of embedded systems. This means that if your system is safety critical, you need to take such errors into account. In a future post I'll explore how bad the effects of such a fault can be.

    References:

    Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, July-Aug 2003, pp. 14-19

    Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.

    Karnik et al., Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable and Secure Computing, April-June 2004, pp. 128-143.

    Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.

    May & Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Trans. On Electronic Devices, vol. 26, 1979

    MISRA, Report 2: Integrity, February 1995. (www.misra.org.uk)

    O’Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, April 1994, pp. 553-557.

    Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.

    Tezzaron Semiconductor, Soft Errors in Electronic Memory -- A White Paper, 2004, www.tezzaron.com

    Wang & Agrawal, Single event upset: an embedded tutorial, 21st Intl. Conf. on VLSI Design, 2008, pp. 429-434.

    Ziegler & Puchner, SER – History, Trends, and Challenges: a guide for designing with memory ICs, Cypress Semiconductor, 2004.