Better Embedded System SW: March 2014

Monday, March 31, 2014

How Often Will Random Faults Kill Someone?

If you design a system that doesn't eliminated all single points of failure, eventually your safety-critical system will kill someone. How often is that? Depends on your exposure, but in real-world systems it could easily be on a daily basis.

"Realistic" Events Happen Every Million Hours

Understanding software safety requires reasoning about extremely low probability events, which can often lead to results that are not entirely intuitive. Let's attempt to put into perspective where to set the bar for what is “realistic” and why single point fault vulnerabilities are inherently dangerous for systems that can kill people if they fail.

If a human being lives to be 90 years old, that is 90 years * 365.25 days/yr * 24 hrs/day = 788,940 hours. While the notion of what a "realistic" fault is can be subjective, perhaps an individual will think something is a realistic failure type if (s)he can expect to see it happen in one human lifetime.

This definition has some problems since everyone’s experience varies. For example, some would say that total loss of dual redundant hydraulic service brakes in a car is unrealistic. But that has actually happened to me due to a common-mode cracking mechanical failure in the brake fluid reservoir of my vehicle, and I had to stop my vehicle using my parking brake. So I’d have to call loss of both service brake hydraulic systems a realistic fault from my point of view. But I have met automotive engineers who have told me it is “impossible” for this failure to happen. The bottom line: intuition as to what is "realistic" isn’t enough for these types of matters -- you have to crunch the numbers and take into account that if you haven't seen it yourself that doesn't mean it can't happen.

So perhaps it is also realistic if a friend has told you a story about such a fault. From a probability point of view let’s just say it's "realistic" if it is likely to happen every 1,000,000 hours (directly to you in your life, plus 25% extra to account for second-hand stories).

Probability math in the safety critical system area is usually only concerned with rounded powers of ten, and 1 million is just a rounding up of a human lifespan. Put another way, if it happens once per million hours, let’s say it’s “realistic.” By way of contrast, the odds of winning the top jackpot in Powerball are about one in 175 million (http://www.powerball.com/powerball/pb_prizes.asp), and someone eventually wins without buying 175 tickets per hour for an entire million-hour human lifetime, so a case can made for much less frequent events to also be “realistic” -- at least for someone in the population.

I should note at this point that aircraft safety and train safety are at least 1000 times more rigorous than one catastrophic failure per million hours. Then again, trains and jumbo jets hold hundreds of people, who are all being exposed to risk together. So the point of this essay is simply to make safety probability more human-accessible, and not to argue that one catastrophic fault per million hours is OK for a high criticality system -- it's not!

"Realistic" Catastrophic Failures Will Happen To A Deployed Fleet Daily

Obermaisser gives permanent hardware failure rates as about 100 FIT (Obermaisser p. 10 in the context of drive-by-wire automobiles. Note that 1 “FIT” is 1 failure per billion hours, so 100 FIT is one failure per 10 million operating hours). This means that you aren’t likely to see any particular car component failure in your lifetime. But cars have lots of components all failing at this rate, so there is a reasonable chance you’ll see some component failure and need to buy a replacement part. (These failure rates are for normal operating lifetimes, and do not count the fact that failures are more frequent as parts near end of life due to age and wearout.)

However, transient failures, such as those caused by cosmic rays, voltage surges, and so on, are much more common, reaching rates of 10,000-100,000 FIT (Obermaisser p. 10; 100,000 per billion hours = 100 per million hours). That means that any particular component will fail in a way that can be fixed by rebooting or the like about 10 to 100 times per million hours – well within the realm of realistic by individual human standards. Obermaisser also gives a frequency for arbitrary faults as 1/50^th of all faults (id., p. 8). Thus, any particular component, can be expected to fail 10 to 100 times per million hours, and to do so in an “arbitrary” way (which is software safety lingo for “dangerous”) about 0.2 to 2 times per million hours. This is right around the range of “realistic,” straddling the once per million hours frequency.

In other words, if you own one single electronic device your whole life, it will suffer a potentially dangerous failure about once in your lifetime. If you continuously own vehicles for your entire life, and they each have 100 computer chips in them, then you can expect to see about one potentially dangerous failure per year since there are 100 of them to fail -- and whether any such failure actually kills you depends upon how well the car's fault tolerance architecture successfully deals with that failure and how lucky you are that day. (We'll get into the point that you don't drive your car 24x7 momentarily.) That is why it is so important to have redundancy in safety-critical automotive systems. These are general examples to indicate the scale of the issues at hand, and not specifically intended to correspond to exact numbers in any particular vehicle, especially one in which failsafes are likely to somewhat reduce the mishap rate by providing partial redundancy.

Now let’s see how the probabilities work when you take into account the large size of the fleet of deployed vehicles. Let's say a car company sells 1 million vehicles in a year. Let’s also say that an average vehicle is driven about 1 hour per day (FHA 2009 National Household Travel Survey, p. 31 says 56 minutes per day, but we're rounding off here). That’s 1 million hours per day. If we multiply that by the range of dangerous transient faults per component (0.2 to 2 per million hours), that means that the fleet of vehicles can expect 0.2 to 2 dangerous transient faults per day for any particular safety critical control component in that fleet. And there are likely more than one safety critical components in each car. In other words, while a dangerous failure may seem unlikely in an individual basis, designers must expect dangerous failures to happen on a daily basis in a large deployed fleet.

While the exact numbers in this calculation are estimates, the important point is that a competent safety-critical designer must take into account arbitrary failures when designing a system for a large deployed fleet.

The above is an example calculation as to why redundancy is required to achieve safety. Arbitrary faults can be expected to be dangerous (we’ve already thrown away 98% of faults as benign in these calculations – we’re just keeping 2% of faults as dangerous faults). Multiple fault containment regions (FCRs) must be used to mitigate such faults.

It is important to note that in the fault tolerant and safety critical computing fields “arbitrary” means just that – it is a completely unconstrained failure mode. It is not simply a failure that seems “realistic” based on some set of preconceived notions or experiences. Rather, designers consider it to be the worst possible failure of a single chip in which it does the worst possible thing to make the system unsafe. For example, a pyrotechnic device must be expected to fire accidentally at some point unless there is true redundancy that mitigates every possible single point of failure, whether the exact failure mechanism in the chip can be imagined by an engineer or not. The only way to avoid single point failures is via some form of true redundancy that serves as an independent check and balance on safety critical functions.

As a second source for similar failure rate numbers, Kopetz, who specializes in automotive drive-by-wire fault tolerant computer systems, gives an acceptable mean-time-to-failure of safety-critical applications of one critical failure per 1 billion hours (more than 100,000 years for any particular vehicle), saying that the “dependability requirements for a drive-by-wire system are even more stringent than the dependability requirements for a fly-by-wire systems, since the number of exposed hours of humans is higher in the automotive domain.” (Kopetz 2004, p. 32). This cannot be achieved using any simplex hardware scheme, and thus requires redundancy to achieve that safety goal. He gives expected component failure rates as 100 FIT for permanent faults and 100,000 FIT for transient faults (id., p. 37).

By the same token, any first fault that persists a long time during operation without being detected or mitigated sets the stage for a second fault to happen, resulting in a sequence of faults that is likewise unsafe. For example, consider if a vehicle has a manufacturing fault or other fault that is undetected but disables a redundant sensor input without that failure being detected or fixed. From the time that fault happens forward, the vehicle is placed at the same risk as if it only had a single point failure vulnerability in the first place. Redundancy doesn’t do any good if a redundant component fails, nobody knows about it, nobody fixes it, and the vehicle keeps operating as if nothing is wrong. For example, for this reason a passenger jet with two engines isn’t allowed to take off on an over-ocean flight with only one engine working at takeoff.

Software Faults Only Make Things Worse

The above fault calculations “assume the absence of software faults,” (Obermaisser, p. 13), which is an unsupported assumption for many systems. Software faults can only be expected to make the arbitrary failure rate worse.Having software-implemented failsafes and partial redundancy may mitigate dangerous faults to more than the computed 2%. But it is impossible for a mitigation technique in the same fault containment region as a failure to be successful at mitigating all possible failure modes.

Software defects manifest as single points of failure in a way that may be counter-intuitive. Software defects are design defects rather than run-time hardware failures, and are therefore present all the time in every copy of a system that is created. (MISRA Report 2, p. 7) Moreover, in the absence of perfect memory and temporal isolation mechanisms, every line of software in a system has the potential to affect the execution of every other line in the system. This is especially problematic in systems which do not use memory protection, which do not have an adequate real time scheduling approach, which make extensive use of global variables, and have other similar problems that result in an elevated risk of the activation of one software defect causing system-wide problems, or causing a cascading activation of other software defects. Therefore, in the absence of compelling proof of complete isolation of software tasks from each other, the entire CPU must be considered a single FCR for software, making the entirety of software on a CPU a single point of failure for which any possible unsafe behavior must be fully and completely mitigated by mechanisms other than software on that same CPU.

Some examples of ways in which multiple apparent software defects might result within a single FCR but still constitute a single point of failure include: corruption of a block of memory (corrupting the values of many global variables), corruption of task scheduling information in and operating system (potentially killing or altering the execution patterns of several tasks), and timing problems that cause multiple tasks to miss their deadlines. It should be understood that these are merely examples – arbitrarily bad faults are possible from even a seemingly trivial software defect.

MISRA states that FMEA and Fault Tree Analysis that examine specific sources and effects of faults are “not applicable to programmable and complex non-programmable systems because it is impossible to evaluate the very high number of possible failure modes and their resulting effects.” (MISRA Report 2 p. 17). Instead, for any complex electronics (whether or not it contains software), MISRA tells designers to “consider faults at the module level.” In other words, MISRA describes considering the entire integrated circuit for a microcontroller as a single FCR unless there is a very strong isolation argument (and considering the findings of Ademaj (2003), it is difficult to see how that can be done).

Deployed Systems Will See Dangerous Random Faults

The point of all this is to demonstrate that arbitrary dangerous single point faults can be expected to happen on a regular basis when hundreds of thousands of embedded systems such as cars are involved. True redundancy is required to avoid weekly mishaps on any full-scale production vehicle that involves hundreds of thousands of units in the field. No single point fault, no matter how obscure or seemingly unlikely, can be tolerated in a safety critical system. Moreover, multiple point faults cannot be tolerated if they are sufficiently likely to happen via accumulation of undetected faults over time. In particular, software-implemented countermeasures that run on a CPU aren't going to be 100% effective if that same CPU is the one that suffered a fault in the first place. True redundancy is required to achieve safety from catastrophic failures for large deployed fleets in the face of random faults.

It is NOT acceptable practice to start arguing whether any particular fault is “realistic” given a particular design of within a single point fault region such as an integrated circuit. This is in part because designer intuition is fallible for very low (but still important) probability events, and also in part because it is as a practical matter impossible to envision all arbitrary faults that might cause a safety problem. Rather, accepted practice is to simply assert that the worst case possible single-point failure will find a way to happen, and design to ensure that such an event does not render the system unsafe.

References:

Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
MISRA, Report 2: Integrity, February 1995.
Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.

Monday, March 24, 2014

Safety Requires No Single Points of Failure

If someone is likely to die due to a system failure, that means the system can't have any single points of failure. The implications for this are more subtle than they might seem at first glance, because a "single point" isn't a single line of code or a single logic gate, but rather a "fault containment region," which is often an entire chip.

Safety Critical Systems Can't Have Single Points of Failure

One of the basic tenets of safety critical system design is avoiding single points of failure. A single point of failure is a component that, if it is the only thing that fails, can make the system unsafe. In contrast, a double (or triple, etc.) point of failure situation is one in which two (or more) components must fail to result in an unsafe system.

As an example, a single-engine airplane has a single point of failure (the single engine – if it fails, the plane doesn’t have propulsion power any more). A commercial airliner with two engines does not have an engine as a single point of failure, because if one engine fails the second engine is sufficient to operate and land the aircraft in expected operational scenarios. Similarly, cars have braking actuators on more than one wheel in part so that if one fails there are other redundant actuators to stop the vehicle.

MISRA Report 2 states that avoiding such failures is the centerpiece of risk assessment. (MISRA Report 2 p. 17)

For systems that are deployed in large numbers for many operating hours, it can reasonably be expected that any complex component such as a computer chip or the software within it will fail at some point. Thus, attaining reasonable safety requires ensuring that no such component is a single point of failure.

When redundancy is used to avoid having a single point of failure, it is important that the redundant components not have any “common mode” failures. In other words, there must not be any way in which both redundant components would fail from the same cause, because any such cause would in effect be a single point of failure. As an example, if two aircraft engines use the same fuel pump, then that fuel pump becomes a single point of failure if it stops pumping fuel to both engines at the same time. Or if two brake pads in a car are closed by pressure from the same hydraulic line, then a hydraulic line rupture could cause both brake pads to fail at the same time due to loss of hydraulic pressure.

Similarly, it is important that there be no sequences of failures that escape detection until a last failure results in a system failure. For example, if one engine on a two-engine aircraft fails, it is essential to be able to detect that failure immediately so that diversion to a nearby airport can be completed before the second engine has a chance to fail as well.

For functions implemented in software, designers must assume that software can fail due to software defects missed in testing or due to hardware malfunctions. Software failures that must be considered include: a task dying or “hanging,” a task missing its deadline, a task producing an unsafe value due to a computational error, and a task producing an unsafe value due to unexpected (potentially abnormal) input values. It is important to note that when a software task “dies” that does not mean the rest of the system stops operating. For example, a software task that periodically computes a servo angle might die and leave the servo angle commanded to the last computed position, depending upon whether other tasks within the system notice whether the servo angle computation task is alive or not.

Avoiding single point failures in a multitasking software system requires that a single task must not manage both normal behavior and failure mode backup behavior, because if a failure is caused by that task dying or misbehaving, the backup behavior is likely to be lost as well. Similarly, a single task should not both perform an action and also monitor faults in that action, because if the task gets the action wrong, one should reasonably expect that it will subsequently fail to perform the monitoring correctly as well.

It is well known in the research literature that avoiding single points of failure is essential in a safety critical system. Leveson says: “To prove the safety of a complex system in the presence of faults, it is necessary to show that no single fault can cause a hazardous effect and that hazards resulting from sequences of failures are sufficiently remote.” (Leveson 1986, pg. 139)

Storey says, in the context of software safety: “it is important to note that as complete isolation can never be guaranteed, system designers should not rely on the correct operation of a single module to achieve safety, even if that module is well trusted. As in all aspects of safety, several independent means of assuring safety are preferred.” (Storey 1996, pg. 238) In other words, a fundamental tenet of safety critical system design is to first ensure that there are no single points of failure, and then ensure that there are no likely pairs or sequences of failures that can also cause a safety problem. A specific implication of this statement is that software within a single CPU cannot be trusted to operate properly at all times. A second independent CPU or other device must be used to duplicate or monitor computations to ensure safety. Storey gives a variety of redundant hardware patterns to avoid single point failures (id., pp. 131-143)

Douglass says:

Douglass on single-point failures (Douglass 1999, pp. 105-107)

Douglass makes it clear that all single point failures must be addressed to attain safety, no matter how remote the probability of failure might be.

Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

Fault Containment Regions Count As "Single Points" of Failure

Additionally, it is important to recognize that any line of code on a CPU (or even many lines of code on a CPU) can be a single point fault that creates arbitrary effects in diverse places elsewhere on that same CPU. Addy makes it clear that: “all parts of the software must be investigated for errors that could have a safety impact. This is not a part of the software safety analysis that can be slighted without risk, and may require more emphasis than it currently receives. [Addy’s] case study suggests that, in particular, care should be taken to search for errors in noncritical software that have a hidden interface with the safety-critical software through data and system services.” (Addy 1991, pg 83). Thus, it is inadequate to look at only some of the code to evaluate safety – you have to look at every last line of code to be sure.

A Fault Containment Region (FCR) is a region of a system which is neither vulnerable to external faults, nor causes faults outside the FCR. (Lala 1994, p. 30) The importance of an FCR is that arbitrary faults that originate inside an FCR cannot spread to another FCR, providing the basis of fault containment within a system that has redundancy to tolerate faults. In other words, if an FCR can generate erroneous data, you need a separate, independent FCR to detect that fault as the basis for creating a fault tolerant system (id.)

Hammet presents architectures that are fail safe, such as a “self-checking pair with simplex fault down” that in my experience is also commonly used in rail signaling equipment. (Hammet 2002, pp. 19-20). Rail signaling systems I have worked with also use redundancy to avoid single points of failure. So it's no mystery how to do this -- you just have to want to get it right.

Obermaisser gives a list of failure modes for a “fault containment region” (FCR) that includes “arbitrary failures,” (Obermaisser 2006 pp. 7-8, section 3.2) with the understanding that a safe system must have multiple redundant FCRs to avoid a single point of failure. Put another way, a fault anywhere in a single FCR renders the entire FCR unreliable. Obermaisser also says that “The independence of FCRs can be compromised by shared physical resources (e.g., power supply, timing source), external faults (e.g., Electromagnetic Interference (EMI), spatial proximity) and design.” (id., section 3.1 p. 6)

Kopetz says that an automotive drive-by-wire system design must “assure that replicated subsystems of the architecture for ultra-dependable systems fail independently of each other.” (Kopetz 2004, p. 32, emphasis per original) Kopetz also says that each FCR must be completely independent of each other, including: computer hardware, power supply, timing source, clock synchronization service, and physical space, noting that “all correlated failures of two subsystems residing on the same silicon die cannot be eliminated.” (id., p. 34).

Continuing on with automotive specific references, none of this was news to the authors of the MISRA Software Guidelines, who said the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, p. 17). In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734). It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.

FCRs must be implemented in separate silicon die (Ademaj 2003, p. 5)

But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their monitor from their Main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

In other words, if a safety critical system has a single point of failure – any single point of failure, even if mitigated with “failsafes” in the same fault containment region – then it is not only unsafe, but also flies in the face of well accepted practices for creating safe systems.

In a future post I'll get into the issue of how critical a system has to be for addressing all single point failures to be mandatory. But the short version is that is someone is likely to die as a result of system failure, you need to avoid all single point failures (every last one of them) on at least the level of FCRs.

References:

Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
Hammet, Design by extrapolation: an evaluation of fault-tolerant avionics, IEEE Aerospace and Electronic Systems, 17(4), 2002, pp. 17-25.
Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
Lala et al., Architectural principles for safety-critical real-time applications, Proceedings of the IEEE, 82(1), Jan. 1994, pp. 25-40.
Leveson, N., Software safety: why, what, how, Computing Surveys, Vol. 18, No. 2, June 1986, pp. 125-163.
MISRA, Report 2: Integrity, February 1995.
Obermaisser, A Fault Hypothesis for Integrated Architectures, International Workshop on Intelligent Solutions in Embedded Systems, June 2006, pp. 1-18.
Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.

Monday, March 17, 2014

Problems Caused by Random Hardware Faults in Critical Embedded Systems

A previous posting described sources of random hardware faults. This posting shows that even single bit errors can lead to catastrophic system failures. Furthermore, these types of faults are difficult or impossible to identify and reproduce in system-level testing. But, just because you can't make such errors reproduce at the snap of your fingers doesn't mean they aren't affecting real systems every day.

Paraphrasing a favorite saying of my colleague Roy Maxion: "You may forget about unlikely faults, but they won't forget about you."

Random Faults Cause Catastrophic Errors

Random faults can and do occur in hardware on an infrequent but inevitable basis. If you have a large number of embedded systems deployed you can count on this happening to your system. But, how severe can the consequences be from a random bit flip? As it turns out, this can be a source of system-killer problems that have hit more than one company, and can cause catastrophic failures of safety critical systems.

While random faults are usually transient events, even the most fleeting of faults will at least sometimes result in a persistent error that can be unsafe. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, pg. 116)

Not every random fault will cause a catastrophic system failure. But based on experimental evidence and published studies, it is virtually certain that at least some random faults will cause catastrophic system failures in a system that is not fully protected against all possible single-point faults and likely combinations of multi-point faults.

Addy presented a case study of an embedded real time control system. The system architecture made heavy use of global variables (which is a bad idea for other reasons (Koopman 2010)). Analysis of test results showed that a single memory bit written to a global variable caused an unsafe action for the system studied despite redundant checks in software. (Addy 1991, pg. 77). This means that a single bit flip causing an unsafe system has to be considered plausible, because it has actually been demonstrated in a safety critical control system.

Addy also reported a fault in which one process set a value, and then another process took action upon that value. Addy identified a failure mode in the first process of his system due to task death that rendered the operator unable to control the safety-critical process:

Excerpt from (Addy 1991, pg. 80)

In Addy’s case a display is frozen and control over the safety critical process is lost. As a result of the experience, Addy recommends a separate safety process. (Addy 1991 pg. 81).

Sun Microsystems learned the Single Event Upset (SEU) lesson the hard way in 2000, when it had to recall high-end servers because they had failed to use error detection/correction codes and were suffering from server crashes due to random faults in SRAM. (Forbes, 2000) Fortunately no deaths were involved since this is a server rather than a safety critical embedded system.

Excerpt from Forbes 2000.

Cisco also had a significant SEU problem with their 12000 series router line cards. They went so far as to publish a description and workaround notification, stating: “Cisco 12000 line cards may reset after single event upset (SEU) failures. This field notice highlights some of those failures, why they occur, and what work arounds are available.” (Cisco 2003) Again, these were not safety critical systems.

More recently, Skarin & Karlsson presented a study of fault injection results on an automotive brake-by-wire system to determine if random faults can cause an unsafe situation. They found that 30% of random errors caused erroneous outputs, and of those 15% resulted in critical braking failures. (Skarin & Karlsson 2008) In the context of this brake-by-wire study, critical failures were either loss of braking or a locked wheel during braking. (id., p. 148) Thus, it is clear that these sorts of errors can lead to critical operational faults in electronic automotive controls. It is worth noting that random errors tend to increase dramatically with altitude, making this a relevant environmental condition (Heijman 2011, p. 6)

For automotive safety applications, MISRA Software Guidelines discuss the need to identify and correct corruption of memory values (MISRA Software Guidelines p. 30).

Random Faults Can Be Impossible to Reproduce In General

Intermittent faults from SEUs and other sources can and do occur in the field, causing real problems that are not readily reproducible even if the cause is a hard failure. In many cases components with faults are diagnosed as “Trouble Not Identified” (TNI). In my work with embedded system companies I have learned that is common to have TNI rates of 50% for complex electronic components across many embedded industries. (This means that half of the components returned to the factory as defective are found to be working properly when normal testing procedures are used.) While it is tempting to blame the customer for incorrect operation of equipment, the reality is that many times a defect in software design, hardware design, or manufacturing is ultimately responsible for these elusive problems. Thomas et al. give a case study of a TNI problem on Ford vehicles and conclude that it is crucial to perform a safety critical electronic root cause analysis for TNI problems rather than just assume it is the customer’s fault for not using the equipment properly or falsely reporting an error. (Thomas 2002)

TNI reports must be tracked down to root cause. (Thomas 2002, pg. 650)

The point is not really the exact number, but rather that random failures are the sort of thing that can be expected to happen on a daily basis for a large deployed fleet of safety-critical systems, and therefore must be guarded against rigorously. Moreover, lack of reproducibility does not mean the failure didn’t happen. In fact, irreproducible failures and very-difficult-to-reproduce failures are quite common in modern electronic components and systems.

Random Faults Are Even Harder To Reproduce In System-Level Testing

While random faults happen all the time in large fleets, detecting and reproducing them is not so easy. For example: “Studies have shown that transient faults occur far more often than permanent ones and are also much harder to detect.” (Yu 2003 pg. 11)

It is also well known that some software faults appear random, even though they may be caused by a software defect that could fixed if identified. Yu explains that this is a common situation:

Software faults can be very difficult to track down (Yu 2003, p. 12)

Random faults cannot be assumed to be benign, and cannot be assumed to always result in a system crash or system reset that puts the system in to a safe state. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, p. 116) “Many designers believe that computers fail safe, whereas NASA experience has shown that computers may exhibit hazardous failure modes.” (NASA 2004, p. 21)

Moreover, random faults are just one type of fault that is expected on a regular, if infrequent basis. “Random faults, like end-of-life failures of electrical components, cannot be designed away. It is possible to add redundancy so that such faults can be easily detected, but no one has ever made a CPU that cannot fail.” (Douglass 1999, pg. 105, emphasis added)

MISRA states that “system design should consider both random and systematic faults.” (MISRA Software Guidelines p. 6). It also states “Any fault whose initiating conditions occur frequently will usually have emerged during testing before production, so that the remaining obscure faults may appear randomly, even though they are entirely systematic in origin.” (MISRA Report 2 p. 7). In other words, software faults can appear random, and are likely to be elusive and difficult to pin down in products – because the easy ones are the ones that get caught during testing.

SEUs and other truly random faults are not reproducible in normal test conditions because they depend upon rare uncontrollable events (e.g., cosmic ray strikes). “The result is an error in one bit, which, however, cannot be duplicated because of its random nature.” (Tang 2003, pg. 111).

While intermittent faults might manifest repeatedly over a long series of tests with a particular individual vehicle or other embedded system, doing so may require precise reproduction of environmental conditions such as vibration, temperature, humidity, voltage fluctuations, operating conditions, and other factors.

Random faults may be difficult to track down, but can result in
errors that persist within the system (Storey 1996, pp. 115-116)

It is unreasonable to expect a fault caused by a random error to perform upon demand in a test setup run for a few hours, or even a few days. It can easily take weeks or months to reproduce random software defects, and some may never be reproduced simply by performing system level testing in any reasonable amount of time. If a fleet of hundreds of thousands of vehicles, or other safety critical embedded system, only sees random faults a few times a day, testing a single vehicle is unlikely to see that same random fault in a short test time – or possibly it may never see a particular random fault over that single test vehicle’s entire operating life.

Thus, it is to be expected that realistic faults – faults that happen to all types of computers on an everyday basis when large numbers of them are deployed – can’t necessarily be reproduced upon demand in a single vehicle simply by performing vehicle-level tests.

Software Faults Can Also Appear To Be Random and Irreproducible In System-Level Testing

Software data corruption includes changes during operation made to RAM and other hardware resources (for example, registers controlling I/O peripherals that can be altered by software). Unlike hardware data corruption, software data corruption is caused by design defects in software rather than disruption of hardware operating state. The defects may be quite subtle and seemingly random, especially if they are caused by timing problems (also known as race conditions). Because software data corruption is caused by a CPU writing values to memory or registers under software control just as the hardware would expect to happen with correctly working software, hardware data corruption countermeasures such as error detecting codes do not mitigate this source of errors.

It is well known that software defects can cause corruption of data or control flow information. Since such corruption is due to software defects, one would expect the frequency at which it happens depends heavily on the quality of the software. Generally developers try to find and eliminate frequent sources of memory corruption, leaving the more elusive and infrequent problems in production code. Horst et al. used modeling and measurement to estimate a corruption would occur once per month with a population of 10,000 processors (1993, abstract), although that software was written to non-safety-critical levels of software quality. Sullivan and Chillarege used defect reports to understand “overlay errors” (the IBM term for memory corruption), and found that such errors had much higher impact on the system than other types of defects (Sullivan 1991, p. 9).

MISRA recommends using redundant data or a checksum (MISRA Software Guidelines p. 30; MISRA report 1 p. 21) NASA recommends using multiple copies or a CRC to guard against variable corruption (NASA 2004, p. 93). IEC 61508-3 recommends protecting data structures against corruption (p. 72).

It is also well known that some software faults appear random, even though they may be caused be a software defect that could fixed if identified. Yu explains that this is a common situation, as shown in a previous figure. (Yu 2003, p. 12) It is well known that software faults can appear to be irreproducible at the system level, making it virtually impossible to reproduce them upon demand with system level testing of an unmodified system.

It reasonable to expect that any software – and especially software that has not been developed in rigorous accordance with high-integrity software procedures – to have residual bugs that can cause memory corruption, crashes or other malfunctions. The figure below provides an example software malfunction from an in-flight entertainment system. Such examples abound if you keep an eye out for them.

Aircraft entertainment system software failure. February 13, 2008 (photo by author)

It should be no surprise if a system suffers random faults in a large-scale deployment that are difficult or impossible to reproduce in the lab via system-level testing, even if their manifestation in the field has been well documented. There simply isn't enough exposure to random fault sources in laboratory testing conditions to experience to the variety of faults that will happen in the real world to a large fleet.

Coming at this another way, saying that something can't happen because it can't be reproduced in the lab ignores how random hardware and software faults manifest in the real world. Arbitrarily bad random faults can and will happen eventually in any widely deployed embedded system. Designers have to make a choice between ignoring this reality, or facing it head-on to design systems that are safe despite such faults occurring.
References:

Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
Cisco, Cisco 12000 Single Event Upset Failures Overview and Work Around Summary, August 15, 2003.
Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
Forbes, Sun Screen, November 11, 2000. Accessed at: http://www.forbes.com/global/2000/1113/0323026a_print.html
Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.
Horst et al., The risk of data corruption in microprocessor-based systems, FTCS, 1993, pp. 576-585.
Koopman, Better Embedded System Software, 2010.
Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
MISRA, Report 2: Integrity, February 1995.
NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
Skarin & Karlsson, Software implemented detection and recovery of soft errors in a brake-by-wire system, EDCC 2008, pp. 145-154.
Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.
Tang & Rodbell, Single-event upsets in microelectronics: fundamental physics and issues, MRS Bulletin, Feb. 2003, pp. 111-116.
Thomas et al., The “trouble not identified” phenomenon in automotive electronics, Microelectronics Reliability 42, 2002, pp. 641-651.
Yu, Y. & Johnson, B., Fault Injection Techniques: a perspective on the state of research. In: Benso & Prinetto (eds.), Fault Injection Techniques and Tools for Embedded System Reliability Evaluation, Kluwer, 2003, pp. 7-39.

Monday, March 3, 2014

Random Hardware Faults

Computer hardware randomly generates incorrect answers once in a while, in some cases due to single event upsets from cosmic rays. Yes, really! Here's a summary of the problem and how often you can expect it to occur, with references for further reading.

---------------------------------------

(As will be the case with some of my posts over the next few months, this text is written more in an academic style, and emphasizes automotive safety applications since that is the area I perform much of my research in. But I hope it is still useful, if less entertaining in style than some of my other posts.)

Hardware corruption in the form of bit flips in memories has been known to occur on terrestrial computers for more than 30 years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 2011). Hardware data corruption includes changes during operation made to both RAM and other CPU hardware resources (for example bits held in CPU registers that are intermediate results in a computation).

"Soft errors" are those in which a transient fault causes a loss of data, but not damage to a circuit. (Mariani03, pg. 51) Terminology can be slightly confusing in that a “soft error” has nothing to do with software, but instead refers to a hardware data corruption that does not involve permanent chip damage. These “soft” errors can be caused by, among other things, neutrons created by cosmic rays that penetrate the earth’s atmosphere.

Soft errors can be caused by external events such as naturally occurring alpha radiation particle caused by impurities in the semiconductor materials, as well as neutrons from cosmic rays that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani03, pg. 51).

As computers evolve to become more capable, they also use smaller and smaller individual devices to perform computation. And, as devices get smaller, they become more vulnerable to data corruption from a variety of sources in which a subtle fleeting event causes a disruption in a data value. For example, a cosmic ray might create a particle that strikes a bit in memory, flipping that bit from a “1” to a “0,” resulting in the change of a data value or death of a task in a real time control system. (Wang 2008 is a brief tutorial.)

Chapter 1 of Ziegler & Puchner (2004) gives an historical perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) lists accepted techniques for protecting against hardware induced errors, including use of error correcting codes (id., p. 9-7) and memory scrubbing (id., p. 9-8). It is noted that techniques that protect memory “are not effective for computational or logic operations” (id., p. 9-8), meaning that only values stored in RAM can be protected by these methods, although special exotic methods can be used to protect the rest of the CPU (id., p. 1-13).

Causes for such failures vary in source and severity depending upon a variety of conditions and the exact type of technology used, but they are always present. The general idea of the failure mode is as follows. SRAM memory uses pairs of inverter logic gates arranged in a feedback pattern to “remember” a “1” or a “0” value. A neutron produced by a cosmic ray can strike atoms in an SRAM cell, inducing a current spike that over-rides the remembered value, flipping a 0 to a 1, or a 1 to a 0. Newer technology uses smaller hardware cells to achieve increased density. But having smaller components makes it easier for a cosmic ray or other disturbance to disrupt the cell’s value. While a cosmic ray strike may seem like an exotic way for computers to fail, this is a well documented effect that has been known for decades (e.g., May & Woods 1979 discussed radiation-induced errors) and occurs on a regular basis with a predictable average frequency, even though it is fundamentally a randomly occurring event.

Radiation strike causing transistor disruption (Gorini 2012).

A paper from Dr. Cristian Constantinescu, a top Intel dependability expert at that time, summarized the situation about a decade ago (Constantinescu 2003). He differentiates random faults into two cases. Intermittent faults, which might be due to a subtle manufacturing or wearout defect, can cause bursts of memory errors (id., pp. 15-16). Transient faults, on the other hand, can be caused by naturally occurring neutron and alpha particles, electromagnetic interference, and electrostatic discharge (id., pg. 16), which he refers to as “soft errors” in general. To simplify terminology, we’ll consider both transient and intermittent faults as “random” errors, even though intermittent errors strictly speaking may come in occasional bursts rather than having purely random independent statistical distributions. His paper goes on to discuss errors observed in real computers. In his study, 193 corporate server computers were studied, and 69% of them observed one or more random errors in a 16-month period. 2.6 percent of the servers reported more than 1000 random errors in that 16-month period due to a combination of random error sources (id., pg. 16):

Random Error Data (Constantinescu 2003, pp. 15-16).

This data supports the well known notion that random errors can and do occur in all computer systems. It is to be expected due to minute variations in manufacturing, environments, and statistical distribution properties that some systems will have a very large number of such errors in a short period of time, while many of the same exact type of system will exhibit none even for extended periods of operation. The problem is only expected to get worse for newer generations of electronics (id., pg. 17).

Excerpt from Mariani 2003, pg. 50

Mariani (2003, pg. 50) makes it clear that soft errors “are officially considered one of the main causes of loss of reliability in modern digital components.” He even goes so far to say of “x-by-wire” automotive systems (which includes “throttle-by-wire” and other computer-controlled functions): “Of course such systems must be highly reliable, as failure will cause fatal accidents, and therefore soft errors must be taken into account.” Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear that “random hardware failures” must be controlled for safety critical automotive applications, with a required rigor to control those failures increasing as safety becomes more important. (MISRA Report 2 p. 18).

Semiconductor error rates are often given in terms of FIT (“Failures in Time”), where one FIT equals one failure per billion operating hours. Mariani gives representative numbers as 4000 FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates increasing significantly for subsequent years. Based on a review of various data sources considered it seems reasonable to use an error rate of 1000 FIT/Mbit (1 errors per 1e13 bit-hr) for SRAM. (e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) Note that not all failures occur in memory. Any transistor on a chip can fail due to radiation, and not all storage is in RAM. As an example, registers such as a computer’s program counter are exposed to Single Event Upsets (SEUs), and the effect of an SEU there can result in the computer having arbitrarily bad behavior unless some form of mitigation is in place. Automotive-specific publications state that hardware induced data corruption must be considered (Seeger 1982).

SEU rates vary considerably based on a number of factors such as altitude, with data presented for Boulder Colorado showing a factor of 3.5 to 4.8 increase in cosmic-ray induced errors over data collected in Essex Junction Vermont (near sea level) (O’Gorman 1994, abstract).

Given the above, it is clear that random hardware errors are to be expected when deploying a large fleet of embedded systems. This means that if your system is safety critical, you need to take such errors into account. In a future post I'll explore how bad the effects of such a fault can be.

References:

Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, July-Aug 2003, pp. 14-19

Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.

Karnik et al., Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable and Secure Computing, April-June 2004, pp. 128-143.

Los Alamos National Laboratory, The Invisible Neutron Threat, https://www.lanl.gov/science/NSS/issue1_2012/story4full.shtml

Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.

May & Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Trans. On Electronic Devices, vol. 26, 1979

MISRA, Report 2: Integrity, February 1995. (www.misra.org.uk)

O’Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, April 1994, pp. 553-557.

Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.

Tezzaron Semiconductor, Soft Errors in Electronic Memory -- A White Paper, 2004, www.tezzaron.com

Wang & Agrawal, Single event upset: an embedded tutorial, 21st Intl. Conf. on VLSI Design, 2008, pp. 429-434.

Ziegler & Puchner, SER – History, Trends, and Challenges: a guide for designing with memory ICs, Cypress Semiconductor, 2004.

Additional material: