Paraphrasing a favorite saying of my colleague Roy Maxion: "You may forget about unlikely faults, but they won't forget about you."
Random faults can and do occur in hardware on an infrequent but inevitable basis. If you have a large number of embedded systems deployed you can count on this happening to your system. But, how severe can the consequences be from a random bit flip? As it turns out, this can be a source of system-killer problems that have hit more than one company, and can cause catastrophic failures of safety critical systems.
While random faults are usually transient events, even the most fleeting of faults will at least sometimes result in a persistent error that can be unsafe. “Hardware component faults may be permanent, transient or intermittent, but design faults will always be permanent. However, it should be remembered that faults of any of these classes may result in errors that persist within the system.” (Storey 1996, pg. 116)
Not every random fault will cause a catastrophic system failure. But based on experimental evidence and published studies, it is virtually certain that at least some random faults will cause catastrophic system failures in a system that is not fully protected against all possible single-point faults and likely combinations of multi-point faults.
Addy also reported a fault in which one process set a value, and then another process took action upon that value. Addy identified a failure mode in the first process of his system due to task death that rendered the operator unable to control the safety-critical process:
errors that persist within the system (Storey 1996, pp. 115-116)
Coming at this another way, saying that something can't happen because it can't be reproduced in the lab ignores how random hardware and software faults manifest in the real world. Arbitrarily bad random faults can and will happen eventually in any widely deployed embedded system. Designers have to make a choice between ignoring this reality, or facing it head-on to design systems that are safe despite such faults occurring.
- Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
- Cisco, Cisco 12000 Single Event Upset Failures Overview and Work Around Summary, August 15, 2003.
- Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
- Forbes, Sun Screen, November 11, 2000. Accessed at: http://www.forbes.com/global/2000/1113/0323026a_print.html
- Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.
- Horst et al., The risk of data corruption in microprocessor-based systems, FTCS, 1993, pp. 576-585.
- Koopman, Better Embedded System Software, 2010.
- Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
- MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
- MISRA, Report 2: Integrity, February 1995.
- NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
- Skarin & Karlsson, Software implemented detection and recovery of soft errors in a brake-by-wire system, EDCC 2008, pp. 145-154.
- Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
- Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.
- Tang & Rodbell, Single-event upsets in microelectronics: fundamental physics and issues, MRS Bulletin, Feb. 2003, pp. 111-116.
- Thomas et al., The “trouble not identified” phenomenon in automotive electronics, Microelectronics Reliability 42, 2002, pp. 641-651.
- Yu, Y. & Johnson, B., Fault Injection Techniques: a perspective on the state of research. In: Benso & Prinetto (eds.), Fault Injection Techniques and Tools for Embedded System Reliability Evaluation, Kluwer, 2003, pp. 7-39.