Better Embedded System SW: Random Hardware Faults

Computer hardware randomly generates incorrect answers once in a while, in some cases due to single event upsets from cosmic rays. Yes, really! Here's a summary of the problem and how often you can expect it to occur, with references for further reading.

---------------------------------------

(As will be the case with some of my posts over the next few months, this text is written more in an academic style, and emphasizes automotive safety applications since that is the area I perform much of my research in. But I hope it is still useful, if less entertaining in style than some of my other posts.)

Hardware corruption in the form of bit flips in memories has been known to occur on terrestrial computers for more than 30 years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 2011). Hardware data corruption includes changes during operation made to both RAM and other CPU hardware resources (for example bits held in CPU registers that are intermediate results in a computation).

"Soft errors" are those in which a transient fault causes a loss of data, but not damage to a circuit. (Mariani03, pg. 51) Terminology can be slightly confusing in that a “soft error” has nothing to do with software, but instead refers to a hardware data corruption that does not involve permanent chip damage. These “soft” errors can be caused by, among other things, neutrons created by cosmic rays that penetrate the earth’s atmosphere.

Soft errors can be caused by external events such as naturally occurring alpha radiation particle caused by impurities in the semiconductor materials, as well as neutrons from cosmic rays that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani03, pg. 51).

As computers evolve to become more capable, they also use smaller and smaller individual devices to perform computation. And, as devices get smaller, they become more vulnerable to data corruption from a variety of sources in which a subtle fleeting event causes a disruption in a data value. For example, a cosmic ray might create a particle that strikes a bit in memory, flipping that bit from a “1” to a “0,” resulting in the change of a data value or death of a task in a real time control system. (Wang 2008 is a brief tutorial.)

Chapter 1 of Ziegler & Puchner (2004) gives an historical perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) lists accepted techniques for protecting against hardware induced errors, including use of error correcting codes (id., p. 9-7) and memory scrubbing (id., p. 9-8). It is noted that techniques that protect memory “are not effective for computational or logic operations” (id., p. 9-8), meaning that only values stored in RAM can be protected by these methods, although special exotic methods can be used to protect the rest of the CPU (id., p. 1-13).

Causes for such failures vary in source and severity depending upon a variety of conditions and the exact type of technology used, but they are always present. The general idea of the failure mode is as follows. SRAM memory uses pairs of inverter logic gates arranged in a feedback pattern to “remember” a “1” or a “0” value. A neutron produced by a cosmic ray can strike atoms in an SRAM cell, inducing a current spike that over-rides the remembered value, flipping a 0 to a 1, or a 1 to a 0. Newer technology uses smaller hardware cells to achieve increased density. But having smaller components makes it easier for a cosmic ray or other disturbance to disrupt the cell’s value. While a cosmic ray strike may seem like an exotic way for computers to fail, this is a well documented effect that has been known for decades (e.g., May & Woods 1979 discussed radiation-induced errors) and occurs on a regular basis with a predictable average frequency, even though it is fundamentally a randomly occurring event.

Radiation strike causing transistor disruption (Gorini 2012).

A paper from Dr. Cristian Constantinescu, a top Intel dependability expert at that time, summarized the situation about a decade ago (Constantinescu 2003). He differentiates random faults into two cases. Intermittent faults, which might be due to a subtle manufacturing or wearout defect, can cause bursts of memory errors (id., pp. 15-16). Transient faults, on the other hand, can be caused by naturally occurring neutron and alpha particles, electromagnetic interference, and electrostatic discharge (id., pg. 16), which he refers to as “soft errors” in general. To simplify terminology, we’ll consider both transient and intermittent faults as “random” errors, even though intermittent errors strictly speaking may come in occasional bursts rather than having purely random independent statistical distributions. His paper goes on to discuss errors observed in real computers. In his study, 193 corporate server computers were studied, and 69% of them observed one or more random errors in a 16-month period. 2.6 percent of the servers reported more than 1000 random errors in that 16-month period due to a combination of random error sources (id., pg. 16):

Random Error Data (Constantinescu 2003, pp. 15-16).

This data supports the well known notion that random errors can and do occur in all computer systems. It is to be expected due to minute variations in manufacturing, environments, and statistical distribution properties that some systems will have a very large number of such errors in a short period of time, while many of the same exact type of system will exhibit none even for extended periods of operation. The problem is only expected to get worse for newer generations of electronics (id., pg. 17).

Excerpt from Mariani 2003, pg. 50

Mariani (2003, pg. 50) makes it clear that soft errors “are officially considered one of the main causes of loss of reliability in modern digital components.” He even goes so far to say of “x-by-wire” automotive systems (which includes “throttle-by-wire” and other computer-controlled functions): “Of course such systems must be highly reliable, as failure will cause fatal accidents, and therefore soft errors must be taken into account.” Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear that “random hardware failures” must be controlled for safety critical automotive applications, with a required rigor to control those failures increasing as safety becomes more important. (MISRA Report 2 p. 18).

Semiconductor error rates are often given in terms of FIT (“Failures in Time”), where one FIT equals one failure per billion operating hours. Mariani gives representative numbers as 4000 FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates increasing significantly for subsequent years. Based on a review of various data sources considered it seems reasonable to use an error rate of 1000 FIT/Mbit (1 errors per 1e13 bit-hr) for SRAM. (e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) Note that not all failures occur in memory. Any transistor on a chip can fail due to radiation, and not all storage is in RAM. As an example, registers such as a computer’s program counter are exposed to Single Event Upsets (SEUs), and the effect of an SEU there can result in the computer having arbitrarily bad behavior unless some form of mitigation is in place. Automotive-specific publications state that hardware induced data corruption must be considered (Seeger 1982).

SEU rates vary considerably based on a number of factors such as altitude, with data presented for Boulder Colorado showing a factor of 3.5 to 4.8 increase in cosmic-ray induced errors over data collected in Essex Junction Vermont (near sea level) (O’Gorman 1994, abstract).

Given the above, it is clear that random hardware errors are to be expected when deploying a large fleet of embedded systems. This means that if your system is safety critical, you need to take such errors into account. In a future post I'll explore how bad the effects of such a fault can be.

References:

Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, July-Aug 2003, pp. 14-19

Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.

Karnik et al., Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable and Secure Computing, April-June 2004, pp. 128-143.

Los Alamos National Laboratory, The Invisible Neutron Threat, https://www.lanl.gov/science/NSS/issue1_2012/story4full.shtml

Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.

May & Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Trans. On Electronic Devices, vol. 26, 1979

MISRA, Report 2: Integrity, February 1995. (www.misra.org.uk)

O’Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, April 1994, pp. 553-557.

Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.

Tezzaron Semiconductor, Soft Errors in Electronic Memory -- A White Paper, 2004, www.tezzaron.com

Wang & Agrawal, Single event upset: an embedded tutorial, 21st Intl. Conf. on VLSI Design, 2008, pp. 429-434.

Ziegler & Puchner, SER – History, Trends, and Challenges: a guide for designing with memory ICs, Cypress Semiconductor, 2004.

Additional material:

Article on cosmic ray memory corruption as a limiting factor in supercomputer size: https://spectrum.ieee.org/computing/hardware/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
Cosmic rays and Japanese network malfunctions: "Cosmic rays causing 30,000 network malfunctions in Japan each year, April 5, 2021: https://mainichi.jp/english/articles/20210405/p2g/00m/0bu/028000c
https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

2 comments:

MikeRSeptember 1, 2015 at 3:21 AM
Hi,
Is a FIT a failure per _billion_ or _trillion_ hours? I saw both numbers used in blog posts on your sites...
Phil KoopmanSeptember 1, 2015 at 7:42 AM
A FIT is 1 failure per billion hours (i.e., failure rate of 1e-9). 4000 FIT = 4 lamda. The numbers above were correct as-is; just that word was wrong. Thanks for pointing out the typo.

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Better Embedded System SW

Monday, March 3, 2014

Random Hardware Faults

2 comments:

Static Analysis Ranked Defect List

Pages

Search This Blog