Monday, February 17, 2014
Many of you are writing software that has safety aspects creeping into it. There's nothing like a real-world case study to bring home both the consequences of unsafe software. The Therac 25 story is must-read material. The short version is several radiation therapy patients were killed by massive radiation overdoses that trace back to bad software. See below for more details...
(This article is written in an academic style rather than as an informal blog post. But hopefully it's informative.)
The Therac 25 accidents form the basis for what is often considered the best-documented software safety case-study available. The experience illustrates a number of principles that are vital to understanding how and why the design and analysis of safety-critical systems must be done in a methodical way according to established principles. The Therac 25 accidents came at a time before many current practices were widespread, and serve as a cautionary tale for why such practices exist and are essential to creating safe systems.
Briefly, the Therac 25 was a medical radiation therapy machine that was supposed to deliver controlled doses of radiation to cancer patients. Basically, this was a “radiation-by-wire” system in which software was used to replace some hardware safety mechanisms. Due to software defects, among other factors, it was involved in six known massive overdose accidents resulting in deaths and serious injuries. (Leveson 1993, p. 18). A simple explanation of the likely mechanism for the accidents was using a beam strength for x-ray exposure, but without the electron beam to X-ray metallic beam-attenuating conversion target target in place, resulting in 100x over-doses. Due to limitations of the dose measurement system, the way patients knew they were over-exposed was radiation burns (and, in at least one case, a reported sizzling sound of the radiation dose measurement devices frying).
Some of the characteristics of the Therac 25 development process are summarized as: almost all testing was done at the system level rather than as lower level unit tests, shared memory variables are unprotected from concurrency defects, and “race conditions due to multitasking without protecting shared variables played an important part in the accidents.” (id., text box pp 20-21) Operators were taught that there were “so many safety mechanisms” that it was “virtually impossible to overdose a patient.” (id., p. 24)
The manufacturer could not reproduce an initially reported problem involving the Ontario Cancer Foundation mishap in 1985. After analysis, they blamed a patient turntable position measurement sensor. A sensor modification and a software failsafe were added to mitigate the problem, (id. pg. 23-26) and the manufacturer claimed a five order of magnitude safety improvement. But this was not an accurate assessment.
Later, after the 1986 East Texas Cancer Center accidents, two manufacturer engineers could not reproduce a malfunction indication reported by the local staff. The manufacturer’s “home office engineer reportedly explained that it was not possible for the Therac-25 to overdose a patient.” But this was found to be untrue after an investigation into a second overdose a month later at the same facility revealed the problem to be a software defect. (id., pp. 27-28) Reproducing the effects of the software defect was difficult, because it was timing-dependent and involved the speed of radiation prescription data entry. (id., p. 28) A number of hardware and software mitigations were added (id., pp. 31-32). But even then, an entirely different timing-dependent software problem emerged to cause the Yakima Valley 1987 overdose mishap (id., pp. 33-34), and potentially another mishap.
At a technical level, some of the factors that contributed to the Therac 25 accidents included: cryptic error messages, using a home-brew real time operating system, mutex operations that were not atomic (and therefore, defective), race conditions between user inputs and machine actions, a problem that only manifested when a counter value rolled over to zero, and generally inadequate testing and reviews.
When used for treatment, the machines were known to throw lots of error codes. But instead of this being seen as a sign that the machines were exercising the safety mechanisms often (which is a really bad idea), this was interpreted as being safe due to all the shutdowns. In reality, a system that exercises its failsafes all the time is prone to eventually seeing a fault that gets past the failsafes. It's well known in operating safety-critical systems that exercising failsafes is undesirable. They are your last line of defense, and should be a last resort backup that is almost never activated. Regularly exercising failsafes is a hallmark of an unsafe system.
The lessons from the Therac 25 form bedrock principles for the safety critical software community. They include: “Accidents are seldom simple – they usually involve a complex web of interacting events with multiple contributing technical, human, and organizational factors.” (id., p. 38). Do not assume that fixing a particular error will prevent future accidents (“There is always another software bug”) (id.). Higher level system engineering failures are often relevant, such as: lack of follow-through on all reported incidents, overconfidence in the software, less-than-acceptable software engineering practices, and unrealistic risk assessments (which for the Therac 25 included an assessment that the software was defect-free). (id.).
“Designing any dangerous system in such a way that one failure can lead to an accident violates basic system-engineering principles. In this respect, software needs to be treated as a single component.” (id. pp. 38-39). (In context, this refers to the software resident on a single CPU, meaning that if any software defect on one CPU can cause an accident, that is a single point failure that renders the system unsafe.)
Leveson lists “basic software-engineering practices that apparently were violated with the Therac-25” as: documentation should not be an afterthought; software quality assurance practices and standards should be established; designs should be kept simple; ways to get error information should be designed in; and that “the software should be subjected to extensive testing and formal analysis at the module and software level: system testing alone is not adequate.” (id., p. 39) Leveson finishes by saying that although this was a medical system, “the lessons apply to all types of systems where computers control dangerous devices.” (id., p. 41)
Leveson, An investigation of the Therac-25 Accidents, IEEE Computer, July 1993, pp. 18-41. (updated version here: http://sunnyday.mit.edu/papers/therac.pdf)
Therac-25 Case materials for teaching: http://www.computingcases.org/case_materials/therac/therac_case_intro.html
CMU 18-649 Software safety lecture (second half covers Therac 25)
Better Embedded System Software. Chapter 28 is on software safety.