Better Embedded System SW: October 2014

Monday, October 13, 2014

Safety Culture

A weak safety culture makes it extremely difficult to create safe systems.

Consequences:
A poor safety culture dramatically elevates the risk of creating an unsafe product. If an organization cuts corners on safety, one should reasonably expect the result to be an unsafe outcome.

Accepted Practices:

Establish a positive safety culture in which all stakeholders put safety first, rigorous adherence to process is expected, and all developers are incentivized to report and correct both process and product problems.

Discussion:
A “safety culture” is the set of attitudes and beliefs employees have to attaining safety. Key aspects of such a culture include a willingness to tell management that there are safety problems, and an insistence that all processes relevant to safety be followed rigorously.

Part of establishing a healthy safety culture in an organization is a commitment to improving processes and products over time. For example, when new practices become accepted in an industry (for example, the introduction of a new version of the MISRA C coding style, or the introduction of a new safety standard such as ISO 26262), the organization should evaluate and at least selectively adopt those practices while formally recording the rationale for excluding and/or slow-rolling the adoption of new practices. (In general, one expects substantially all new accepted practices in an industry to be adopted over time by a company, and it is simply a matter of how aggressively this is done and in what order.)

Ideally, organizations should identify practices that will improve safety proactively instead of reactively. But regardless, it is unacceptable for an organization building safety critical systems to ignore new safety-relevant accepted practices with an excuse such as “that way was good enough before, so there is no reason to improve” – especially in the absence of a compelling proof that the old practice really was “good enough.”

Another aspect of a healthy safety culture is aggressively pursuing every potential safety problem to root cause resolution. In a safety-critical system there is no such thing as a one-off failure. If a system is observed to behave incorrectly, then that behavior must be presumed to be something that will happen again (probably frequently) on a large deployed fleet. It is, however, acceptable to log faults in a hazard log and then prioritize their resolution based on risk analysis such as using a risk table (Koopman 2010, ch. 28).

Along these lines, blaming a person for a design defect is usually not an acceptable root cause. Since people (developers and system operators alike) make mistakes, saying something like “programmer X made a mistake, so we fired him and now the problem is fixed” is simply scapegoating. The new replacement programmer is similarly sure to make mistakes. Rather, if a bug makes it through a supposedly rigorous process, the fact that the process didn’t prevent, detect, and catch the bug is what is broken (for example, perhaps design reviews need to be modified to specifically look for the type of defect that escaped into the field). Similarly, it is all too easy to scapegoat operators when the real problem is a poor design or even when the real problem is a defective product. In short, blaming a person should be the last alternative when all other problems have been conclusively ruled out – not the first alternative to avoid fixing the problem with a broken process or broken safety culture.

Believing that certain classes of defects are impossible to the degree that there is no point even looking for them is a sure sign of a defective safety culture. For example, saying that software defects cannot possibly be responsible for safety problems and instead blaming problems on human operators (or claiming that repeated problems simply didn’t happen) is a sure sign of a defective safety culture. See, for example, the Therac 25 radiation accidents. No software is defect free (although good ones are nearly defect free to begin with, and are improved as soon as new hazards are identified). No system is perfectly safe under all possible operating conditions. An organization with a mature safety culture recognizes this and responds to an incident or accident in a manner that finds out what really happened (with no preconceptions as to whether it might be a software fault or not) so it can be truly fixed. It is important to note that both incidents and accidents must be addressed. A “near miss” must be sufficient to provoke corrective action. Waiting for people to die (or dozens of people to die) after multiple incidents have occurred and been ignored is unacceptable (for an example of this, consider the continual O-ring problems that preceded the Challenger space shuttle accident).

The creation of safe software requires adherence to a defined process with minimal deviation, and the only practical way to ensure this is by having a robust Software Quality Assurance (SQA) function. This is not the same as thorough testing, nor is it the same as manufacturing quality. Rather than being based on testing the product, SQA is based on defining and auditing how well the development process (and other aspects of ensuring system safety) have been followed. No matter how conscientious the workers, independent checks, balances, and quantifiable auditing results are required to ensure that the process is really being followed, and is being followed in a way that is producing the desired results. It is also necessary to make sure the SQA function itself is healthy and operational.

Selected Sources:
Making the transition from creating ordinary software to safety critical software is well known to require a cultural shift that typically involves a change from an all-testing approach to quality to one that has a balance of testing and process management. Achieving this state is typically referred to as having a “safety culture” and is necessary step in achieving safety. (Storey 1996, p. 107) Without a safety culture it is extremely difficult, if not impossible, to create safe software. The concept of a “safety culture” is borrowed from other, non-software fields, such as nuclear power safety and occupational safety.

MISRA Software Guidelines Section 3.1.4 Assessment recommends an independent assessor to ensure that required practices are being followed (i.e., an SQA function).

MISRA provides a section on “human error management” that includes: “it is recommended that a fear free but responsible culture is engendered for the reporting of issues and errors” (MISRA Software Guidelines p. 58) and “It is virtually impossible to prevent human errors from occurring, therefore provision should be made in the development process for effective error detection and correction; for example, reviews by individuals other than the authors.”

References:

Challenger Disaster: http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster
Koopman, Better Embedded System Software, 2010.
MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998. (and later editions)
MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.
Therac 25 Accidents: http://betterembsw.blogspot.com/2014/02/the-therac-25-case-study-in-unsafe.html

Friday, October 3, 2014

A Case Study of Toyota Unintended Acceleration and Software Safety

Oct 3, 2014: updated with video of the lecture

Here is my case study talk on the Toyota unintended acceleration cases that have been in the news and the courts the past few years.

The talk summary is below and embedded slides are below. Additional pointers:

Vimeo hosted video (higher quality source file)
Youtube hosted video
Streaming video of talk (1 hour - split screen with speaker & slides, requires free account)

(Please see end of post for video download and copyright info.)

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

A Case Study of Toyota Unintended Acceleration and Software Safety

Abstract:
Investigations into potential causes of Unintended Acceleration (UA) for Toyota vehicles have made news several times in the past few years. Some blame has been placed on floor mats and sticky throttle pedals. But, a jury trial verdict was based on expert opinions that defects in Toyota's Electronic Throttle Control System (ETCS) software and safety architecture caused a fatal mishap. This talk will outline key events in the still-ongoing Toyota UA litigation process, and pull together the technical issues that were discovered by NASA and other experts. The results paint a picture that should inform future designers of safety critical software in automobiles and other systems.

Bio:
Prof. Philip Koopman has served as a Plaintiff expert witness on numerous cases in Toyota Unintended Acceleration litigation, and testified in the 2013 Bookout trial. Dr. Koopman is a member of the ECE faculty at Carnegie Mellon University, where he has worked in the broad areas of wearable computers, software robustness, embedded networking, dependable embedded computer systems, and autonomous vehicle safety. Previously, he was a submarine officer in the US Navy, an embedded CPU architect for Harris Semiconductor, and an embedded system researcher at United Technologies. He is a senior member of IEEE, senior member of the ACM, and a member of IFIP WG 10.4 on Dependable Computing and Fault Tolerance. He has affiliations with the Carnegie Mellon Institute for Software Research (ISR) and the National Robotics Engineering Center (NREC).

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

I am getting an increasing number of requests to do this talk in person, both as a keynote speaker and for internal corporate audiences. Audiences tell me that while the video is nice, an in-person experience of both the presentation and small-group follow-up discussions has a lot more impact for organizations who need help in coming to terms with creating high quality software and safety critical systems. If you are interested please get in touch for details: koopman@cmu.edu

Other info:

Download copy of full-resolution video file set of talk, Box.com 340 MB .zip file of a web directory with interactive split-screen viewing format. Experts only! Please do not ask me for support -- it works for me but I don't have any details about this format beyond saying to unzip it and open Default.html in a web browser.)
Mirror of full resolution video talk download (dropbox.com of same files as on box.com)
One or more of these download sites might be blocked by company networks, so if you get an error message please try both links at home. If they still don't work, send me an e-mail and I'll see what I can do.
Download medium-bit-rate 720p video from CMU server (.mp4; 124MB)

All materials (slides & video) are licensed under Creative Commons Attribution BY v. 4.0.

Please include "Prof. Philip Koopman, Carnegie Mellon University" as the attribution.

If you are planning on using the materials in a course or similar, I would appreciate it if you let me know so I can track adoption. If you need a variation from the CC BY 4.0 license (for example, to incorporate materials in a situation that is at odds with the license terms) please contact me and it can usually be arranged.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =