Monday, October 13, 2014

Safety Culture

A weak safety culture makes it extremely difficult to create safe systems.

Consequences:
A poor safety culture dramatically elevates the risk of creating an unsafe product. If an organization cuts corners on safety, one should reasonably expect the result to be an unsafe outcome.

Accepted Practices:
  • Establish a positive safety culture in which all stakeholders put safety first, rigorous adherence to process is expected, and all developers are incentivized to report and correct both process and product problems.
Discussion:
A “safety culture” is the set of attitudes and beliefs employees have to attaining safety. Key aspects of such a culture include a willingness to tell management that there are safety problems, and an insistence that all processes relevant to safety be followed rigorously.

Part of establishing a healthy safety culture in an organization is a commitment to improving processes and products over time. For example, when new practices become accepted in an industry (for example, the introduction of a new version of the MISRA C coding style, or the introduction of a new safety standard such as ISO 26262), the organization should evaluate and at least selectively adopt those practices while formally recording the rationale for excluding and/or slow-rolling the adoption of new practices. (In general, one expects substantially all new accepted practices in an industry to be adopted over time by a company, and it is simply a matter of how aggressively this is done and in what order.)

Ideally, organizations should identify practices that will improve safety proactively instead of reactively. But regardless, it is unacceptable for an organization building safety critical systems to ignore new safety-relevant accepted practices with an excuse such as “that way was good enough before, so there is no reason to improve” – especially in the absence of a compelling proof that the old practice really was “good enough.”

Another aspect of a healthy safety culture is aggressively pursuing every potential safety problem to root cause resolution. In a safety-critical system there is no such thing as a one-off failure.  If a system is observed to behave incorrectly, then that behavior must be presumed to be something that will happen again (probably frequently) on a large deployed fleet.  It is, however, acceptable to log faults in a hazard log and then prioritize their resolution based on risk analysis such as using a risk table (Koopman 2010, ch. 28).

Along these lines, blaming a person for a design defect is usually not an acceptable root cause. Since people (developers and system operators alike) make mistakes, saying something like “programmer X made a mistake, so we fired him and now the problem is fixed” is simply scapegoating. The new replacement programmer is similarly sure to make mistakes. Rather, if a bug makes it through a supposedly rigorous process, the fact that the process didn’t prevent, detect, and catch the bug is what is broken (for example, perhaps design reviews need to be modified to specifically look for the type of defect that escaped into the field). Similarly, it is all too easy to scapegoat operators when the real problem is a poor design or even when the real problem is a defective product. In short, blaming a person should be the last alternative when all other problems have been conclusively ruled out – not the first alternative to avoid fixing the problem with a broken process or broken safety culture.

Believing that certain classes of defects are impossible to the degree that there is no point even looking for them is a sure sign of a defective safety culture. For example, saying that software defects cannot possibly be responsible for safety problems and instead blaming problems on human operators (or claiming that repeated problems simply didn’t happen) is a sure sign of a defective safety culture. See, for example, the Therac 25 radiation accidents. No software is defect free (although good ones are nearly defect free to begin with, and are improved as soon as new hazards are identified). No system is perfectly safe under all possible operating conditions. An organization with a mature safety culture recognizes this and responds to an incident or accident in a manner that finds out what really happened (with no preconceptions as to whether it might be a software fault or not) so it can be truly fixed. It is important to note that both incidents and accidents must be addressed. A “near miss” must be sufficient to provoke corrective action. Waiting for people to die (or dozens of people to die) after multiple incidents have occurred and been ignored is unacceptable (for an example of this, consider the continual O-ring problems that preceded the Challenger space shuttle accident).

The creation of safe software requires adherence to a defined process with minimal deviation, and the only practical way to ensure this is by having a robust Software Quality Assurance (SQA) function. This is not the same as thorough testing, nor is it the same as manufacturing quality. Rather than being based on testing the product, SQA is based on defining and auditing how well the development process (and other aspects of ensuring system safety) have been followed. No matter how conscientious the workers, independent checks, balances, and quantifiable auditing results are required to ensure that the process is really being followed, and is being followed in a way that is producing the desired results. It is also necessary to make sure the SQA function itself is healthy and operational.

Selected Sources:
Making the transition from creating ordinary software to safety critical software is well known to require a cultural shift that typically involves a change from an all-testing approach to quality to one that has a balance of testing and process management. Achieving this state is typically referred to as having a “safety culture” and is necessary step in achieving safety. (Storey 1996, p. 107)  Without a safety culture it is extremely difficult, if not impossible, to create safe software. The concept of a “safety culture” is borrowed from other, non-software fields, such as nuclear power safety and occupational safety.

MISRA Software Guidelines Section 3.1.4 Assessment recommends an independent assessor to ensure that required practices are being followed (i.e., an SQA function).

MISRA provides a section on “human error management” that includes: “it is recommended that a fear free but responsible culture is engendered for the reporting of issues and errors” (MISRA Software Guidelines p. 58) and “It is virtually impossible to prevent human errors from occurring, therefore provision should be made in the development process for effective error detection and correction; for example, reviews by individuals other than the authors.”

References:


2 comments:

  1. Thanks Phil for this interesting article. From your point of view, what are the "most complete" safety standards, the one that have the strongest requirements? On the other hand, again, from your perspective and with your experience, is there anything you would like to improve in any safety standard? If yes, what?

    ReplyDelete
  2. I think a widely accepted starting point is IEC-61508. From there it depends upon your industry and the specifics of your product. Looking forward there are some who say that a safety argument might be superior to a safety standard, but that is currently a research topic, and does not mean you shouldn't use a safety standard if there is one suitable to your situation.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!