Thursday, September 18, 2014

A Case Study of Toyota Unintended Acceleration and Software Safety

Today I gave the Carnegie Mellon ECE department seminar on a case study of the Toyota unintended acceleration cases that have been in the news and the courts the past few years.

The talk summary is below. You can download the complete presentation slides here:
     http://www.ece.cmu.edu/~koopman/pubs/koopman14_toyota_ua_slides.pdf


= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

A Case Study of Toyota Unintended Acceleration and Software Safety 

Abstract:
Investigations into potential causes of Unintended Acceleration (UA) for Toyota vehicles have made news several times in the past few years. Some blame has been placed on floor mats and sticky throttle pedals. But, a jury trial verdict was based on expert opinions that defects in Toyota's Electronic Throttle Control System (ETCS) software and safety architecture caused a fatal mishap.  This talk will outline key events in the still-ongoing Toyota UA litigation process, and pull together the technical issues that were discovered by NASA and other experts. The results paint a picture that should inform future designers of safety critical software in automobiles and other systems.

Bio:
Prof. Philip Koopman has served as a Plaintiff expert witness on numerous cases in Toyota Unintended Acceleration litigation, and testified in the 2013 Bookout trial.  Dr. Koopman is a member of the ECE faculty at Carnegie Mellon University, where he has worked in the broad areas of wearable computers, software robustness, embedded networking, dependable embedded computer systems, and autonomous vehicle safety. Previously, he was a submarine officer in the US Navy, an embedded CPU architect for Harris Semiconductor, and an embedded system researcher at United Technologies.  He is a senior member of IEEE, senior member of the ACM, and a member of IFIP WG 10.4 on Dependable Computing and Fault Tolerance. He has affiliations with the Carnegie Mellon Institute for Software Research (ISR) and the National Robotics Engineering Center (NREC).

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =


Thursday, September 11, 2014

Fail-Safe Mechanisms Must Be Tested

Some systems base their safety arguments on the presence of “fail-safe” behaviors. In other words, if a failure occurs, the argument is that the system will respond in a safe way, such as by shutting down in a safe manner. If you have fail-safe mechanisms, you need to test them with a full range of faults within the intended fault model to make sure they work properly.

Consequences:
Failing to specifically test for mitigation of single points of failure means that there is no way to be sure that the mitigation really works, putting safety of the system into doubt.

As an example, if a hardware watchdog timer is not turned on, it won’t reset the system, but there might be no way to tell whether the watchdog timer is on or not (or set to the wrong value, or otherwise used improperly) without specifically testing whether the watchdog works or not. Thus, you can’t take credit for having a watchdog timer unless you have actually tested that it works for each fault that matters (or, if there are many such faults, argue that you have attained sufficient coverage with the tests that are run).

Accepted Practices:
  • Each and every fail-safe mechanism and fault management mechanism must be tested, preferably on a fully integrated system. Such tests may be difficult to perform in normal functional testing and may require intentional fault injection from the outside of the system (e.g., breaking a sensor) or fault injection at test points inside the system (e.g., intentionally killing a task using special test support infrastructure).
Discussion:
Fault injection is the process of intentionally inducing a hardware or software fault and determining its effect upon the system.

Fault management mechanisms, and especially fail-safe mechanisms, are often the key points upon which an argument as to the safety of a system rests. As an example, a safety case based on a watchdog timer detecting task failures requires that the watchdog timer actually work. While it is of course important to make sure that the system has been designed properly, there is no substitute for testing whether the watchdog timer is actually turned on during system test. (To revisit a point on system testing made elsewhere in my postings – system testing is not sufficient to ensure safety, but thorough system testing is certainly an important thing to do.) It is similarly important to specifically test every fault mode that must be handled by the system to ensure fault handling is done correctly.

Some examples of fault tests that should be performed include: killing each task independently to ensure that the death of any task is caught by the watchdog (and, by extension, cannot cause an unsafe system state); overloading the system to ensure that it behaves safely in an unanticipated CPU overload situation; checking that diagnostic fail-safes detect the faults they are supposed to and react by putting the system into a safe state; disabling sensors; disabling actuators; and others.

Another perspective on this topic is that ensuring safety usually involves arguing that all single points of failure have been mitigated to make the system safe. To demonstrate that the reasoning is accurate, a system must have corresponding failures injected to make sure that the mitigation approaches actually work, since the system’s safety case rests upon that assumption. This might include intentionally corrupting bits in memory, corrupting computations that take place, corrupting stack contents, and so on.

It is important to note that ordinary system functional testing tends to do a poor job at exercising fault mitigation mechanisms. As an example, if a particular task is never supposed to die, and testing has been thorough, then that task won’t die during normal functional testing (if it did, the system would be defective!). The point of detecting task death is to handle situations you missed in testing. But that means the mechanism to detect task death and perform a restart hasn’t been tested by normal system-level functional tests. Therefore, testing fail-safe mechanisms requires special techniques that intentionally introduce faults into the system to activate those fail-safes.

Selected Sources:
Safety critical systems are deemed safe only if they can withstand the occurrence of any single point fault. But, there is no way to know if they will really do that unless testing includes actually injecting representative single point faults to see if the system will respond in a safe manner. You can’t know if a system is safe if you don’t actually test its safety capabilities, and doing so requires fault injection. For example, if you expect a watchdog to detect failed tasks, you need to kill each and every task in turn to see if the watchdog really works. Arlat correctly states that “physical fault injection will always be needed to test the actual implementation of a fault tolerant system” (Arlat 1990, pg. 180)

The need to actually test fail-safe mechanisms to see if they really work should be readily apparent to any engineer. Pullum discusses this topic by suggesting the use of fault injection (intentionally causing faults as a testing technique) in the context of “verification of integration of fault and error processing mechanisms” for creating dependable systems (Pullum 2001, pg. 93).

“Fault injection is important to evaluating the dependability of computer systems. … It is particularly hard to recreate a failure scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, speaking about the need for fault injection as part of testing a system). Mariani refers to the IEC 61508 safety standard and concludes that “fault-injection will be mandatory for soft error sensitivity verification” for safety critical systems (Mariani03, pg. 60). “A fault-tolerant computer system’s dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection – the deliberate insertion of faults into an operational system to determine its response – offers an effective solution to this problem.” (Clark 1995, pg. 47).

Fault injection must include all possible single-point faults, not just faults that can be conveniently injected via the pins or connectors of a component. Rimen et al. compared internal vs. external fault injection, and found that that only 9%-12% of bit flip faults that occur inside a microcontroller could be tested via external pin fault injection (Rimen et al. 1994, p. 76). In 1994, Karlsson reported on the effectiveness of using a radioactive isotope to inject faults into a microcontroller (Karlsson 1994). Later fault injection work by Karlsson’s research group was performed on automotive brake-by-wire applications, sponsored by Volvo (Aidemark 2002), clearly demonstrating the applicability of fault injection as a relevant technique for safety critical automotive systems. And other similar work found defects in a safety critical automotive network protocol. (Ademaj 2003)

A test specifically on an engine control program using fault injection caused “permanently locking the engine’s throttle at full speed.” (Vinter 2001).

There are numerous other scholarly works in this area.  An early example is Bossen (1981). Some others include: Arlat et al. (1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and Kanawati (1995). As a more recent example, Baumeister et al. performed fault injection on an automotive braking controller via irradiating it and measuring the errors, finding that unprotected SRAM and unprotected microcontroller paths were both sensitive to upsets (Baumeister 2012, pg. 5)

MISRA Software Guidelines take it for granted that fault management capabilities will be tested (e.g., MISRA Software Guidelines 3.4.8.3 pg. 44, MISRA Report 4 p. v) “Fault injection test” is recommended by ISO 26262-6 (pg. 23) for software integration, noting that “This includes injection of arbitrary faults in order to test safety mechanisms (e.g., by corrupting software or hardware components).”

By the late 1990s fault injection tools had become quite sophisticated, and were capable of injecting faults while a system was running at full speed even if source code was not available (e.g., Carreira 1998).
An example of a testing approach along these lines is E-GAS (E-GAS), which includes numerous tests based on auto manufacturer experience to ensure that various faults will be handled safely.

It is important to note that while mitigation techniques such as watchdog timers are a good practice if implemented properly, they are not sufficient to guarantee safety in the face of random errors. For example, Gunneflo presents experimental evidence indicating that watchdog effectiveness is less than perfect, and depends heavily on the particular software being run. Gunneflo recommends: “To accurately estimate coverage and latency for watch-dog mechanisms in a specific system, fault injection experiments must be carried out with the final implementation of the system using the real software.” (Gunneflo 1989, pg. 347). In other words, even if you have a watchdog timer, you need to perform fault injection to understand whether there are holes in your fault tolerance approach.

References:
  • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
  • Aidemark, Experimental evaluation of time-redundant execution for a brake-by-wire application, DSN 2002.
  • Arlat et al., Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems, FTCS, 1989.
  • Arlat et al., Fault Injection for Dependability Validation: a methodology and some applications, IEEE Trans. SW Eng., 16(2), pp. 166-182 Feb. 1990.
  • Barton et al., Fault injection experiments using FIAT, IEEE Trans. Computers, pp. 575-592, April 1990.
  • Baumeister et al., Evaluation of chip-level irradiation effects in a 32-bit safety microcontroller for automotive braking applications, IEEE Workshop on Silicon Errors in Logic – System Effects, 2012.
  • Benso et al., Fault injection for embedded microprocessor-bases systems, J. Universal Computer Science, 5(10), pp. 693-711, 1999.
  • Bossen & Hsiao, ED/FI: a technique for improving computer system RAS, FTCS, 1981.
  • Carreira et al., Xception: a technique for the experimental evaluation of dependability in modern computers, IEEE Trans. Software Engineering, Feb. 1998, pp. 125-136.
  • Clark, J. et al. Fault Injection: a method for validating computer-system dependability, IEEE Computer, June 1995, pp., 47-56
  • EGAS, Standardized E-Gas monitoring concept for engine management systems of gasoline and diesel engines version 4.0, Work Group EGAS, Jan. 30, 2007.
  • Gunneflo et al., Evaluation of error detection schemes using fault injection by heavy-ion radiation, Fault Tolerant Computing Symposium, 1989, pp. 340-347.
  • Han et al., DOCTOR: An integrated software fault injection environment for distributed real-time systems, International Computer and Dependability Symposium, 1995, pp. 204-213.
  • Hsueh, M. et al., Fault injection techniques and tools, IEEE Computer, April 1997, pp. 75-82.
  • ISO 26262, Road vehicles – Functional Safety, International Standard, First Edition, Nov 15, 2011, ISO, part 6.
  • Kanawati et al., FERRARI: a flexible software-based fault and error injection system, IEEE Trans. Computers, 44(2), Feb. 1995, pp. 248-260.
  • Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • Pullum, L., Software Fault Tolerance Techniques and Implementation, Artech House, 2001.
  • Rimen et al., On microprocessor error behavior modeling, Fault Tolerant Computing Symposium, 1994, pp. 76-85.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.

Monday, September 1, 2014

Peer Reviews and Critical Software

Every line of critical embedded software should be peer reviewed via a process that includes a physical face-to-face meeting and that produces an auditable peer review report.

Consequences:
Failing to perform peer reviews can reasonably be expected to increase the defect rate in software for several reasons. All real-world projects have limited time and resources, so by skipping or skimping on peer reviews developers have missed an easy chance to eliminate defects. With inadequate reviews, developers are spread thin chasing down bugs found during testing. Additionally, peer reviews can find defects that are impractical to find in most types of testing, especially in cases of fault management or handling unexpected/infrequent operating scenarios.

Accepted Practices:
  • Every line of code must be reviewed by at least one independent, technically skilled person. That review must include actually reading the entirety of the code rather than just looking at selected portions.
  • Peer reviews must be documented so that it is possible to audit the fact that they took place and the effectiveness of the reviews. At a minimum this includes recording the name of the reviewer(s), the code reviewed, the date of the review, and the number of defects found. If no auditable documentation of software quality is available for incorporated components (e.g., safety certification or peer review reports), then new peer reviews must be performed on that third-party code.
  • Acceptable safety critical system software processes normally require a formal meeting-based review rather than a remote review, e-mail review, or other casual checking mechanism.

Discussion:
Peer reviews involve having an independent person – other than the author –  look at source code and other design documents. The main purposes of the review are to ensure that code conforms to style guidelines and to find defects missed by the author of the code. Running a static analysis tool is not a substitute for a peer review, and neither is an in-person discussion that solely discusses the output of a static analysis tool. A proper peer review requires having an independent person (or, strongly preferable, a small group of independent reviewers) read the code in its entirety to ensure quality. The everyday analogy to a peer review is having someone else proof-read something you’ve written. It is nearly impossible to see all our own mistakes whether we are writing software or writing English prose.

It is well known that more formal reviews provide more efficient and effective results, with the gold standard being what is known as a Fagan Style Inspection (a “code inspection”) that involves a pre-review, a formal meeting with defined roles, a written review report, and follow up actions. Regardless of the type of review, accepted practice is to record the results of reviews and audit them to make sure every single line of code has been reviewed when written, and re-reviewed when a module has been modified.



General code inspection process.

MISRA requires a “structure program review” for SIL 2 and above. (MISRA Report 2 p. ix). MISRA specifically lists “Fagan Inspection” as a type of review (MISRA Software Guidelines p. 12), and devotes two appendices of a report on verification and validation to “walkthroughs,” listing structured walkthroughs, code inspections, Fagan inspections, and peer reviews (MISRA Report 6 pp. 132-136). MISRA points out that walkthroughs (their general term for peer reviews) “are acknowledged to be an effective process for identifying errors in programs – indeed they can be more effective than computer-based testing for certain types of error.”

MISRA also points out that fixing a bug may make things worse instead of better, and says that code reviews and analysis should be used to validate bug fixes. (MISRA Report 5 p. 135)
494. Peer reviews are somewhat labor intensive, and might account for 10% of the effort on a project. However, it is common for good peer reviews to find 50% or more of the defects in a code base, and thus finding defects via peer review is much cheaper than finding them via testing. Ineffective reviews can be diagnosed by the fact that they find far fewer defects. Acceptable peer reviews normally find defects that would be missed by testing, especially in parts of the code that are difficult to test thoroughly (for example, exception and failure management code).

Selected Sources:
McConnell devotes Chapter 24 to a discussion of reviews and inspections (McConnell 1993). Boehm & Basili summarized best practices for reducing software defects, and included the following point relevant to peer reviews: “Peer reviews catch 60 percent of the defects.”  (Boehm 2001, pg. 137).

Ganssle lists four steps that should be the first steps taken to improve software quality. They are: “1. Buy and use a version control system; 2. Institute a Firmware Standards Manual; 3. Start a program of Code Inspection; 4. Create a quiet environment conducive to thinking.” #3 is his term for peer reviews, indicating his recommendation for formal code inspections. He also says that he knows companies that have made all these changes to their software process in a single day. (Ganssle 2000, p. 13). (Ganssle’s #2 item is coding style, discussed in Section 8.6. ).

MISRA Software Guidelines list the following as techniques on a one-picture overview of the software lifecycle: “Walkthrough, Fagan Inspection, Code Inspection, Peer Review, Argument, etc.” (MISRA Guidelines 1994, pg. 20) indicating the importance of formal peer reviews in a safety critical software lifecycle. Integrity Level 2 (which is only somewhat safety critical) and higher integrity levels require a “structured program review” (pg. 29). That document also gives these rules: “3.5.2.2 Before dynamic testing begins the code should be reviewed in accordance with the software verification plan to ensure that it does conform to the design specification” (pg. 56) and “3.5.2.3 Code reviews and/or walkthroughs should be used to identify any inconsistencies with the specifications” (pg. 56) and “4.3.4.3 The communication of information regarding errors to design and development personnel should be as clear as possible. For example, errors found during reviews should be fully recorded at the point of detection.”

MISRA C rule 116 states: “All libraries used in production code shall be written to comply with the provisions of this document, and shall have been subject to appropriate validation” (MISRA C pg. 55). Within the context of embedded systems, an operating system such as OSEK would be expected to count as a “library” in that it is code included in the system that is relied upon for safety, and thus should have been subject to appropriate validation, which would be expected to include peer reviews. If there is no evidence of peer review or safety certification, the system designer should perform peer reviews on the OS code (which is an excellent reason to use a safety certified OS!)

Fagan-style inspections are a formal version of a “peer review,” which involves multiple software developers looking at software and other design artifacts to find defects. Fagan-style inspections originated at IBM (Fagan 1976). A later paper presented updated techniques, concluding that “inspections increase productivity and improve final program quality. Furthermore, improvements in process control and project management are enabled by inspections.” (Fagan 1986). It is widely recognized that Fagan-style inspections are a best practice, and that some sort of effective peer review technique is an accepted practice.

Fagan-style Formal Inspections are recommended by the FAA (FAA 2000, p. J-23). IEC 61503-3 highly recommends performing some sort of design review on all software at all SILs, and recommends Fagan inspections at SIL4. (p. 91).

References:
  • Boehm, B. & Basili, V., Software Defect Reduction Top 10 List, IEEE Computer, pp. 135-137, Jan. 2001.
  • FAA, System Safety Handbook, Appendix J: Software Safety, Federal Aviation Administration, Dec. 2000
  • Fagan, M., "Advances in software inspections," IEEE Trans. Software Engineering, SE-12, July 1986, pp. 744-751.
  • Fagan, M., "Design and code inspections to reduce errors in program development," IBM Systems Journal, 15(3), 1976, pp. 182-211.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • MISRA, Report 5: Software Metrics, February 1995.
  • MISRA, Report 6: Verification and Validation, February 1995.

Monday, August 25, 2014

Use of Static Analysis Tools

Critical embedded software should use static checking tools with a defined and appropriate set of rules, and should have zero warnings from those tools.

Consequences:
While rigorous peer reviews can catch many defects, some misuses of language are easy for humans to miss but straightforward for a static checking tool to find. Failing to use a static checking tool exposes software to a needless risk of defects. Ignoring or accepting the presence of large numbers of warnings similarly exposes software to needless risk of defects.

Accepted Practices:
  • Using a static checking tool that has been configured to automatically check as many coding guideline violations as practicable. For automotive applications, following all or almost all (with defined and justified exceptions) of the MISRA C coding standard rules is an accepted practice.
  • Ensuring that code checks “clean,” meaning that there are no static checking violations.
  • In rare instances in which a coding rule violation has been formally approved, use pragmas to formally document the deviation and direct the static checking tool not to issue a warning.
Discussion:
Static checking tools look for suspicious coding structures and data use within a piece of software. Traditionally, they look for things that are “warnings” instead of errors. The distinction is that an error prevents the compiler from being able to generate code that will run. In contrast, a warning is an instance in which code can be compiled, but in which there is a substantial likelihood that the code the compiler generates will not actually do what the designer wants it to do. Reasons for a warning might include ambiguities in the language standard (the code does something, but it’s unclear whether what it does is what the language standard meant), gaps in the language standard (the code does something arbitrary because the language standard does not standardize behavior for this case), and dangerous coding practices (the code does something that is probably a bad idea to attempt). In other words, warnings point out potential code defects. Static analysis capabilities vary depending upon the tool, but in general are all designed to help find instances of poor use of a programming language and violations of coding rules.

An analogous example to a static checking tool is the Microsoft Word grammar assistant. It tells you when it thinks a phrase is incorrect or awkward, even if all the words are spelled correctly. This is a loose analogy because creativity in expression is important for some writing. But safety critical computer code (and English-language writing describing the details of how such systems work) is better off being methodical, regular, and precise, rather than creatively expressed but ambiguous.

Static checking tools are an important way of checking for coding style violations. They are particularly effective at finding language use that is ambiguous or dangerous. While not every instance of a static checking tool warning means that there is an actual software defect, each warning given means that there is the potential for a defect. Accepted practice for high quality software (especially safety critical software) is to eliminate all warnings so that the code checks “clean.” The reasons for this include the following. A warning may seem to be OK when examined, but might become a bug in the context of other changes made to the code later. A multitude of warnings that have been investigated and deemed acceptable may obscure the appearance of a new warning that indicates an actual bug. The reviewer may not understand some subtle language-dependent aspect of a warning, and thus think things are OK when they are actually not.

Selected Sources:
MISRA Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.”

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell, pg. 237, emphasis added).

References:
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • (See also posting on Coding Style Guidelines and MISRA C)



Monday, August 11, 2014

Coding Style Guidelines and MISRA C

Critical embedded software should follow a well-defined set of coding guidelines, enforced with comprehensive static checking tools, with essentially no deviations. MISRA C is an example of an accepted set of such coding guidelines.

Consequences:
Coding style guidelines exist to make it more difficult to make mistakes, and also to make it easier to detect when mistakes have been made. Failing to establish or follow formal, written coding guidelines makes it more difficult to understand code, leading to less effective code reviews and a reasonable expectation of increased levels of software defects. Failing to follow the language usage rules defined by a coding style guideline leads to using the language in a way that can normally be expected to result in poorly defined or incorrect software behaviors, increases the risk of software defects.

Accepted Practices:

  • All projects should follow a written coding style guideline document.
  • Coding guidelines should address formatting, commenting, name use, and other similar topics.
  • Coding guidelines should address good language usage practices to create understandable code and reduce the chance of introducing software defects.
  • Coding guidelines should specifically address which language features and usage patterns should be avoided as being error-prone, dangerous, or undefined by the language standard.
  • Coding guidelines should be followed with essentially no exception. Exceptions should require formal review with written approval and annotations in the source code. If guidelines are inappropriate, the guidelines should be changed.

Discussion:
Style in software can be considered analogous to style in writing. Compilers enforce some basic programming language construction rules that allow the source code to be compiled into executable software. Style, on the other hand, has more to do with how ideas are expressed within the constraints of the programming language used. Some style considerations have to do with variable naming conventions, indentation, and physical organization aspects of lines of code. Other style considerations have to do with the use of the programming language itself. Some constructs in a programming language are ambiguous or easily misunderstood. And some constructions in software, while correct in terms of language definition, are very likely to indicate a software defect.

A classic example of a subtle defect in the C programming language is:
“if (x = y) { …}”   The programmer almost always means to compare “x” and “y” for equality, but the C programming language is defined such that this code instead copies the value of “y” into “x” and then tests to see if the result of that copy operation was non-zero. The correct code would be “if (x == y) { … }” which adds a second “=” to make the operation a comparison instead of an assignment operation. Using “=” instead of “==” in conditionals is a common mistake when creating C programs, easy to confuse visually, and is therefore prohibited by typical style guidelines even though the single “=” version of the code is a valid language construct. A loose analogy might be a prohibition against using multiple negatives in an English language sentence because it is too difficult to not not not not not not (sic) make a mistake with such a sentence even though the meaning is unambiguous if the sentence is carefully (and correctly) analyzed.

It is accepted practice to have a defined set of coding guidelines that cover all relevant aspects of programming language use. Such guidelines are typically tailored for each project, but once defined should be followed rigorously. Guidelines typically cover code formatting, commenting, use of names, language use conventions, and other relevant aspects. While guidelines can be tailored per project, there are nonetheless a number of generally accepted practices for reducing the chance of software defects (such as forbidding a single “=” in a conditional evaluation as just discussed).

Of particular concern in safety critical software are language use rules to avoid ambiguity and hazardous language constructs. It is a generally accepted practice to outright ban hazardous and error-prone language structures to avoid the chance of defects, even if doing so makes software a bit less convenient to write and those structures would otherwise be a legal use of the language. In other words, an essential aspect of coding style for safety critical systems is outlawing code structures that are technically valid but are too dangerous or error-prone to use. The result is a written document that defines coding style in general, and language usage rules in particular. These rules must be applied rigorously and with essentially no exceptions. (The “no exceptions” part is feasible because it is acceptable to tailor the rules to the particular project. So it is not a matter of applying arbitrary rules and making exceptions, but rather picking rules that make sense for the situation and then rigorously sticking to them.) In short, every safety critical software project must have a coding style guide and must follow it rigorously to achieve acceptable levels of safety.

It is often the case that it makes sense to adopt an existing set of language use rules rather than make up your own. The MISRA C set of coding rules was specifically created for safety critical automotive software, and is the most well known C programming language subset for safety critical software. A typical practice when writing safety critical C code is to start with MISRA C, create a defined set of which rules will be followed (usually this is almost all of them), and then follow those rules rigorously. Exceptions to adopted rules should be very few, granted only after a formal written review process, and documented in the code as to the type of exception and reason for granting it. Preferably, automated tools (widely available for MISRA C as discussed in Jones 2002, pg. 56) are used to enforce the rules in addition to a required peer review of code.

It is accepted practice to adopt new coding style rules when better practices come into use, and apply those coding style rules to existing code when that code is being updated and incorporated into a new product.

Selected Sources:
The MISRA C guidelines (MISRA C 1998) are specifically designed for safety critical systems at SIL 2 and above. (MISRA Report 2 p. ix) They consist of a list of rules about coding practices to use and practices to avoid. They concentrate primarily on language use rather than code formatting. While MISRA C was originally developed for automotive applications, it was being set forth as a more general standard for adoption in other domains by 2002 (e.g., Jones 2002).  Over time, MISRA C has transition beyond just automotive applications to mainstream use for high integrity software in other areas. A predecessor of MISRA C is the list of rules in the book Safer C (Hatton, 1995).

More general coding style guidelines abound. It is easy to find a coding style guideline that can be adapted for the specifics of a project. Examples include chapter 18 of McConnell (1993).

NASA says that it is important that all levels of the project agree to the coding standards, and that they are enforced.   (NASA 2004 pg. 146)

Enforcing coding style involves the use of static checking in addition to formal peer reviews. Beyond the general consensus in the software community that following a defined coding style is a good idea, Nagappan and Ball found that “there exists a strong positive correlation between the static analysis defect density and the pre-release defect density determined by testing. Further, the predicted pre-release defect density and the actual pre-release defect density are strongly correlated at a high degree of statistical significance.” (Nagappan 2005, abstract)  In other words, modules that fail to follow a coding style as determined by static analysis have more bugs.

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell 1993, pg. 237).

MISRA Software Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines 1994, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.” IEC 61508 highly recommends (which more or less means “requires” as an accepted practice) static analysis at SIL 2 and above (IEC 61508-3 pg. 83).

Finally, an automotive manufacturer has published data showing that they expect one “major bug” for every 30 coding rule violations (Kawana 2004):


(Figure from Kawana 2004)

Note: As with my other posts in the last few months this was written regarding practices about a decade ago. There are newer sources for coding style information available now such as an updated version of MISRA C. There is also the the ISO-26262 standard, which is intended to replace the MISRA software guidelines..  But we'll save discussing those for another time.

References:

  • Hatton, Les, Safer C, 1995.
  • Jones, N., MISRA C guidelines, Embedded Systems Programming, Beginner’s Corner, July 2002, pp. 55-56.
  • Kawana et al., Empirical Approach for Reliability Assurance of Vehicle Software, Automotive Software Workshop, San Diego, 2004.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • Nagappan & Ball, Static analysis tools as early indicators of pre-release defect density, International Conference on Software Engineering, 2005, pp. 580-586.