Monday, September 1, 2014

Peer Reviews and Critical Software

Every line of critical embedded software should be peer reviewed via a process that includes a physical face-to-face meeting and that produces an auditable peer review report.

Consequences:
Failing to perform peer reviews can reasonably be expected to increase the defect rate in software for several reasons. All real-world projects have limited time and resources, so by skipping or skimping on peer reviews developers have missed an easy chance to eliminate defects. With inadequate reviews, developers are spread thin chasing down bugs found during testing. Additionally, peer reviews can find defects that are impractical to find in most types of testing, especially in cases of fault management or handling unexpected/infrequent operating scenarios.

Accepted Practices:
  • Every line of code must be reviewed by at least one independent, technically skilled person. That review must include actually reading the entirety of the code rather than just looking at selected portions.
  • Peer reviews must be documented so that it is possible to audit the fact that they took place and the effectiveness of the reviews. At a minimum this includes recording the name of the reviewer(s), the code reviewed, the date of the review, and the number of defects found. If no auditable documentation of software quality is available for incorporated components (e.g., safety certification or peer review reports), then new peer reviews must be performed on that third-party code.
  • Acceptable safety critical system software processes normally require a formal meeting-based review rather than a remote review, e-mail review, or other casual checking mechanism.

Discussion:
Peer reviews involve having an independent person – other than the author –  look at source code and other design documents. The main purposes of the review are to ensure that code conforms to style guidelines and to find defects missed by the author of the code. Running a static analysis tool is not a substitute for a peer review, and neither is an in-person discussion that solely discusses the output of a static analysis tool. A proper peer review requires having an independent person (or, strongly preferable, a small group of independent reviewers) read the code in its entirety to ensure quality. The everyday analogy to a peer review is having someone else proof-read something you’ve written. It is nearly impossible to see all our own mistakes whether we are writing software or writing English prose.

It is well known that more formal reviews provide more efficient and effective results, with the gold standard being what is known as a Fagan Style Inspection (a “code inspection”) that involves a pre-review, a formal meeting with defined roles, a written review report, and follow up actions. Regardless of the type of review, accepted practice is to record the results of reviews and audit them to make sure every single line of code has been reviewed when written, and re-reviewed when a module has been modified.



General code inspection process.

MISRA requires a “structure program review” for SIL 2 and above. (MISRA Report 2 p. ix). MISRA specifically lists “Fagan Inspection” as a type of review (MISRA Software Guidelines p. 12), and devotes two appendices of a report on verification and validation to “walkthroughs,” listing structured walkthroughs, code inspections, Fagan inspections, and peer reviews (MISRA Report 6 pp. 132-136). MISRA points out that walkthroughs (their general term for peer reviews) “are acknowledged to be an effective process for identifying errors in programs – indeed they can be more effective than computer-based testing for certain types of error.”

MISRA also points out that fixing a bug may make things worse instead of better, and says that code reviews and analysis should be used to validate bug fixes. (MISRA Report 5 p. 135)
494. Peer reviews are somewhat labor intensive, and might account for 10% of the effort on a project. However, it is common for good peer reviews to find 50% or more of the defects in a code base, and thus finding defects via peer review is much cheaper than finding them via testing. Ineffective reviews can be diagnosed by the fact that they find far fewer defects. Acceptable peer reviews normally find defects that would be missed by testing, especially in parts of the code that are difficult to test thoroughly (for example, exception and failure management code).

Selected Sources:
McConnell devotes Chapter 24 to a discussion of reviews and inspections (McConnell 1993). Boehm & Basili summarized best practices for reducing software defects, and included the following point relevant to peer reviews: “Peer reviews catch 60 percent of the defects.”  (Boehm 2001, pg. 137).

Ganssle lists four steps that should be the first steps taken to improve software quality. They are: “1. Buy and use a version control system; 2. Institute a Firmware Standards Manual; 3. Start a program of Code Inspection; 4. Create a quiet environment conducive to thinking.” #3 is his term for peer reviews, indicating his recommendation for formal code inspections. He also says that he knows companies that have made all these changes to their software process in a single day. (Ganssle 2000, p. 13). (Ganssle’s #2 item is coding style, discussed in Section 8.6. ).

MISRA Software Guidelines list the following as techniques on a one-picture overview of the software lifecycle: “Walkthrough, Fagan Inspection, Code Inspection, Peer Review, Argument, etc.” (MISRA Guidelines 1994, pg. 20) indicating the importance of formal peer reviews in a safety critical software lifecycle. Integrity Level 2 (which is only somewhat safety critical) and higher integrity levels require a “structured program review” (pg. 29). That document also gives these rules: “3.5.2.2 Before dynamic testing begins the code should be reviewed in accordance with the software verification plan to ensure that it does conform to the design specification” (pg. 56) and “3.5.2.3 Code reviews and/or walkthroughs should be used to identify any inconsistencies with the specifications” (pg. 56) and “4.3.4.3 The communication of information regarding errors to design and development personnel should be as clear as possible. For example, errors found during reviews should be fully recorded at the point of detection.”

MISRA C rule 116 states: “All libraries used in production code shall be written to comply with the provisions of this document, and shall have been subject to appropriate validation” (MISRA C pg. 55). Within the context of embedded systems, an operating system such as OSEK would be expected to count as a “library” in that it is code included in the system that is relied upon for safety, and thus should have been subject to appropriate validation, which would be expected to include peer reviews. If there is no evidence of peer review or safety certification, the system designer should perform peer reviews on the OS code (which is an excellent reason to use a safety certified OS!)

Fagan-style inspections are a formal version of a “peer review,” which involves multiple software developers looking at software and other design artifacts to find defects. Fagan-style inspections originated at IBM (Fagan 1976). A later paper presented updated techniques, concluding that “inspections increase productivity and improve final program quality. Furthermore, improvements in process control and project management are enabled by inspections.” (Fagan 1986). It is widely recognized that Fagan-style inspections are a best practice, and that some sort of effective peer review technique is an accepted practice.

Fagan-style Formal Inspections are recommended by the FAA (FAA 2000, p. J-23). IEC 61503-3 highly recommends performing some sort of design review on all software at all SILs, and recommends Fagan inspections at SIL4. (p. 91).

References:
  • Boehm, B. & Basili, V., Software Defect Reduction Top 10 List, IEEE Computer, pp. 135-137, Jan. 2001.
  • FAA, System Safety Handbook, Appendix J: Software Safety, Federal Aviation Administration, Dec. 2000
  • Fagan, M., "Advances in software inspections," IEEE Trans. Software Engineering, SE-12, July 1986, pp. 744-751.
  • Fagan, M., "Design and code inspections to reduce errors in program development," IBM Systems Journal, 15(3), 1976, pp. 182-211.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • MISRA, Report 5: Software Metrics, February 1995.
  • MISRA, Report 6: Verification and Validation, February 1995.

Monday, August 25, 2014

Use of Static Analysis Tools

Critical embedded software should use static checking tools with a defined and appropriate set of rules, and should have zero warnings from those tools.

Consequences:
While rigorous peer reviews can catch many defects, some misuses of language are easy for humans to miss but straightforward for a static checking tool to find. Failing to use a static checking tool exposes software to a needless risk of defects. Ignoring or accepting the presence of large numbers of warnings similarly exposes software to needless risk of defects.

Accepted Practices:
  • Using a static checking tool that has been configured to automatically check as many coding guideline violations as practicable. For automotive applications, following all or almost all (with defined and justified exceptions) of the MISRA C coding standard rules is an accepted practice.
  • Ensuring that code checks “clean,” meaning that there are no static checking violations.
  • In rare instances in which a coding rule violation has been formally approved, use pragmas to formally document the deviation and direct the static checking tool not to issue a warning.
Discussion:
Static checking tools look for suspicious coding structures and data use within a piece of software. Traditionally, they look for things that are “warnings” instead of errors. The distinction is that an error prevents the compiler from being able to generate code that will run. In contrast, a warning is an instance in which code can be compiled, but in which there is a substantial likelihood that the code the compiler generates will not actually do what the designer wants it to do. Reasons for a warning might include ambiguities in the language standard (the code does something, but it’s unclear whether what it does is what the language standard meant), gaps in the language standard (the code does something arbitrary because the language standard does not standardize behavior for this case), and dangerous coding practices (the code does something that is probably a bad idea to attempt). In other words, warnings point out potential code defects. Static analysis capabilities vary depending upon the tool, but in general are all designed to help find instances of poor use of a programming language and violations of coding rules.

An analogous example to a static checking tool is the Microsoft Word grammar assistant. It tells you when it thinks a phrase is incorrect or awkward, even if all the words are spelled correctly. This is a loose analogy because creativity in expression is important for some writing. But safety critical computer code (and English-language writing describing the details of how such systems work) is better off being methodical, regular, and precise, rather than creatively expressed but ambiguous.

Static checking tools are an important way of checking for coding style violations. They are particularly effective at finding language use that is ambiguous or dangerous. While not every instance of a static checking tool warning means that there is an actual software defect, each warning given means that there is the potential for a defect. Accepted practice for high quality software (especially safety critical software) is to eliminate all warnings so that the code checks “clean.” The reasons for this include the following. A warning may seem to be OK when examined, but might become a bug in the context of other changes made to the code later. A multitude of warnings that have been investigated and deemed acceptable may obscure the appearance of a new warning that indicates an actual bug. The reviewer may not understand some subtle language-dependent aspect of a warning, and thus think things are OK when they are actually not.

Selected Sources:
MISRA Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.”

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell, pg. 237, emphasis added).

References:
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • (See also posting on Coding Style Guidelines and MISRA C)



Monday, August 11, 2014

Coding Style Guidelines and MISRA C

Critical embedded software should follow a well-defined set of coding guidelines, enforced with comprehensive static checking tools, with essentially no deviations. MISRA C is an example of an accepted set of such coding guidelines.

Consequences:
Coding style guidelines exist to make it more difficult to make mistakes, and also to make it easier to detect when mistakes have been made. Failing to establish or follow formal, written coding guidelines makes it more difficult to understand code, leading to less effective code reviews and a reasonable expectation of increased levels of software defects. Failing to follow the language usage rules defined by a coding style guideline leads to using the language in a way that can normally be expected to result in poorly defined or incorrect software behaviors, increases the risk of software defects.

Accepted Practices:

  • All projects should follow a written coding style guideline document.
  • Coding guidelines should address formatting, commenting, name use, and other similar topics.
  • Coding guidelines should address good language usage practices to create understandable code and reduce the chance of introducing software defects.
  • Coding guidelines should specifically address which language features and usage patterns should be avoided as being error-prone, dangerous, or undefined by the language standard.
  • Coding guidelines should be followed with essentially no exception. Exceptions should require formal review with written approval and annotations in the source code. If guidelines are inappropriate, the guidelines should be changed.

Discussion:
Style in software can be considered analogous to style in writing. Compilers enforce some basic programming language construction rules that allow the source code to be compiled into executable software. Style, on the other hand, has more to do with how ideas are expressed within the constraints of the programming language used. Some style considerations have to do with variable naming conventions, indentation, and physical organization aspects of lines of code. Other style considerations have to do with the use of the programming language itself. Some constructs in a programming language are ambiguous or easily misunderstood. And some constructions in software, while correct in terms of language definition, are very likely to indicate a software defect.

A classic example of a subtle defect in the C programming language is:
“if (x = y) { …}”   The programmer almost always means to compare “x” and “y” for equality, but the C programming language is defined such that this code instead copies the value of “y” into “x” and then tests to see if the result of that copy operation was non-zero. The correct code would be “if (x == y) { … }” which adds a second “=” to make the operation a comparison instead of an assignment operation. Using “=” instead of “==” in conditionals is a common mistake when creating C programs, easy to confuse visually, and is therefore prohibited by typical style guidelines even though the single “=” version of the code is a valid language construct. A loose analogy might be a prohibition against using multiple negatives in an English language sentence because it is too difficult to not not not not not not (sic) make a mistake with such a sentence even though the meaning is unambiguous if the sentence is carefully (and correctly) analyzed.

It is accepted practice to have a defined set of coding guidelines that cover all relevant aspects of programming language use. Such guidelines are typically tailored for each project, but once defined should be followed rigorously. Guidelines typically cover code formatting, commenting, use of names, language use conventions, and other relevant aspects. While guidelines can be tailored per project, there are nonetheless a number of generally accepted practices for reducing the chance of software defects (such as forbidding a single “=” in a conditional evaluation as just discussed).

Of particular concern in safety critical software are language use rules to avoid ambiguity and hazardous language constructs. It is a generally accepted practice to outright ban hazardous and error-prone language structures to avoid the chance of defects, even if doing so makes software a bit less convenient to write and those structures would otherwise be a legal use of the language. In other words, an essential aspect of coding style for safety critical systems is outlawing code structures that are technically valid but are too dangerous or error-prone to use. The result is a written document that defines coding style in general, and language usage rules in particular. These rules must be applied rigorously and with essentially no exceptions. (The “no exceptions” part is feasible because it is acceptable to tailor the rules to the particular project. So it is not a matter of applying arbitrary rules and making exceptions, but rather picking rules that make sense for the situation and then rigorously sticking to them.) In short, every safety critical software project must have a coding style guide and must follow it rigorously to achieve acceptable levels of safety.

It is often the case that it makes sense to adopt an existing set of language use rules rather than make up your own. The MISRA C set of coding rules was specifically created for safety critical automotive software, and is the most well known C programming language subset for safety critical software. A typical practice when writing safety critical C code is to start with MISRA C, create a defined set of which rules will be followed (usually this is almost all of them), and then follow those rules rigorously. Exceptions to adopted rules should be very few, granted only after a formal written review process, and documented in the code as to the type of exception and reason for granting it. Preferably, automated tools (widely available for MISRA C as discussed in Jones 2002, pg. 56) are used to enforce the rules in addition to a required peer review of code.

It is accepted practice to adopt new coding style rules when better practices come into use, and apply those coding style rules to existing code when that code is being updated and incorporated into a new product.

Selected Sources:
The MISRA C guidelines (MISRA C 1998) are specifically designed for safety critical systems at SIL 2 and above. (MISRA Report 2 p. ix) They consist of a list of rules about coding practices to use and practices to avoid. They concentrate primarily on language use rather than code formatting. While MISRA C was originally developed for automotive applications, it was being set forth as a more general standard for adoption in other domains by 2002 (e.g., Jones 2002).  Over time, MISRA C has transition beyond just automotive applications to mainstream use for high integrity software in other areas. A predecessor of MISRA C is the list of rules in the book Safer C (Hatton, 1995).

More general coding style guidelines abound. It is easy to find a coding style guideline that can be adapted for the specifics of a project. Examples include chapter 18 of McConnell (1993).

NASA says that it is important that all levels of the project agree to the coding standards, and that they are enforced.   (NASA 2004 pg. 146)

Enforcing coding style involves the use of static checking in addition to formal peer reviews. Beyond the general consensus in the software community that following a defined coding style is a good idea, Nagappan and Ball found that “there exists a strong positive correlation between the static analysis defect density and the pre-release defect density determined by testing. Further, the predicted pre-release defect density and the actual pre-release defect density are strongly correlated at a high degree of statistical significance.” (Nagappan 2005, abstract)  In other words, modules that fail to follow a coding style as determined by static analysis have more bugs.

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell 1993, pg. 237).

MISRA Software Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines 1994, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.” IEC 61508 highly recommends (which more or less means “requires” as an accepted practice) static analysis at SIL 2 and above (IEC 61508-3 pg. 83).

Finally, an automotive manufacturer has published data showing that they expect one “major bug” for every 30 coding rule violations (Kawana 2004):


(Figure from Kawana 2004)

Note: As with my other posts in the last few months this was written regarding practices about a decade ago. There are newer sources for coding style information available now such as an updated version of MISRA C. There is also the the ISO-26262 standard, which is intended to replace the MISRA software guidelines..  But we'll save discussing those for another time.

References:

  • Hatton, Les, Safer C, 1995.
  • Jones, N., MISRA C guidelines, Embedded Systems Programming, Beginner’s Corner, July 2002, pp. 55-56.
  • Kawana et al., Empirical Approach for Reliability Assurance of Vehicle Software, Automotive Software Workshop, San Diego, 2004.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • Nagappan & Ball, Static analysis tools as early indicators of pre-release defect density, International Conference on Software Engineering, 2005, pp. 580-586.


Wednesday, July 23, 2014

Don’t Overflow the Stack

Somewhere in your embedded system is the stack (or several of them for some multitasking systems). If you blow up the stack, your software will crash. Or worse, especially if you don’t have memory protection. For a critical system you need to make sure the stack has some elbow room and make sure that you know when you have a stack size problem.

Also, you shouldn’t ever use recursion in a safety critical embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The consequences of not understanding maximum stack depth can be a seemingly random memory corruption as the stack overwrites RAM variables. Whether or not this actually causes a program crash depends upon a number of factors, including chance, and in the worse case it can cause unsafe program behavior without a crash. 

Accepted Practices:
  • Compute worst case stack depth using a static analysis tool.
  • Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
  • Avoid all recursion to ensure that worst case stack depth is bounded.
Discussion:
The “stack” is an area in memory used for storing certain types of data. For example, in the C programming language this is where non-static local variables go. The stack gets bigger every time a subroutine is called, usually holding the subroutine return address, parameter values that have been passed, and local variables for each currently active subroutine. As nested subroutines are called, the stack keeps getting bigger as more information is added to the stack to perform each deeper and deeper call. When each subroutine is completed, the stack information is returned to the stack area for later use, un-nesting the layers of stack usage and shrinking the stack size. Thus the maximum size of the stack is determined by how many subroutines are called on top of each other (the depth of the subroutine call graph), as well as the storage space needed by each of those subroutines, plus additional area needed by any interrupt service routines that may be active at the same time.

Some processors have a separate hardware stack for subroutine return information and a different stack for parameters and variables. And many operating systems have multiple stacks in support of multiple tasks. But for the most part similar ideas apply to all embedded controllers, and we’ll just discuss the single-stack case to keep things simple.


It is common for stacks to grow “downward” from high memory locations to low locations, with global variables, Real Time Operating System (RTOS) task information, or other values such as the C heap being allocated from the low memory locations to high memory locations. There is an unused memory space in between the stack and the globals. In other words, the stack shares limited RAM space with other variables. Because RAM space is limited, there is the possibility of the stack growing so large that it overlaps variable storage memory addresses so that the stack corrupts other memory (and/or loads and stores to global memory corrupt the stack). To avoid this, it is accepted practice to determine the worst case maximum stack depth and ensure such overlap is impossible.

Maximum stack depth can be determined by a number of means. One way is to periodically sample the stack pointer while the program is running and find the maximum observed stack size. This approach is a starting point, but one should not expect that sampling will happen to catch the absolute maximum stack depth. Catching the worst case can be quite difficult because interrupt service routines also use the stack and run at times that are generally unpredictable compared to program execution flow. (Moreover, if you use a timer interrupt to sample the stack pointer value, you’ll never see the stack pointer when the timer interrupt is masked, leaving a blind spot in this technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically designed to find the worst case set of subroutine calls in terms of stack depth. This technique is effective unless the code has structures that confound the analysis (e.g., if a program uses recursion, this technique generally doesn’t work). When the technique does work, it has the virtue of giving an absolute bound without having to rely upon whether testing happened to have encountered the worst case. You should use this technique if at all possible. If static analysis isn’t possible because of how you have written your code, you should change your code so you actually can do static analysis for maximum stack depth. In particular, this means you should never use recursion in a critical system both because it makes static analysis of stack depth impossible. Moreover, recursive routines are prone to stack overflows even if they are bug free, but happen to be fed an exceptional input value that causes too many levels of recursion. (In general, using recursion in any small-microcontroller embedded system is a bad idea for these reasons even if the system is not safety critical.)



Beyond static analysis of stack depth (or if static analysis isn’t possible and the system isn’t safety critical), an additional accepted practice is to use a “stack sentinel” to find the “high water mark” of the stack (I’ve also heard this called a “stack watermark” and a “stack guard”). With this technique each memory location in the stack area of memory is initialized to a predetermined known value or pattern of known values. As the stack grows in size, it will overwrite these known sentinel values with other values, leaving behind new patterns in memory that remain behind, like footprints in fresh snow. Thus, it is easy to tell how big the stack has ever been since the system was started by looking for how far into the unused memory space the “footprints” of stack usage trample past the current stack size. This both gives the maximum stack size during a particular program run, and the overall system maximum stack size assuming that the worst case path has been executed. (Actually identifying the worst case path need not be done – it is sufficient to have run it at some time during program execution without knowing exactly when it happened. It will leave its mark on the stack memory contents when it does happen.)


To convert this idea into a run-time protection technique, extra memory space between the stack and other data is set up as a sacrificial memory area that is there to detect unexpected corruptions before the stack can corrupt the globals or other RAM data. The program is run, and then memory contents are examined periodically to see how many stack area memory words are left unchanged. As part of the design validation process, designers compare the observed worst case stack depth against the static analysis computed worst case stack depth to ensure that they understand their system, have tested the worst case paths, and have left adequate margin in stack memory to prevent stack overflows if they missed some special situation that is even worse than the predicted worst case.

This sentinel technique should also be used in testing and in production systems to periodically check the sentinels in the sacrificial memory area. If you have computed a worst case stack depth at design time and you detect that the computed depth has been exceeded at run time, this is an indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space into global variables, the system might crash. Or that might just corrupt global variables or RTOS system state and have who-knows-what effect. (In a safety critical system “who-knows-what” equals “unsafe.”)  Naively assuming that you will always get a system crash detectable by the watchdog on a stack overflow is a dangerous practice. Sometimes the watchdog will catch a stack overflow. But there is no guarantee this will always happen. Consider that a stack-smashing attack is the security version of an intentional stack overflow, but is specifically designed to take over a system, not just merely crash it. So a crash after a stack overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering program execution paths to avoid a deep sequence of calls, and accounting for interrupts adding even more to the stack. And, again, even though we shouldn’t have to say this – never using recursion.

While there isn’t a set number for how much margin to leave in terms of extra memory past the computed maximum stack size, there are two considerations. First, you want to leave enough room to have an ample sacrificial area so that a problem with stack depth is unlikely to have enough time to go all the way through the sacrificial area and touch the globals before it is detected by a periodic timer tick checking sentinels. (Note: we didn’t say check current stack depth; we said periodically check sentinels to see if the stack has gotten too big between checks.)  Also, you want to leave some extra margin to account for the possibility you just never encountered the worst-case stack depth in analysis or testing. I’ve heard designers say worst case stack depth of 50% of available stack memory is a good idea. Above 90% use of stack memory (10% sacrificial memory area set aside) is probably a bad idea. But the actual number will depend on the details of your system.

Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched part of embedded systems folklore and accepted practices. Douglass mentions watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that it can cause unpredictable behavior (MISRA Guidelines, pg. 20).

MISRA C required rule 70 explicitly bans recursion:

NASA recommends using stack guards (essentially the same as the technique that I call “stack sentinels”) to check for stack overflow or corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and cause arbitrarily bad behavior of a running program. Regehr et al. provide an overview of research in that area relevant to embedded microcontrollers and ways to mitigate those problems (Regehr 2006). He notes that “the potential for stack overflow in embedded systems is hard to detect by testing.” (id., p. 776), with the point being that it is reasonable to expect that a system which has been tested but not thoroughly analyzed will have run-time stack overflows that corrupt memory.

References:
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.

Monday, June 30, 2014

Avoid Concurrency Defects

Accesses to variables shared among multiple threads of execution must be protected via disabling interrupts, using a mutex, or some other rigorously applied concurrency management approach.

Consequences:

Incorrect or lax use of concurrency techniques can be expected to lead to concurrency bugs. Such bugs are usually difficult to detect via testing and difficult to reproduce once detected. A fraction of any such bugs can be reasonably expected to make it past even extensive testing into production fleets.

Accepted Practices:
  • Aggressively minimize the use of globally shared variables. Every variable shared between tasks is a chance for a concurrency defect.
  • For every access to a shared variable, treat the entire time that a copy of the variable is “live” in the computation as a critical section, protecting access via masking interrupts or some other well defined technique.
  • Avoid concurrency management techniques that are “home brewed” or otherwise not part of well proven practice. Similarly, do not modify well known techniques to optimize efficiency or obtain other perceived benefits. Such techniques are extremely difficult to get right, and altering techniques or using non-standard techniques can be reasonably expected to introduce defects.
  • Declare every shared variable “volatile” to ensure that reads and writes do not result in stale data being used due to compiler optimization attempts to improve computation speed.
  • Keep critical sections as short as possible to minimize negative effects on scheduling. (The largest critical section forms a minimum length on blocking time for scheduling purposes).
Discussion:

One aspect of a modern real time embedded system is that it must appear to do many things at once. For example, an engine controller must look at many inputs, set throttle angle and perform diagnostic checks all at the same time. Typically many of these tasks are written as relatively independent pieces of software that must work together, and they must all appear to run at the same time.

In reality there is only one CPU, so tasks must take turns using that CPU, with that turn taking supervised by the RTOS. And, tasks often must share things such as memory locations and input/output devices. The turn-taking means that, unless software designers are extremely careful, on occasion shared resources can be left in undefined or incorrect states, resulting in concurrency bugs.

Example concurrency defect.
Source: http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html (That blog post has a nice discussion of how situation-dependent such defects are)

Avoiding and fixing concurrency bugs is a major source of design and testing effort on most embedded systems. In part this is because concurrency bugs can be quite subtle, and in part it is because they can be very difficult to activate during testing as well as difficult to isolate even when one is observed (if they are observed at all during testing). The difficulty of detecting and fixing concurrency defects, as well as the reasonable probability that they won’t be seen at all in testing, makes disciplined use of good practices essential.

A variety of techniques are available to avoid concurrency problems. A preferred approach is avoiding situations in which concurrency bugs are possible. For example, avoiding the use of shared global variables avoids associated concurrency defects (because the shared variables simply aren’t there to begin with). But, when sharing can’t be avoided, there are well defined basic techniques that work. Typically such techniques work by “locking” some resource so that other tasks cannot use it. As an analogy, consider a changing room at a clothing store. If you want to make sure that nobody else tries to use the room you are using, you need to “lock” the room when you enter (with an actual door lock, or maybe just by closing the door or curtain all the way), and then “unlock” the room when you leave. If you never lock the door it might be that nothing bad happens for dozens or hundreds of times you try on clothes. But eventually someone will wander into the room by mistake while you are there if you don’t lock the door.

Disabling interrupts is a concurrency design approach that can be thought of as a program “locking” the CPU so that no other task can use it when a shared variable is being accessed. The way this works is that any task wanting to, say, increment a variable first disables interrupts, then increments the variable, then re-enables interrupts. The period of time between the first read of a variable and when the variable is done being updated is known as a “critical section,” and is the time during which no other task can be permitted to access the variable. Disabling interrupts turns off the hardware’s ability to switch tasks or perform anything but the desired computation during the critical section. This ensures that no other task in the system can read or write the shared variable, because disabling interrupts prevents any other task from running. It is essential that every single access to a shared variable disable interrupts for the entire use of the variable for this to be guaranteed to work. If a local copy of the variable is kept and used outside the time during which interrupts are disabled, there are no guarantees as to how the system will behave when that local copy is subsequently used to update the variable. Other techniques are available to manage concurrency beyond disabling interrupts. But, this is a common technique in embedded systems.

Even using these techniques, special care must be used in accessing any shared resource. For example, the keyword volatile must be used for every shared resource to ensure that the most up to date copy is always accessed. (But, even this won’t help if that copy is updated at an unexpected time.)

Selected Sources:

Ball describes concurrency defects in terms of being race conditions and prescribes disabling interrupts to solve the problem. (Ball 2002, pp. 162).

Douglass gives a pattern for a critical section in section 7.2, and in section 7.2.6 says that the most common way to prevent context switching during a critical section is to disable interrupts. In section 7.2.5, Douglass says: “The designers and programmers must show good discipline in ensuring that every resource access locks the resource before performing any manipulation of the source.” (Douglass 2002)

MISRA recommends that developers “Use Test-and-Set instructions or a signaling mechanism, such as Dekker/Dijkstra/Lamport Semaphores, to protect and mark as ‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more modern terminology, this is a recommendation to use a mutex or related semaphore-based “lock” on data. MISRA also cautions that interrupt enable and disable instructions must be used with care. (id.)

Sullivan presents results of a study of defect types, concluding that 11% of high impact memory corruption errors are due to concurrency defects. (Sullivan 1991, p. 6). This means that while most defects are easier to track down, a few race conditions and other concurrency defects can be expected to happen.

Concurrency defects are so difficult to find that specific testing and analysis tools have been developed to find them, prompting the creation of a benchmark suite to evaluate such tools (Jalbert 2011). A significant challenge to creating such a benchmark is the difficulty in reproducing such bugs even when the exact bug is completely understood. (id., first page)

Park et al. provide a summary of work on finding and fixing concurrency defects (Park 2010). Among other things they note that a concurrency defect was responsible in part for the 2003 Northeastern US electricity blackout that left 10 million people without power (id. p. 245), and that such bugs are difficult to reproduce (id.). 

References:
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • Jalbert, RADBench: a concurrency bug benchmark suite, HotPar 2011, pp. 2-8.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • Park, Falcon: fault localization in concurrent programs, ICSE 2010, pp. 245-254.
  • Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.