Monday, August 11, 2014

Coding Style Guidelines and MISRA C

Critical embedded software should follow a well-defined set of coding guidelines, enforced with comprehensive static checking tools, with essentially no deviations. MISRA C is an example of an accepted set of such coding guidelines.

Consequences:
Coding style guidelines exist to make it more difficult to make mistakes, and also to make it easier to detect when mistakes have been made. Failing to establish or follow formal, written coding guidelines makes it more difficult to understand code, leading to less effective code reviews and a reasonable expectation of increased levels of software defects. Failing to follow the language usage rules defined by a coding style guideline leads to using the language in a way that can normally be expected to result in poorly defined or incorrect software behaviors, increases the risk of software defects.

Accepted Practices:

  • All projects should follow a written coding style guideline document.
  • Coding guidelines should address formatting, commenting, name use, and other similar topics.
  • Coding guidelines should address good language usage practices to create understandable code and reduce the chance of introducing software defects.
  • Coding guidelines should specifically address which language features and usage patterns should be avoided as being error-prone, dangerous, or undefined by the language standard.
  • Coding guidelines should be followed with essentially no exception. Exceptions should require formal review with written approval and annotations in the source code. If guidelines are inappropriate, the guidelines should be changed.

Discussion:
Style in software can be considered analogous to style in writing. Compilers enforce some basic programming language construction rules that allow the source code to be compiled into executable software. Style, on the other hand, has more to do with how ideas are expressed within the constraints of the programming language used. Some style considerations have to do with variable naming conventions, indentation, and physical organization aspects of lines of code. Other style considerations have to do with the use of the programming language itself. Some constructs in a programming language are ambiguous or easily misunderstood. And some constructions in software, while correct in terms of language definition, are very likely to indicate a software defect.

A classic example of a subtle defect in the C programming language is:
“if (x = y) { …}”   The programmer almost always means to compare “x” and “y” for equality, but the C programming language is defined such that this code instead copies the value of “y” into “x” and then tests to see if the result of that copy operation was non-zero. The correct code would be “if (x == y) { … }” which adds a second “=” to make the operation a comparison instead of an assignment operation. Using “=” instead of “==” in conditionals is a common mistake when creating C programs, easy to confuse visually, and is therefore prohibited by typical style guidelines even though the single “=” version of the code is a valid language construct. A loose analogy might be a prohibition against using multiple negatives in an English language sentence because it is too difficult to not not not not not not (sic) make a mistake with such a sentence even though the meaning is unambiguous if the sentence is carefully (and correctly) analyzed.

It is accepted practice to have a defined set of coding guidelines that cover all relevant aspects of programming language use. Such guidelines are typically tailored for each project, but once defined should be followed rigorously. Guidelines typically cover code formatting, commenting, use of names, language use conventions, and other relevant aspects. While guidelines can be tailored per project, there are nonetheless a number of generally accepted practices for reducing the chance of software defects (such as forbidding a single “=” in a conditional evaluation as just discussed).

Of particular concern in safety critical software are language use rules to avoid ambiguity and hazardous language constructs. It is a generally accepted practice to outright ban hazardous and error-prone language structures to avoid the chance of defects, even if doing so makes software a bit less convenient to write and those structures would otherwise be a legal use of the language. In other words, an essential aspect of coding style for safety critical systems is outlawing code structures that are technically valid but are too dangerous or error-prone to use. The result is a written document that defines coding style in general, and language usage rules in particular. These rules must be applied rigorously and with essentially no exceptions. (The “no exceptions” part is feasible because it is acceptable to tailor the rules to the particular project. So it is not a matter of applying arbitrary rules and making exceptions, but rather picking rules that make sense for the situation and then rigorously sticking to them.) In short, every safety critical software project must have a coding style guide and must follow it rigorously to achieve acceptable levels of safety.

It is often the case that it makes sense to adopt an existing set of language use rules rather than make up your own. The MISRA C set of coding rules was specifically created for safety critical automotive software, and is the most well known C programming language subset for safety critical software. A typical practice when writing safety critical C code is to start with MISRA C, create a defined set of which rules will be followed (usually this is almost all of them), and then follow those rules rigorously. Exceptions to adopted rules should be very few, granted only after a formal written review process, and documented in the code as to the type of exception and reason for granting it. Preferably, automated tools (widely available for MISRA C as discussed in Jones 2002, pg. 56) are used to enforce the rules in addition to a required peer review of code.

It is accepted practice to adopt new coding style rules when better practices come into use, and apply those coding style rules to existing code when that code is being updated and incorporated into a new product.

Selected Sources:
The MISRA C guidelines (MISRA C 1998) are specifically designed for safety critical systems at SIL 2 and above. (MISRA Report 2 p. ix) They consist of a list of rules about coding practices to use and practices to avoid. They concentrate primarily on language use rather than code formatting. While MISRA C was originally developed for automotive applications, it was being set forth as a more general standard for adoption in other domains by 2002 (e.g., Jones 2002).  Over time, MISRA C has transition beyond just automotive applications to mainstream use for high integrity software in other areas. A predecessor of MISRA C is the list of rules in the book Safer C (Hatton, 1995).

More general coding style guidelines abound. It is easy to find a coding style guideline that can be adapted for the specifics of a project. Examples include chapter 18 of McConnell (1993).

NASA says that it is important that all levels of the project agree to the coding standards, and that they are enforced.   (NASA 2004 pg. 146)

Enforcing coding style involves the use of static checking in addition to formal peer reviews. Beyond the general consensus in the software community that following a defined coding style is a good idea, Nagappan and Ball found that “there exists a strong positive correlation between the static analysis defect density and the pre-release defect density determined by testing. Further, the predicted pre-release defect density and the actual pre-release defect density are strongly correlated at a high degree of statistical significance.” (Nagappan 2005, abstract)  In other words, modules that fail to follow a coding style as determined by static analysis have more bugs.

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell 1993, pg. 237).

MISRA Software Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines 1994, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.” IEC 61508 highly recommends (which more or less means “requires” as an accepted practice) static analysis at SIL 2 and above (IEC 61508-3 pg. 83).

Finally, an automotive manufacturer has published data showing that they expect one “major bug” for every 30 coding rule violations (Kawana 2004):


(Figure from Kawana 2004)

Note: As with my other posts in the last few months this was written regarding practices about a decade ago. There are newer sources for coding style information available now such as an updated version of MISRA C. There is also the the ISO-26262 standard, which is intended to replace the MISRA software guidelines..  But we'll save discussing those for another time.

References:

  • Hatton, Les, Safer C, 1995.
  • Jones, N., MISRA C guidelines, Embedded Systems Programming, Beginner’s Corner, July 2002, pp. 55-56.
  • Kawana et al., Empirical Approach for Reliability Assurance of Vehicle Software, Automotive Software Workshop, San Diego, 2004.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • Nagappan & Ball, Static analysis tools as early indicators of pre-release defect density, International Conference on Software Engineering, 2005, pp. 580-586.


Wednesday, July 23, 2014

Don’t Overflow the Stack

Somewhere in your embedded system is the stack (or several of them for some multitasking systems). If you blow up the stack, your software will crash. Or worse, especially if you don’t have memory protection. For a critical system you need to make sure the stack has some elbow room and make sure that you know when you have a stack size problem.

Also, you shouldn’t ever use recursion in a safety critical embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The consequences of not understanding maximum stack depth can be a seemingly random memory corruption as the stack overwrites RAM variables. Whether or not this actually causes a program crash depends upon a number of factors, including chance, and in the worse case it can cause unsafe program behavior without a crash. 

Accepted Practices:
  • Compute worst case stack depth using a static analysis tool.
  • Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
  • Avoid all recursion to ensure that worst case stack depth is bounded.
Discussion:
The “stack” is an area in memory used for storing certain types of data. For example, in the C programming language this is where non-static local variables go. The stack gets bigger every time a subroutine is called, usually holding the subroutine return address, parameter values that have been passed, and local variables for each currently active subroutine. As nested subroutines are called, the stack keeps getting bigger as more information is added to the stack to perform each deeper and deeper call. When each subroutine is completed, the stack information is returned to the stack area for later use, un-nesting the layers of stack usage and shrinking the stack size. Thus the maximum size of the stack is determined by how many subroutines are called on top of each other (the depth of the subroutine call graph), as well as the storage space needed by each of those subroutines, plus additional area needed by any interrupt service routines that may be active at the same time.

Some processors have a separate hardware stack for subroutine return information and a different stack for parameters and variables. And many operating systems have multiple stacks in support of multiple tasks. But for the most part similar ideas apply to all embedded controllers, and we’ll just discuss the single-stack case to keep things simple.


It is common for stacks to grow “downward” from high memory locations to low locations, with global variables, Real Time Operating System (RTOS) task information, or other values such as the C heap being allocated from the low memory locations to high memory locations. There is an unused memory space in between the stack and the globals. In other words, the stack shares limited RAM space with other variables. Because RAM space is limited, there is the possibility of the stack growing so large that it overlaps variable storage memory addresses so that the stack corrupts other memory (and/or loads and stores to global memory corrupt the stack). To avoid this, it is accepted practice to determine the worst case maximum stack depth and ensure such overlap is impossible.

Maximum stack depth can be determined by a number of means. One way is to periodically sample the stack pointer while the program is running and find the maximum observed stack size. This approach is a starting point, but one should not expect that sampling will happen to catch the absolute maximum stack depth. Catching the worst case can be quite difficult because interrupt service routines also use the stack and run at times that are generally unpredictable compared to program execution flow. (Moreover, if you use a timer interrupt to sample the stack pointer value, you’ll never see the stack pointer when the timer interrupt is masked, leaving a blind spot in this technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically designed to find the worst case set of subroutine calls in terms of stack depth. This technique is effective unless the code has structures that confound the analysis (e.g., if a program uses recursion, this technique generally doesn’t work). When the technique does work, it has the virtue of giving an absolute bound without having to rely upon whether testing happened to have encountered the worst case. You should use this technique if at all possible. If static analysis isn’t possible because of how you have written your code, you should change your code so you actually can do static analysis for maximum stack depth. In particular, this means you should never use recursion in a critical system both because it makes static analysis of stack depth impossible. Moreover, recursive routines are prone to stack overflows even if they are bug free, but happen to be fed an exceptional input value that causes too many levels of recursion. (In general, using recursion in any small-microcontroller embedded system is a bad idea for these reasons even if the system is not safety critical.)



Beyond static analysis of stack depth (or if static analysis isn’t possible and the system isn’t safety critical), an additional accepted practice is to use a “stack sentinel” to find the “high water mark” of the stack (I’ve also heard this called a “stack watermark” and a “stack guard”). With this technique each memory location in the stack area of memory is initialized to a predetermined known value or pattern of known values. As the stack grows in size, it will overwrite these known sentinel values with other values, leaving behind new patterns in memory that remain behind, like footprints in fresh snow. Thus, it is easy to tell how big the stack has ever been since the system was started by looking for how far into the unused memory space the “footprints” of stack usage trample past the current stack size. This both gives the maximum stack size during a particular program run, and the overall system maximum stack size assuming that the worst case path has been executed. (Actually identifying the worst case path need not be done – it is sufficient to have run it at some time during program execution without knowing exactly when it happened. It will leave its mark on the stack memory contents when it does happen.)


To convert this idea into a run-time protection technique, extra memory space between the stack and other data is set up as a sacrificial memory area that is there to detect unexpected corruptions before the stack can corrupt the globals or other RAM data. The program is run, and then memory contents are examined periodically to see how many stack area memory words are left unchanged. As part of the design validation process, designers compare the observed worst case stack depth against the static analysis computed worst case stack depth to ensure that they understand their system, have tested the worst case paths, and have left adequate margin in stack memory to prevent stack overflows if they missed some special situation that is even worse than the predicted worst case.

This sentinel technique should also be used in testing and in production systems to periodically check the sentinels in the sacrificial memory area. If you have computed a worst case stack depth at design time and you detect that the computed depth has been exceeded at run time, this is an indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space into global variables, the system might crash. Or that might just corrupt global variables or RTOS system state and have who-knows-what effect. (In a safety critical system “who-knows-what” equals “unsafe.”)  Naively assuming that you will always get a system crash detectable by the watchdog on a stack overflow is a dangerous practice. Sometimes the watchdog will catch a stack overflow. But there is no guarantee this will always happen. Consider that a stack-smashing attack is the security version of an intentional stack overflow, but is specifically designed to take over a system, not just merely crash it. So a crash after a stack overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering program execution paths to avoid a deep sequence of calls, and accounting for interrupts adding even more to the stack. And, again, even though we shouldn’t have to say this – never using recursion.

While there isn’t a set number for how much margin to leave in terms of extra memory past the computed maximum stack size, there are two considerations. First, you want to leave enough room to have an ample sacrificial area so that a problem with stack depth is unlikely to have enough time to go all the way through the sacrificial area and touch the globals before it is detected by a periodic timer tick checking sentinels. (Note: we didn’t say check current stack depth; we said periodically check sentinels to see if the stack has gotten too big between checks.)  Also, you want to leave some extra margin to account for the possibility you just never encountered the worst-case stack depth in analysis or testing. I’ve heard designers say worst case stack depth of 50% of available stack memory is a good idea. Above 90% use of stack memory (10% sacrificial memory area set aside) is probably a bad idea. But the actual number will depend on the details of your system.

Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched part of embedded systems folklore and accepted practices. Douglass mentions watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that it can cause unpredictable behavior (MISRA Guidelines, pg. 20).

MISRA C required rule 70 explicitly bans recursion:

NASA recommends using stack guards (essentially the same as the technique that I call “stack sentinels”) to check for stack overflow or corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and cause arbitrarily bad behavior of a running program. Regehr et al. provide an overview of research in that area relevant to embedded microcontrollers and ways to mitigate those problems (Regehr 2006). He notes that “the potential for stack overflow in embedded systems is hard to detect by testing.” (id., p. 776), with the point being that it is reasonable to expect that a system which has been tested but not thoroughly analyzed will have run-time stack overflows that corrupt memory.

References:
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.

Monday, June 30, 2014

Avoid Concurrency Defects

Accesses to variables shared among multiple threads of execution must be protected via disabling interrupts, using a mutex, or some other rigorously applied concurrency management approach.

Consequences:

Incorrect or lax use of concurrency techniques can be expected to lead to concurrency bugs. Such bugs are usually difficult to detect via testing and difficult to reproduce once detected. A fraction of any such bugs can be reasonably expected to make it past even extensive testing into production fleets.

Accepted Practices:
  • Aggressively minimize the use of globally shared variables. Every variable shared between tasks is a chance for a concurrency defect.
  • For every access to a shared variable, treat the entire time that a copy of the variable is “live” in the computation as a critical section, protecting access via masking interrupts or some other well defined technique.
  • Avoid concurrency management techniques that are “home brewed” or otherwise not part of well proven practice. Similarly, do not modify well known techniques to optimize efficiency or obtain other perceived benefits. Such techniques are extremely difficult to get right, and altering techniques or using non-standard techniques can be reasonably expected to introduce defects.
  • Declare every shared variable “volatile” to ensure that reads and writes do not result in stale data being used due to compiler optimization attempts to improve computation speed.
  • Keep critical sections as short as possible to minimize negative effects on scheduling. (The largest critical section forms a minimum length on blocking time for scheduling purposes).
Discussion:

One aspect of a modern real time embedded system is that it must appear to do many things at once. For example, an engine controller must look at many inputs, set throttle angle and perform diagnostic checks all at the same time. Typically many of these tasks are written as relatively independent pieces of software that must work together, and they must all appear to run at the same time.

In reality there is only one CPU, so tasks must take turns using that CPU, with that turn taking supervised by the RTOS. And, tasks often must share things such as memory locations and input/output devices. The turn-taking means that, unless software designers are extremely careful, on occasion shared resources can be left in undefined or incorrect states, resulting in concurrency bugs.

Example concurrency defect.
Source: http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html (That blog post has a nice discussion of how situation-dependent such defects are)

Avoiding and fixing concurrency bugs is a major source of design and testing effort on most embedded systems. In part this is because concurrency bugs can be quite subtle, and in part it is because they can be very difficult to activate during testing as well as difficult to isolate even when one is observed (if they are observed at all during testing). The difficulty of detecting and fixing concurrency defects, as well as the reasonable probability that they won’t be seen at all in testing, makes disciplined use of good practices essential.

A variety of techniques are available to avoid concurrency problems. A preferred approach is avoiding situations in which concurrency bugs are possible. For example, avoiding the use of shared global variables avoids associated concurrency defects (because the shared variables simply aren’t there to begin with). But, when sharing can’t be avoided, there are well defined basic techniques that work. Typically such techniques work by “locking” some resource so that other tasks cannot use it. As an analogy, consider a changing room at a clothing store. If you want to make sure that nobody else tries to use the room you are using, you need to “lock” the room when you enter (with an actual door lock, or maybe just by closing the door or curtain all the way), and then “unlock” the room when you leave. If you never lock the door it might be that nothing bad happens for dozens or hundreds of times you try on clothes. But eventually someone will wander into the room by mistake while you are there if you don’t lock the door.

Disabling interrupts is a concurrency design approach that can be thought of as a program “locking” the CPU so that no other task can use it when a shared variable is being accessed. The way this works is that any task wanting to, say, increment a variable first disables interrupts, then increments the variable, then re-enables interrupts. The period of time between the first read of a variable and when the variable is done being updated is known as a “critical section,” and is the time during which no other task can be permitted to access the variable. Disabling interrupts turns off the hardware’s ability to switch tasks or perform anything but the desired computation during the critical section. This ensures that no other task in the system can read or write the shared variable, because disabling interrupts prevents any other task from running. It is essential that every single access to a shared variable disable interrupts for the entire use of the variable for this to be guaranteed to work. If a local copy of the variable is kept and used outside the time during which interrupts are disabled, there are no guarantees as to how the system will behave when that local copy is subsequently used to update the variable. Other techniques are available to manage concurrency beyond disabling interrupts. But, this is a common technique in embedded systems.

Even using these techniques, special care must be used in accessing any shared resource. For example, the keyword volatile must be used for every shared resource to ensure that the most up to date copy is always accessed. (But, even this won’t help if that copy is updated at an unexpected time.)

Selected Sources:

Ball describes concurrency defects in terms of being race conditions and prescribes disabling interrupts to solve the problem. (Ball 2002, pp. 162).

Douglass gives a pattern for a critical section in section 7.2, and in section 7.2.6 says that the most common way to prevent context switching during a critical section is to disable interrupts. In section 7.2.5, Douglass says: “The designers and programmers must show good discipline in ensuring that every resource access locks the resource before performing any manipulation of the source.” (Douglass 2002)

MISRA recommends that developers “Use Test-and-Set instructions or a signaling mechanism, such as Dekker/Dijkstra/Lamport Semaphores, to protect and mark as ‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more modern terminology, this is a recommendation to use a mutex or related semaphore-based “lock” on data. MISRA also cautions that interrupt enable and disable instructions must be used with care. (id.)

Sullivan presents results of a study of defect types, concluding that 11% of high impact memory corruption errors are due to concurrency defects. (Sullivan 1991, p. 6). This means that while most defects are easier to track down, a few race conditions and other concurrency defects can be expected to happen.

Concurrency defects are so difficult to find that specific testing and analysis tools have been developed to find them, prompting the creation of a benchmark suite to evaluate such tools (Jalbert 2011). A significant challenge to creating such a benchmark is the difficulty in reproducing such bugs even when the exact bug is completely understood. (id., first page)

Park et al. provide a summary of work on finding and fixing concurrency defects (Park 2010). Among other things they note that a concurrency defect was responsible in part for the 2003 Northeastern US electricity blackout that left 10 million people without power (id. p. 245), and that such bugs are difficult to reproduce (id.). 

References:
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • Jalbert, RADBench: a concurrency bug benchmark suite, HotPar 2011, pp. 2-8.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • Park, Falcon: fault localization in concurrent programs, ICSE 2010, pp. 245-254.
  • Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.

Monday, June 16, 2014

Ensure Code Is Modular

Critical embedded software should be modular, and in particular should limit function size to no more than 1-2 pages of code including comments.

Consequences: Poor modularity can reasonably be expected to result in code that is more complex, more brittle, and in general has a higher defect rate than modular code. This is in part because code with poor modularity it is harder to understand (and therefore harder to get right), and in part because it is more difficult to test than code with good modularity.

Accepted Practices:

  • A significant majority of functions, procedures, or subroutines must each be no more than a defined length in the range of 100-225 lines long, nominally 120 lines long, including comments. Longer functions should be rare, and their length should be specifically justified, taking into account whether they have low cyclomatic complexity or other mitigating factors. 
  • Each module should have high cohesion, low coupling, and good information hiding. There are no generally accepted precise values for these metrics.

Discussion:

Code is considered modular if it is written in reasonably small chunks of code (modules) that work together in a way that maintains some key properties. Code that is merely in small chunks but that does not achieve these properties is not considered modular. A code “module” is classically defined as a piece of code that accomplishes a particular function. This may be either a single software subroutine (or equivalent) with associated variables, or may be a small set of collected subroutines that work together to accomplish a function or a set of closely related functions. Key properties of modular code include the following areas.

Small size: each function should be one or two pages of code when printed (up to say 120 lines of code including comments maximum, but with most functions being less than one page of 60-75 lines including comments). Having a function size larger than this makes it difficult for developers and reviewers to understand the entirety of the function at one time, decreasing the effectiveness of finding bugs. Large size also makes it harder to create effective unit tests due to the increased complexity of a large function. (This length limit is historically associated with a function, in terms of a single function fitting on one printed page of about 65 lines.)

High cohesion: each module performs a single job and contains a set of closely related functions rather than a collection of unrelated functions. Having low cohesion results in code that mixes up unrelated functions, making it difficult to test, and difficult to maintain as changes need to be made. Low cohesion also normally leads to bugs due to related functions being scattered around many modules instead of in a single module.

Low coupling: modules should have a low dependency upon other module data. High coupling causes data dependencies among modules that makes testing difficult and tends to lead to subtle data sharing bugs. Good practice is to minimize coupling via avoiding global variables. High coupling can be reasonably expected to increase defect rates.

Information hiding: modules should hide as much about the details of their implementation as possible to reduce implicit dependencies that can cause other parts of the software to fail. For example, the exact algorithm or data format used for calculations internal to a module should not be discernible to any other module. Accepted practice is to use the “static” keyword on identifiers within a .c file to accomplish information hiding, and declare variables within a procedure, instead of globally, if they are only needed within that procedure.

Selected Sources:

Fenton & Neil discuss the topic of software metrics (1999), and say that to really understand the problem you need to consider many factors at once rather than simple metrics. They give an example a combination of problem complexity, use of IEC1508 (draft version of IEC 61508), code complexity, and operational usage among others (Fenton Fig 3, pg. 684) as a way to ensure good code. In other words, it is clear that these types of factors matter, even if there is no single optimal answer.

McConnell presents a survey of function length practices. A length limit of one to two printed pages is common in industry. McConnell says: “If you want to write routines longer than about 200 lines, be careful. (A line is a non-comment nonblank line of source code.)” He also says: “you’re bound to run into an upper limit of understandability as you pass 200 lines of code” (McConnell 1993, pp. 93-94). These comments are primarily in the context of desktop applications rather than safety critical embedded control software. In my opinion a smaller limit than 200 lines of non-comment code is considered acceptable practice for safety critical embedded systems. McConnell, chapter 5, talks more generally about modular programming in terms of information hiding, coupling, cohesion, and function length.


Selby & Basili found that software modules with high coupling and low coherence had 7.0 times as many errors (per lines of non-comment code) as modules with low coupling and high coherence. (Selby et al. 1991, abstract, p. 146).


Withrow found that the optimum size for minimizing defects was 225 lines of code per module, which is a bit larger than some guidelines (Withrow 1990), but still in the same general range being discussed.

The NASA “Power of Ten” rules for writing safety critical code recommend functions of at most 60 lines of text (Holzman 06, rule 4).  They also recommend declaring data objects at the smallest possible level of scope (id., rule 6).

IEC 61508-3 highly recommends a module size limit (p. 93).

MISRA states that modularity requires high cohesion and low coupling (MISRA Report 5, pp. 9-10).

References:
  • Fenton, A critique of software defect prediction models, IEEE Trans. Software Engineering, Sep/Oct 1999, pp. 675-689.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Parts 2,3,7. 
  • Holzman, ''The Power of Ten -- Rules for Developing Safety Critical Code,'' IEEE Computer, June 2006, pp. 93-95.
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, Report 5: Software Metrics, February 1995.
  • Selby & Basili, Analyzing error-prone system structure, IEEE Trans. Software Engineering, Feb 1991, pp. 141-152.
  • Withrow, Error density and size in Ada software, IEEE Software, Jan. 1990, pp. 26-30.





Monday, June 9, 2014

Minimize Use of Global Variables

Critical embedded software should use the minimum practicable variable scope for each variable, and should minimize use of global variables. As one of the chapters in my book says, Global Variables Are Evil.  Over-use of globals can reasonably be expected to result in significantly increased defect rates and the presence of difficult-to-find software defects that are likely to render a system unsafe.

Consequences: Using too many global variables increases software complexity and can be reasonably expected to increase the number of bugs, as well as make the code difficult to maintain properly. Defining variables as globals that could instead be defined as locals can be reasonably expected to significantly increase the risk of the data being improperly used globally, as well as to make it more difficult to track and analyze the few variables that should legitimately be global. In short, excessive use of globals leads to an increased number of software defects.

Accepted Practices:
  • A significant minority of variables (or fewer) should be global. Ideally zero variables should be global. (Special globals such as mathematical constants and configuration information might be excluded from this metric.) The exact number varies with the system, but an expected range would be from less than 1% to perhaps 10% of all statically allocated variables (even this is probably too high), with an extremely strong preference for the lower side of that range Exceeding this range can reasonably be expected to lead to an increase in software defects.
  • The need for each global or category of globals should be specifically justified as required for effective software construction. Speed increases are generally not sufficient justification, nor is limited memory space.
  • In any system with more than one task (including systems that just have a main task plus interrupts), every non-constant global should be declared volatile, and every access to a global should be protected by a form of concurrency management (e.g., disabling interrupts or using a mutex). Failing to do either can normally be expected to result in concurrency defects somewhere in your code.
  • Each variable should be declared locally if possible. Variables used by a collection of functions in the same C programming language module should be declared as top-level “static” within the corresponding .c file to limit visibility to functions declared within that .c file. Variables used by only one function should be declared within that function and thus not be visible to any other function.
  • Global variables must be declared consistently throughout the system, including having exactly the same type information (e.g., if a global is declared as “unsigned” in one place, it needs to be “unsigned” everywhere). 
Discussion:

Global variables (often called “globals” for short) are programming language variables that are visible to the entirety of the program. In contrast, non-global variables (often called local variables, or locals for short), have a reduced scope that makes them visible only in the part of the program where they are used, and invisible to the rest of the program. Excessive use of global variables must be avoided in safety critical software due to the complexity introduced as well as the risk of concurrency hazards.

One reason to avoid globals is that use of many globals can be reasonably expected to lead to high coupling among many disparate portions of a program. Any variable that is globally visible might be read or written from anywhere in the code, increasing program complexity and thus the chance for software defects. While analysis might be performed to determine where globals actually are read and written, doing so for a large number of globals is time consuming, and must be re-done any time any substantive change is made to the code. (For example, it would be expected that such analysis would have to be re-done for each software release.)

Another reason to avoid globals is that they introduce concurrency hazards (see the discussion on concurrency in an upcoming post). Because it is difficult to keep track of what parts of the program are reading and writing a global, safe code must assume that other tasks can access the global and use full concurrency protection each time a global is referenced. This includes both locking access to the global when making a change and declaring the global “volatile” to ensure any changes propagate throughout the software.

Even if concurrency hazards are generally avoided, if more than one place in the program modifies a global it is easy to have unexpected software behavior due to two portions of the program modifying the globals’ value in an unanticipated sequence. This can be reasonably expected to lead to infrequent and subtle (but potentially severe) concurrency defects.

As an analogy, consider if you keep your grocery shopping list on a whiteboard on your fridge. If you live alone and are careful, this may work out fine, and you will never accidentally have an item erased or added in error to your list. But if you live in a busy house (perhaps a dormitory), something you wrote on that whiteboard might be changed or accidentally erased, and you might have no idea that it happened or who did it. In this analogy, everything you write on that whiteboard is a global variable – anyone who enters the house can see it, act upon the information, and potentially modify it. As an extension of this analogy, if that whiteboard is a “to do” list, you are using it to communicate information to others in the house. If that list is modified, corrupted, or overwritten, your communication will fail, with that sort of problem becoming more likely as more and more people share the whiteboard for a diversity of purposes.

Using too many globals can be thought of as the data flow version of spaghetti code. With code “control” flow (conditional “if” statements and the like) that is tangled, it is difficult to follow the flow of control of the software. Similarly, with too many globals it is difficult to follow the flow of data through the program – you get “spaghetti data.” In both cases (tangled data flow and tangled control flow) designers can reasonably be expected that the spaghetti software will have elevated levels of software defects. Excessive use of global variables makes unit testing difficult, because it requires identifying and setting specific values in all the globals referenced by a module that is being unit tested.

In some cases the risk of a global may seem only theoretical rather than practical, if a variable is only used in a single module but happens to be defined as a global. However, this still presents a latent risk of some other piece of the software modifying the variable intentionally (defeating the principle of information hiding), or by accident via a typographical error that was intended to refer to a similarly named variable defined elsewhere. Therefore, it is an important accepted practice to only define variables as global if it is essential to do so.

There are two related classes of globals that are an exception, and are generally considered acceptable for use. One class is constants that represent physical properties (e.g., the number of degrees in a circle being 360), system configuration information (e.g., the fact that a vehicle has a 6 cylinder engine or a 4-speed transmission), and other values that don’t change at run time. These are properly defined as constant values using the programming language keyword “const” in the code or other similar approach. 

Another class of generally acceptable globals is system state information that must be generally accessible across most software, such as engine speed in a vehicle. These system state variables should be written in exactly one place in the code, should not be duplicated (to avoid confusion and the chance for different copies of the variable to get out of synch), should be kept few in number, and ideally should be read via access functions rather than directly to enforce modularity. These are the sorts of globals that should be kept to a minimum and count toward the statement that having a few well-chosen globals might be OK. In distributed embedded systems these globals usually show up as broadcast bus values rather than as actual memory-based global variables.

Entirely eliminating the use of non-constant globals is sometimes unachievable. But when complete elimination isn’t practical, globals should nonetheless be very few (handfuls, not thousands of them) and strategically selected. The globals that are used should be used carefully, and reviewed to ensure each is essential.

Selected Sources:

Wulf & Shaw in 1973 wrote a paper entitled: “Global variable considered harmful,” (Wulf 1973) describing it as the data version of a “goto” statement – which a previous paper by Dijkstra had famously declared “harmful.” After that publication, the concept of avoiding globals has mostly been a matter of clarifying the gray areas and educating new programmers.



McConnell says: “Global data is like a love letter between routines – it might go where you want it to go, or it might get lost in the mail.” (McConnell 1993, pg. 88). McConnell also says: “global-data coupling is undesirable because the connection between routines is neither intimate nor visible. The connection is so easy to miss that you could refer to it as information hiding's evil cousin – ‘information losing.’” (McConnell 1993, pg. 90).


Ganssle advises “Minimize global variables, both to reduce interaction between routines and to reduce the number of places where interrupts will cause problems.” (Ganssle 1992, pg. 186). A few years later, Ganssle and designers in general appreciated that global variables were even more of a problem than previously thought. He then wrote, resorting to humor to make his point: “One of the greatest evils in the universe, an evil in part responsible for global warming, ozone depletion, and male pattern baldness, is the use of global variables.” Humor aside, he goes on to confirm the problems with globals, and gives recommendations. “Every firmware standard – backed up by the rigorous checks of code inspections – must set rules about global use.” He finishes by saying: “I feel that defining a global is such a source of problems that the team leader should approve every one.” (Ganssle 2000, p. 38)

NASA recommends avoiding too many inter-component dependencies: “components should not depend on each other in a complex way,” which includes as a significant factor the use of globals to communicate data between components (NASA 2004, p. 93).

MISRA Software Guidelines recommend against using global variables (MISRA Report 5, p. 10).  IEC 61508-3 highly recommends a modular approach (p. 79), information hiding/encapsulation (id., p. 93), and a fully defined interface (id., p. 93), which sum up to recommending against the use of global variables.

Update: you can find an example of getting rid of a global variable here:
  http://betterembsw.blogspot.com/2013/09/getting-rid-of-global-variables.html

References:
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Parts 2,3,7. 
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 5: Software Metrics, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Wulf & Shaw, Global variable considered harmful, SIGPLAN Notices, Feb 1973, pp. 28-34.