Wednesday, July 23, 2014

Don’t Overflow the Stack

Somewhere in your embedded system is the stack (or several of them for some multitasking systems). If you blow up the stack, your software will crash. Or worse, especially if you don’t have memory protection. For a critical system you need to make sure the stack has some elbow room and make sure that you know when you have a stack size problem.

Also, you shouldn’t ever use recursion in a safety critical embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The consequences of not understanding maximum stack depth can be a seemingly random memory corruption as the stack overwrites RAM variables. Whether or not this actually causes a program crash depends upon a number of factors, including chance, and in the worse case it can cause unsafe program behavior without a crash. 

Accepted Practices:
  • Compute worst case stack depth using a static analysis tool.
  • Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
  • Avoid all recursion to ensure that worst case stack depth is bounded.
Discussion:
The “stack” is an area in memory used for storing certain types of data. For example, in the C programming language this is where non-static local variables go. The stack gets bigger every time a subroutine is called, usually holding the subroutine return address, parameter values that have been passed, and local variables for each currently active subroutine. As nested subroutines are called, the stack keeps getting bigger as more information is added to the stack to perform each deeper and deeper call. When each subroutine is completed, the stack information is returned to the stack area for later use, un-nesting the layers of stack usage and shrinking the stack size. Thus the maximum size of the stack is determined by how many subroutines are called on top of each other (the depth of the subroutine call graph), as well as the storage space needed by each of those subroutines, plus additional area needed by any interrupt service routines that may be active at the same time.

Some processors have a separate hardware stack for subroutine return information and a different stack for parameters and variables. And many operating systems have multiple stacks in support of multiple tasks. But for the most part similar ideas apply to all embedded controllers, and we’ll just discuss the single-stack case to keep things simple.


It is common for stacks to grow “downward” from high memory locations to low locations, with global variables, Real Time Operating System (RTOS) task information, or other values such as the C heap being allocated from the low memory locations to high memory locations. There is an unused memory space in between the stack and the globals. In other words, the stack shares limited RAM space with other variables. Because RAM space is limited, there is the possibility of the stack growing so large that it overlaps variable storage memory addresses so that the stack corrupts other memory (and/or loads and stores to global memory corrupt the stack). To avoid this, it is accepted practice to determine the worst case maximum stack depth and ensure such overlap is impossible.

Maximum stack depth can be determined by a number of means. One way is to periodically sample the stack pointer while the program is running and find the maximum observed stack size. This approach is a starting point, but one should not expect that sampling will happen to catch the absolute maximum stack depth. Catching the worst case can be quite difficult because interrupt service routines also use the stack and run at times that are generally unpredictable compared to program execution flow. (Moreover, if you use a timer interrupt to sample the stack pointer value, you’ll never see the stack pointer when the timer interrupt is masked, leaving a blind spot in this technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically designed to find the worst case set of subroutine calls in terms of stack depth. This technique is effective unless the code has structures that confound the analysis (e.g., if a program uses recursion, this technique generally doesn’t work). When the technique does work, it has the virtue of giving an absolute bound without having to rely upon whether testing happened to have encountered the worst case. You should use this technique if at all possible. If static analysis isn’t possible because of how you have written your code, you should change your code so you actually can do static analysis for maximum stack depth. In particular, this means you should never use recursion in a critical system both because it makes static analysis of stack depth impossible. Moreover, recursive routines are prone to stack overflows even if they are bug free, but happen to be fed an exceptional input value that causes too many levels of recursion. (In general, using recursion in any small-microcontroller embedded system is a bad idea for these reasons even if the system is not safety critical.)



Beyond static analysis of stack depth (or if static analysis isn’t possible and the system isn’t safety critical), an additional accepted practice is to use a “stack sentinel” to find the “high water mark” of the stack (I’ve also heard this called a “stack watermark” and a “stack guard”). With this technique each memory location in the stack area of memory is initialized to a predetermined known value or pattern of known values. As the stack grows in size, it will overwrite these known sentinel values with other values, leaving behind new patterns in memory that remain behind, like footprints in fresh snow. Thus, it is easy to tell how big the stack has ever been since the system was started by looking for how far into the unused memory space the “footprints” of stack usage trample past the current stack size. This both gives the maximum stack size during a particular program run, and the overall system maximum stack size assuming that the worst case path has been executed. (Actually identifying the worst case path need not be done – it is sufficient to have run it at some time during program execution without knowing exactly when it happened. It will leave its mark on the stack memory contents when it does happen.)


To convert this idea into a run-time protection technique, extra memory space between the stack and other data is set up as a sacrificial memory area that is there to detect unexpected corruptions before the stack can corrupt the globals or other RAM data. The program is run, and then memory contents are examined periodically to see how many stack area memory words are left unchanged. As part of the design validation process, designers compare the observed worst case stack depth against the static analysis computed worst case stack depth to ensure that they understand their system, have tested the worst case paths, and have left adequate margin in stack memory to prevent stack overflows if they missed some special situation that is even worse than the predicted worst case.

This sentinel technique should also be used in testing and in production systems to periodically check the sentinels in the sacrificial memory area. If you have computed a worst case stack depth at design time and you detect that the computed depth has been exceeded at run time, this is an indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space into global variables, the system might crash. Or that might just corrupt global variables or RTOS system state and have who-knows-what effect. (In a safety critical system “who-knows-what” equals “unsafe.”)  Naively assuming that you will always get a system crash detectable by the watchdog on a stack overflow is a dangerous practice. Sometimes the watchdog will catch a stack overflow. But there is no guarantee this will always happen. Consider that a stack-smashing attack is the security version of an intentional stack overflow, but is specifically designed to take over a system, not just merely crash it. So a crash after a stack overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering program execution paths to avoid a deep sequence of calls, and accounting for interrupts adding even more to the stack. And, again, even though we shouldn’t have to say this – never using recursion.

While there isn’t a set number for how much margin to leave in terms of extra memory past the computed maximum stack size, there are two considerations. First, you want to leave enough room to have an ample sacrificial area so that a problem with stack depth is unlikely to have enough time to go all the way through the sacrificial area and touch the globals before it is detected by a periodic timer tick checking sentinels. (Note: we didn’t say check current stack depth; we said periodically check sentinels to see if the stack has gotten too big between checks.)  Also, you want to leave some extra margin to account for the possibility you just never encountered the worst-case stack depth in analysis or testing. I’ve heard designers say worst case stack depth of 50% of available stack memory is a good idea. Above 90% use of stack memory (10% sacrificial memory area set aside) is probably a bad idea. But the actual number will depend on the details of your system.

Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched part of embedded systems folklore and accepted practices. Douglass mentions watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that it can cause unpredictable behavior (MISRA Guidelines, pg. 20).

MISRA C required rule 70 explicitly bans recursion:

NASA recommends using stack guards (essentially the same as the technique that I call “stack sentinels”) to check for stack overflow or corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and cause arbitrarily bad behavior of a running program. Regehr et al. provide an overview of research in that area relevant to embedded microcontrollers and ways to mitigate those problems (Regehr 2006). He notes that “the potential for stack overflow in embedded systems is hard to detect by testing.” (id., p. 776), with the point being that it is reasonable to expect that a system which has been tested but not thoroughly analyzed will have run-time stack overflows that corrupt memory.

References:
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.

Monday, June 30, 2014

Avoid Concurrency Defects

Accesses to variables shared among multiple threads of execution must be protected via disabling interrupts, using a mutex, or some other rigorously applied concurrency management approach.

Consequences:

Incorrect or lax use of concurrency techniques can be expected to lead to concurrency bugs. Such bugs are usually difficult to detect via testing and difficult to reproduce once detected. A fraction of any such bugs can be reasonably expected to make it past even extensive testing into production fleets.

Accepted Practices:
  • Aggressively minimize the use of globally shared variables. Every variable shared between tasks is a chance for a concurrency defect.
  • For every access to a shared variable, treat the entire time that a copy of the variable is “live” in the computation as a critical section, protecting access via masking interrupts or some other well defined technique.
  • Avoid concurrency management techniques that are “home brewed” or otherwise not part of well proven practice. Similarly, do not modify well known techniques to optimize efficiency or obtain other perceived benefits. Such techniques are extremely difficult to get right, and altering techniques or using non-standard techniques can be reasonably expected to introduce defects.
  • Declare every shared variable “volatile” to ensure that reads and writes do not result in stale data being used due to compiler optimization attempts to improve computation speed.
  • Keep critical sections as short as possible to minimize negative effects on scheduling. (The largest critical section forms a minimum length on blocking time for scheduling purposes).
Discussion:

One aspect of a modern real time embedded system is that it must appear to do many things at once. For example, an engine controller must look at many inputs, set throttle angle and perform diagnostic checks all at the same time. Typically many of these tasks are written as relatively independent pieces of software that must work together, and they must all appear to run at the same time.

In reality there is only one CPU, so tasks must take turns using that CPU, with that turn taking supervised by the RTOS. And, tasks often must share things such as memory locations and input/output devices. The turn-taking means that, unless software designers are extremely careful, on occasion shared resources can be left in undefined or incorrect states, resulting in concurrency bugs.

Example concurrency defect.
Source: http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html (That blog post has a nice discussion of how situation-dependent such defects are)

Avoiding and fixing concurrency bugs is a major source of design and testing effort on most embedded systems. In part this is because concurrency bugs can be quite subtle, and in part it is because they can be very difficult to activate during testing as well as difficult to isolate even when one is observed (if they are observed at all during testing). The difficulty of detecting and fixing concurrency defects, as well as the reasonable probability that they won’t be seen at all in testing, makes disciplined use of good practices essential.

A variety of techniques are available to avoid concurrency problems. A preferred approach is avoiding situations in which concurrency bugs are possible. For example, avoiding the use of shared global variables avoids associated concurrency defects (because the shared variables simply aren’t there to begin with). But, when sharing can’t be avoided, there are well defined basic techniques that work. Typically such techniques work by “locking” some resource so that other tasks cannot use it. As an analogy, consider a changing room at a clothing store. If you want to make sure that nobody else tries to use the room you are using, you need to “lock” the room when you enter (with an actual door lock, or maybe just by closing the door or curtain all the way), and then “unlock” the room when you leave. If you never lock the door it might be that nothing bad happens for dozens or hundreds of times you try on clothes. But eventually someone will wander into the room by mistake while you are there if you don’t lock the door.

Disabling interrupts is a concurrency design approach that can be thought of as a program “locking” the CPU so that no other task can use it when a shared variable is being accessed. The way this works is that any task wanting to, say, increment a variable first disables interrupts, then increments the variable, then re-enables interrupts. The period of time between the first read of a variable and when the variable is done being updated is known as a “critical section,” and is the time during which no other task can be permitted to access the variable. Disabling interrupts turns off the hardware’s ability to switch tasks or perform anything but the desired computation during the critical section. This ensures that no other task in the system can read or write the shared variable, because disabling interrupts prevents any other task from running. It is essential that every single access to a shared variable disable interrupts for the entire use of the variable for this to be guaranteed to work. If a local copy of the variable is kept and used outside the time during which interrupts are disabled, there are no guarantees as to how the system will behave when that local copy is subsequently used to update the variable. Other techniques are available to manage concurrency beyond disabling interrupts. But, this is a common technique in embedded systems.

Even using these techniques, special care must be used in accessing any shared resource. For example, the keyword volatile must be used for every shared resource to ensure that the most up to date copy is always accessed. (But, even this won’t help if that copy is updated at an unexpected time.)

Selected Sources:

Ball describes concurrency defects in terms of being race conditions and prescribes disabling interrupts to solve the problem. (Ball 2002, pp. 162).

Douglass gives a pattern for a critical section in section 7.2, and in section 7.2.6 says that the most common way to prevent context switching during a critical section is to disable interrupts. In section 7.2.5, Douglass says: “The designers and programmers must show good discipline in ensuring that every resource access locks the resource before performing any manipulation of the source.” (Douglass 2002)

MISRA recommends that developers “Use Test-and-Set instructions or a signaling mechanism, such as Dekker/Dijkstra/Lamport Semaphores, to protect and mark as ‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more modern terminology, this is a recommendation to use a mutex or related semaphore-based “lock” on data. MISRA also cautions that interrupt enable and disable instructions must be used with care. (id.)

Sullivan presents results of a study of defect types, concluding that 11% of high impact memory corruption errors are due to concurrency defects. (Sullivan 1991, p. 6). This means that while most defects are easier to track down, a few race conditions and other concurrency defects can be expected to happen.

Concurrency defects are so difficult to find that specific testing and analysis tools have been developed to find them, prompting the creation of a benchmark suite to evaluate such tools (Jalbert 2011). A significant challenge to creating such a benchmark is the difficulty in reproducing such bugs even when the exact bug is completely understood. (id., first page)

Park et al. provide a summary of work on finding and fixing concurrency defects (Park 2010). Among other things they note that a concurrency defect was responsible in part for the 2003 Northeastern US electricity blackout that left 10 million people without power (id. p. 245), and that such bugs are difficult to reproduce (id.). 

References:
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • Jalbert, RADBench: a concurrency bug benchmark suite, HotPar 2011, pp. 2-8.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • Park, Falcon: fault localization in concurrent programs, ICSE 2010, pp. 245-254.
  • Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.

Monday, June 16, 2014

Ensure Code Is Modular

Critical embedded software should be modular, and in particular should limit function size to no more than 1-2 pages of code including comments.

Consequences: Poor modularity can reasonably be expected to result in code that is more complex, more brittle, and in general has a higher defect rate than modular code. This is in part because code with poor modularity it is harder to understand (and therefore harder to get right), and in part because it is more difficult to test than code with good modularity.

Accepted Practices:

  • A significant majority of functions, procedures, or subroutines must each be no more than a defined length in the range of 100-225 lines long, nominally 120 lines long, including comments. Longer functions should be rare, and their length should be specifically justified, taking into account whether they have low cyclomatic complexity or other mitigating factors. 
  • Each module should have high cohesion, low coupling, and good information hiding. There are no generally accepted precise values for these metrics.

Discussion:

Code is considered modular if it is written in reasonably small chunks of code (modules) that work together in a way that maintains some key properties. Code that is merely in small chunks but that does not achieve these properties is not considered modular. A code “module” is classically defined as a piece of code that accomplishes a particular function. This may be either a single software subroutine (or equivalent) with associated variables, or may be a small set of collected subroutines that work together to accomplish a function or a set of closely related functions. Key properties of modular code include the following areas.

Small size: each function should be one or two pages of code when printed (up to say 120 lines of code including comments maximum, but with most functions being less than one page of 60-75 lines including comments). Having a function size larger than this makes it difficult for developers and reviewers to understand the entirety of the function at one time, decreasing the effectiveness of finding bugs. Large size also makes it harder to create effective unit tests due to the increased complexity of a large function. (This length limit is historically associated with a function, in terms of a single function fitting on one printed page of about 65 lines.)

High cohesion: each module performs a single job and contains a set of closely related functions rather than a collection of unrelated functions. Having low cohesion results in code that mixes up unrelated functions, making it difficult to test, and difficult to maintain as changes need to be made. Low cohesion also normally leads to bugs due to related functions being scattered around many modules instead of in a single module.

Low coupling: modules should have a low dependency upon other module data. High coupling causes data dependencies among modules that makes testing difficult and tends to lead to subtle data sharing bugs. Good practice is to minimize coupling via avoiding global variables. High coupling can be reasonably expected to increase defect rates.

Information hiding: modules should hide as much about the details of their implementation as possible to reduce implicit dependencies that can cause other parts of the software to fail. For example, the exact algorithm or data format used for calculations internal to a module should not be discernible to any other module. Accepted practice is to use the “static” keyword on identifiers within a .c file to accomplish information hiding, and declare variables within a procedure, instead of globally, if they are only needed within that procedure.

Selected Sources:

Fenton & Neil discuss the topic of software metrics (1999), and say that to really understand the problem you need to consider many factors at once rather than simple metrics. They give an example a combination of problem complexity, use of IEC1508 (draft version of IEC 61508), code complexity, and operational usage among others (Fenton Fig 3, pg. 684) as a way to ensure good code. In other words, it is clear that these types of factors matter, even if there is no single optimal answer.

McConnell presents a survey of function length practices. A length limit of one to two printed pages is common in industry. McConnell says: “If you want to write routines longer than about 200 lines, be careful. (A line is a non-comment nonblank line of source code.)” He also says: “you’re bound to run into an upper limit of understandability as you pass 200 lines of code” (McConnell 1993, pp. 93-94). These comments are primarily in the context of desktop applications rather than safety critical embedded control software. In my opinion a smaller limit than 200 lines of non-comment code is considered acceptable practice for safety critical embedded systems. McConnell, chapter 5, talks more generally about modular programming in terms of information hiding, coupling, cohesion, and function length.


Selby & Basili found that software modules with high coupling and low coherence had 7.0 times as many errors (per lines of non-comment code) as modules with low coupling and high coherence. (Selby et al. 1991, abstract, p. 146).


Withrow found that the optimum size for minimizing defects was 225 lines of code per module, which is a bit larger than some guidelines (Withrow 1990), but still in the same general range being discussed.

The NASA “Power of Ten” rules for writing safety critical code recommend functions of at most 60 lines of text (Holzman 06, rule 4).  They also recommend declaring data objects at the smallest possible level of scope (id., rule 6).

IEC 61508-3 highly recommends a module size limit (p. 93).

MISRA states that modularity requires high cohesion and low coupling (MISRA Report 5, pp. 9-10).

References:
  • Fenton, A critique of software defect prediction models, IEEE Trans. Software Engineering, Sep/Oct 1999, pp. 675-689.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Parts 2,3,7. 
  • Holzman, ''The Power of Ten -- Rules for Developing Safety Critical Code,'' IEEE Computer, June 2006, pp. 93-95.
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, Report 5: Software Metrics, February 1995.
  • Selby & Basili, Analyzing error-prone system structure, IEEE Trans. Software Engineering, Feb 1991, pp. 141-152.
  • Withrow, Error density and size in Ada software, IEEE Software, Jan. 1990, pp. 26-30.





Monday, June 9, 2014

Minimize Use of Global Variables

Critical embedded software should use the minimum practicable variable scope for each variable, and should minimize use of global variables. As one of the chapters in my book says, Global Variables Are Evil.  Over-use of globals can reasonably be expected to result in significantly increased defect rates and the presence of difficult-to-find software defects that are likely to render a system unsafe.

Consequences: Using too many global variables increases software complexity and can be reasonably expected to increase the number of bugs, as well as make the code difficult to maintain properly. Defining variables as globals that could instead be defined as locals can be reasonably expected to significantly increase the risk of the data being improperly used globally, as well as to make it more difficult to track and analyze the few variables that should legitimately be global. In short, excessive use of globals leads to an increased number of software defects.

Accepted Practices:
  • A significant minority of variables (or fewer) should be global. Ideally zero variables should be global. (Special globals such as mathematical constants and configuration information might be excluded from this metric.) The exact number varies with the system, but an expected range would be from less than 1% to perhaps 10% of all statically allocated variables (even this is probably too high), with an extremely strong preference for the lower side of that range Exceeding this range can reasonably be expected to lead to an increase in software defects.
  • The need for each global or category of globals should be specifically justified as required for effective software construction. Speed increases are generally not sufficient justification, nor is limited memory space.
  • In any system with more than one task (including systems that just have a main task plus interrupts), every non-constant global should be declared volatile, and every access to a global should be protected by a form of concurrency management (e.g., disabling interrupts or using a mutex). Failing to do either can normally be expected to result in concurrency defects somewhere in your code.
  • Each variable should be declared locally if possible. Variables used by a collection of functions in the same C programming language module should be declared as top-level “static” within the corresponding .c file to limit visibility to functions declared within that .c file. Variables used by only one function should be declared within that function and thus not be visible to any other function.
  • Global variables must be declared consistently throughout the system, including having exactly the same type information (e.g., if a global is declared as “unsigned” in one place, it needs to be “unsigned” everywhere). 
Discussion:

Global variables (often called “globals” for short) are programming language variables that are visible to the entirety of the program. In contrast, non-global variables (often called local variables, or locals for short), have a reduced scope that makes them visible only in the part of the program where they are used, and invisible to the rest of the program. Excessive use of global variables must be avoided in safety critical software due to the complexity introduced as well as the risk of concurrency hazards.

One reason to avoid globals is that use of many globals can be reasonably expected to lead to high coupling among many disparate portions of a program. Any variable that is globally visible might be read or written from anywhere in the code, increasing program complexity and thus the chance for software defects. While analysis might be performed to determine where globals actually are read and written, doing so for a large number of globals is time consuming, and must be re-done any time any substantive change is made to the code. (For example, it would be expected that such analysis would have to be re-done for each software release.)

Another reason to avoid globals is that they introduce concurrency hazards (see the discussion on concurrency in an upcoming post). Because it is difficult to keep track of what parts of the program are reading and writing a global, safe code must assume that other tasks can access the global and use full concurrency protection each time a global is referenced. This includes both locking access to the global when making a change and declaring the global “volatile” to ensure any changes propagate throughout the software.

Even if concurrency hazards are generally avoided, if more than one place in the program modifies a global it is easy to have unexpected software behavior due to two portions of the program modifying the globals’ value in an unanticipated sequence. This can be reasonably expected to lead to infrequent and subtle (but potentially severe) concurrency defects.

As an analogy, consider if you keep your grocery shopping list on a whiteboard on your fridge. If you live alone and are careful, this may work out fine, and you will never accidentally have an item erased or added in error to your list. But if you live in a busy house (perhaps a dormitory), something you wrote on that whiteboard might be changed or accidentally erased, and you might have no idea that it happened or who did it. In this analogy, everything you write on that whiteboard is a global variable – anyone who enters the house can see it, act upon the information, and potentially modify it. As an extension of this analogy, if that whiteboard is a “to do” list, you are using it to communicate information to others in the house. If that list is modified, corrupted, or overwritten, your communication will fail, with that sort of problem becoming more likely as more and more people share the whiteboard for a diversity of purposes.

Using too many globals can be thought of as the data flow version of spaghetti code. With code “control” flow (conditional “if” statements and the like) that is tangled, it is difficult to follow the flow of control of the software. Similarly, with too many globals it is difficult to follow the flow of data through the program – you get “spaghetti data.” In both cases (tangled data flow and tangled control flow) designers can reasonably be expected that the spaghetti software will have elevated levels of software defects. Excessive use of global variables makes unit testing difficult, because it requires identifying and setting specific values in all the globals referenced by a module that is being unit tested.

In some cases the risk of a global may seem only theoretical rather than practical, if a variable is only used in a single module but happens to be defined as a global. However, this still presents a latent risk of some other piece of the software modifying the variable intentionally (defeating the principle of information hiding), or by accident via a typographical error that was intended to refer to a similarly named variable defined elsewhere. Therefore, it is an important accepted practice to only define variables as global if it is essential to do so.

There are two related classes of globals that are an exception, and are generally considered acceptable for use. One class is constants that represent physical properties (e.g., the number of degrees in a circle being 360), system configuration information (e.g., the fact that a vehicle has a 6 cylinder engine or a 4-speed transmission), and other values that don’t change at run time. These are properly defined as constant values using the programming language keyword “const” in the code or other similar approach. 

Another class of generally acceptable globals is system state information that must be generally accessible across most software, such as engine speed in a vehicle. These system state variables should be written in exactly one place in the code, should not be duplicated (to avoid confusion and the chance for different copies of the variable to get out of synch), should be kept few in number, and ideally should be read via access functions rather than directly to enforce modularity. These are the sorts of globals that should be kept to a minimum and count toward the statement that having a few well-chosen globals might be OK. In distributed embedded systems these globals usually show up as broadcast bus values rather than as actual memory-based global variables.

Entirely eliminating the use of non-constant globals is sometimes unachievable. But when complete elimination isn’t practical, globals should nonetheless be very few (handfuls, not thousands of them) and strategically selected. The globals that are used should be used carefully, and reviewed to ensure each is essential.

Selected Sources:

Wulf & Shaw in 1973 wrote a paper entitled: “Global variable considered harmful,” (Wulf 1973) describing it as the data version of a “goto” statement – which a previous paper by Dijkstra had famously declared “harmful.” After that publication, the concept of avoiding globals has mostly been a matter of clarifying the gray areas and educating new programmers.



McConnell says: “Global data is like a love letter between routines – it might go where you want it to go, or it might get lost in the mail.” (McConnell 1993, pg. 88). McConnell also says: “global-data coupling is undesirable because the connection between routines is neither intimate nor visible. The connection is so easy to miss that you could refer to it as information hiding's evil cousin – ‘information losing.’” (McConnell 1993, pg. 90).


Ganssle advises “Minimize global variables, both to reduce interaction between routines and to reduce the number of places where interrupts will cause problems.” (Ganssle 1992, pg. 186). A few years later, Ganssle and designers in general appreciated that global variables were even more of a problem than previously thought. He then wrote, resorting to humor to make his point: “One of the greatest evils in the universe, an evil in part responsible for global warming, ozone depletion, and male pattern baldness, is the use of global variables.” Humor aside, he goes on to confirm the problems with globals, and gives recommendations. “Every firmware standard – backed up by the rigorous checks of code inspections – must set rules about global use.” He finishes by saying: “I feel that defining a global is such a source of problems that the team leader should approve every one.” (Ganssle 2000, p. 38)

NASA recommends avoiding too many inter-component dependencies: “components should not depend on each other in a complex way,” which includes as a significant factor the use of globals to communicate data between components (NASA 2004, p. 93).

MISRA Software Guidelines recommend against using global variables (MISRA Report 5, p. 10).  IEC 61508-3 highly recommends a modular approach (p. 79), information hiding/encapsulation (id., p. 93), and a fully defined interface (id., p. 93), which sum up to recommending against the use of global variables.

Update: you can find an example of getting rid of a global variable here:
  http://betterembsw.blogspot.com/2013/09/getting-rid-of-global-variables.html

References:
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Parts 2,3,7. 
  • McConnell, Code Complete, Microsoft Press, 1993.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 5: Software Metrics, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Wulf & Shaw, Global variable considered harmful, SIGPLAN Notices, Feb 1973, pp. 28-34.

Monday, June 2, 2014

Avoid High Cyclomatic Complexity

Cyclomatic complexity is a measure of the number of paths through a particular piece of code (a higher number means the software is more complex). Critical software functions with high cyclomatic complexity are one type of "spaghetti" code that should be avoided. One can argue an exception for single-level switch statements, so long as there are no nested conditionals within the switch cases themselves.

Consequences: A high cyclomatic complexity for a particular function means that the function will be difficult to understand, and more difficult to test. That in turn means functions with a cyclomatic complexity above 10 or 15 can be reasonably expected to have more undetected defects than simpler functions. In practice, “spaghetti code” with a high cyclomatic complexity can reasonably be expected to have a poor design and poor unit test coverage, resulting in the reasonable expectation that it will have an elevated rate of software defects.

Accepted Practices:
  • Cyclomatic complexity should be limited for almost all software functions to a relatively small, predetermined threshold defined by the project software development plan. A range of 10 to 15 for a significant majority of functions in a project is typical.
  • Functions with a cyclomatic complexity above the defined threshold of 10 to 15 should be subject to special scrutiny, and simplified if practical. Functions with a complexity significantly above 15 (perhaps a predefined threshold in the range of 30-50 and above, depending on the particular system involved) should be strictly prohibited.
  • In some cases, functions with a single switch statement having many cases may be acceptable because of the regular structure of a switch statement, but should not contain nested switch statements, nor contain conditional statements within the first-level switch statement.
Discussion:

McCabe Cyclomatic Complexity (e.g., as discussed in NIST 500-235) is a measure of the complexity of a piece of software. Rather than counting the number of lines of code, cyclomatic complexity (roughly speaking) counts the number of decision points within the software. For example, a piece of software with 1000 lines of code but no “if” statements and no loops has a cyclomatic complexity of 1. But a piece of software 100 lines long might have a much higher cyclomatic complexity if it contains many levels of nested “if” statements and loops. Thus, cyclomatic complexity measures how complicated a piece of software is in terms of flow control structure rather than simply how big it is.

An important use of cyclomatic complexity is in understanding how difficult a piece of software will be to test. A common assessment of unit testing thoroughness is the “branch coverage” of a set of tests. Branch coverage measures whether tests that have been run have exercised both sides of a branch. For example, if a piece of code has a single “if … else” construct, attaining 100% branch coverage requires one test case to exercise the “if” part, and another test case to exercise the “else” part, since a program can’t take both paths at once. But, if there are nested conditionals the number of tests required can increase dramatically due to all the different paths that might have to be taken to exercise all paths through the code. Cyclomatic complexity provides a way to understand the number of tests required for these paths. In particular, depending upon the specifics of the software, it can take up to M unit tests to achieve 100% branch coverage, where M is the McCabe Cyclomatic Complexity of the software under test. (The actual number depends upon the structure of the software, but for a higher complexity number one expects that the effort required to test the software will increase.)

Below are examples of program control flow graphs from NIST publication 500-235. The point of these figures is not the particulars of the program, but rather the visible increase in number of possible paths (from top to bottom in each graph) as the cyclomatic complexity increases. Ensuring that tests exercise each and every path through the graph is going to be a lot harder for a complexity of 17 or 28 than for a complexity of 5.


Cyclomatic complexity examples from NIST 500-235, 1996.

A generally accepted practice is to limit cyclomatic complexity for each software function within a system to a range of 10-15 (NIST 500-235 pg. 25). Functions having a complexity above 10 or 15 require special attention and consideration as to why the deviation is required, and can reasonably be expected to be much more difficult to test thoroughly than functions with a complexity of 10 or below. This means that the effort required to achieve high branch coverage during unit test of each function is reduced with low cyclomatic complexity.

Code with a high cyclomatic complexity normally ends up as what is referred to as “spaghetti code,” because the control flow structure is a tangled mess like a plate of spaghetti. When you pull at one end of a piece of spaghetti on a dinner plate, you have no idea which of the other ends will also move. So it is with trying to understand the details of spaghetti software. Such code has the properties of being difficult to understand, difficult to test, and more prone to defects. (Historical note: "spaghetti code" classically referred to code with many goto statements, but more recently the term has come to encompass poorly structured code in general.)

The primary exception to this rule in practice is when a “switch” statement is used to deal with a large number of alternatives in a very regular pattern, especially if the switch is being used to dispatch different command bytes from a communication protocol parser or the like. But, test effort is still significant for a switch statement with many alternative possible paths, and some compilers have trouble with switch statements having more than 256 cases, so care should be taken with large switch statements. In any event, nested conditionals should not be buried inside cases of switch statements.

Selected Sources:

US National Institute of Standards and Technology (NIST) Special Publication 500-235 (Walsh 1996) explains the relationship between cyclomatic complexity and testing effort. Page 25 of that publication discusses a complexity limit of 10 being an accepted practice, and says that while limits as high as 15 have been successful, even higher levels of complexity require additional testing effort.

The Reliability Information Analysis Center (“RIAC” – formerly known as “RAC”) is the focus point of the US Department of Defense for reliability information of all types. They summarize the effects of McCabe Cyclomatic Complexity (MCC) as follows: “Empirically, numbers less than ten imply reasonable structure, numbers higher than 30 are of questionable structure. Very high cyclomatic numbers of more than 50 imply the application cannot be tested, while even higher numbers of more than 75 imply that every change may trigger a ‘bad fix.’. This metric is widely used for Quality Assurance and test planning purposes.” (RAC 1996, p. 124) In context, the term “bad fix” refers in general to a situation in which every attempt to fix a software defect introduces one or more new software defects, making it impossible to remove all defects regardless of how much effort is expended.



Jack Ganssle, a respected expert on embedded systems, makes it clear that “spaghetti” code must be avoided. He writes “A sure sign of disaster is the programmer who doesn’t seem to have a firm grasp of the program’s data and control flow, or who writes spaghetti code.” (Ganssle 1992, pg. 64)

Walsh found that defect rates increased significantly for modules with a McCabe Cyclomatic Complexity value above 10 on an embedded computer system (Walsh 1979, pg. 766).

Scholarly data on code complexity metrics is mixed because of the difficulty of performing controlled studies across different application areas, different programming team skill levels, different languages, and other confounding factors. For example, different projects start “counting” defects at different points in the development process, and defect counts are responsive to whether testing is thorough or not (inadequate testing may not find bugs, but that doesn’t mean the software is truly defect-free). Safety critical system developers tend to be a conservative lot (for good reason), and prefer to have less complexity because it makes it easier to inspect and test systems to ensure low defect rates.

I can also draw on my own personal experience conducting more than 130 embedded software design reviews. In essentially all cases when I see a complexity warning sign, I can find a defect within an hour (usually just minutes) of looking at code. But, more importantly, it is relatively straightforward to make the code less complex and more understandable (and therefore less defect prone) with very little effort in almost every case of high complexity that I have encountered. And, usually, once I show the programming team how to simplify, they agree it is better and start the practice themselves on other code. So, in my own extensive consulting and teaching experience, what I have found is that code with complexity warning signs is indicative of a software team that needs to improve in some significant way to create acceptable software – which is an undesirable state for developers of safety critical systems. When I find code that is simple, understandable, and devoid of unnecessary complexity, I know that I have found programmers who in general know what they are doing.

The developers of safety critical software standards feel the same way. They make it clear that complexity must be kept low in safety critical systems. Simplicity is one of the tools that is essential for creating safe systems. All of MISRA Report 5 is devoted to the topic of Software Metrics. It describes MCC (MISRA Report 5, pp. 37-38, p. 60) and gives a maximum acceptable complexity of 15 for safety critical systems.

The FAA recommends using a complexity metric such as MCC (FAA 2000, pg. J-17, J-23). NASA recommends reducing complexity using MCC and design for modularity via having weak coupling (NASA 2004, p. 95).

There are many possible complexity metrics for code. I have found that cyclomatic complexity is a useful predictor of likely code quality.

References:
  • FAA, System Safety Handbook, Appendix J: Software Safety, Federal Aviation Administration, Dec. 2000
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • MISRA, Report 5: Software Metrics, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • RAC, Introduction to Software Reliability: a state of the art review, Reliability Analysis Center, 1996.
  • Watson, A. & McCabe, T., (NIST 500-235), Structured Testing: a testing methodology using the cyclomatic complexity metric, Special Publication 500-235, National Institute of Standards and Technology, Sept. 1996




Monday, May 26, 2014

Proper Watchdog Timer Use

Watchdog timers are a prevalent mechanism for helping to ensure embedded system reliability. But they only work if you use them properly. Effective watchdog timer use requires that the failure of any periodic task in the system must result in a watchdog timer reset.

Consequences: Improper use of a watchdog timer leads to a false sense of security in which software task deaths and software task time overruns are not detected, causing possible missed deadlines or partial software failures.

Accepted Practices: 
  • If multiple periodic tasks are in the system, each and every such task must contribute directly to the watchdog being kicked to ensure every task is alive.
  • Use of a hardware timer interrupt to directly kick the watchdog is a bad practice. (There is arguably an exception of the ISR keeps a record of all currently live tasks as described later.)
  • Inferring task health by monitoring the lowest priority task alone is a bad practice. This approach fails to detect dead high priority tasks.
  • The watchdog timeout period should be set to the shortest practical value. The system should remain safe even if any combination of tasks dies for the entire period of the watchdog timeout value.
  • Every time the watchdog timer reset happens during testing of a fully operational system, that fact should be recorded and investigated. 
Discussion

Briefly, a watchdog timer can be thought of as a counter that starts at some predetermined value and counts down to zero. If the watchdog actually gets to zero, it resets the system in the hopes that a system reset will fix whatever problem has occurred. Preventing such a reset requires “kicking” (terms for this vary) the watchdog periodically to set the count back at the original value, preventing a system reset. The idea is for software to convince the hardware watchdog that the system is still alive, forestalling a reset. The idea is not unlike asking a teenager to call in every couple hours on a date to make sure that everything is going OK.

Watchdog timer arrangement.

Once the watchdog kicks a system reset is the most common reaction, although in some cases a permanent shutdown of the system is preferable if it is deemed better to wait for maintenance intervention before attempting a restart.

Getting the expected benefit from a watchdog timer requires using it in a proper manner. For example, having a hardware timer interrupt trigger unconditional kicking of the watchdog is a specifically bad practice, because it doesn’t indicate that any software task except the hardware timer ISR is working properly. (By analogy, having your teenager set up a computer to automatically call home with a prerecorded “I’m OK” message every hour on Saturday night doesn’t tell you that she’s really OK on her date.)

For a system with multiple tasks it is essential that each and every task contribute to the watchdog being kicked. Hearing from just one task isn’t enough – all tasks need to have some sort of unanimous “vote” on the watchdog being kicked. Correspondingly, a specific bad practice is to have one task or ISR report in that it is running via kicking the watchdog, and infer that this means all other tasks are executing properly. (Again by analogy, hearing from one of three teenagers out on different dates doesn’t tell you how the other two are doing.) As an example, the watchdog “should never be kicked from an interrupt routine” (MISRA Report 3, p. 38), which in general refers to the bad practice of using a timer service ISR to kick the watchdog.

A related bad practice is assuming that if a low priority task is running, this means that all other tasks are running. Higher priority tasks could be “dead” for some reason and actually give more time for low priority tasks to run. Thus, if a low priority task kicks the watchdog or sets a flag that is the sole enabling data for an ISR to kick the watchdog, this method will fail to detect if other tasks have failed to run in a timely periodic manner.

Monitoring CPU load is not a substitute for a watchdog timer. Tasks can miss their deadlines even with CPU loads of 70%-80% because of bursts of momentary overloads that are to be expected in a real time operating system environment as a normal part of system operation. For this reason, another bad practice is using software inside the system being monitored to perform a CPU load analysis or other indirect health check and kick the watchdog periodically by default unless the calculation indicates a problem. (This is a variant of kicking the watchdog from inside an ISR.)

The system software should not be in charge of measuring workload over time -- that is the job of the watchdog. The software being monitored should kick the watchdog if it is making progress. It is up to the watchdog mechanism to decide if progress is fast enough. Thus, any conditional watchdog kick should be done just based on liveness (have tasks actually been run), and not on system loading (do we think tasks are probably running).

One way to to kick a watchdog timer in a multi-tasking system is sketched by the below graphic:


Key attributes of this watchdog approach are: (1) all tasks must be alive to kick the WDT. If even one task is dead the WDT will time out, resetting the system. (2) The tasks do not keep track of time or CPU load on their own, making it impossible for them to have a software defect or execution defect that “lies” to the WDT itself about whether things are alive. Rather than making the CPU’s software police itself and shut down to await a watchdog kick if something is wrong, this software merely has the tasks report in when they finish execution and lets the WDT properly due its job of policing timeliness. More sophisticated versions of this code are possible depending upon the system involved; this is a classroom example of good watchdog timer use. Where "taskw" is run from depends on the scheduling strategy and how tight the watchdog timer interval is, but it is common to run it in a low-priority task.

Setting the timing of the watchdog system is also important. If the goal is to ensure that a task is being executed at least every 5 msec, then setting the watchdog timer at 800 msec doesn’t tell you there is a problem until that task is 795 msec late. Watchdog timers should be set reasonably close to the period of the slowest task that is kicking them, with just a little extra time beyond what is required to account for execution variation and scheduling jitter.

If watchdog timer resets are seen during testing they should be investigated. If an acceptable real time scheduling approach is used, a watchdog timer reset should never occur unless there has been system failure. Thus, finding out the root cause for each watchdog timer reset recorded is an essential part of safety critical design. For example, in an automotive context, any watchdog timer event recordings could be stored in the vehicle until it is taken in for maintenance. During maintenance, a technician’s computer should collect the event recordings and send them back to the car’s manufacturer via the Internet.

While watchdog timers can't detect all problems, a good watchdog timer implementation is a key foundation of creating a safe embedded control system. It is a negligent design omission to fail to include an acceptable watchdog timer in a safety critical system.

Selected Sources

Watchdog timers are a classical approach to ensuring system reliability, and are a pervasive hardware feature on single-chip microcontrollers for this reason.

An early scholarly reference is a survey paper of existing approaches to industrial process control (Smith 1970, p. 220). Much more recently, Ball discusses the use of watchdog timers, and in particular the need for every task to participate in kicking the watchdog. (Ball 2002, pp 81-83). Storey points out that while they are easy to implement, watchdog timers do have distinct limitations that must be taken into account (Storey pg. 130). In other words, watchdog timers are an important accepted practice that must be designed well to be effective, but even then they only mitigate some types of faults.

Lantrip sets forth an example of how to ensure multiple tasks work together to use a watchdog timer properly. (Lantrip 1997). Ganssle discusses how to arrange for all tasks to participate in kicking the watchdog, ensuring that some tasks don’t die while others stay alive. (Ganssle 2000, p. 125).

Brown specifically discusses good and bad practices. “I’ve seen some multitasking systems that use an interrupt to tickle the watchdog. This approach defeats the whole purpose for having one in the first place. If all the tasks were blocked and unable to run, the interrupt method would continue to service the watchdog and the reset would never occur. A better solution is to use a separate monitor task that not only tickles the watchdog, but monitors the other system tasks as well.” (Brown 1998 pg. 46).

The MISRA Software Guidelines recommend using a watchdog to detect failed tasks (MISRA Report 1, p. 43), noting that tasks (which they call “processes”) may fail because of noise/EMI, communications failure, software defects, or hardware faults. The MISRA Software Guidelines say that a “watchdog is essential, and must not be inhibited,” while pointing out that having returning an engine to idle in a throttle-by-wire application could be unsafe. (MISRA Report 1, p. 49). MISRA also notes that “The consequence of each routine failing must be identified, and appropriate watchdog and default action specified.” (MISRA Report 4 p. 33, emphasis added)

NASA recommends using a watchdog and emphasizes that it must be able to detect death of all tasks (NASA 2004, p. 93). IEC 61508-2 lists a watchdog timer as a form of test by redundant hardware (pg. 115) (without implying that it provides complete redundancy).

Addy identified a task death failure mode in a case study (Addy 1991, pg. 79) due to a task encountering a run-time fault that was not properly caught, resulting in the task never being restarted. Thus, it is reasonably conceivable that a task will die in a multitasking operating system. Inability to detect a task death is a defect in a watchdog timer, and a defective watchdog timer approach undermines the safety of the entire system. With such a defective approach, it would be expected that task deaths or other foreseeable events will go undetected by the watchdog timer.

References:
  • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. 
  • Lantrip, D. & Bruner, L., General purpose watchdog timer component for a multitasking system, Embedded Systems Programming, April 1997, pp. 42-54.
  • MISRA, Report 1: Diagnostics and Integrated Vehicle Systems, February 1995.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Smith, Digital control of industrial processes, ACM Computing Surveys, Sept. 1970, pp. 211-241.
  • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.





Monday, May 19, 2014

Task Death In Safety Critical Systems

Safety critical systems must be designed with an expectation that one or more tasks on the same CPU might fail via hanging, might fail to be scheduled, or otherwise might not execute in a periodic manner as intended, leading to a partial software failure.

Consequences:
Failing to monitor each and every task and failing to mitigate task execution faults can lead to a task “dying” without being recovered. If a task dies, it no longer performs its functions even though other tasks are still operational. This can leave the system in an uncontrolled state, cause loss of fail-safe functionality (if the fail-safe is in a dead task), cause stale recorded failure data (if the task collecting that data dies), etc.

Accepted Practice:
  • Each task must be continually monitored for completion within a defined period. This ensures that the task is scheduled and that it is completed on time. A properly designed watchdog timer that monitors all tasks is an accepted way to accomplish this.
  • If a task fails to periodically complete execution, fault mitigation must be performed. Common fault mitigation techniques include restarting the task or restarting the system. 
Discussion

Embedded systems typically have one or more periodic software tasks that must complete execution within some maximum period (each task has its own deadline, and by default the deadline equals the period unless some other deadline is specified). In a properly designed system, careful analysis and testing has been performed to ensure that each task will continue operation and meet its deadline even under worst case operating conditions.

However, competent safety-critical embedded system designers realize that faults may occur, resulting in a partial failure of the software system, and plan accordingly. Each task must be monitored to ensure the following: each task executes at least once in every period that it is supposed to execute; each task finishes execution without abnormally terminating; and each task completes execution within its predetermined deadline. This means that the system must not only make sure that once a task is scheduled to run it meets its deadline, but also make sure that the task is scheduled to run in the first place.

As an analogy, consider if you have a set of things you need to do every evening, but have a bad memory. One way you might handle this is to put post-it notes on your fridge to remind yourself what needs to be done every evening before you go to sleep. Let’s say you’re very methodical, so you take each note down in turn, do the action, then put it back on the fridge, in order. One of the notes says “check that basement door is locked” because you’ve found the kids have a habit of forgetting to lock it when they come in from playing. For a while this works fine. But one day you get distracted while locking the basement door and set the note down, forgetting to put it back on the fridge. The next night you might forget to check the door and not even realize it (at least not until you wake up in the middle of the night wondering about it – but embedded CPUs don’t have that feature!). What has happened is that a disruption caused one of your tasks to fall off the scheduling list. Just because you finished the entire list before morning doesn’t mean everything got done, because something was missing from the list. The same thing can happen in a computer if one of its tasks dies and doesn’t get put back on the task to-do list – the computer “thinks” it is running all its tasks, but in reality one of the tasks hasn’t been run because it isn’t even on the to-do list.

Selected Sources

Kleidermacher points out that tasks can die, saying “When a thread faults (for example, due to a stack overflow), the kernel should provide some mechanism whereby notification can be sent to the supervisor thread. If necessary, the supervisor can then make a system call to close down the faulted thread, or the entire process, and restart it. The supervisor might also be hooked into a software ‘watchdog’ setup, whereby thread deadlocks and starvation can be detected as well.” (Kleidermacher 2001, pg. 23).

Ganssle refers to the necessity of checking all tasks by saying “This problem multiplies in a system with an RTOS, as a reliable watchdog monitors all of the tasks. If some of the tasks die but others stay alive – perhaps tickling the [Watchdog Timer] – then the system’s operation is at best degraded.”  (Ganssle 2000, p. 125)

Tasks can be expected to die due to a variety of causes, and this is especially likely to happen in a system without hardware support for memory protection, such as OSEK, because software from one task can accidentally corrupt another task's data. Task death is an expected possibility in such an RTOS. For example, a technical summary of the OSEK operating system shows “Failed” as a task state, indicating that tasks can be expected to fail and need to be restarted to restore system operation. (Feiler 2003, Fig 1, shown below).


OSEK tasks can be expected to fail, requiring a task restart. (Feiler 2003, Fig. 1.)


Another contemporaneous real time operating system, VxWorks, did not support memory protection and did not support automatic task restart. As a result, it was dramatically more prone to subtle but significant partial software failures if software applications misbehaved (Koopman 2008, pp. 220-221, recounting work done several years earlier).  In particular, parts of the system would appear to be working, but key pieces of the system would have failed, giving a superficial appearance of a system that was working (the keyboard accepted commands and displayed them on the computer screen) when in fact the system was mostly dead, and programs could not be run on it. (VxWorks has since added memory protection capabilities, so these experimental results do not necessarily represent current VxWorks systems.)

Safety critical system designers should reasonably expect that a task might die (cease to finish computation in a periodic manner). They should have complete analysis to determine which task deaths might lead to an unsafe system (by default, this is all tasks in the system), and should take proactive measures to detect and recover from one or more tasks dying, regardless of the underlying cause -- even if no specific cause for the task death can be imagined by the designers.

References:
  • Feiler, Real-time application development with OSEK: a review of the OSEK standards, Carnegie Mellon University Software Engineering Institute, CMU/SEI-2003-TN-004, Nov 2003.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Koopman, P., DeVale, K. & DeVale, J., "Interface robustness testing: experiences and lessons learned from the Ballista Project," In: Kanoun, K. & Spainhower, L., Eds., Dependability Benchmarking for Computer Systems, IEEE Press, 2008, pp. 201-226.


Monday, May 12, 2014

Real Time Scheduling Analysis for Critical Systems

Critical embedded software must be created using methodical scheduling analysis that ensures that every task can meet its deadline under worst-case fault-free operational conditions. Looking just at CPU load is not only inadequate to determine if you will meet real time deadlines -- it is a specifically bad practice that is to be avoided.

Consequences:
Failing to do an accurate real time schedule design for a safety critical system means that designers do not know if the system will miss deadlines. It is impractical in a complex system (which means almost any real-world product) to do enough testing with a multitasked RTOS-based system to ensure that deadlines will be met under worst case conditions. Thus, a design team that fails to do scheduling analysis should appreciate that they have an unknown probability of missing deadlines, and can reasonably expect to miss deadlines in the worst case if the CPU load is above about 69% for a large number of non-harmonic tasks.

 Accepted Practices:
  • Using Rate Monotonic Scheduling (RMA) when using an RTOS with multi-tasking capability, or using some other mathematically sound scheduling analysis. Rate Monotonic Scheduling (RMS) is what you get as a result of RMA.
  • Documenting the RMA analysis including at least listing: all tasks, WCET for each task, period of each task, and worst case system blocking time. Assumptions used in the analysis should be stated and justified (for example, an assumption that no low priority task blocks a high priority task must be justified via written explanation, or if untrue then advanced techniques must be used to ensure schedulability). The system use must be less than 100% of the CPU if a set of favorable assumptions such as harmonic task periods can be documented to hold, and may be restricted to as low as 69.3% CPU used (meaning 30.7% unused CPU capacity) in the general case.
  • A specifically bad practice is basing real time performance decisions solely on spare capacity (e.g., “CPU is only 80% loaded on average”) in the absence of mathematical scheduling analysis, because it does not guarantee safety critical tasks will meet their deadlines. Similarly, monitoring spare CPU capacity as the only way to infer whether deadlines are being met is a specifically bad practice, because it does not actually tell you whether high frequency deadlines are being met or not.
  • Permitting more than one instance of a real-time task to queue is a specifically bad practice, because this can only happen when real time deadlines are being missed. This practice is a bandaid for a real-time system, and indicates that the system is missing its real-time deadlines.
Discussion

Real time scheduling is a mathematical approach to ensuring that every task in a real time embedded system meets its deadlines under all specified operating conditions. Using a mathematical approach is required because testing can only exercise some of the system operating conditions. There are almost always real time scheduling situations that can’t be adequately tested in a complex piece of software, requiring either simplifying the system to make testing feasible, or using a rigorous mathematical approach to ensure that complex scheduling is guaranteed to work. When a multi-tasking real time operating system such as OSEK is used, a mathematical approach must be used to ensure deadlines are met.

Rate Monotonic Analysis

The generally accepted method for scheduling critical real time systems is to perform Rate Monotonic Analysis (RMA), possibly with one of a number of adaptations for special circumstances, and create a system with a Rate Monotonic Schedule. RMA has the virtue of providing a simple set of rules that guarantees all tasks in a system will meet their deadlines. To achieve this, RMA requires rules to be followed such as all tasks having a defined fastest time period at which they will run, and the period of tasks being harmonic multiples to permit using 100% of the CPU capacity. (Task periods are considered harmonic if they are exact multiples of each other. For example, periods of 1, 10, 100, 1000 msec are harmonic, but 1, 9, 98, 977 msec are not harmonic.) To implement RMA, designers sort tasks based on period, and assign the fastest task to the highest priority, second fastest task to the second highest priority, and so on. If a designer wishes to have multiple tasks at the same priority that is acceptable, but it is required that all tasks at a given priority execute at the same period. 

The benefit of RMA is that there is a mathematical proof that all tasks will meet deadlines if a certain set of rules is followed. If any of the rules is bent, however, then more complex mathematical analysis must be performed or other special techniques used to ensure that deadlines will be met. In particular, if task periods are not harmonic multiples of each other, guaranteeing that deadlines will be met requires leaving slack capacity on the CPU. For a system with many tasks, RMA can only guarantee that task deadlines will be met if the CPU load is a maximum of 69.3%. (For this reason, it is common for embedded systems to use harmonic multiples of task periods.)


Example execution of a system scheduled with RMA.
(Source: http://blogs.abo.fi/alexeevpetr/2011/11/24/generating-multithread-code-from-simulink-model-for-embedded-target/)

Employing RMA requires at least the following steps at design time: listing all tasks in the system including interrupt service routines, determining the fastest period of each task, determining the worst case execution time (WCET) of each task, and determining the maximum blocking time of the system (longest time during which interrupts are disabled). Given those numbers, tasks can be assigned priorities and designers can know if the system will always meet its deadlines. Given correct RMA scheduling and a fault-free operational system, every task will complete by the end of its period, so it will never be possible to have a second occurrence of a task enqueued while waiting for a first occurrence to complete.

If the CPU is over-subscribed, there are established methods of ensuring that critical tasks always complete while non-critical tasks get whatever CPU resources are left. The simplest method is to simply assign all non-critical tasks priorities lower than the lowest priority critical task (which works so long as those non-critical tasks can’t substantively delay with the operation of the critical tasks). If it is important to share available CPU time across a number of noncritical tasks even when the CPU is overloaded, this can be done by having non-critical tasks take turns at the lowest priority. Note that an overloaded CPU does not cause any critical tasks to miss their deadlines in this scenario; it is simply a matter of making non-critical tasks wait a bit longer to execute if the CPU is overloaded. (This assumes that the CPU can handle worst case demand from critical tasks. This assumption must be ensured by the design process for the system to be safe.)

Less Than 100% CPU Load Does Not Guarantee Deadlines Are Met

A specifically bad practice is looking at idle time during testing to determine whether or not the CPU is overloaded, and inferring from less that 100% usage that the system will meet its deadlines. That simply does not account for the worst case, or potentially even just an infrequent heavy load case.

As a non-computer example of missing deadlines with less than 100% loading, let’s say you want to spend five days per week working and three days out of every 12 volunteering at a homeless shelter (the fact that it is out of 12 days instead of out of 14 days makes these periods non-harmonic). Because work has a period of 7 days, it will have higher priority than the 12 day volunteer service period. This means you’ll complete all of your work the first 5 days (Monday – Friday) out of each 7-day work week before you start volunteering on weekends. Most weeks this will work fine (you’ll have enough time for both). But when the 12-day volunteer period starts on a Monday, you’ll only have one weekend (two days) in the 12 day period for volunteering that runs Monday of one week through Friday of the next week.  Thus you’ll be a day short for volunteering whenever the volunteer period starts on a Monday. The amount of time you’ve committed is 5 days out of 7 plus 3 days out of 12:  5/7 + 3/12 = 96.4%. But you’re going to come up a whole day short on volunteering once in a while even though you are less than 100% committed and you are taking some weekend days off, performing neither task. You could solve this by committing 4 days out of 14 to volunteering, which is actually a slightly higher workload (100% total), but changes the period to be harmonic with the weekly work cycle. (3.5 days out of every 14 would also work.) Thus, as shown by this example, non-harmonic task periods can result in missed deadlines even with a workload that is less than 100%.

Example of 79% loaded CPU missing a task deadline with only 4 tasks.

As a computer-based example, the above figure shows another example in terms of a task in a four-task system missing its deadline on a system with non-harmonic periods even though the CPU is only 79% loaded. This scenario would happen rarely, and only when all the tasks’ periods synchronize, which for this example would be only once every Least Common Multiple of (19,24,29,34) = 224,808 time units – and even then only if each task actually took its worst case time to execute. Thus, while Task 4 would be expected to miss its deadline in operation, detecting that situation might be difficult with limited duration testing.

Selected Sources

Rate Monotonic Scheduling was developed to address the problem of creating an easy-to-follow recipe for ensuring that real time schedules can be met in a system with multiple tasks. Influential early papers include (Liu & Layland 1973 and Lehoczky et al. 1989).

Ganssle recommends the use of RMA as early as 1992 (Ganssle 1992, pg. 200-201) and sketches its use, giving a reference for more details.

Douglass provides a pattern for static priority real time scheduling and states: “the most common policy for the selection of policies is rate monotonic scheduling or RMS”.  (Douglass 2002, section 5.9.5; note that rate monotonic scheduling is what you get as a result of rate monotonic analysis).

In the context of safety critical operating systems, Kleidermacher says “Rate monotonic analysis (RMA) is frequently used by system designers to analyze and predict the timing behavior of systems.” (Kleidermacher 2001, pg. 30).

MISRA Report 3 discusses the use of real-time kernels (which for our purposes are operating systems that use some sort of scheduling approach). It notes that there are a number accepted scheduling techniques, and that the use of “best available technology” such as for example using a certified RTOS brings benefit, providing a reference to a number of text.

References:
  • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Lehoczky, J.; Sha, L.; Ding, Y. "The rate monotonic scheduling algorithm: exact characterization and average case behavior", IEEE Real-Time Systems Symposium, 1989, pp. 166–171.
  • Liu, C. L.; Layland, J., "Scheduling algorithms for multiprogramming in a hard real-time environment", Journal of the ACM, 1973, (1): 46–61
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.

Monday, May 5, 2014

Mitigating Data Corruption

As previously discussed, data corruption from bit flips can happen from a variety of sources. The results from such faults can be catastrophic. An effective technique should be used to detect both software- and hardware-caused corruption of critical data. Effective techniques include error coding and data replication in various forms.

Consequences:
Unintentional modification of data values can cause arbitrarily bad system behavior, even if only one bit is modified. Therefore, safety critical systems should take measures to prevent or detect such corruptions, consistent with the estimated risk of hardware-induced data corruption and anticipated software quality. Note that even best practices may not prevent corruption due to software concurrency defects.

Accepted Practices:

  • When hardware data corruption may occur and automatic error correction is desired, use a hardware single error correction/multiple error detection circuitry (SECMED) form of Error Detection and Correction circuitry (EDAC), sometimes just called Error Correcting Code circuitry (ECC) for all bytes of RAM. This would protect against hardware memory corruption, including hardware corruption of operating system variables. However, it does not protect against software memory corruption.
  • Use a software approach such as a cyclic redundancy code CRC (preferred), or checksum to detect a corrupted program image, and test for corruption at least when the system is booted.
  • Use a software approach such as keeping redundant copies to detect software data corruption of RAM values.
  • Use fault injection to test data corruption detection and correction mechanisms.
  • Perform memory tests to ensure there are no hard faults.

Discussion:

Safety critical systems must protect against data corruption to avoid small changes in data which can render the system unsafe. Even a single bit in memory changing in the wrong way could cause a system to change from being safe to unsafe. To guard against this, various schemes for memory corruption detection and prevention are used.

Effect of a bit flip (Source: http://dependablesystem.blogspot.com/2012/05/flip-happens.html)

Hardware and Software Are Both Corruption Sources

Hardware memory corruption occurs when a radiation event, voltage fluctuation, source of electrical noise, or other cause makes one or more bits flip from one value to another. In non-volatile memory such as flash memory, wearout, low programming voltage, or electrical charge leakage over time can also cause bits in memory to have an incorrect value. Mitigation techniques for these types of memory errors include the use of hardware error detection/correction codes (sometimes called “EDAC”) for RAM, and typically the use of a “checksum” for flash memory to ensure that all the bytes, when “added up,” give the same total as they did when the program image was stored in flash.

If hardware memory error detection support is not available, RAM can also be protected with some sort of redundant storage. A common practice is to store two copies of a value in two different places in RAM, often with one copy inverted or otherwise manipulated. It's important to avoid storing the two copies next to each other to avoid problems of errors that corrupt adjacent bits in memory. Rather, there should be two entirely different sections of memory for mirrored variables, with each section having only one copy of each mirrored variable. That way, if a small chunk of memory is arbitrarily corrupted, it can at most affect one of the two copies of any mirrored variable. Error detection codes such as checksums can also be used, and provide a tradeoff of increased computation time when a change is made vs. requiring less storage space for error detection information as compared to simple replication of data.

Software memory corruption occurs when one part of a program mistakenly writes data to a place that should only be written to by another part of the program due to a software defect. This can happen as a result of a defect that produces an incorrect pointer into memory, due to a buffer overflow (e.g., trying to put 17 bytes of data into a 16 byte storage area), due to a stack overflow, or due to a concurrency defect, among other scenarios.

Hardware error detection does not help in detecting software memory corruption, because the hardware will ordinarily assume that software has permission to make any change it likes. (There may be exceptions if hardware has a way to “lock” portions of memory from modifications, which is not the case here.) Software error detection may help if the corruption is random, but may not help if the corruption is a result of defective software following authorized RAM modification procedures that just happen to point to the wrong place when modifications are made. While various approaches to reduce the chance of accidental data corruption can be envisioned, acceptable practice for safety critical systems in the absence of redundant computing hardware calls for, at a minimum, storing redundant copies of data. There must also be a recovery plan such as system reboot or restoration to defaults if a corruption is detected.

Data Mirroring

A common approach to providing data corruption protection is to use a data mirroring approach in which a second copy of a variable having a one’s complement value is stored in addition to the ordinary variable value. A one’s complement representation of a number is computed by inverting all the bits in a number. So this means one copy of the number is stored normally, and the second copy of that same number is stored with all the bits inverted (“complemented”). As an example, if the original number is binary “0000” the one’s complement mirror copy would be “1111.” When the number is read, both the “0000” and the “1111” are read and compared to make sure they are exact complements of each other. Among other things, this approach gives protection against a software defect or hardware corruption that sets a number of RAM locations to all be the same value. That sort of corruption can be detected regardless of the constant corruption value put into RAM, because two mirrored copies can’t have the same value unless at least one of the pair has been corrupted (e.g., if all zeros are written to RAM, both copies in a mirrored pair will have the value “0000,” indicating a data corruption has occurred).

Mirroring can also help detect hardware bit flip corruptions. A bit flip is when a binary value (a 0 or 1), is corrupted to have the opposite value (changing to a 1 or 0 respectively), which in turn corrupts the value of the number stored at the memory location suffering one or more bit flips. So long as only one of two mirror values suffers a bit flip, that corruption will be detectable because the two copies won’t match properly as exact complements of each other.

A good practice is to ensure that the mirrored values are not adjacent to each other so that an erroneous multi-byte variable update is less likely to modify both the original and mirrored copy. Such mirrored copies are vulnerable to a pair of independent bit flips that just happen to correspond to the same position within each of a pair of complemented stored values. Therefore, for highly critical systems a Cyclic Redundancy Check (CRC) or other more advanced error detection method is recommended.

It is important to realize that all memory values that can conceivably cause a system hazard need to be protected by mirroring, not just a portion of memory. For example a safety-critical Real Time Operating System will have values in memory that control task scheduling. Corruption of these variables can lead to task death or other problems if the RTOS doesn't protect data integrity, even if the application software does use mirroring. Note that there are multiple ways for an RTOS to protect its data integrity from software and hardware defects beyond this, such as via using hardware access protection. But, if the only mechanism being used in a system to prevent memory corruption is mirroring, the RTOS has to use it too or you have a vulnerability.

Selected Sources

Automotive electronics designers knew as early as 1982 that data corruption could be expected in automotive electronics. Seeger writes: “Due to the electrically hostile environment that awaits a microprocessor based system in an automobile, it is necessary to use extra care in the design of software for those systems to ensure that the system is fault tolerant. Common faults that can occur include program counter faults, altered RAM locations, or erratic sensor inputs.” (Seeger 1982, abstract, emphasis added). Automotive designers generally accepted the fact that RAM location disruptions would happen in automotive electronics (due to electromagnetic interference (EMI), radiation events, or other disturbances), and had to ensure that any such disruption would not result in an unsafe system.

Stepner, in a paper on real time operating systems that features a discussion of OSEK (the trade name of an automotive-specific real time operating system), states with regard to avoiding corruption of data: “One technique is the redundant storage of critical variables, and comparison prior to being used. Another is the grouping of critical variables together and keeping a CRC over each group.” (Stepner 1999, pg. 155).

Brown says “We’ve all heard stories of bit flips that were caused by cosmic rays or EMI” and goes on to describe a two-out-of-three voting scheme to recover from variable corruption. (Brown 1998 pp. 48-49). A variation of keeping only two copies permits detection but not correction of corruption. Brown also acknowledges that designers must account for software data corruption, saying “Another, and perhaps more common, cause of memory corruption is a rogue pointer, which can run wild through memory leaving a trail of corrupted variables in its wake. Regardless of the cause, the designer of safety-critical software must consider the threat that sometime, somewhere, a variable will be corrupted.” (id., p. 48).

Kleidermacher says: “When all of an application’s threads share the same memory space, any thread could—intentionally or unintentionally— corrupt the code, data, or stack of another thread. A misbehaved thread could even corrupt the kernel’s own code or internal data structures. It’s easy to see how a single errant pointer in one thread could easily bring down the entire system, or at least cause it to behave unexpectedly.” (Kleidermacher 2001, pg. 23). Kleidermacher advocates hardware memory protection, but in the absence of a hardware mechanism, software mechanisms are required to mitigate memory corruption faults.

Fault injection is a way to test systems to see how they respond to faults in memory or elsewhere (see also an upcoming post on that topic). Fault injection can be performed in hardware (e.g., by exposing a hardware circuit to a radiation source or by using hardware test features to modify bit values), or injected via software means (e.g., slightly modifying software to permit flipping bits in memory to simulate a hardware fault). In a research paper, Vinter used a hybrid hardware/software fault injection technique to corrupt bits in a computer running an automotive-style engine control application. The conclusions of this paper start by saying: “We have demonstrated that bit-flips inside a central processing unit executing an engine control program can cause critical failures, such as permanently locking the engine’s throttle at full speed.” (Vinter 2001). Fault injection remains a preferred technique for determining whether there are data corruption vulnerabilities that can result in unsafe system behavior.

References:
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.
  • Stepner, D., Nagarajan, R., & Hui, D., Embedded application design using a real-time OS, Design Automation Conference, 1999, pp. 151-156.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.