Consequences:
Incorrect
or lax use of concurrency techniques can be expected to lead to concurrency
bugs. Such bugs are usually difficult to detect via testing and difficult to
reproduce once detected. A fraction of any such bugs can be reasonably expected
to make it past even extensive testing into production fleets.
Accepted Practices:
- Aggressively minimize the use of globally shared variables. Every variable shared between tasks is a chance for a concurrency defect.
- For every access to a shared variable, treat the entire time that a copy of the variable is “live” in the computation as a critical section, protecting access via masking interrupts or some other well defined technique.
- Avoid concurrency management techniques that are “home brewed” or otherwise not part of well proven practice. Similarly, do not modify well known techniques to optimize efficiency or obtain other perceived benefits. Such techniques are extremely difficult to get right, and altering techniques or using non-standard techniques can be reasonably expected to introduce defects.
- Declare every shared variable “volatile” to ensure that reads and writes do not result in stale data being used due to compiler optimization attempts to improve computation speed.
- Keep critical sections as short as possible to minimize negative effects on scheduling. (The largest critical section forms a minimum length on blocking time for scheduling purposes).
Discussion:
One aspect of a modern real time embedded system is that it must appear to do many things at once. For example, an engine controller must look at many inputs, set throttle angle and perform diagnostic checks all at the same time. Typically many of these tasks are written as relatively independent pieces of software that must work together, and they must all appear to run at the same time.
In reality there is only one CPU, so tasks must take turns using that CPU, with that turn taking supervised by the RTOS. And, tasks often must share things such as memory locations and input/output devices. The turn-taking means that, unless software designers are extremely careful, on occasion shared resources can be left in undefined or incorrect states, resulting in concurrency bugs.
Example concurrency defect.
Source: http://blogs.windriver.com/engblom/2010/06/true-concurrency-is-truly-different-again.html (That blog post has a nice discussion of how situation-dependent such defects are)
Avoiding and fixing concurrency bugs is a major source of design and testing effort on most embedded systems. In part this is because concurrency bugs can be quite subtle, and in part it is because they can be very difficult to activate during testing as well as difficult to isolate even when one is observed (if they are observed at all during testing). The difficulty of detecting and fixing concurrency defects, as well as the reasonable probability that they won’t be seen at all in testing, makes disciplined use of good practices essential.
A variety of techniques are available to avoid concurrency problems. A preferred approach is avoiding situations in which concurrency bugs are possible. For example, avoiding the use of shared global variables avoids associated concurrency defects (because the shared variables simply aren’t there to begin with). But, when sharing can’t be avoided, there are well defined basic techniques that work. Typically such techniques work by “locking” some resource so that other tasks cannot use it. As an analogy, consider a changing room at a clothing store. If you want to make sure that nobody else tries to use the room you are using, you need to “lock” the room when you enter (with an actual door lock, or maybe just by closing the door or curtain all the way), and then “unlock” the room when you leave. If you never lock the door it might be that nothing bad happens for dozens or hundreds of times you try on clothes. But eventually someone will wander into the room by mistake while you are there if you don’t lock the door.
Disabling interrupts is a concurrency design approach that can be thought of as a program “locking” the CPU so that no other task can use it when a shared variable is being accessed. The way this works is that any task wanting to, say, increment a variable first disables interrupts, then increments the variable, then re-enables interrupts. The period of time between the first read of a variable and when the variable is done being updated is known as a “critical section,” and is the time during which no other task can be permitted to access the variable. Disabling interrupts turns off the hardware’s ability to switch tasks or perform anything but the desired computation during the critical section. This ensures that no other task in the system can read or write the shared variable, because disabling interrupts prevents any other task from running. It is essential that every single access to a shared variable disable interrupts for the entire use of the variable for this to be guaranteed to work. If a local copy of the variable is kept and used outside the time during which interrupts are disabled, there are no guarantees as to how the system will behave when that local copy is subsequently used to update the variable. Other techniques are available to manage concurrency beyond disabling interrupts. But, this is a common technique in embedded systems.
Even using these techniques, special care must be used in accessing any shared resource. For example, the keyword volatile must be used for every shared resource to ensure that the most up to date copy is always accessed. (But, even this won’t help if that copy is updated at an unexpected time.)
Selected Sources:
Ball
describes concurrency defects in terms of being race conditions and prescribes
disabling interrupts to solve the problem. (Ball 2002, pp. 162).
Douglass
gives a pattern for a critical section in section 7.2, and in section 7.2.6
says that the most common way to prevent context switching during a critical
section is to disable interrupts. In section 7.2.5, Douglass says: “The
designers and programmers must show good discipline in ensuring that every
resource access locks the resource before performing any manipulation of the
source.” (Douglass 2002)
MISRA
recommends that developers “Use Test-and-Set instructions or a signaling
mechanism, such as Dekker/Dijkstra/Lamport Semaphores, to protect and mark as
‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more modern
terminology, this is a recommendation to use a mutex or related semaphore-based
“lock” on data. MISRA also cautions that interrupt enable and disable
instructions must be used with care. (id.)
Sullivan presents results of a study of defect
types, concluding that 11% of high impact memory corruption errors are due to
concurrency defects. (Sullivan 1991, p. 6). This means that while most defects
are easier to track down, a few race conditions and other concurrency defects
can be expected to happen.
Concurrency defects are so difficult to find that specific testing and analysis tools have been developed to find them, prompting the creation of a benchmark suite to evaluate such tools (Jalbert 2011). A significant challenge to creating such a benchmark is the difficulty in reproducing such bugs even when the exact bug is completely understood. (id., first page)
Park et al. provide a summary of work on finding and fixing concurrency defects (Park 2010). Among other things they note that a concurrency defect was responsible in part for the 2003 Northeastern US electricity blackout that left 10 million people without power (id. p. 245), and that such bugs are difficult to reproduce (id.).
References:
- Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
- Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
- Jalbert, RADBench: a concurrency bug benchmark suite, HotPar 2011, pp. 2-8.
- MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
- Park, Falcon: fault localization in concurrent programs, ICSE 2010, pp. 245-254.
- Sullivan & Chillarege, Software defects and their impact on system availability: a study of field failures in operating systems, Fault Tolerant Computing Symposium, 1991, pp 1-9.