Somewhere in your embedded system is the stack (or several
of them for some multitasking systems). If you blow up the stack, your software
will crash. Or worse, especially if
you don’t have memory protection. For a critical system you need to make sure
the stack has some elbow room and make sure that you know when you have a stack
size problem.
Also, you shouldn’t ever use recursion in a safety critical
embedded system. (This shouldn’t even need to be said, but apparently it does.)
Consequences: The
consequences of not understanding maximum stack depth can be a seemingly random
memory corruption as the stack overwrites RAM variables. Whether or not this
actually causes a program crash depends upon a number of factors, including
chance, and in the worse case it can cause unsafe program behavior without a
crash.
Accepted
Practices:
- Compute worst case stack depth using a static analysis tool.
- Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
- Avoid all recursion to ensure that worst case stack depth is bounded.
The “stack” is an area in memory used for storing certain
types of data. For example, in the C programming language this is where
non-static local variables go. The stack gets bigger every time a subroutine is
called, usually holding the subroutine return address, parameter values that
have been passed, and local variables for each currently active subroutine. As
nested subroutines are called, the stack keeps getting bigger as more
information is added to the stack to perform each deeper and deeper call. When
each subroutine is completed, the stack information is returned to the stack
area for later use, un-nesting the layers of stack usage and shrinking the
stack size. Thus the maximum size of the stack is determined by how many
subroutines are called on top of each other (the depth of the subroutine call
graph), as well as the storage space needed by each of those subroutines, plus
additional area needed by any interrupt service routines that may be active at
the same time.
Some processors have a separate hardware stack for
subroutine return information and a different stack for parameters and
variables. And many operating systems have multiple stacks in support of
multiple tasks. But for the most part similar ideas apply to all embedded
controllers, and we’ll just discuss the single-stack case to keep things simple.
It is common for stacks to grow “downward” from high memory
locations to low locations, with global variables, Real Time Operating System
(RTOS) task information, or other values such as the C heap being allocated
from the low memory locations to high memory locations. There is an unused
memory space in between the stack and the globals. In other words, the stack
shares limited RAM space with other variables. Because RAM space is limited,
there is the possibility of the stack growing so large that it overlaps variable
storage memory addresses so that the stack corrupts other memory (and/or loads
and stores to global memory corrupt the stack). To avoid this, it is accepted
practice to determine the worst case maximum stack depth and ensure such
overlap is impossible.
Maximum stack depth can be determined by a number of means.
One way is to periodically sample the stack pointer while the program is
running and find the maximum observed stack size. This approach is a starting
point, but one should not expect that sampling will happen to catch the
absolute maximum stack depth. Catching the worst case can be quite difficult
because interrupt service routines also use the stack and run at times that are
generally unpredictable compared to program execution flow. (Moreover, if you
use a timer interrupt to sample the stack pointer value, you’ll never see the
stack pointer when the timer interrupt is masked, leaving a blind spot in this
technique.) So this is not how you should determine worst case stack depth.
A much better approach is to use static analysis tools specifically
designed to find the worst case set of subroutine calls in terms of stack
depth. This technique is effective unless the code has structures that confound
the analysis (e.g., if a program uses recursion, this technique generally
doesn’t work). When the technique does work, it has the virtue of giving an
absolute bound without having to rely upon whether testing happened to have
encountered the worst case. You should use this technique if at all possible.
If static analysis isn’t possible because of how you have written your code,
you should change your code so you actually can do static analysis for maximum
stack depth. In particular, this means you should never use recursion in a critical system both because it
makes static analysis of stack depth impossible. Moreover, recursive routines
are prone to stack overflows even if they are bug free, but happen to be fed an
exceptional input value that causes too many levels of recursion. (In general,
using recursion in any small-microcontroller embedded system is a bad idea for
these reasons even if the system is not safety critical.)
Beyond static analysis of stack depth (or if static analysis
isn’t possible and the system isn’t safety critical), an additional accepted
practice is to use a “stack sentinel” to find the “high water mark” of the
stack (I’ve also heard this called a “stack watermark” and a “stack guard”).
With this technique each memory location in the stack area of memory is
initialized to a predetermined known value or pattern of known values. As the
stack grows in size, it will overwrite these known sentinel values with other
values, leaving behind new patterns in memory that remain behind, like
footprints in fresh snow. Thus, it is easy to tell how big the stack has ever
been since the system was started by looking for how far into the unused memory
space the “footprints” of stack usage trample past the current stack size. This
both gives the maximum stack size during a particular program run, and the
overall system maximum stack size assuming that the worst case path has been
executed. (Actually identifying the worst case path need not be done – it is
sufficient to have run it at some time during program execution without knowing
exactly when it happened. It will leave its mark on the stack memory contents
when it does happen.)
To convert this idea into a run-time protection technique,
extra memory space between the stack and other data is set up as a sacrificial
memory area that is there to detect unexpected corruptions before the stack can
corrupt the globals or other RAM data. The program is run, and then memory
contents are examined periodically to see how many stack area memory words are
left unchanged. As part of the design validation process, designers compare the
observed worst case stack depth against the static analysis computed worst case
stack depth to ensure that they understand their system, have tested the worst
case paths, and have left adequate margin in stack memory to prevent stack
overflows if they missed some special situation that is even worse than the
predicted worst case.
This sentinel technique should also be used in testing and
in production systems to periodically check the sentinels in the sacrificial
memory area. If you have computed a worst case stack depth at design time and
you detect that the computed depth has been exceeded at run time, this is an
indication that you have a very serious problem.
If the stack overflows past the sacrificial memory space
into global variables, the system might crash. Or that might just corrupt
global variables or RTOS system state and have who-knows-what effect. (In a
safety critical system “who-knows-what” equals “unsafe.”) Naively assuming that you will always get a
system crash detectable by the watchdog on a stack overflow is a dangerous
practice. Sometimes the watchdog will catch a stack overflow. But there is no
guarantee this will always happen. Consider that a stack-smashing attack is the
security version of an intentional stack overflow, but is specifically designed
to take over a system, not just merely crash it. So a crash after a stack
overflow is by no means a sure thing.
Avoiding stack overflow problems is a matter of considering
program execution paths to avoid a deep sequence of calls, and accounting for
interrupts adding even more to the stack. And, again, even though we shouldn’t
have to say this – never using recursion.
While there isn’t a set number for how much margin to leave
in terms of extra memory past the computed maximum stack size, there are two
considerations. First, you want to leave enough room to have an ample
sacrificial area so that a problem with stack depth is unlikely to have enough
time to go all the way through the sacrificial area and touch the globals
before it is detected by a periodic timer tick checking sentinels. (Note: we
didn’t say check current stack depth; we said periodically check sentinels to
see if the stack has gotten too big between checks.) Also, you want to leave some extra margin to
account for the possibility you just never encountered the worst-case stack
depth in analysis or testing. I’ve heard designers say worst case stack depth
of 50% of available stack memory is a good idea. Above 90% use of stack memory
(10% sacrificial memory area set aside) is probably a bad idea. But the actual
number will depend on the details of your system.
Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched
part of embedded systems folklore and accepted practices. Douglass mentions
watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that
it can cause unpredictable behavior (MISRA Guidelines, pg. 20).
MISRA C required rule 70 explicitly bans recursion:
NASA recommends using stack guards (essentially the same as
the technique that I call “stack sentinels”) to check for stack overflow or
corruption (NASA 2004, p. 93).
Stack overflow errors are well known to corrupt memory and
cause arbitrarily bad behavior of a running program. Regehr et al. provide an
overview of research in that area relevant to embedded microcontrollers and
ways to mitigate those problems (Regehr 2006). He notes that “the potential for
stack overflow in embedded systems is hard to detect by testing.” (id., p.
776), with the point being that it is reasonable to expect that a system which
has been tested but not thoroughly analyzed will have run-time stack overflows
that corrupt memory.
References:
- Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
- MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
- MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
- NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
- Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.