Wednesday, July 23, 2014

Don’t Overflow the Stack

Somewhere in your embedded system is the stack (or several of them for some multitasking systems). If you blow up the stack, your software will crash. Or worse, especially if you don’t have memory protection. For a critical system you need to make sure the stack has some elbow room and make sure that you know when you have a stack size problem.

Also, you shouldn’t ever use recursion in a safety critical embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The consequences of not understanding maximum stack depth can be a seemingly random memory corruption as the stack overwrites RAM variables. Whether or not this actually causes a program crash depends upon a number of factors, including chance, and in the worse case it can cause unsafe program behavior without a crash. 

Accepted Practices:
  • Compute worst case stack depth using a static analysis tool.
  • Include a stack sentinel or a related technique in the supervisor task and perform a graceful shutdown or reset prior to an actual overflow.
  • Avoid all recursion to ensure that worst case stack depth is bounded.
Discussion:
The “stack” is an area in memory used for storing certain types of data. For example, in the C programming language this is where non-static local variables go. The stack gets bigger every time a subroutine is called, usually holding the subroutine return address, parameter values that have been passed, and local variables for each currently active subroutine. As nested subroutines are called, the stack keeps getting bigger as more information is added to the stack to perform each deeper and deeper call. When each subroutine is completed, the stack information is returned to the stack area for later use, un-nesting the layers of stack usage and shrinking the stack size. Thus the maximum size of the stack is determined by how many subroutines are called on top of each other (the depth of the subroutine call graph), as well as the storage space needed by each of those subroutines, plus additional area needed by any interrupt service routines that may be active at the same time.

Some processors have a separate hardware stack for subroutine return information and a different stack for parameters and variables. And many operating systems have multiple stacks in support of multiple tasks. But for the most part similar ideas apply to all embedded controllers, and we’ll just discuss the single-stack case to keep things simple.


It is common for stacks to grow “downward” from high memory locations to low locations, with global variables, Real Time Operating System (RTOS) task information, or other values such as the C heap being allocated from the low memory locations to high memory locations. There is an unused memory space in between the stack and the globals. In other words, the stack shares limited RAM space with other variables. Because RAM space is limited, there is the possibility of the stack growing so large that it overlaps variable storage memory addresses so that the stack corrupts other memory (and/or loads and stores to global memory corrupt the stack). To avoid this, it is accepted practice to determine the worst case maximum stack depth and ensure such overlap is impossible.

Maximum stack depth can be determined by a number of means. One way is to periodically sample the stack pointer while the program is running and find the maximum observed stack size. This approach is a starting point, but one should not expect that sampling will happen to catch the absolute maximum stack depth. Catching the worst case can be quite difficult because interrupt service routines also use the stack and run at times that are generally unpredictable compared to program execution flow. (Moreover, if you use a timer interrupt to sample the stack pointer value, you’ll never see the stack pointer when the timer interrupt is masked, leaving a blind spot in this technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically designed to find the worst case set of subroutine calls in terms of stack depth. This technique is effective unless the code has structures that confound the analysis (e.g., if a program uses recursion, this technique generally doesn’t work). When the technique does work, it has the virtue of giving an absolute bound without having to rely upon whether testing happened to have encountered the worst case. You should use this technique if at all possible. If static analysis isn’t possible because of how you have written your code, you should change your code so you actually can do static analysis for maximum stack depth. In particular, this means you should never use recursion in a critical system both because it makes static analysis of stack depth impossible. Moreover, recursive routines are prone to stack overflows even if they are bug free, but happen to be fed an exceptional input value that causes too many levels of recursion. (In general, using recursion in any small-microcontroller embedded system is a bad idea for these reasons even if the system is not safety critical.)



Beyond static analysis of stack depth (or if static analysis isn’t possible and the system isn’t safety critical), an additional accepted practice is to use a “stack sentinel” to find the “high water mark” of the stack (I’ve also heard this called a “stack watermark” and a “stack guard”). With this technique each memory location in the stack area of memory is initialized to a predetermined known value or pattern of known values. As the stack grows in size, it will overwrite these known sentinel values with other values, leaving behind new patterns in memory that remain behind, like footprints in fresh snow. Thus, it is easy to tell how big the stack has ever been since the system was started by looking for how far into the unused memory space the “footprints” of stack usage trample past the current stack size. This both gives the maximum stack size during a particular program run, and the overall system maximum stack size assuming that the worst case path has been executed. (Actually identifying the worst case path need not be done – it is sufficient to have run it at some time during program execution without knowing exactly when it happened. It will leave its mark on the stack memory contents when it does happen.)


To convert this idea into a run-time protection technique, extra memory space between the stack and other data is set up as a sacrificial memory area that is there to detect unexpected corruptions before the stack can corrupt the globals or other RAM data. The program is run, and then memory contents are examined periodically to see how many stack area memory words are left unchanged. As part of the design validation process, designers compare the observed worst case stack depth against the static analysis computed worst case stack depth to ensure that they understand their system, have tested the worst case paths, and have left adequate margin in stack memory to prevent stack overflows if they missed some special situation that is even worse than the predicted worst case.

This sentinel technique should also be used in testing and in production systems to periodically check the sentinels in the sacrificial memory area. If you have computed a worst case stack depth at design time and you detect that the computed depth has been exceeded at run time, this is an indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space into global variables, the system might crash. Or that might just corrupt global variables or RTOS system state and have who-knows-what effect. (In a safety critical system “who-knows-what” equals “unsafe.”)  Naively assuming that you will always get a system crash detectable by the watchdog on a stack overflow is a dangerous practice. Sometimes the watchdog will catch a stack overflow. But there is no guarantee this will always happen. Consider that a stack-smashing attack is the security version of an intentional stack overflow, but is specifically designed to take over a system, not just merely crash it. So a crash after a stack overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering program execution paths to avoid a deep sequence of calls, and accounting for interrupts adding even more to the stack. And, again, even though we shouldn’t have to say this – never using recursion.

While there isn’t a set number for how much margin to leave in terms of extra memory past the computed maximum stack size, there are two considerations. First, you want to leave enough room to have an ample sacrificial area so that a problem with stack depth is unlikely to have enough time to go all the way through the sacrificial area and touch the globals before it is detected by a periodic timer tick checking sentinels. (Note: we didn’t say check current stack depth; we said periodically check sentinels to see if the stack has gotten too big between checks.)  Also, you want to leave some extra margin to account for the possibility you just never encountered the worst-case stack depth in analysis or testing. I’ve heard designers say worst case stack depth of 50% of available stack memory is a good idea. Above 90% use of stack memory (10% sacrificial memory area set aside) is probably a bad idea. But the actual number will depend on the details of your system.

Selected Sources:
Stack sentinels and avoidance of recursion are an entrenched part of embedded systems folklore and accepted practices. Douglass mentions watermarking in Section 9.8.6 (Douglass 2002).
MISRA Software Guidelines discourage recursion, saying that it can cause unpredictable behavior (MISRA Guidelines, pg. 20).

MISRA C required rule 70 explicitly bans recursion:

NASA recommends using stack guards (essentially the same as the technique that I call “stack sentinels”) to check for stack overflow or corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and cause arbitrarily bad behavior of a running program. Regehr et al. provide an overview of research in that area relevant to embedded microcontrollers and ways to mitigate those problems (Regehr 2006). He notes that “the potential for stack overflow in embedded systems is hard to detect by testing.” (id., p. 776), with the point being that it is reasonable to expect that a system which has been tested but not thoroughly analyzed will have run-time stack overflows that corrupt memory.

References:
  • Douglass, B. P., Real-Time Design Patterns: robust scalable architecture for real-time systems, Pearson Education, first printing, September 2002, copyright by Pearson in 2003.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Regehr et al., Eliminating stack overflow by abstract interpretation, ACM Trans. Embedded Computing Systems, Nov. 2006, pp. 751-778.

14 comments:

  1. Regarding pointer-to-function (FTP): These seem to confound static analyzers when it comes to stack depth anaylsis. Do you know of any analyzers that can (or attempt to) accomodate PTFs?

    ReplyDelete
  2. The short answer is I don't know of one, but they might exist. If you're writing safety critical code it is commonly recommended that you avoid non-const pointers to functions (e.g., according to MISRA C). So I guess I'd look for an analyzer that can handle the special case of const PTFs, which ought to be a lot easier to implement than one that handles pointers that might change.

    ReplyDelete
  3. You might have a look at "gstack" which comes together with Multi2000 IDE from Green Hills. There exists special compiler construct calles "strong pointers".

    ReplyDelete
  4. I wonder why the essential technique of "tail recursion", well-known in functional programming, is totally neglected. is "tail recursion" really to be banned on the same ground? isn't it essentially transforming, from a computational point of view, recursion into iteration?

    ReplyDelete
  5. Yes, in general, tail recursion is also outlawed. There is a common compiler optimization that applies: "tail recursion elimination" that prevents that form of recursion from actually consuming stack space.
    The problem is that for safety critical embedded software, especially on a microcontroller without memory protection, you have to be SURE that the compiler optimizer gets rid of the tail recursion. And you also need to make sure that your stack depth analysis tools can see "through" the tail recursion to compute worst case stack depth, which is likely to be more problematic.

    ReplyDelete
    Replies
    1. I thank you for your kind informative answer and I take duly note. Maybe my concern is more about the form of the recommendation than in its content. I remember from my studies in the eighties that many people confounded "structural programming" with "avoiding GOTOs". As every moderately educated programmers know, you can well use *under strict control* GOTOs and still claim good structured programming style, why certainly you cannot achieve proper "structured programming" simply avoiding GOTOs. Aren't we in a similar situation? Is it not the use of recursion *per se*, but an improper, blind use of it that is the problem? The real issue is "memory-aware programming", that might well include exceptional, controlled use of recursion. Similarly, it's not just the blind avoidance of recursion and other tricks (that can be automated), but a disciplined programming style that guarantees the required properties to the code. (just my take, thanks for the attention and for your blog.)

      Delete
    2. PS: just to complete my thought: some people translate the "ban" on recursion into the statement that "functional programming style is intrinsically less - or not at all - safe", which I think it's at best, meaningless, at worst, false and misleading.

      Delete
  6. There are two different but important points here.

    (1) A disciplined programming style is necessary, but not sufficient for safety. You not only have to have discipline, you have to be able to prove that you actually followed the discipline (preferably with automated checkers). It is all too easy for a bug to slip by humans, and this sort of bug is easily catastrophic. In the case of recursion that often makes static analysis of maximum stack depth impossible, so now you aren't sure if the recursion is really disciplined or not. In a safety critical system with no memory protection hardware, recursion just isn't worth the risk. In a non-critical desktop system with virtual memory it might well make sense, especially if the algorithm being used is inherent recursive (think "Towers of Hanoi").

    (2) The ban on recursion applies to imperative languages. For functional languages the question is not "does it use recursion," but rather for the purposes of this discussion a more relevant question would be "can you make provable guarantees about worst case stack depth" (among other possible properties you care about). And even that assumes that recursion is using stack space instead of heap space (depends on the implementation). I don't want to go off on a tangent of whether functional programming languages are suitable for safety critical systems or not. I know there are very smart people who have been working in that area for decades. My point is simply that recursion is bad in imperative languages such as C, and functional languages would require additional thought and discussion.

    ReplyDelete
    Replies
    1. Many thanks for your exceptionally swift, authoritative and insightful answers.

      Delete
  7. Let me first make an observation: Stack is a shared resource which is hidden on the level most programmers work on - it is not visible on the level of implementation nor on the design level. Hence potential stack errors are rarely considered. Recursion is just one construct that can cause stack overflow - others are possible as well but may (?) be easier to test or detect by static analysis tools (e.g. nested interrupts and/or interrupt avalange). Obviously recursion can trigger a fast and an "endless" stack growth - hence it may be particularily dangerous.

    But the point is that the problem´s root cause is the unconscious use and growth of a hidden shared resource "stack" rather than the programming technique. Isn´t worth to fight this root cause?

    But before going to memory protection let me ask how to handle recursion in case of conscious use of this shared resource, e.g. one my limit the maximal recursion depth by passing a call depth counter as parameter and inhibit recursion beyond a static limit during run time. This would allow to determine the maximum stack size - even more reaching this limit could be tracked during testing by checking the branch coverage on the statement checking the limit. How would be judge such a recursion with (provable) call depth limitation?

    And how would be judge recursion in case we have a (certified) OS that uses (HW) memory protection to encapsulate the stack and uses different stack for different tasks/processes/interrupts? In that case unintended stack growth and the associated falsification of other data can be detected and inhibited by the memory protection exception handler. Meanwhile there are embedded certified OSes and microprocessors that support these. How should we judge recursion in such an environment?

    ReplyDelete
    Replies
    1. If you want to break the "no recursion" rule then you need a safety case that argues why the system is safe despite the presence of recursion. This is no small thing, and there is a good reason why "no recursion" is a ubiquitous coding rule.

      The types of things you'd have to argue might include:
      - There is no bug in the depth counter logic, and the depth counter can't be corrupted somehow
      - Stack cannot overflow even if the code is later modified with bug fixes, maintenance, etc. (what if someone messes with the depth counter?)
      - If the stack does somehow overflow, it does not corrupt other system state (this assurance is not true of the types of systems I was talking about in my posting)
      - Is there a tool that can prove the depth limitation (what if the person doing the proof makes a mistake?)
      - What happens to the system if a task does recurse past a stack limit and invokes an error handler -- is it still safe? Does the recovery work in all possible scenarios even while the system is operating?

      Moreover, I have yet to meet the piece of embedded software for which recursion was necessary (or even desirable). However, assuming there is such an application, using recursion would have to be worth the pain of going through all the safety proofs. Thus, the default rule is "no recursion" so that it is hard to make a mistake in a safety critical system. If you really really want to use recursion then the onus is upon the developer to provide an extraordinary level of assurance of safety.

      Per other postings, if you want to use a functional programming language that has recursion, I'd expect that the tool chain would need to provide these types of assurances to feed into the safety case.

      (This is an informal comment and is not a fully considered opinion regarding system safety.)

      Delete
  8. Thanks for the elaborated response. The discussion touches so many interesting and important topics that I hope the thread does not become too big. I try to group with decreasing priority:

    1) Most important seems to be your statement "If the stack does somehow overflow, it does not corrupt other system state (this assurance is not true of the types of systems I was talking about in my posting)" - it seems to me that meanwhile there are OSes and microcontrollers for embedded applications that provide at least the feature of HW memory protection with the properties mentioned above (separate stack for each process, HW inhibited stack overrun). This alone does not make recursion safe, for sure, but I would dare to say that this is almost a must if someone wants manage the risk involved with recursion. Would you agree on that?

    If so this could cover part of the things mentioned in your previous comment to be agrued (corruption of depth counter, anything related to "if stack overflows" because this is inhibited)

    2) Regarding "bug in depth counter logic": What I have in mind is a design pattern that is so obvious that a human being can prove the property of limited recursion depth mathematically. I am not sure where to put more trust in: A mathematic proof done by an expert or an analysis tool. I´d love to see a formal analysis tool which comes close to a formal proof AND is applicable for today´s embedded programming techniques and languages. I have seen some simulation based approaches (e.g. TLA+) and some formal approaches (e.g. Isabelle) but neither covers both needs. Any recommendation on this?

    3) Let me propose to distinguish been
    a) systematic faults inside the recursion
    b) faults outside the recursion that do not corrupt the control and data flow (namely "only" provide incorrect inputs).
    c) faults outside the recursion that corrupt the data or control flow (HW faults and nasty SW faults), and

    It seems (a) and (c) are manageable by applying design patterns inside the implementation of the recursion with such a low level of complexity that they are provable.
    (c) is not manageable without additional safety mechanisms outside the implementation.

    Do you consider this a reasonable approach to chop the problem in smaller pieces?

    4) Regarding functional programming
    While sharing the expectations on the tool chain ... how could one prove that a tool chain really fulfills these requirements? Especially if we consider faults of type (c) above? And is there a tool chain that provides a reliable analysis of the maximum recursion depth and error handling for a given functional algorithm?

    ReplyDelete
  9. From an academic point of view these ideas seem to have merit, although I haven't done a close check of any pitfalls.

    From a practical point of view ... why go to all this trouble when you could just avoid recursion in the first place?

    ReplyDelete
  10. Thank you for this excellent writeup.

    Some thoughts (I'm not working in the embedded device space or even safety-critical software myself, but I like Prolog, Erlang, Scheme and Esterel and don't like C):

    #1 Does the rule against recursion forbid "non-apparent recursion", i.e. call sequences like function_A -> function_B -> ... -> function_A ? I would say, yes! (This can also easily be checked at runtime and even statically of course).

    #2 Under the condition that the programming language allows to express recursion cleanly, recursive code can be easier to understand and check than equivalent imperative code even in simple cases. Iterative code using loops can be unnecessarily complex and can run-away and consume limited resources just as easily. Mathematical proof that the loop will terminate will be needed and might be difficult to obtain.

    #2.1 If the complexity is low enough that using "primitive-recursive" FOR loops will do and "partial-recursive" WHILE loops are not needed, so much the better. But that sounds like a borderline case.

    #3 Is it allowed to use potentially unbounded apparent data structures (like linked lists) to replace the potentially unbounded hidden data structure (i.e. the stack)? Again, one danger point is replaced by another. On a unrelated tack, this reminds me of an effort to emulate recursion in COBOL in the 90s because recursion was the natural way to process the telecom cabling representation. The stack was emulated by operations on the relational database. Hmm.... horrific!

    #4 The caveat "the ban on recursion applies to imperative languages" is not really a good one as the machinery "under the hood" will be an imperative language no matter what. The caveat should be "the ban on recursion applies to imperative language code as written by the developer".

    #5 The rule against recursion sounds like a DEFINITION of (a certain class of) safety-critical systems: If the system has reached complexity levels so high that use of recursion can no longer be replaced by a human programmer by equivalent, verifiable code that uses iteration alone THEN it no longer can be regarded as a "safety-critical system" in the present sense.

    With best regards,

    -- Anon

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!