Monday, January 6, 2014

Do Not Re-Enable Interrupts In An ISR

Summary: It might sound intuitive to re-enable interrupts within an ISR so higher priority interrupts can run without delay. But, in fact, this is probably the worst thing you can do for delay, and might even make the code unsafe due to stack overflow.

A previous post on rules for using interrupts included the rule:
  • "Don't re-enable interrupts within an Interrupt Service Routine (ISR).  That's just asking for subtle race condition and stack overflow problems."
Some developers take a different point of view, and feel that it is best to re-enable interrupts to let higher priority ISRs run without delay. The idea is that high priority interrupts should run as soon as possible, without having to wait for low priority interrupts to complete. Re-enabling interrupts within an ISR seems to let that happen. BUT, while there might be some intuitive appeal to this notion, this is a dangerous practice that makes things worse instead of better.

First, to recap, when an interrupt triggers an ISR one of the first things that happens is that further interrupts get masked by the interrupt handling hardware mechanisms. Once the ISR starts running, at some point (best is at the beginning) it acknowledges the interrupt source, clearing the interrupt request that triggered the ISR. At that point the ISR can re-enable interrupts if it wants to, or leave them masked until the ISR completes execution. (The "return from interrupt" instruction will typically restore interrupt flags, re-enabling interrupts as appropriate when the ISR completes.) If interrupts are re-enabled within the ISR, then another interrupt can suspend the ISR and run some other, second ISR. This means that if a higher priority interrupt comes along, it can run its ISR right away.

The problem is that other, bad, things can also happen once interrupts are re-enabled:
  •  If a lower priority interrupt comes along, it also gets to run, suspending the currently running, higher priority ISR. Interrupt priority hardware does not keep track of interrupt history after an ISR starts and acknowledges its interrupt source, and so loses track of how high the priority is for the running ISR. Worse, if a high and low priority interrupt happen at the same time, this approach guarantees that the high priority ISR waits for the low priority ISR. This happens because the high priority ISR runs first, then gets preempted by the lower priority ISR as soon as interrupts are re-enabled. In the case where no other interrupts are pending, the low priority ISR runs to completion before the high priority ISR gets to finish.
  • The ISRs nest as above rather than running one at a time, filling up the stack. You might be able to account for this by allocating enough stack for all ISRs to be active at the same time.  But if you leave interrupts masked in ISRs. the worst case is only the single biggest ISR stack use. (Some hardware has multiple tiers/levels/classes ... pick your favorite term ... of interrupts, but in that case it is still only one ISR of stack use per tier rather than one per ISR source.)
  • The same ISR might run more than once at a time, especially if it got unlucky and was preempted by other ISRs, delaying its completion time. For example, if you get a burst of noise on an ISR hardware line you might kick of a half dozen or so copies of the same ISR. Or once in a while hardware events happen close together and re-trigger the ISR. This will lead to trouble if your ISR code is not re-entrant. It also could overflow the stack, ending up in memory corruption, etc. unless you can accurately predict or limit how many times ISRs can be re-triggered in absolute worst-case conditions.
You could say "the highest priority ISR doesn't re-enable interrupts"  -- but what about the second-highest priority ISR? Once you get more than a couple ISRs involved this gets hopeless to untangle. You could try to write some sort of ISR handler to mitigate some of these risks, but it's going to be difficult to get right, and add overhead to every ISR. In all, the situation sounds pretty messy and prone to problems .. and it is. You might get away with this on some systems some of the time if you are really good (and never make mistakes). But, getting concurrency-related tricky code right is notoriously difficult.  Re-enabling interrupts is just asking for problems.

So let's look at the alternative. What is the true cost you might be trying to avoid in terms of delaying that oh-so-urgent high priority ISR because you're not re-enabling interrupts in an ISR? 

The worst case is that the longest-running low priority ISR runs to completion, making all the higher priority ISRs wait for it to complete before they can start. But after that all the remaining ISRs that are pending will complete in priority order -- highest to lowest priority. That's exactly what you want except for the low priority ISR clogging up the works. So if you have an obnoxiously long low priority ISR that's a problem. But if none of your ISRs run for very long (which is how you're supposed to write ISRs), you're fine. Put into scheduling terms, you want to make sure none of your ISRs runs long, because a long-running ISR gives you a high blocking time, and blocking time delays high priority tasks from completing. 

Let's compare outcomes for the two alternative strategies. If you re-enable interrupts, the worst case latency for the highest priority ISR in the system is that it arrives, and then gets preempted by every other ISR in the system (including the longest-running ISR if it comes in later, but before the high priority ISR has a chance to complete).  If you leave interrupts masked, the worst case is that the longest-running ISR has to complete, but then the high priority ISR goes immediately afterward. So, leaving interrupts disabled (masked) during every ISR is clearly a win for the worst case, in that you only have to wait for the longest-running ISR to complete before running the highest priority ISR, instead of waiting for all ISRs to complete. The worst case is typically what you care about in a real time embedded system, so you should leave interrupts disabled in ISRs to ensure the fastest worst-case completion time of high priority ISRs. And, leaving interrupts disabled in ISRs also gets rid of the risks of stack overflow and re-triggered ISRs we mentioned.

UPDATE: To avoid confusion, it's important to note that the above is talking about what happens at ONE level of interrupts, such that when one ISR is running no other interrupts run until the ISR completes or ISRs at that level complete. Many architectures have multiple levels, in which one ISR can interrupt another ISR at a lower level even if that lower level has interrupts masked. This corresponds to the comment about one ISR per level being active in the worst case. Also, note that if an architecture can change the priorities of interrupts within a single level that's irrelevant -- it is the existence of levels that are each individually maskable and that are prioritized as groups of interrupts per level that gives a way around some of these problems. So, going back to the title says, do not RE-enable the same level of interrupts within an ISR.

Static Analysis Ranked Defect List

  Crazy idea of the day: Static Analysis Ranked Defect List. Here is a software analysis tool feature request/product idea: So many times we...