Better Embedded System SW: interrupts

Showing posts with label interrupts. Show all posts

Monday, January 6, 2014

Do Not Re-Enable Interrupts In An ISR

Summary: It might sound intuitive to re-enable interrupts within an ISR so higher priority interrupts can run without delay. But, in fact, this is probably the worst thing you can do for delay, and might even make the code unsafe due to stack overflow.

A previous post on rules for using interrupts included the rule:

"Don't re-enable interrupts within an Interrupt Service Routine (ISR). That's just asking for subtle race condition and stack overflow problems."

Some developers take a different point of view, and feel that it is best to re-enable interrupts to let higher priority ISRs run without delay. The idea is that high priority interrupts should run as soon as possible, without having to wait for low priority interrupts to complete. Re-enabling interrupts within an ISR seems to let that happen. BUT, while there might be some intuitive appeal to this notion, this is a dangerous practice that makes things worse instead of better.

First, to recap, when an interrupt triggers an ISR one of the first things that happens is that further interrupts get masked by the interrupt handling hardware mechanisms. Once the ISR starts running, at some point (best is at the beginning) it acknowledges the interrupt source, clearing the interrupt request that triggered the ISR. At that point the ISR can re-enable interrupts if it wants to, or leave them masked until the ISR completes execution. (The "return from interrupt" instruction will typically restore interrupt flags, re-enabling interrupts as appropriate when the ISR completes.) If interrupts are re-enabled within the ISR, then another interrupt can suspend the ISR and run some other, second ISR. This means that if a higher priority interrupt comes along, it can run its ISR right away.

The problem is that other, bad, things can also happen once interrupts are re-enabled:

If a lower priority interrupt comes along, it also gets to run, suspending the currently running, higher priority ISR. Interrupt priority hardware does not keep track of interrupt history after an ISR starts and acknowledges its interrupt source, and so loses track of how high the priority is for the running ISR. Worse, if a high and low priority interrupt happen at the same time, this approach guarantees that the high priority ISR waits for the low priority ISR. This happens because the high priority ISR runs first, then gets preempted by the lower priority ISR as soon as interrupts are re-enabled. In the case where no other interrupts are pending, the low priority ISR runs to completion before the high priority ISR gets to finish.

The ISRs nest as above rather than running one at a time, filling up the stack. You might be able to account for this by allocating enough stack for all ISRs to be active at the same time. But if you leave interrupts masked in ISRs. the worst case is only the single biggest ISR stack use. (Some hardware has multiple tiers/levels/classes ... pick your favorite term ... of interrupts, but in that case it is still only one ISR of stack use per tier rather than one per ISR source.)

The same ISR might run more than once at a time, especially if it got unlucky and was preempted by other ISRs, delaying its completion time. For example, if you get a burst of noise on an ISR hardware line you might kick of a half dozen or so copies of the same ISR. Or once in a while hardware events happen close together and re-trigger the ISR. This will lead to trouble if your ISR code is not re-entrant. It also could overflow the stack, ending up in memory corruption, etc. unless you can accurately predict or limit how many times ISRs can be re-triggered in absolute worst-case conditions.

You could say "the highest priority ISR doesn't re-enable interrupts" -- but what about the second-highest priority ISR? Once you get more than a couple ISRs involved this gets hopeless to untangle. You could try to write some sort of ISR handler to mitigate some of these risks, but it's going to be difficult to get right, and add overhead to every ISR. In all, the situation sounds pretty messy and prone to problems .. and it is. You might get away with this on some systems some of the time if you are really good (and never make mistakes). But, getting concurrency-related tricky code right is notoriously difficult. Re-enabling interrupts is just asking for problems.

So let's look at the alternative. What is the true cost you might be trying to avoid in terms of delaying that oh-so-urgent high priority ISR because you're not re-enabling interrupts in an ISR?

The worst case is that the longest-running low priority ISR runs to completion, making all the higher priority ISRs wait for it to complete before they can start. But after that all the remaining ISRs that are pending will complete in priority order -- highest to lowest priority. That's exactly what you want except for the low priority ISR clogging up the works. So if you have an obnoxiously long low priority ISR that's a problem. But if none of your ISRs run for very long (which is how you're supposed to write ISRs), you're fine. Put into scheduling terms, you want to make sure none of your ISRs runs long, because a long-running ISR gives you a high blocking time, and blocking time delays high priority tasks from completing.

Let's compare outcomes for the two alternative strategies. If you re-enable interrupts, the worst case latency for the highest priority ISR in the system is that it arrives, and then gets preempted by every other ISR in the system (including the longest-running ISR if it comes in later, but before the high priority ISR has a chance to complete). If you leave interrupts masked, the worst case is that the longest-running ISR has to complete, but then the high priority ISR goes immediately afterward. So, leaving interrupts disabled (masked) during every ISR is clearly a win for the worst case, in that you only have to wait for the longest-running ISR to complete before running the highest priority ISR, instead of waiting for all ISRs to complete. The worst case is typically what you care about in a real time embedded system, so you should leave interrupts disabled in ISRs to ensure the fastest worst-case completion time of high priority ISRs. And, leaving interrupts disabled in ISRs also gets rid of the risks of stack overflow and re-triggered ISRs we mentioned.

UPDATE: To avoid confusion, it's important to note that the above is talking about what happens at ONE level of interrupts, such that when one ISR is running no other interrupts run until the ISR completes or ISRs at that level complete. Many architectures have multiple levels, in which one ISR can interrupt another ISR at a lower level even if that lower level has interrupts masked. This corresponds to the comment about one ISR per level being active in the worst case. Also, note that if an architecture can change the priorities of interrupts within a single level that's irrelevant -- it is the existence of levels that are each individually maskable and that are prioritized as groups of interrupts per level that gives a way around some of these problems. So, going back to the title says, do not RE-enable the same level of interrupts within an ISR.

Saturday, May 25, 2013

Adding Prioritization to an Single Level Interrupt Priority System

Summary of technique: Add a software structure that executes only the highest priority pending interrupt within the ISR polling loop. Then start again at the top of the polling loop instead of polling all possible ISRs. This gives you a prioritized non-preemptive interrupt service routine scheduler.

- - - - - - - - - - - - - - - - - - - -

With some microcontrollers, all of your interrupts come in at the same priority level (for example, via an external interrupt request pin). The usual thing to do in that case is create a polling loop to check all the sources of interrupts and see which one needs to be serviced by looking at peripheral status registers. For example:
if(HWTimerTick) { ... ISR to service hardware timer tick ... }
if(ADCReady) { ... ISR to service A to D converter ... }
if(SerialPortDataInReady ) { ... ISR to read a serial port byte... }
if(SerialPortDataOutReady) { ... ISR to write a serial port byte ... }
...
(Of course this isn't real code ... I'm just sketching a flow that you've seen before if you've written this type of ISR that polls all the devices that can cause interrupts to see which one actually needs to be serviced.)

If only one of these devices is active, then this approach should work pretty well. And if you do system-level testing probably things will work fine -- at least most of the time.

But the way you can get into trouble is if one of the interrupts has a short deadline for being serviced. Let's say you have the above code and are seeing serial input bytes being dropped once in a while. What could be happening?

One cause of dropping bytes might be that the HW Timer Tick and/or the ADC Ready interrupts are active at the same time that the serial port data input interrupt is ready. You need to execute them before you can get data from the serial port. If the sum of their two execution times is longer than the time between serial byte arrivals, you're going to take too long to get to the serial port input ISR and will drop bytes.

You might buy a faster processor (which might be unnecessary as we'll see), but before doing that you might reorganize the code to put the serial input first in the list of ISRs so you can get to it faster when an interrupt comes in:
if(SerialPortDataInReady ) { ... read a serial port byte... }
if(HWTimerTick) { ... service hardware timer tick ... }
if(ADCReady) { ... service A to D converter ... }
if(SerialPortDataOutReady) { ... write a serial

And that will *almost* work. Things might get a little better, but it won't cure the problem. (Or, MUCH worse, it will cure the problem in testing only to have the problem reappear in the field after you've shipped a lot of systems!) Now when you get an interrupt you'll service the serial port input ISR first. But, then you'll go off and do the other ISRs. If those other ISRs take enough time, you will be stuck in those other ISRs too long and will miss the next byte -- you won't get back to the top of the list of ISRs in time.

You might try re-enabling interrupts inside any long ISRs to let the serial port get processed sooner. But resist the temptation -- that probably won't work, and will likely result in stack overflows due to recursive interrupt processing. (Simple rule: NEVER re-enable interrupts from inside an ISR.)

What we really need here is prioritization. And it's pretty easy to get even though we don't have hardware interrupt prioritization. All you have to do is (1) put the checks for each ISR in priority order, and (2) only execute the first one in the list each time you process interrupts. This can be done as follows:
if(SerialPortDataInReady ) { ... read a serial port byte... }
else if(HWTimerTick) { ... service hardware timer tick ... }
else if(ADCReady) { ... service A to D converter ... }
else if(SerialPortDataOutReady) { ... write a serial port byte ... }

Now only the first active interrupt will be serviced and the rest ignored. When you drop out of this structure and exit, any pending interrupt will re-trigger the checks from the beginning, again executing the highest priority interrupt that is still active (i.e., the first active one in the list). This will continue until all pending interrupts have been processed. You can use a "while" loop around the code above, or in many systems it may make sense just to exit interrupt processing and let the hardware interrupts re-trigger to re-run the polling code as a new interrupt.

This approach means that the worst case delay between processing serial input bytes is no longer all the ISRs running (if all interrupts are active). Rather, the worst case is the single longest ISR happens to be running, completes, and the serial port input ISR runs next. This happens because the list only runs at most one ISR rather than all of them. If that one ISR runs too long to meet deadlines, then it's probably too "fat" and should be simplified or its job moved out of ISRs and into the main loop.

There is no free lunch. The lowest priority ISR (the one at the end of the list) might starve. Making sure you meet all your ISR deadlines is trickier with this structure. Without the "elseif" approach the worst case timing is easy to compute -- it is the run time of all ISRs. But it might be too slow to live with. With this structure you have a nonpreemptive prioritized scheduling system for ISRs, and need to use suitable math and a suitable scheduling approach. Generally you'd want to use rate monotonic analysis (RMA) suitably adapted for the ISRs being non-preemptive. The analysis may be a little more complex, but this approach might help you salvage a situation in which you're missing deadlines and have already committed to a certain speed of microcontroller.

(Note on terminology: technically the whole thing is one big ISR that calls a different function depending upon what's active. But I'm calling each such function an ISR because that is really what it does ... you're using a software dispatcher to pick which ISR to run instead hardware prioritization logic to pick an ISR.)

Thursday, April 25, 2013

Why Short Interrupt Service Routines Matter

Most embedded systems I see use interrupts to handle high priority events, which are typically triggered by some peripheral device. So far so good. But it is also common for these systems to have significant timing problems even though their CPUs are not 100% loaded.

Let's take an example of three interrupts and see how this type of thing can happen. Let's call their service routines IntH, IntM, and IntL (for high/medium/low priority), and assume this is a single-level interrupt priority system. By that I mean that these Interrupt Service Routines (ISRs) can't be interrupted by any of the others once they start executing.

Say that you write your software and you measure an idle task at taking 80% of the CPU. The most important ISR has highest priority, etc. And maybe this time it works fine. But eventually you'll run into a system which has timing problems. You're only 80% loaded; how could you have timing problems? To find out why, we need to dig deeper.

The first step is to measure the worst case (longest) execution time and worst case (fastest) period for each ISR. Let's say it turns out this way:

IntH: Execution time = 10 msec Period = 1000 msec
IntM: Execution time = 0.01 msec Period = 1 msec
IntL: Execution time = 2 msec Period = 100 msec

Let's take a look a the numbers. This task set is loaded at: (10/1000) + (0.01/1) + (2/100) = 4%.
BUT it will miss deadlines! How can that be?

IntM and IntL are both going to miss their deadlines (if we assume deadline = period) periodically. IntM will miss its deadline up to 10 times every time IntH runs, because the CPU is tied up for 10 msec with IntH, but IntM needs to run every 1 msec. So once per second IntM will miss its deadlines because it is starved by IntH.

OK, so maybe you saw that one coming. But there is a more insidious problem here. IntM can also miss its deadline because of IntL. Once IntL executes, it ties up the CPU for 2 msec, causing IntM to miss its 1 msec period. Even though IntL has a lower priority, once it runs it can't be interrupted, so it hogs the CPU and causes a deadline miss.

There are plenty of bandaids that can be tossed at this system (and I have the feeling I've seen them all in design reviews). The obvious hack of re-enabling interrupts partway through an ISR is dangerous and should not be used under any circumstance. It leads to timing-dependent stack overflows, race conditions and so on. And more importantly, re-enabling interrupts in an ISR is, in my experience, a sign that the designers didn't understand the root cause of the timing problems.

But there is a principled way to solve these problems involving two general rules:
- If possible, sort ISR and task priority by period; shortest period with highest priority. This minimizes effective CPU use when you do scheduling. To understand why this is important you'll need to read up on Rate Monotonic Scheduling and related techniques.
- Keep ISR worst case execution time as small as possible -- only a few handfuls of instructions. If you need to get more done, dump data from the ISR into a buffer and kick off a non-ISR task do do the processing. This prevents one ISR from making another miss its deadline and largely deflects the problem of ISRs not necessarily being assigned the priority you'd like in your particular hardware.

The key insight is that "important" and "priority" are not the same things. Priority is about making real time scheduling math work, and boils down to assigning highest priority to short-period and short-deadline tasks. Getting that to work in turn requires all ISRs (even low priority ones) to be short. The importance of an ISR from the point of view of functionality ("this function is more important to the customer") is largely irrelevant -- the point of real time scheduling is to make sure everything executes every time. Sometimes "important" and "short deadline" correspond, but not always. It is the deadline that should be paid attention to when assigning priorities if you want to meet real-time deadlines. (Or, put another way, "important" means real-time and unimportant means non-real-time.)

The discussion above also applies to systems with multiple levels of interrupt priorities. Within each level of priority (assuming one level can interrupt ISRs in another level), once a pig ISR starts none of the other interrupts at that task level can interrupt it.

Make all your ISRs short, and do the analysis to make sure the worst case clumping of ISR executions doesn't overload your CPU.

Monday, March 25, 2013

Rules for Using Interrupts

Here's a brief guide to rules for good interrupt design.

Keep your Interrupt Service Routine (ISR) short. Ideally half a page of C code max. If you must use assembly code, keep it to one page max. Long ISRs cause timing problems, often in surprising ways.
Keep ISR execution time very short. 100-200 clock cycles tops, although there is room for discussion on the exact number. If you have a lot of work to do, shovel the data into a holding buffer and let the main loop or a non-ISR task do the rest.
Know the worst case ISR execution time so you can do real-time scheduling. Avoid loops, because these make worst case trickier, and an indefinite loop might hang once in a while due to something you didn't think of.
Actually do the real time scheduling, which is a bit tricky because ISRs are non-preemptive within the same ISR priority level. (My book chapter on this works out the math in gory detail.)
Don't waste time in an ISR (for example, don't put in a wait loop for some hardware response).
Save the registers you modify if your hardware doesn't already do that for you. (Seems obvious, but if you have a lot of registers it might take a lot of testing to catch the one place where a register is used in the main code and the ISR clobbers it.)
Acknowledge the interrupt source at the beginning of the ISR (right after you save registers). It makes code reviews easier if it is always in the same place.
Don't re-enable interrupts within an ISR. That's just asking for subtle race condition and stack overflow problems.

There also some system-level issues having to do with playing well with ISRs:

Make sure to disable interrupts when accessing a variable shared with an ISR. Do so for the shortest possible time. Do this even if you "know" it is safe (compiler optimizer behavior is difficult to predict, and code generation may change with a new compiler version). Ideally, protect those variables with access methods so you only have to get this right in one place in the code.
Declare any shared ISR/non-ISR variables as volatile.
When you do timing analysis, don't forget that ISRs consume time too.
When you do stack depth analysis, don't forget worst-case stack-up of interrupts all occurring at the same time (especially if your processor supports multiple levels of interrupts).
Make sure that all interrupt vectors are initialized, even if you don't plan on using them.
Only use Non-Maskable Interrupts for a catastrophic system event such as system reset.
Be sure to initialize all your interrupt-related data structures and hardware that can generate interrupts before you enable interrupts during the boot-up cycle.
Once you turn on the watchdog timer, don't ever mask its interrupt.
If you find yourself doing something weird within an ISR, go back and fix the root cause of the problem. Weird ISRs spell trouble.

If you've been bitten by an interrupt in a way that isn't covered above, let me know!