Thursday, April 25, 2013

Why Short Interrupt Service Routines Matter

Most embedded systems I see use interrupts to handle high priority events, which are typically triggered by some peripheral device. So far so good. But it is also common for these systems to have significant timing problems even though their CPUs are not 100% loaded.

Let's take an example of three interrupts and see how this type of thing can happen.  Let's call their service routines IntH, IntM, and IntL (for high/medium/low priority), and assume this is a single-level interrupt priority system. By that I mean that these Interrupt Service Routines (ISRs) can't be interrupted by any of the others once they start executing.

Say that you write your software and you measure an idle task at taking 80% of the CPU.  The most important ISR has highest priority, etc.  And maybe this time it works fine.  But eventually you'll run into a system which has timing problems.  You're only 80% loaded; how could you have timing problems? To find out why, we need to dig deeper.

The first step is to measure the worst case (longest) execution time and worst case (fastest) period for each ISR.  Let's say it turns out this way:

IntH: Execution time = 10 msec       Period = 1000 msec
IntM: Execution time =  0.01 msec    Period =    1 msec
IntL: Execution time =  2 msec       Period =  100 msec

Let's take a look a the numbers. This task set is loaded at:  (10/1000) + (0.01/1) + (2/100) = 4%.
BUT it will miss deadlines! How can that be?

IntM and IntL are both going to miss their deadlines (if we assume deadline = period) periodically.  IntM will miss its deadline up to 10 times every time IntH runs, because the CPU is tied up for 10 msec with IntH, but IntM needs to run every 1 msec. So once per second IntM will miss its deadlines because it is starved by IntH.

OK, so maybe you saw that one coming.  But there is a more insidious problem here. IntM can also miss its deadline because of IntL.  Once IntL executes, it ties up the CPU for 2 msec, causing IntM to miss its 1 msec period. Even though IntL has a lower priority, once it runs it can't be interrupted, so it hogs the CPU and causes a deadline miss.

There are plenty of bandaids that can be tossed at this system (and I have the feeling I've seen them all in design reviews). The obvious hack of re-enabling interrupts partway through an ISR is dangerous and should not be used under any circumstance.  It leads to timing-dependent stack overflows, race conditions and so on. And more importantly, re-enabling interrupts in an ISR is, in my experience, a sign that the designers didn't understand the root cause of the timing problems.

But there is a principled way to solve these problems involving two general rules:
 - If possible, sort ISR and task priority by period; shortest period with highest priority.  This minimizes effective CPU use when you do scheduling. To understand why this is important you'll need to read up on Rate Monotonic Scheduling and related techniques.
 - Keep ISR worst case execution time as small as possible -- only a few handfuls of instructions.  If you need to get more done, dump data from the ISR into a buffer and kick off a non-ISR task do do the processing. This prevents one ISR from making another miss its deadline and largely deflects the problem of ISRs not necessarily being assigned the priority you'd like in your particular hardware.

The key insight is that "important" and "priority" are not the same things. Priority is about making real time scheduling math work, and boils down to assigning highest priority to short-period and short-deadline tasks. Getting that to work in turn requires all ISRs (even low priority ones) to be short. The importance of an ISR from the point of view of functionality ("this function is more important to the customer") is largely irrelevant -- the point of real time scheduling is to make sure everything executes every time.  Sometimes "important" and "short deadline" correspond, but not always. It is the deadline that should be paid attention to when assigning priorities if you want to meet real-time deadlines.  (Or, put another way, "important" means real-time and unimportant means non-real-time.)

The discussion above also applies to systems with multiple levels of interrupt priorities. Within each level of priority (assuming one level can interrupt ISRs in another level), once a pig ISR starts none of the other interrupts at that task level can interrupt it.

Make all your ISRs short, and do the analysis to make sure the worst case clumping of ISR executions doesn't overload your CPU.

Static Analysis Ranked Defect List

  Crazy idea of the day: Static Analysis Ranked Defect List. Here is a software analysis tool feature request/product idea: So many times we...