Monday, May 12, 2014

Real Time Scheduling Analysis for Critical Systems

Critical embedded software must be created using methodical scheduling analysis that ensures that every task can meet its deadline under worst-case fault-free operational conditions. Looking just at CPU load is not only inadequate to determine if you will meet real time deadlines -- it is a specifically bad practice that is to be avoided.

Consequences:
Failing to do an accurate real time schedule design for a safety critical system means that designers do not know if the system will miss deadlines. It is impractical in a complex system (which means almost any real-world product) to do enough testing with a multitasked RTOS-based system to ensure that deadlines will be met under worst case conditions. Thus, a design team that fails to do scheduling analysis should appreciate that they have an unknown probability of missing deadlines, and can reasonably expect to miss deadlines in the worst case if the CPU load is above about 69% for a large number of non-harmonic tasks.

 Accepted Practices:
  • Using Rate Monotonic Scheduling (RMA) when using an RTOS with multi-tasking capability, or using some other mathematically sound scheduling analysis. Rate Monotonic Scheduling (RMS) is what you get as a result of RMA.
  • Documenting the RMA analysis including at least listing: all tasks, WCET for each task, period of each task, and worst case system blocking time. Assumptions used in the analysis should be stated and justified (for example, an assumption that no low priority task blocks a high priority task must be justified via written explanation, or if untrue then advanced techniques must be used to ensure schedulability). The system use must be less than 100% of the CPU if a set of favorable assumptions such as harmonic task periods can be documented to hold, and may be restricted to as low as 69.3% CPU used (meaning 30.7% unused CPU capacity) in the general case.
  • A specifically bad practice is basing real time performance decisions solely on spare capacity (e.g., “CPU is only 80% loaded on average”) in the absence of mathematical scheduling analysis, because it does not guarantee safety critical tasks will meet their deadlines. Similarly, monitoring spare CPU capacity as the only way to infer whether deadlines are being met is a specifically bad practice, because it does not actually tell you whether high frequency deadlines are being met or not.
  • Permitting more than one instance of a real-time task to queue is a specifically bad practice, because this can only happen when real time deadlines are being missed. This practice is a bandaid for a real-time system, and indicates that the system is missing its real-time deadlines.
Discussion

Real time scheduling is a mathematical approach to ensuring that every task in a real time embedded system meets its deadlines under all specified operating conditions. Using a mathematical approach is required because testing can only exercise some of the system operating conditions. There are almost always real time scheduling situations that can’t be adequately tested in a complex piece of software, requiring either simplifying the system to make testing feasible, or using a rigorous mathematical approach to ensure that complex scheduling is guaranteed to work. When a multi-tasking real time operating system such as OSEK is used, a mathematical approach must be used to ensure deadlines are met.

Rate Monotonic Analysis

The generally accepted method for scheduling critical real time systems is to perform Rate Monotonic Analysis (RMA), possibly with one of a number of adaptations for special circumstances, and create a system with a Rate Monotonic Schedule. RMA has the virtue of providing a simple set of rules that guarantees all tasks in a system will meet their deadlines. To achieve this, RMA requires rules to be followed such as all tasks having a defined fastest time period at which they will run, and the period of tasks being harmonic multiples to permit using 100% of the CPU capacity. (Task periods are considered harmonic if they are exact multiples of each other. For example, periods of 1, 10, 100, 1000 msec are harmonic, but 1, 9, 98, 977 msec are not harmonic.) To implement RMA, designers sort tasks based on period, and assign the fastest task to the highest priority, second fastest task to the second highest priority, and so on. If a designer wishes to have multiple tasks at the same priority that is acceptable, but it is required that all tasks at a given priority execute at the same period. 

The benefit of RMA is that there is a mathematical proof that all tasks will meet deadlines if a certain set of rules is followed. If any of the rules is bent, however, then more complex mathematical analysis must be performed or other special techniques used to ensure that deadlines will be met. In particular, if task periods are not harmonic multiples of each other, guaranteeing that deadlines will be met requires leaving slack capacity on the CPU. For a system with many tasks, RMA can only guarantee that task deadlines will be met if the CPU load is a maximum of 69.3%. (For this reason, it is common for embedded systems to use harmonic multiples of task periods.)


Example execution of a system scheduled with RMA.
(Source: http://blogs.abo.fi/alexeevpetr/2011/11/24/generating-multithread-code-from-simulink-model-for-embedded-target/)

Employing RMA requires at least the following steps at design time: listing all tasks in the system including interrupt service routines, determining the fastest period of each task, determining the worst case execution time (WCET) of each task, and determining the maximum blocking time of the system (longest time during which interrupts are disabled). Given those numbers, tasks can be assigned priorities and designers can know if the system will always meet its deadlines. Given correct RMA scheduling and a fault-free operational system, every task will complete by the end of its period, so it will never be possible to have a second occurrence of a task enqueued while waiting for a first occurrence to complete.

If the CPU is over-subscribed, there are established methods of ensuring that critical tasks always complete while non-critical tasks get whatever CPU resources are left. The simplest method is to simply assign all non-critical tasks priorities lower than the lowest priority critical task (which works so long as those non-critical tasks can’t substantively delay with the operation of the critical tasks). If it is important to share available CPU time across a number of noncritical tasks even when the CPU is overloaded, this can be done by having non-critical tasks take turns at the lowest priority. Note that an overloaded CPU does not cause any critical tasks to miss their deadlines in this scenario; it is simply a matter of making non-critical tasks wait a bit longer to execute if the CPU is overloaded. (This assumes that the CPU can handle worst case demand from critical tasks. This assumption must be ensured by the design process for the system to be safe.)

Less Than 100% CPU Load Does Not Guarantee Deadlines Are Met

A specifically bad practice is looking at idle time during testing to determine whether or not the CPU is overloaded, and inferring from less that 100% usage that the system will meet its deadlines. That simply does not account for the worst case, or potentially even just an infrequent heavy load case.

As a non-computer example of missing deadlines with less than 100% loading, let’s say you want to spend five days per week working and three days out of every 12 volunteering at a homeless shelter (the fact that it is out of 12 days instead of out of 14 days makes these periods non-harmonic). Because work has a period of 7 days, it will have higher priority than the 12 day volunteer service period. This means you’ll complete all of your work the first 5 days (Monday – Friday) out of each 7-day work week before you start volunteering on weekends. Most weeks this will work fine (you’ll have enough time for both). But when the 12-day volunteer period starts on a Monday, you’ll only have one weekend (two days) in the 12 day period for volunteering that runs Monday of one week through Friday of the next week.  Thus you’ll be a day short for volunteering whenever the volunteer period starts on a Monday. The amount of time you’ve committed is 5 days out of 7 plus 3 days out of 12:  5/7 + 3/12 = 96.4%. But you’re going to come up a whole day short on volunteering once in a while even though you are less than 100% committed and you are taking some weekend days off, performing neither task. You could solve this by committing 4 days out of 14 to volunteering, which is actually a slightly higher workload (100% total), but changes the period to be harmonic with the weekly work cycle. (3.5 days out of every 14 would also work.) Thus, as shown by this example, non-harmonic task periods can result in missed deadlines even with a workload that is less than 100%.

Example of 79% loaded CPU missing a task deadline with only 4 tasks.

As a computer-based example, the above figure shows another example in terms of a task in a four-task system missing its deadline on a system with non-harmonic periods even though the CPU is only 79% loaded. This scenario would happen rarely, and only when all the tasks’ periods synchronize, which for this example would be only once every Least Common Multiple of (19,24,29,34) = 224,808 time units – and even then only if each task actually took its worst case time to execute. Thus, while Task 4 would be expected to miss its deadline in operation, detecting that situation might be difficult with limited duration testing.

Selected Sources

Rate Monotonic Scheduling was developed to address the problem of creating an easy-to-follow recipe for ensuring that real time schedules can be met in a system with multiple tasks. Influential early papers include (Liu & Layland 1973 and Lehoczky et al. 1989).

Ganssle recommends the use of RMA as early as 1992 (Ganssle 1992, pg. 200-201) and sketches its use, giving a reference for more details.

Douglass provides a pattern for static priority real time scheduling and states: “the most common policy for the selection of policies is rate monotonic scheduling or RMS”.  (Douglass 2002, section 5.9.5; note that rate monotonic scheduling is what you get as a result of rate monotonic analysis).

In the context of safety critical operating systems, Kleidermacher says “Rate monotonic analysis (RMA) is frequently used by system designers to analyze and predict the timing behavior of systems.” (Kleidermacher 2001, pg. 30).

MISRA Report 3 discusses the use of real-time kernels (which for our purposes are operating systems that use some sort of scheduling approach). It notes that there are a number accepted scheduling techniques, and that the use of “best available technology” such as for example using a certified RTOS brings benefit, providing a reference to a number of text.

References:
  • Douglass, Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns, Addison-Wesley Professional, 1999.
  • Ganssle, J., The Art of Programming Embedded Systems, Academic Press, 1992.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Lehoczky, J.; Sha, L.; Ding, Y. "The rate monotonic scheduling algorithm: exact characterization and average case behavior", IEEE Real-Time Systems Symposium, 1989, pp. 166–171.
  • Liu, C. L.; Layland, J., "Scheduling algorithms for multiprogramming in a hard real-time environment", Journal of the ACM, 1973, (1): 46–61
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.

4 comments:

  1. Thank you for an excellent article. However, I am surprised that you have not mentioned Deadline Monotonic Analysis (DMA) - not, of course, to be confused with Earliest Deadline First (EDF), which is anathema to real-time programming.

    RMA (in its impure form with allowance made for task dependencies, etc.) is ideal for time triggered systems but is seriously sub-optimal for the more general event-driven systems which abound in the world of embedded systems. DMA is much better matched to the realities of such systems and the mathematics of the analysis is almost identical to that of RMA. Only the task ordering criterion changes, from period to deadline.

    DMA, however, does not yield any simple overall "scheduling sufficiency" criterion akin to the magic 69% utilisation figure arising from RMA; each task has to be assessed individually as to its schedulability. This is no bad thing, though. The famous 69% figure is much better known than the theory underlying it and is generally misunderstood and abused, like other measures of utilisation, as you point out in your article. With DMA, though, there are no short cuts; you really have to understand what you're doing!

    ReplyDelete
  2. You're right that DMA is most often the right choice. The fix to the scheduling sufficiency criterion is to create a virtual period = MIN(period, deadline) and then order by virtual period. This wastes some CPU capacity but then gets you back to the magic 69% (or 100% for harmonic periods) if you exclude the amount wasted by using a virtual period in cases where deadline<period. My book chapter on scheduling shows the math for this. You can also see it in the slides for lecture 16 of my graduate-level embedded systems course here: http://www.ece.cmu.edu/~ece649/

    ReplyDelete
  3. Thanks for this. One comment. You said: "To achieve this, RMA requires rules to be followed such as all tasks having a defined fastest time period at which they will run, and the period of tasks being harmonic multiples to permit using 100% of the CPU capacity." Does RMA really require the period of the tasks to be harmonic multiples? I am no expert - but did not remember that as a requirement.

    ReplyDelete
    Replies
    1. Harmonic multiples is only required if you want to use 100% of the CPU. If task periods aren't harmonic, then you get to use the equation based on number of tasks to compute the limit (I'm sure you can find it if you search for RMA tutorials) and this gets you the 69.3% utilization bound mentioned a couple times in this posting.

      Delete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...