Monday, January 4, 2016

Top Two Mistakes In Main Loop Scheduling


In my previous posting I showed you how to build a multi-rate main loop. In this posting we're going to start digging into the critical issue of how to do timing analysis for a multi-rate main loop cooperative scheduler. I'll explain the top two mistakes I see in industry code that lead to missed task deadlines.

I'm going to assume your multi-rate main loop is implemented more or less the way I described in my previous blog posting. The important property we care about is whether each task gets to run to completion sometime within its assigned time period.

Multi-Rate Main Loop CPU Utilization

Let's say you have a set of tasks that looks like this, with the table columns being the task number, the period the task runs at (in msec), and the amount of CPU time it takes to run the task (again in msec):

Task  Period  Compute   
 0       5       1    
 1      10       2    
 2      20       3        
 3     100      11     

In this example, Task 2 runs once every 20 msec, and takes 3 msec to run once it has started.

Here is the first big question: is the CPU overloaded to more than 100%? To answer that, for each row in the table add a CPU % that is computed by dividing each task's run time by its period. For example task 2 runs 3 msec out of every 20 msec, so the CPU utilization is 3/20 = 15%.

Task  Period  Compute  CPU Load 
 0       5       1       20%
 1      10       2       20%
 2      20       3       15%    
 3     100      11       11%
         Total CPU Load  66%

OK, total CPU load is only 66%.  This is good news so far.

To do this math in the real world, you also need to add in the CPU time spent by interrupt service routines. Sometimes you end up with more than 100%, and that is obviously a problem.

It might seem impossible to get this wrong. But I would not be writing this if I hadn't seen this problem in a real industry design review. More than once. The way teams get this wrong is not by getting the math wrong. The problem is that they don't do the analysis in the first place, and have no idea what their CPU load is because they never measure it.

An especially tricky problem is when a task might normally run quickly, but once in a while has a very long compute time. This means that the system might meet its deadlines most of the time, but not all the time. If you were to monitor typical CPU load on a testbed things would be fine. But you might still miss deadlines once in a while -- perhaps only once in a really long while that you don't see during system test.

When doing a CPU loading calculation, the compute time has to be the worst case path through each and every task in the code (sometimes called Worst Case Execution Time: WCET), not just a typical path. Determining WCET can be a bit of a pain, but doing so for each task -- even the very infrequent ones, is a required first step in knowing if you are going to miss deadlines.


Blocking Time

But we're not out of the woods yet! It is pretty easy to design a system that misses deadlines despite low CPU loads. Have you spotted the problem with the example task set above yet?

The issue is that in a non-preemptive system like this once a task starts running, it runs to completion. If it is an infrequent task it might account for a small CPU load, but still hog the system for a long time.

Take another look at Task 3. It only runs every 100 msec, but it runs for 11 msec. That's only 11% CPU load. But 11 msec is more than twice as long as Task 1's period! So that means no matter what you do, once task 3 starts running, task 1 will miss a deadline.  If you want to see a simple example, consider the following timeline:

Time 100:   Task 0 runs for 1 msec
Time 101:   Task 3 runs for 11 msec
Time 112:   Task 0 can start running.  But it needed to run again between times 105 and 110, so it has already missed its deadline.

This problem would occur regardless of CPU load. Even if Task 3 executes only once per day, every day Task 0 would miss its deadline. In fact, the longer the period for Task 3, the less frequently the problem will happen and the harder it is going to be to find this problem in testing.  (Sure, some systems can occasionally miss a deadline. But you at least want to know that's going to happen to make sure you can design the system to be robust to a missed deadline. And for a safety critical system, having a system that misses any deadlines due to a design flaw is a bad idea regardless of system robustness.)

In general, if any task has a compute time longer than twice the period of the fastest task, you're going to miss deadlines.  (A more precise formulation is if the sum of twice the compute time of the fastest task plus the compute time of the longest task is more than twice the period of the fastest task, you'll miss deadlines.) Note that this is a necessary but not sufficient condition for schedulability.

In design reviews I see this problem of low CPU % but blocked high priority task frequently.  Very frequently. So don't let it bite you!

Even if this isn't the case it is still possible to miss deadlines with less than 100% CPU load, but that analysis takes a while to explain, so it will have to wait for another posting..

If you simply can't wait, the easy-to-explain (but tedious) way to check things is to lay out a timeline of when tasks will execute using a spreadsheet, with one row per msec and starting with all tasks being ready to run at time zero. Put down when each task will run as a series of occupied cells in the timeline. Repeat until you reach the least common multiple (LCM) of all the task periods. If they all fit, you should be OK.  If not, you need to dig deeper into what's going on. (Note that this neglects the effects of jitter in task start times caused by race conditions between the timer ISR and the main loop if..else chain. So exercise care with that analysis.) 

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...