Better Embedded System SW: February 2016

Wednesday, February 24, 2016

A Nice Rant About Representing Computer System Time

Here's a nice rant about dealing with time, time zones, daylight savings time, leap seconds, and why keeping things straight is so difficult. This especially applies to embedded systems which might not have a network connection, let alone access to a networked time service.

The overall video series looks pretty interesting too, although it's more about computers in general than embedded computers specifically. (Scroll down past the video link to see a list of how time-keeping bugs have caused severe outages and even deaths.)

"Summary: Published on Dec 30, 2013 A web app that works out how many seconds ago something happened. How hard can coding that be? Tom Scott explains how time twists and turns like a twisty-turny thing. It's not to be trifled with! (Embedded from YouTube; <computerphile>)"

Bonus content: here is a rogue's gallery of time goofs that I happen to include in my classroom lectures. If you know of other high profile outages or worse please submit as comments and I'll update the list as we go. (Note: this is only bugs caused by bad time-keeping, not all software outages):

Feb. 1991: Patriot missile failure due to floating point time roundoff; 28 deaths (link)
Dec 31, 1999: Y2K
Feb. 2007: F-22 raptor computer system crash due to crossing the international date line (link)
Feb. 2008: Microsoft Zunes basically bricked by leap year bug (link)
Mar. 2011: iPhones spring back instead of springing forward (link)
Mar. 2012: Windows Azure leap-year bug takes down G-cloud (link)
Jul. 1, 2012: Leap second bug wreaks havoc upon on-line services using NTP (link)
Sep. 2013: Deep Impact comet mission ends due to calendar date rollover fault (link)
May 2015: Boeing 787 timer rollover bug crashes engine software after 248 days (link)
Jul. 1, 2015: Another leap second problem, but not nearly as bad as 2012 (link)
Feb. 2016: iPhone prank bricks iPhones if date set back to zero Unix time (link)
Feb. 2016: Leap year bug leaves passengers without bags at Dusseldorf airport (link)
Jan. 19, 2038 03:15:07 GMT: Unix time rolls over ("Y2K for Unix")

(There are relatively few listings before 2010. Don't think for a minute that time suddenly got harder. What got worse was probably more things can be broken and result in headlines due to timekeeping faults.)

Update 11/8/2016: This is an entertaining and enlightening post about the quirks of time: Falsehoods programmers believe about time and it has a follow-up of additional falsehoods

Monday, February 1, 2016

Multi-Rate Main Loop Task Timing

In the past couple posts I've talked about how to build a multi-rate main loop scheduler and the two biggest mistakes I tend to see with timing analysis. In this posting I'll wrap up the topic (at least for now) by describing how to determine whether a particular task in a non-preemptive multi-rate system is going to meet its deadlines. The math gets pretty complex and is pretty painful, so I'll work an example and give the general rules without trying to show the iterative math behind the method. At that, this posting will take more work to understand completely than my usual posts, but it's just a complex topic and this isn't the place for an entire book chapter worth of math. On real systems you can generally do the checking with a spreadsheet once you understand the technique.

Consider the following task set:

Task Period Compute CPU Load
0 7 2 28.6%
1 10 2 20%
2 20 3 15%
3 101 5 5%
4 199 3 1.5%
Total CPU Load 70.1% (numbers rounded to nearest 0.1%)

If we want to find out the worst case response time for Task 2, we need to look at the following worst case:

All the tasks in the task set become ready to run simultaneously. In practice this means that the system timer is at zero or equal to the least common multiple of all periods
The most obnoxious task with priority lower than the task we care about sneaks in and starts running just before the highest priority task gets to run. (Generally this happens because that task got a chance to start right at the last instant before the counter ticked.)
All the tasks with higher priority run one or more times (based on their periods) before the task we care about runs. In general some tasks will run multiple times before our task gets a chance to run.

So for the example case of Task #2, that means:

Identify Task #3 as the task with the largest compute time from those tasks with a priority higher than Task #2. Assume it starts running at time zero because it snuck in ahead of all the other tasks.
Because it is time zero, all other tasks are ready to run starting at time zero. But because Task #3 snuck in first, all the other tasks are in the run queue at time zero.

Now we do an iterative calculation as follows, with each numbered step being an iteration of run the highest queued priority task and compute the new system time when it ends.

Start at time 0 with Task 3 running.
- Tasks 0, 1, 2, 4 queue to run at time zero.
- Running Task 3 takes us to time 5 when Task 3 completes.
- At time 5 tasks still queued are: 0, 1, 2, 4
Time 5: run highest priority queued task: Task 0.
- This runs for 2 msec, bringing us to time 7.
- At time 7 we reach the Task 0 period, so another copy of Task 0 queues to run.
- At time 7 tasks still queued are: 0, 1, 2, 4
At time 7 run highest priority queued task: Task 0.
- This runs for 2 msec, bringing us to time 9.
- At time 9 tasks still queued are: 1, 2, 4
At time 9 run highest priority queued task: Task 1.
- This runs for 2 msec, bringing us to time 11.
- At time 10 another copy of Task 1 is queued (note that we have missed the deadline for Task 1 since it should have completed at time 10, but ran until time 11).
- At time 11 tasks still queued are: 1, 2, 4
At time 11 run highest priority queued task: Task 1.
- This runs for 2 msec, bringing us to time 13.
- At time 13 tasks still queued are: 2, 4
At time 13 run highest priority queued task: Task 2.
- This runs for 3 msec, bringing us to time 16.
- At time 14 another copy of Task 0 is queued.
- At time 14 tasks still queued are: 0, 4.
At time 16 Task #2 has completed, ending our analysis

We've computed the worst case time to complete Task 2 is 16 msec, which is less than its 20 msec period. So Task 2 will meet its deadline. However, along the way we noticed that even though the CPU is only loaded to 70.1% Task 1 is going to miss its deadline.

Graphically, the above scenario looks like this:

The blue arrows show where tasks become queued to run, and the boxes show which task is running when.

To determine if your system is schedulable you need to repeat this analysis for every task in the system. In this case, repeating it for Task 1 will reveal that Task 1 misses its deadlines even though the CPU is only loaded at 70.1%.

In general the technique is to assume the longest-running task (biggest compute time) with priority lower than yours starts running and all other tasks queue at that same time. Then play out the sequence to see if you meet your deadline. There are a few notes on special cases. The longest task with lower priority may vary depending on which task you are evaluating. For example, the longest lower priority task for Task #3 is not Task #3, but rather Task #4. And the lowest priority task doesn't have a lower priority blocking task, so when evaluating Task #4 you can just assume it starts at time zero (there is no initial blocker). This can be expressed as an iterative equation that has to be cycled until it converges. If you really want a math-based approach take a look at the CAN performance lecture slides in my grad course, which is pretty much the same math. But trying to explain the math takes a lot more words and complicated typography than are practical in a blog entry.

If you find out you are missing deadlines and want to fix it, there are two general techniques that can help:

Make any really long-running low priority task run shorter, possibly by breaking it up into multiple shorter tasks that work together, or that can "yield" control back to the main loop every once in a while during their execution and pick up the next time they run where they left off. This will reduce the length of the initial blocking effect for all higher priority tasks.
Schedule tasks so their periods evenly divide each other (this will result in the Least Common Multiple of all periods equals the largest period). This corresponds to the the approach of harmonic task periods discussed in the real time scheduling chapter of my book. For non-preemptive tasking it will NOT support 100% CPU usage, but probably it will make worst case latency better.

Notes for readers who want to be thorough:

If you have interrupts you have to include the ISR contribution to CPU workload when doing this analysis. Doing so can get a bit tricky. The real time scheduling chapter of my book shows how to do this for a single-rate main loop scheduler.
It may be that you want to resolve a situation in which the fastest task gets blocked for too long by putting that faster task into an ISR. If you do that, keep it short and don't forget that it still affects CPU workload. This approach is discussed for a single-rate main loop scheduler, also in the real time scheduling chapter of my book.
We don't include the current task in the "most obnoxious" consideration because we start by assuming the system will meet deadlines where deadline = task period. Any particular task has to have finished running before it can run again unless it misses its deadline. So a task can't self-block in a system that is schedulable.