Monday, July 22, 2013

Making Main Loop Scheduling Timing More Deterministic

Summary: Wasting time in a main loop scheduler can make testing system-level timing a whole lot easier.



It's common enough to see a main loop scheduler in an embedded system along the lines of the following:

for(;;)
{ Task1();
  Task2();
  Task3();
}

I've heard this referred to as a "static scheduler," "cyclic executive," "main loop scheduler," or "static non-preemptive scheduler" among other terms. Regardless of what you call it, the idea is simple: run all the tasks that need to be run, then go back and do it again until the system is shut down or reset. There might be interrupt service routines (ISRs) also running in the system.

The main appeal of this approach is that it is simple. You don't need a real time operating system and, even more importantly, it would appear to be difficult to get wrong. But, there's a little more to it than that...

The first potential problem is those pesky ISRs. They might have timing problems, cause concurrency problems, and so on. Those issues are beyond what I want to talk about today except for making the point that they can affect the execution speed of one iteration through the main loop in ways that may not be obvious. You should do timing analysis for the ISRs (chapter 14 of my book has gory details).  But for today's discussion we're going to assume that you have the ISRs taken care of.

The next problem is timing analysis of the main loop. The worst case response time for running a task is one trip through the loop. But how long that trip is might vary depending on the calculations each task performs and how much time ISRs steal. It can be difficult to figure out the absolute worst case execution time (you should try, but it might not be easy). But the really bad news is, even if you know the theoretical worst case timing you're unlikely to actually see it during testing.

Consider the tester trying to make sure the system will function with worst case timing. How do you get the above static scheduler to take the worst case path through the code a bunch of times in a row to see what breaks? It's a difficult task, and probably most testers don't have a way to pull that off. So what is happening is you are shipping product that has never been tested for worst case main loop execution time. Will it work?  Who knows.  Do you want to take that chance with 10,000 or 100,000 units in the field? Eventually one of them will see worst case conditions and you haven't actually tested what will happen.

Fortunately there is an easy way to mitigate this risk. Add a time-waster at the end of the main loop. The time waster should convert the above main loop, which runs as fast as it can, to a main loop that runs exactly once per a defined period (for example, once every 100 msec):

for(;;)
{ StartTimer(100);  // start a 100 msec countdown
  Task1();
  Task2();
  Task3();
  WaitForTimer(0);  // wait for the 100 msec countdown to reach 0
}

This is just a sketch of the code -- how you build it will depend upon your system. The idea is that you waste time in the WaitForTimer routine until you've spent 100 msec in the main loop, then you run the loop again. Thus, the main loop runs exactly once every 100 msec.  If the tasks run faster than 100 msec as determined by a hardware timer, you waste time at the end, waiting for the 100 msec period to be up before starting the next main loop iteration. If the tasks take exactly 100 msec then you just start the main loop again immediately. If the tasks run longer than 100 msec, then you should log an error or perform some other action so you know something went wrong.

The key benefit to doing this is to ensure that in testing the average timing behavior is identical to the worst case timing behavior. That way, if something works when the system is fast, but breaks when it actually takes 100 msec to complete the main loop, you'll see it right away in testing. A second benefit is that since you are actively managing the main loop timing, you have a way to know the timing ran a little long on some loops even if it isn't bad enough to cause a watchdog reset.






5 comments:

  1. Quite often one has the problem that the tasks 1..n are run longer than the overall schedule. In this case it is good design practise to design them as state machines to react fast and keep execution time short.

    This design is simple to understand and debug and leads to very robust systems. Maybe it is worth an own blog entry.

    Peter / pmueller@sinelabore.com

    ReplyDelete
  2. Peter -- that's a good suggestion. For timing you then have to figure out which state machine action in each task takes the longest amount of time and use that in your worst case timing analysis, but it is one way to make everything fit into a single list of tasks that all execute once per main loop.

    A related way that I used on a system was to find a natural chunk of work and just do one or more chunks in a task each time through. For example, in a hotel control system application I just did actions for one room each time through the loop and incremented room number for the next time through the loop.

    Eventually I'll try to work my way through the spectrum of scheduling options if there is interest (there are a lot of ways to slice up things for scheduling, each with a different set of tradeoffs.

    ReplyDelete
  3. My suggestion is to use a main loop that can wait on a timeout (master semaphore). This gets increment in an interrupt handler. The main loop can also look on things like communication events (using an async, interrupt-only driven comms stack) so that on message reception, the Rx handler can be run immediately. Otherwise all tasks are run once per major executive tick. This approach gives a deterministic regular update; making timers very easy (increment a counter on each tick, so a timer is just another byte or word variable local to the task).

    Tasks in turn use things like state machines to break up large chunks of work.

    Finally, by having the master timer implemented as a semaphore, in the event that some task for some reason some (rare) time does take too long, the main executive loop just runs the tasks as many times as are needed to "catch up".

    The main loop is something like
    (while master_sempahore == 0)
    {
    // do nothing
    }
    Task A
    Task B

    etc

    Thats of course a big simplification because there also needs to be the break out from the busy-wait to handle comms events...

    This approach allows a pretty simple event driven system + cooperative multi-tasking to be built, with deterministic behaviour.

    ReplyDelete
  4. What if you dont gurantee return to main loop in certain situations! Do you feel it is essential to have guarantee of return to main loop under all conditions. For example,
    main:
    While{
    do 1 ,
    while() do 2
    }
    We can have watch dog timer in the second while loop to ensure that we are back again at the start of the loop in known worst case time.Do you feel there is any basic problem of stability /issue in this design?

    ReplyDelete
  5. Sounds like a risky idea to me.
    If you use a timer to exit the second while there is a chance you are halfway through some system update or change.
    Just because something theoretically might be made to work doesn't mean it is a good idea. I advocate keeping it simple.
    "Probably right" has a lot more risk than "so simple it is easy to show it is not wrong."

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...