Monday, November 9, 2015

How Long Should You Set Your Watchdog Timer


Summary: You can often use a watchdog effectively even if you don't know worst case execution time. The more you know about system timing, the tighter you can set the watchdog.



Sometimes there is a situation in which a design team doesn't have resources to figure out a tight bound on worst case execution time for a system. Or perhaps they are worried about false alarm watchdog trips due to infrequent situations in which the software is running properly, but takes an unusually long time to complete a main loop. Sometimes they set the watchdog to the maximum possible time (perhaps several seconds) to avoid false alarm trips. And sometimes they just set it to the maximum possible value because they don't know what else to do.

In other words, some designers turn off the watchdog or set it to the maximum possible setting because they don't have time to do a detailed analysis.  But, almost always, you can do a lot better than that.

To get maximum watchdog effectiveness you want to put a tight bound on worst case execution time with it. However, if your system safety strategy permits it, there are simpler ways to compute watchdog period by analyzing the application instead of the software itself.  Below I'll work through some ways to set the watchdog, going from the more complicated way to a simpler way.

Consider software that heats water in an appliance. Just to make the problem concrete and use easy numbers, lets say that the control loop for heating executes every 100 msec, and takes between 40 and 75 msec to execute (worst case slow and fast speeds).   Let's also say that it uses a single-task main loop scheduler without an RTOS so we don't have to worry about task start time jitter.  How could we set the watchdog for this system? Ideally we'd like the tightest possible timing, but there may be some slack because water takes a while to heat up, and takes a while to boil dry. How long should we set the watchdog timer for?

Classical Watchdog Setup

Classically, you'd want to compute the worst case execution time range of the software (40-75 msec in this case). Let's assume the watchdog kick happens as the last instruction of each execution. Since the software only runs once every 100 msec, then the shortest time between kicks is when one cycle runs 75 msec, waits 25 msec, then the  next cycle runs faster, completing the computation and kicking the watchdog in only 40 msec.    25+40=65 msec shortest time between kicks. In contrast, the longest time between kicks is when a short cycle of 40 msec is followed by 60 msec of waiting, then a long cycle of 75 msec.  60+75=135 msec longest time between kicks. It helps a lot to sketch this out:


If you're setting a conventional watchdog timer, you'd want to set it at 135 msec (or the closest setting greater than that). If you have a windowed watchdog, you'd want to set the minimum setting at 65 msec, or the closest setting lower than that. 

Note that if your running an RTOS, the scheduling might guarantee that the task runs once in every 100 msec, but not when it starts within that period. In that case the worst case shortest time is the task running back-to-back at its shortest length = 40 msec. The longest time will be when a short task runs at the beginning of a period, and the next task completes right at the end of a period, giving 60 msec (idle time at end of period) + 100 msec (one more period) = 160 msec between watchdog kicks. Thus, a windowed watchdog for this system would have to permit a kick interval of 40 to 160 msec.

Watchdog Approximate Setup

Sometimes designers want a shortcut. It is usually a mistake to set the watchdog at exactly the period because of timing jitter in where the watchdog actually gets kicked. Instead, a handy rule of thumb for non-critical applications (for which you don't want to do the detailed analysis) is to set the watchdog timer interval to twice the software execution period. For this example, you'd set the watchdog timer to twice the period of 100 msec = 200 msec.  There are a couple assumptions you need to make: (1) the software always finishes before the end of its period, and (2) the effectiveness of the watchdog at this longer timeout will be good enough to ensure adequate safety for your system. (If you are building a safety critical system you need to dig deeper on this point.) 

This simpler approach sacrifices some watchdog timer effectiveness for detecting faults that perturb system timing compared to the theoretical bound of 135-160 msec. But it will still catch a system that is definitely hung without needing detailed timing analysis.

For a windowed watchdog the rule of thumb is a little more difficult.  That is because, in principle, your task might run the full length of one period and complete instantly on the next period, giving effectively a zero-length minimum watchdog timer kick interval. If you can establish a lower bound on the minimum possible run time of your task, you can set that as an approximate lower bound on watchdog timer kicks. If you don't know the minimum time, you probably can't use the lower bound of a windowed watchdog timer.

Application-Based Watchdog Setup

The previous approaches required knowing the run time of the software. But, what do you do if you know nothing about the run time, or have low confidence in your ability to predict the worst case bounds even after extensive analysis and testing?

An alternate way to approach this problem is to skip the analysis of software timing and instead think about the application. Ask yourself, what is the longest period for which the application can withstand a hung CPU?  For example, for a counter-top appliance heating water, how long can the heater be left full-on due to hung software without causing a problem such as a fire, smoke, or equipment damage?  Probably it's quite a bit longer than 100 msec.  But it might not be 5 or 10 seconds. And probably you don't want to risk melting a thermal fuse or a household smoke alarm by  turning off the watchdog entirely.

As a practical matter, if the system's controls go unstable after a certain amount of time, you'd better make your watchdog timer period shorter than that length of time!

For the above example, the control period is 100 msec. But, let's say the system can withstand 500 msec of no control inputs without becoming uncrecoverable, so long as the control system starts working again within that time and it doesn't happen often. In this case, the watchdog timer must be shorter than 500 msec at least. But there is one other thing to consider. The system probably takes a while to reboot and start running the control loop after the watchdog timer trips. So we need to account for that restart time. For this example, let's say the time between a watchdog reset and the time the software starts controlling the system generally ranges from 70 to 120 msec based on test measurements.

Based on knowing the system reset time and stability grace period, we can get an approximate watchdog setting as follows.  We have 500 msec of no-control grace period, minus 120 msec of worst case restart time.  500-120 = 380 msec. Thus, for this system the maximum permissible watchdog timer value is 380 msec to avoid losing control system stability. Using this approach, the watchdog maximum period should be set at the longest time that is shorter than 380 msec. Without knowing more about software computation time, there is not much we can say about the minimum period for a windowed watchdog.

Note that for this approach we still need to know something about the worst case execution of the software in case you hit a long execution path after the watchdog reset. However, it is often the case that you know the longest time that is likely to be seen (e.g., via measuring a number of runs) even if you don't know the details of the real time scheduling approach used.  And often you might be willing to take the chance that you won't hit an unlikely, even worse running time right after a system reset. For a non-safety critical application this might be appropriate, and is certainly better than just turning the watchdog off entirely.

Finally, it is often useful to combine the period rule of thumb with the control stability rule of thumb (if you know the task execution period). You want the watchdog set shorter than the time required to ensure control stability, but longer than the time it actually takes to execute the software so that it will actually be kicked often enough. For the above example this means setting the watchdog somewhere between two periods and the control stability time limit, giving a range for the maximum watchdog limit of 200-380 msec. This can be set without detailed software execution time analysis beyond knowing the task period and the range of likely system restart times.

Summary
If you know the maximum worst case execution time and the minimum execution time, you should set your watchdog as tightly as you can. If you don't know those values, but at least are confident you'll complete execution within your task period, then you should set the watchdog to twice the period. You should also bound the maximum watchdog timeout setting by taking into account how long the system can operate without a reset before it loses control stability.

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...