Monday, May 26, 2014

Proper Watchdog Timer Use

Watchdog timers are a prevalent mechanism for helping to ensure embedded system reliability. But they only work if you use them properly. Effective watchdog timer use requires that the failure of any periodic task in the system must result in a watchdog timer reset.

Consequences: Improper use of a watchdog timer leads to a false sense of security in which software task deaths and software task time overruns are not detected, causing possible missed deadlines or partial software failures.

Accepted Practices: 
  • If multiple periodic tasks are in the system, each and every such task must contribute directly to the watchdog being kicked to ensure every task is alive.
  • Use of a hardware timer interrupt to directly kick the watchdog is a bad practice. (There is arguably an exception of the ISR keeps a record of all currently live tasks as described later.)
  • Inferring task health by monitoring the lowest priority task alone is a bad practice. This approach fails to detect dead high priority tasks.
  • The watchdog timeout period should be set to the shortest practical value. The system should remain safe even if any combination of tasks dies for the entire period of the watchdog timeout value.
  • Every time the watchdog timer reset happens during testing of a fully operational system, that fact should be recorded and investigated. 
Discussion

Briefly, a watchdog timer can be thought of as a counter that starts at some predetermined value and counts down to zero. If the watchdog actually gets to zero, it resets the system in the hopes that a system reset will fix whatever problem has occurred. Preventing such a reset requires “kicking” (terms for this vary) the watchdog periodically to set the count back at the original value, preventing a system reset. The idea is for software to convince the hardware watchdog that the system is still alive, forestalling a reset. The idea is not unlike asking a teenager to call in every couple hours on a date to make sure that everything is going OK.

Watchdog timer arrangement.

Once the watchdog kicks a system reset is the most common reaction, although in some cases a permanent shutdown of the system is preferable if it is deemed better to wait for maintenance intervention before attempting a restart.

Getting the expected benefit from a watchdog timer requires using it in a proper manner. For example, having a hardware timer interrupt trigger unconditional kicking of the watchdog is a specifically bad practice, because it doesn’t indicate that any software task except the hardware timer ISR is working properly. (By analogy, having your teenager set up a computer to automatically call home with a prerecorded “I’m OK” message every hour on Saturday night doesn’t tell you that she’s really OK on her date.)

For a system with multiple tasks it is essential that each and every task contribute to the watchdog being kicked. Hearing from just one task isn’t enough – all tasks need to have some sort of unanimous “vote” on the watchdog being kicked. Correspondingly, a specific bad practice is to have one task or ISR report in that it is running via kicking the watchdog, and infer that this means all other tasks are executing properly. (Again by analogy, hearing from one of three teenagers out on different dates doesn’t tell you how the other two are doing.) As an example, the watchdog “should never be kicked from an interrupt routine” (MISRA Report 3, p. 38), which in general refers to the bad practice of using a timer service ISR to kick the watchdog.

A related bad practice is assuming that if a low priority task is running, this means that all other tasks are running. Higher priority tasks could be “dead” for some reason and actually give more time for low priority tasks to run. Thus, if a low priority task kicks the watchdog or sets a flag that is the sole enabling data for an ISR to kick the watchdog, this method will fail to detect if other tasks have failed to run in a timely periodic manner.

Monitoring CPU load is not a substitute for a watchdog timer. Tasks can miss their deadlines even with CPU loads of 70%-80% because of bursts of momentary overloads that are to be expected in a real time operating system environment as a normal part of system operation. For this reason, another bad practice is using software inside the system being monitored to perform a CPU load analysis or other indirect health check and kick the watchdog periodically by default unless the calculation indicates a problem. (This is a variant of kicking the watchdog from inside an ISR.)

The system software should not be in charge of measuring workload over time -- that is the job of the watchdog. The software being monitored should kick the watchdog if it is making progress. It is up to the watchdog mechanism to decide if progress is fast enough. Thus, any conditional watchdog kick should be done just based on liveness (have tasks actually been run), and not on system loading (do we think tasks are probably running).

One way to to kick a watchdog timer in a multi-tasking system is sketched by the below graphic:


Key attributes of this watchdog approach are: (1) all tasks must be alive to kick the WDT. If even one task is dead the WDT will time out, resetting the system. (2) The tasks do not keep track of time or CPU load on their own, making it impossible for them to have a software defect or execution defect that “lies” to the WDT itself about whether things are alive. Rather than making the CPU’s software police itself and shut down to await a watchdog kick if something is wrong, this software merely has the tasks report in when they finish execution and lets the WDT properly due its job of policing timeliness. More sophisticated versions of this code are possible depending upon the system involved; this is a classroom example of good watchdog timer use. Where "taskw" is run from depends on the scheduling strategy and how tight the watchdog timer interval is, but it is common to run it in a low-priority task.

Setting the timing of the watchdog system is also important. If the goal is to ensure that a task is being executed at least every 5 msec, then setting the watchdog timer at 800 msec doesn’t tell you there is a problem until that task is 795 msec late. Watchdog timers should be set reasonably close to the period of the slowest task that is kicking them, with just a little extra time beyond what is required to account for execution variation and scheduling jitter.

If watchdog timer resets are seen during testing they should be investigated. If an acceptable real time scheduling approach is used, a watchdog timer reset should never occur unless there has been system failure. Thus, finding out the root cause for each watchdog timer reset recorded is an essential part of safety critical design. For example, in an automotive context, any watchdog timer event recordings could be stored in the vehicle until it is taken in for maintenance. During maintenance, a technician’s computer should collect the event recordings and send them back to the car’s manufacturer via the Internet.

While watchdog timers can't detect all problems, a good watchdog timer implementation is a key foundation of creating a safe embedded control system. It is a negligent design omission to fail to include an acceptable watchdog timer in a safety critical system.

Selected Sources

Watchdog timers are a classical approach to ensuring system reliability, and are a pervasive hardware feature on single-chip microcontrollers for this reason.

An early scholarly reference is a survey paper of existing approaches to industrial process control (Smith 1970, p. 220). Much more recently, Ball discusses the use of watchdog timers, and in particular the need for every task to participate in kicking the watchdog. (Ball 2002, pp 81-83). Storey points out that while they are easy to implement, watchdog timers do have distinct limitations that must be taken into account (Storey pg. 130). In other words, watchdog timers are an important accepted practice that must be designed well to be effective, but even then they only mitigate some types of faults.

Lantrip sets forth an example of how to ensure multiple tasks work together to use a watchdog timer properly. (Lantrip 1997). Ganssle discusses how to arrange for all tasks to participate in kicking the watchdog, ensuring that some tasks don’t die while others stay alive. (Ganssle 2000, p. 125).

Brown specifically discusses good and bad practices. “I’ve seen some multitasking systems that use an interrupt to tickle the watchdog. This approach defeats the whole purpose for having one in the first place. If all the tasks were blocked and unable to run, the interrupt method would continue to service the watchdog and the reset would never occur. A better solution is to use a separate monitor task that not only tickles the watchdog, but monitors the other system tasks as well.” (Brown 1998 pg. 46).

The MISRA Software Guidelines recommend using a watchdog to detect failed tasks (MISRA Report 1, p. 43), noting that tasks (which they call “processes”) may fail because of noise/EMI, communications failure, software defects, or hardware faults. The MISRA Software Guidelines say that a “watchdog is essential, and must not be inhibited,” while pointing out that having returning an engine to idle in a throttle-by-wire application could be unsafe. (MISRA Report 1, p. 49). MISRA also notes that “The consequence of each routine failing must be identified, and appropriate watchdog and default action specified.” (MISRA Report 4 p. 33, emphasis added)

NASA recommends using a watchdog and emphasizes that it must be able to detect death of all tasks (NASA 2004, p. 93). IEC 61508-2 lists a watchdog timer as a form of test by redundant hardware (pg. 115) (without implying that it provides complete redundancy).

Addy identified a task death failure mode in a case study (Addy 1991, pg. 79) due to a task encountering a run-time fault that was not properly caught, resulting in the task never being restarted. Thus, it is reasonably conceivable that a task will die in a multitasking operating system. Inability to detect a task death is a defect in a watchdog timer, and a defective watchdog timer approach undermines the safety of the entire system. With such a defective approach, it would be expected that task deaths or other foreseeable events will go undetected by the watchdog timer.

References:
  • Addy, E., A case study on isolation of safety-critical software, Proc. Conf Computer Assurance, pp. 75-83, 1991.
  • Ball, Embedded Microprocessor Systems: Real World Design, Newnes, 2002.
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. 
  • Lantrip, D. & Bruner, L., General purpose watchdog timer component for a multitasking system, Embedded Systems Programming, April 1997, pp. 42-54.
  • MISRA, Report 1: Diagnostics and Integrated Vehicle Systems, February 1995.
  • MISRA, Report 3: Noise, EMC and Real-Time, February 1995.
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Smith, Digital control of industrial processes, ACM Computing Surveys, Sept. 1970, pp. 211-241.
  • Storey, N., Safety Critical Computer Systems, Addison-Wesley, 1996.





15 comments:

  1. Very interesting article, especially on the topic of multiple threads, thank you.
    I work in a similar field, i.e. pc embedded in automation machines, running windows embedded, console application, no user interface.

    My plan is to implement a software only watchdog timer, using the techniques you explained, since in windows processes are isolated from each other I think it could be viable.

    The watchdog application must be of course more reliable than the controlled apps, my plan is to adopt a strict abstract data type implementation, like explained in "C interfaces and implementation", David Hanson, and use a subset of MISRA-C 2004.

    Since the estimated size of the program should be relatively small I hope to achieve a good code coverage testing, and also to use it as a sort of template for explaining good programming style and techniques.

    If you think it could be interesting for your blog please let me know, my interest is to grow professionally and make myself known to other professionals.

    ReplyDelete
  2. A software-only watchdog is likely to be better than no watchdog. But, keep in mind that this won't protect you against faults inside the operating system that cause the watchdog task to die or other common-mode problems caused by running the watchdog and the application on the same platform.

    ReplyDelete
  3. Yes, I'm aware of the limitations of a software only approach, however there are no safety requirements.
    Anyway it is an improvement compared to the no watchdog situation since it can catch faults previously undetected, like a single thread like a single thread dead, or "hung".
    Thanks again for your blog.

    ReplyDelete
  4. Let me put it to you that Watchdog timers themselves are bad practice.

    To accept that a 'random unpredictable reset' is better than a 'un-recoverable halt' is detremental to design and code quality.

    I believe Watchdog timers give a false sense of security , like a dark cloud hanging over the software engineer form the very beginning of a design. And this leads to sloppy code. Because, no matter what happens 'at least it will reset'.

    Well. I don't use watchdog timers. Ever.

    And I am forced to accept that my code must run reliably, or else disaster.

    What this results in is :
    Reliable firmware.
    Because the designer needs to.

    Ensure that interrupt routines are small, completely predictable, and in-interruptible.

    The main application is essentially the watchdog. Switching predictably between tasks, polling flags, managing execution.


    If you run such an application once.
    You run it a thousand times.
    Because the execution is independent of the unpredictability introduced by interrupts.
    And every possible scenario is defined.

    I am forced to design like this.
    Because there is no other option.
    There is no sloppy watchdog ticking away in the background waiting to clean up after my lazy code over runs the stack.

    A random reset in the middle of a measurement or a mars-rover sample collection, a random reset while controlling the flaps of an aircraft wing. etc.
    Do not use watchdogs.


    Watchdogs are a false sense of security.
    A random reset is unacceptable.
    Don't allow it into your design.

    ReplyDelete
    Replies
    1. I appreciate the sentiment you express. Random resets are unacceptable. But that doesn't mean you shouldn't have a watchdog. I prefer a philosophy that you use a watchdog, but if the watchdog ever trips (even once) that is a full-on, catastrophic failure of the engineering process, because it should never happen if your code is of good quality and your system is well designed. (I.e., a watchdog trip is a sign of a software fault, not a "business as usual" event)

      Even with a perfectly designed piece of software you can get a watchdog trip due to a single event upset in the CPU logic. More likely a big software project is less than perfect and there are a few subtle bugs that just happened to slip through. Completely deterministic design reduces the chance of that bug getting past, but sometimes it might still happen.

      It is interesting to debate whether using a watchdog presents a moral hazard that leads to laziness in coding, but I vote on the side of disciplined design plus watchdog.

      BTW, a watchdog in fact did save the Mars Pathfinder mission, and I'm sure they are glad they had it turned on:
      http://research.microsoft.com/en-us/um/people/mbj/mars_pathfinder/mars_pathfinder.html

      Delete
  5. In my opinion many software engineers introduce a watchdog mechanism too light-hearted. You have to ask yourself the question 'What's worse, a stalled program or a reasonable chance my program will randomly reset because of an unexpected delay somewhere?'
    And in many programs the watchdog is implemented wrong, giving a false sense of security when the watchdog is kicked way too often.

    Don't get me wrong, I see good reasons for implementing a watchdog but on several occasions I've seen the watchdog code to introduce hard to find bugs.
    99.5% of all embedded applications can do without a watchdog. Only when no human is present to reset the device on stalling or when lives are on stake a watchdog is required. In all other situations the chance of the watchdog code creating more problems than it solves makes me stay away from this feature when I can.

    ReplyDelete
    Replies
    1. This blog material was originally written in the context of safety-critical systems.

      I can't remember a design review I've ever done where a stalled program was an acceptable outcome. Yes, incorrect watchdog use is a bad thing, but it is not all that difficult to use it correctly. If your system is simply not critical and you're really worried about false alarms, I'd rather see a really generous watchdog interval than see it turned off. (How to determine the interval is a topic for another blog post at some point.)

      I really disagree very strongly with your 99.5% number, but I'm putting up the post since it seems to based on your experience, which apparently is a lot different than mine.

      Delete
  6. I am working on a system used to controll and mechanical clock mechanism, which implies that everything happens really really slowly compared to the uC. My code is designed as classical state machine. I want to use watchdog timer but I am not quite sure how to and where to add feeding the dog. I suppose that main loop is bad place for that? Could you please give some words on that?

    ReplyDelete
    Replies
    1. If you are calling the state machine once each time through the main loop, then the main loop is where you'd probably put the watchdog. The purpose of the watchdog in this case is to ensure the main loop is alive so the state machine will be able to take a transition when it's time to do so.

      If you're worried about the state machine doing the right thing that's a different type of problem. You might consider an additional software check to make sure a transition in the state machine happens once in a while (basically a software watchdog), but that's beyond the core watchdog function of making sure tasks are alive and becomes application-specific in a hurry.

      If you're worried about power down then that also becomes application and chip specific quickly (some watchdogs are used to wake the chip up once in a while -- which is arguably not a watchdog function any more). So that sort of thing is beyond scope for this posting.

      Delete
  7. Thank you for your post. I am an electronics engineer in a small company, in a small company I need to do everything including writing firmware. I don't have a degree and mostly self-taught. Most of our systems are not critical, but your post makes me a better embedded Software designer.

    Thank you!

    James Wu

    ReplyDelete
  8. HI thanks for this great article , i have made this week the first contact with the WDT after debugging a bug in my code but i'm having the following issue if you could help plz

    i'm working on a project for automotive system where we use the MPC5748 MCU ... the application use an RTOS based on AUTOSAR OS , and this MPC target support two type of watchdogs : software and hardware (they have used soft WDT).

    My mission is to fit an algorithm within this application,the developpement of the algorithm has been done ... the problem is that in the task where the algorithm is running is a 1ms task and the algorithm needs much more time than the time dedicated to this function.

    I'm a newbie to the embedded world.By the way, in the algorithm main function the program will reset itself and this seems to be a timeOut generated by the expiration of the watchdog.

    My questions are:

    Can i disable the wachdog timer for this specified function (wich must not be disabled but just for testing purpose) ? It is possible to use more timeOut for the watchdog on that specified function. ?
    Must i develop another task with a big delay in orther to run the algorithm , but the problem is that the algorithm need to be synchronised with the 1ms task since we are receiving can commands ...
    What are other options to try ?
    NB: This is a general problem on the watchdog timer and any useful informations will be much helpfull for me.Sorry because i can't share the code !!!

    thanks

    ReplyDelete
    Replies
    1. Do not ever disable the watchdog once your program initialization is completed.

      It might be justifiable to lengthen the watchdog period a bit if it is still fast enough to be effective (see my post on how long to set the watchdog from Nov 9, 2015). In general if you have a really long function either you need to break it up into pieces and run it a piece at a time or protect it with a much longer period software watchdog and take the risk that SW watchdog is possibly not as high a level of protection as HW watchdog.

      Delete
    2. Hi Frankenstein,
      I was just reading your posting, a bit late, but I have AUTOSAR experience. The one thing that baffles me is that persons post their problem, then usually receive an answer and then never post again if the problem was solved and even more useful to the community how it was solved.

      In principle I think you need to either have the algo moved to a different task or you change the task timing. Try 10ms. Ask your AUTOSAR integrator to change the configuration and don't forget to test things and measure your actual algo/calculation.

      Regards

      Delete
  9. Just to add my two cents...

    I've been working in the design of hardware and software for safety-related products (up to SIL3) for several years and, although I somewhat agree that watchdogs may lead some people to be lazy/sloppy at coding, I fully agree with Phil that, at least for safety-related applications, not having a watchdog is not an option, no matter how thorough and rigorous that has been the design/development process.

    Moreover, many of the comments above expressed concern about the consequences of an erratic reset of the CPU due to its failure to kick the watchdog in a timely manner. However, resetting the CPU is not the only option: a different, although quite common approach in safety-related applications (in particular when the failure of the CPU means that you can not longer rely on it), is to have an external, hardware-based watchdog circuitry that, if not refreshed appropriately, it causes a safe shutdown of the system that switches all safety outputs to their safe state.

    The safety shutdown mechanism is carried out by the hardware circuitry of the watchdog subsystem and it usually includes a latching mechanism to prevent the automatic (unsafe) restart of the application.

    Additionally, the circuitry of the external watchdog is kept as simple as possible so that its failure modes can be analyzed and taken into account and it also provides some control and feedback signals so that the CPU can check that the external hardware is operating properly. If that's not the case, now it's the CPU which initiates the safe shutdown.

    The principles of a similar approach are already described in http://betterembsw.blogspot.ca/2014/04/monitor-actuator-pair-design-pattern.html

    Obviously, this approach is not suitable for all safety related applications (e.g. you don't want something like this in the engine control of an aircraft) but still has its place in applications where the consequences of a failure can not tolerated (e.g. injury or death) or when they are costlier than the economic impact of the shutdown.

    Thanks and keep up with this excellent blog,
    Picante.//

    ReplyDelete
  10. I sometimes use an regular interrupt as the monitor task whereby it only kicks the dog if all of the other tasks have 'checked in'. But yes, agreed, never just simply kick the dog unconditionally in the isr. I cannot agree with the position that watchdogs are un-necessary in well designed code. In complex systems there is no way to be 100% sure that some failure cannot happen. In any case, there is always the possibility of memory corruption due to harsh environment or whatever. I would never allow my team to ship a product without a watchdog - our systems are not safety critical, but if they simply locked up in the field without recovery I'd be looking for a new job. I do agree however that it is important to detect, log, and investigate watchdogs. In our systems where there is an RTOS we can use the Task Control Block to ascertain where the execution locked up and this provides a useful bread crumb to trace back to the origin of the problem.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!