Comments on Better Embedded System SW: Proper Watchdog Timer Use

You need to do this to prevent a race condition du...

2021-10-15T21:40:30.292-04:00

You need to do this to prevent a race condition during the operation (i.e., to make it an atomic operation), assuming a single core processor.

Hello, I bought your book is very interesting!!. ...

2021-10-15T20:48:53.182-04:00

Hello, I bought your book is very interesting!!.

I have a question:
why in the above code inside the "Void Alive function (uint16 x);" interrupts are disabled and then enabled. It may be a silly question but I am new to the world of microcontrollers.

Thanks

You might want extra protection to ensure that onl...

2018-09-06T09:18:59.983-04:00

You might want extra protection to ensure that only authorized tasks check in with the watch dog, each task is only setting one bit, that the bit being set actually corresponds to the task making the call, etc.
This is a sketch of the idea. One reason you should use a commercial operating system instead of a home-made one whenever possible is because these are the types of things that the pros (hopefully) think of and handle.

The idea looks really great. But shouldn't wat...

2018-09-06T04:37:05.568-04:00

The idea looks really great. But shouldn't watch_flag be synchronized ? What if I have task that is not allowed to lock mutex ?

Hi Frankenstein, I was just reading your posting, ...

2017-06-08T09:07:34.455-04:00

Hi Frankenstein,
I was just reading your posting, a bit late, but I have AUTOSAR experience. The one thing that baffles me is that persons post their problem, then usually receive an answer and then never post again if the problem was solved and even more useful to the community how it was solved.

In principle I think you need to either have the algo moved to a different task or you change the task timing. Try 10ms. Ask your AUTOSAR integrator to change the configuration and don't forget to test things and measure your actual algo/calculation.

Regards

I sometimes use an regular interrupt as the monito...

2016-08-26T07:16:24.186-04:00

I sometimes use an regular interrupt as the monitor task whereby it only kicks the dog if all of the other tasks have 'checked in'. But yes, agreed, never just simply kick the dog unconditionally in the isr. I cannot agree with the position that watchdogs are un-necessary in well designed code. In complex systems there is no way to be 100% sure that some failure cannot happen. In any case, there is always the possibility of memory corruption due to harsh environment or whatever. I would never allow my team to ship a product without a watchdog - our systems are not safety critical, but if they simply locked up in the field without recovery I'd be looking for a new job. I do agree however that it is important to detect, log, and investigate watchdogs. In our systems where there is an RTOS we can use the Task Control Block to ascertain where the execution locked up and this provides a useful bread crumb to trace back to the origin of the problem.

Just to add my two cents... I've been working...

2016-06-29T01:16:25.748-04:00

Just to add my two cents...

I've been working in the design of hardware and software for safety-related products (up to SIL3) for several years and, although I somewhat agree that watchdogs may lead some people to be lazy/sloppy at coding, I fully agree with Phil that, at least for safety-related applications, not having a watchdog is not an option, no matter how thorough and rigorous that has been the design/development process.

Moreover, many of the comments above expressed concern about the consequences of an erratic reset of the CPU due to its failure to kick the watchdog in a timely manner. However, resetting the CPU is not the only option: a different, although quite common approach in safety-related applications (in particular when the failure of the CPU means that you can not longer rely on it), is to have an external, hardware-based watchdog circuitry that, if not refreshed appropriately, it causes a safe shutdown of the system that switches all safety outputs to their safe state.

The safety shutdown mechanism is carried out by the hardware circuitry of the watchdog subsystem and it usually includes a latching mechanism to prevent the automatic (unsafe) restart of the application.

Additionally, the circuitry of the external watchdog is kept as simple as possible so that its failure modes can be analyzed and taken into account and it also provides some control and feedback signals so that the CPU can check that the external hardware is operating properly. If that's not the case, now it's the CPU which initiates the safe shutdown.

The principles of a similar approach are already described in http://betterembsw.blogspot.ca/2014/04/monitor-actuator-pair-design-pattern.html

Obviously, this approach is not suitable for all safety related applications (e.g. you don't want something like this in the engine control of an aircraft) but still has its place in applications where the consequences of a failure can not tolerated (e.g. injury or death) or when they are costlier than the economic impact of the shutdown.

Thanks and keep up with this excellent blog,
Picante.//

Do not ever disable the watchdog once your program...

2016-04-24T09:56:28.176-04:00

Do not ever disable the watchdog once your program initialization is completed.

It might be justifiable to lengthen the watchdog period a bit if it is still fast enough to be effective (see my post on how long to set the watchdog from Nov 9, 2015). In general if you have a really long function either you need to break it up into pieces and run it a piece at a time or protect it with a much longer period software watchdog and take the risk that SW watchdog is possibly not as high a level of protection as HW watchdog.

HI thanks for this great article , i have made thi...

2016-04-23T22:23:14.090-04:00

HI thanks for this great article , i have made this week the first contact with the WDT after debugging a bug in my code but i'm having the following issue if you could help plz

i'm working on a project for automotive system where we use the MPC5748 MCU ... the application use an RTOS based on AUTOSAR OS , and this MPC target support two type of watchdogs : software and hardware (they have used soft WDT).

My mission is to fit an algorithm within this application,the developpement of the algorithm has been done ... the problem is that in the task where the algorithm is running is a 1ms task and the algorithm needs much more time than the time dedicated to this function.

I'm a newbie to the embedded world.By the way, in the algorithm main function the program will reset itself and this seems to be a timeOut generated by the expiration of the watchdog.

My questions are:

Can i disable the wachdog timer for this specified function (wich must not be disabled but just for testing purpose) ? It is possible to use more timeOut for the watchdog on that specified function. ?
Must i develop another task with a big delay in orther to run the algorithm , but the problem is that the algorithm need to be synchronised with the 1ms task since we are receiving can commands ...
What are other options to try ?
NB: This is a general problem on the watchdog timer and any useful informations will be much helpfull for me.Sorry because i can't share the code !!!

thanks

Thank you for your post. I am an electronics engin...

2016-04-09T18:54:46.722-04:00

Thank you for your post. I am an electronics engineer in a small company, in a small company I need to do everything including writing firmware. I don't have a degree and mostly self-taught. Most of our systems are not critical, but your post makes me a better embedded Software designer.

Thank you!

James Wu

If you are calling the state machine once each tim...

2016-02-23T06:42:21.489-05:00

If you are calling the state machine once each time through the main loop, then the main loop is where you'd probably put the watchdog. The purpose of the watchdog in this case is to ensure the main loop is alive so the state machine will be able to take a transition when it's time to do so.

If you're worried about the state machine doing the right thing that's a different type of problem. You might consider an additional software check to make sure a transition in the state machine happens once in a while (basically a software watchdog), but that's beyond the core watchdog function of making sure tasks are alive and becomes application-specific in a hurry.

If you're worried about power down then that also becomes application and chip specific quickly (some watchdogs are used to wake the chip up once in a while -- which is arguably not a watchdog function any more). So that sort of thing is beyond scope for this posting.

I am working on a system used to controll and mech...

2016-02-23T05:38:54.920-05:00

I am working on a system used to controll and mechanical clock mechanism, which implies that everything happens really really slowly compared to the uC. My code is designed as classical state machine. I want to use watchdog timer but I am not quite sure how to and where to add feeding the dog. I suppose that main loop is bad place for that? Could you please give some words on that?

This blog material was originally written in the c...

2015-11-07T19:13:43.130-05:00

This blog material was originally written in the context of safety-critical systems.

I can't remember a design review I've ever done where a stalled program was an acceptable outcome. Yes, incorrect watchdog use is a bad thing, but it is not all that difficult to use it correctly. If your system is simply not critical and you're really worried about false alarms, I'd rather see a really generous watchdog interval than see it turned off. (How to determine the interval is a topic for another blog post at some point.)

I really disagree very strongly with your 99.5% number, but I'm putting up the post since it seems to based on your experience, which apparently is a lot different than mine.

In my opinion many software engineers introduce a ...

2015-11-07T18:55:06.277-05:00

In my opinion many software engineers introduce a watchdog mechanism too light-hearted. You have to ask yourself the question 'What's worse, a stalled program or a reasonable chance my program will randomly reset because of an unexpected delay somewhere?'
And in many programs the watchdog is implemented wrong, giving a false sense of security when the watchdog is kicked way too often.

Don't get me wrong, I see good reasons for implementing a watchdog but on several occasions I've seen the watchdog code to introduce hard to find bugs.
99.5% of all embedded applications can do without a watchdog. Only when no human is present to reset the device on stalling or when lives are on stake a watchdog is required. In all other situations the chance of the watchdog code creating more problems than it solves makes me stay away from this feature when I can.

I appreciate the sentiment you express. Random r...

2015-03-07T20:12:47.265-05:00

I appreciate the sentiment you express. Random resets are unacceptable. But that doesn't mean you shouldn't have a watchdog. I prefer a philosophy that you use a watchdog, but if the watchdog ever trips (even once) that is a full-on, catastrophic failure of the engineering process, because it should never happen if your code is of good quality and your system is well designed. (I.e., a watchdog trip is a sign of a software fault, not a "business as usual" event)

Even with a perfectly designed piece of software you can get a watchdog trip due to a single event upset in the CPU logic. More likely a big software project is less than perfect and there are a few subtle bugs that just happened to slip through. Completely deterministic design reduces the chance of that bug getting past, but sometimes it might still happen.

It is interesting to debate whether using a watchdog presents a moral hazard that leads to laziness in coding, but I vote on the side of disciplined design plus watchdog.

BTW, a watchdog in fact did save the Mars Pathfinder mission, and I'm sure they are glad they had it turned on:
http://research.microsoft.com/en-us/um/people/mbj/mars_pathfinder/mars_pathfinder.html

Let me put it to you that Watchdog timers themselv...

2015-03-07T18:52:18.262-05:00

Let me put it to you that Watchdog timers themselves are bad practice.

To accept that a 'random unpredictable reset' is better than a 'un-recoverable halt' is detremental to design and code quality.

I believe Watchdog timers give a false sense of security , like a dark cloud hanging over the software engineer form the very beginning of a design. And this leads to sloppy code. Because, no matter what happens 'at least it will reset'.

Well. I don't use watchdog timers. Ever.

And I am forced to accept that my code must run reliably, or else disaster.

What this results in is :
Reliable firmware.
Because the designer needs to.

Ensure that interrupt routines are small, completely predictable, and in-interruptible.

The main application is essentially the watchdog. Switching predictably between tasks, polling flags, managing execution.

If you run such an application once.
You run it a thousand times.
Because the execution is independent of the unpredictability introduced by interrupts.
And every possible scenario is defined.

I am forced to design like this.
Because there is no other option.
There is no sloppy watchdog ticking away in the background waiting to clean up after my lazy code over runs the stack.

A random reset in the middle of a measurement or a mars-rover sample collection, a random reset while controlling the flaps of an aircraft wing. etc.
Do not use watchdogs.

Watchdogs are a false sense of security.
A random reset is unacceptable.
Don't allow it into your design.

Yes, I'm aware of the limitations of a softwar...

2014-10-12T12:32:16.186-04:00

Yes, I'm aware of the limitations of a software only approach, however there are no safety requirements.
Anyway it is an improvement compared to the no watchdog situation since it can catch faults previously undetected, like a single thread like a single thread dead, or "hung".
Thanks again for your blog.

A software-only watchdog is likely to be better th...

2014-09-01T12:28:59.604-04:00

A software-only watchdog is likely to be better than no watchdog. But, keep in mind that this won't protect you against faults inside the operating system that cause the watchdog task to die or other common-mode problems caused by running the watchdog and the application on the same platform.

Very interesting article, especially on the topic ...

2014-08-27T16:14:26.036-04:00

Very interesting article, especially on the topic of multiple threads, thank you.
I work in a similar field, i.e. pc embedded in automation machines, running windows embedded, console application, no user interface.

My plan is to implement a software only watchdog timer, using the techniques you explained, since in windows processes are isolated from each other I think it could be viable.

The watchdog application must be of course more reliable than the controlled apps, my plan is to adopt a strict abstract data type implementation, like explained in "C interfaces and implementation", David Hanson, and use a subset of MISRA-C 2004.

Since the estimated size of the program should be relatively small I hope to achieve a good code coverage testing, and also to use it as a sort of template for explaining good programming style and techniques.

If you think it could be interesting for your blog please let me know, my interest is to grow professionally and make myself known to other professionals.