Monday, November 9, 2015

How Long Should You Set Your Watchdog Timer

Summary: You can often use a watchdog effectively even if you don't know worst case execution time. The more you know about system timing, the tighter you can set the watchdog.

Sometimes there is a situation in which a design team doesn't have resources to figure out a tight bound on worst case execution time for a system. Or perhaps they are worried about false alarm watchdog trips due to infrequent situations in which the software is running properly, but takes an unusually long time to complete a main loop. Sometimes they set the watchdog to the maximum possible time (perhaps several seconds) to avoid false alarm trips. And sometimes they just set it to the maximum possible value because they don't know what else to do.

In other words, some designers turn off the watchdog or set it to the maximum possible setting because they don't have time to do a detailed analysis.  But, almost always, you can do a lot better than that.

To get maximum watchdog effectiveness you want to put a tight bound on worst case execution time with it. However, if your system safety strategy permits it, there are simpler ways to compute watchdog period by analyzing the application instead of the software itself.  Below I'll work through some ways to set the watchdog, going from the more complicated way to a simpler way.

Consider software that heats water in an appliance. Just to make the problem concrete and use easy numbers, lets say that the control loop for heating executes every 100 msec, and takes between 40 and 75 msec to execute (worst case slow and fast speeds).   Let's also say that it uses a single-task main loop scheduler without an RTOS so we don't have to worry about task start time jitter.  How could we set the watchdog for this system? Ideally we'd like the tightest possible timing, but there may be some slack because water takes a while to heat up, and takes a while to boil dry. How long should we set the watchdog timer for?

Classical Watchdog Setup

Classically, you'd want to compute the worst case execution time range of the software (40-75 msec in this case). Let's assume the watchdog kick happens as the last instruction of each execution. Since the software only runs once every 100 msec, then the shortest time between kicks is when one cycle runs 75 msec, waits 25 msec, then the  next cycle runs faster, completing the computation and kicking the watchdog in only 40 msec.    25+40=65 msec shortest time between kicks. In contrast, the longest time between kicks is when a short cycle of 40 msec is followed by 60 msec of waiting, then a long cycle of 75 msec.  60+75=135 msec longest time between kicks. It helps a lot to sketch this out:

If you're setting a conventional watchdog timer, you'd want to set it at 135 msec (or the closest setting greater than that). If you have a windowed watchdog, you'd want to set the minimum setting at 65 msec, or the closest setting lower than that. 

Note that if your running an RTOS, the scheduling might guarantee that the task runs once in every 100 msec, but not when it starts within that period. In that case the worst case shortest time is the task running back-to-back at its shortest length = 40 msec. The longest time will be when a short task runs at the beginning of a period, and the next task completes right at the end of a period, giving 60 msec (idle time at end of period) + 100 msec (one more period) = 160 msec between watchdog kicks. Thus, a windowed watchdog for this system would have to permit a kick interval of 40 to 160 msec.

Watchdog Approximate Setup

Sometimes designers want a shortcut. It is usually a mistake to set the watchdog at exactly the period because of timing jitter in where the watchdog actually gets kicked. Instead, a handy rule of thumb for non-critical applications (for which you don't want to do the detailed analysis) is to set the watchdog timer interval to twice the software execution period. For this example, you'd set the watchdog timer to twice the period of 100 msec = 200 msec.  There are a couple assumptions you need to make: (1) the software always finishes before the end of its period, and (2) the effectiveness of the watchdog at this longer timeout will be good enough to ensure adequate safety for your system. (If you are building a safety critical system you need to dig deeper on this point.) 

This simpler approach sacrifices some watchdog timer effectiveness for detecting faults that perturb system timing compared to the theoretical bound of 135-160 msec. But it will still catch a system that is definitely hung without needing detailed timing analysis.

For a windowed watchdog the rule of thumb is a little more difficult.  That is because, in principle, your task might run the full length of one period and complete instantly on the next period, giving effectively a zero-length minimum watchdog timer kick interval. If you can establish a lower bound on the minimum possible run time of your task, you can set that as an approximate lower bound on watchdog timer kicks. If you don't know the minimum time, you probably can't use the lower bound of a windowed watchdog timer.

Application-Based Watchdog Setup

The previous approaches required knowing the run time of the software. But, what do you do if you know nothing about the run time, or have low confidence in your ability to predict the worst case bounds even after extensive analysis and testing?

An alternate way to approach this problem is to skip the analysis of software timing and instead think about the application. Ask yourself, what is the longest period for which the application can withstand a hung CPU?  For example, for a counter-top appliance heating water, how long can the heater be left full-on due to hung software without causing a problem such as a fire, smoke, or equipment damage?  Probably it's quite a bit longer than 100 msec.  But it might not be 5 or 10 seconds. And probably you don't want to risk melting a thermal fuse or a household smoke alarm by  turning off the watchdog entirely.

As a practical matter, if the system's controls go unstable after a certain amount of time, you'd better make your watchdog timer period shorter than that length of time!

For the above example, the control period is 100 msec. But, let's say the system can withstand 500 msec of no control inputs without becoming uncrecoverable, so long as the control system starts working again within that time and it doesn't happen often. In this case, the watchdog timer must be shorter than 500 msec at least. But there is one other thing to consider. The system probably takes a while to reboot and start running the control loop after the watchdog timer trips. So we need to account for that restart time. For this example, let's say the time between a watchdog reset and the time the software starts controlling the system generally ranges from 70 to 120 msec based on test measurements.

Based on knowing the system reset time and stability grace period, we can get an approximate watchdog setting as follows.  We have 500 msec of no-control grace period, minus 120 msec of worst case restart time.  500-120 = 380 msec. Thus, for this system the maximum permissible watchdog timer value is 380 msec to avoid losing control system stability. Using this approach, the watchdog maximum period should be set at the longest time that is shorter than 380 msec. Without knowing more about software computation time, there is not much we can say about the minimum period for a windowed watchdog.

Note that for this approach we still need to know something about the worst case execution of the software in case you hit a long execution path after the watchdog reset. However, it is often the case that you know the longest time that is likely to be seen (e.g., via measuring a number of runs) even if you don't know the details of the real time scheduling approach used.  And often you might be willing to take the chance that you won't hit an unlikely, even worse running time right after a system reset. For a non-safety critical application this might be appropriate, and is certainly better than just turning the watchdog off entirely.

Finally, it is often useful combine the period rule of thumb with the control stability rule of thumb (if you know the task execution period). You want the watchdog set shorter than the time required to ensure control stability, but longer than the time it actually takes to execute the software so that it will actually be kicked often enough. For the above example this means setting the watchdog somewhere between two periods and the control stability time limit, giving a range for the maximum watchdog limit of 200-380 msec. This can be set without detailed software execution time analysis beyond knowing the task period and the range of likely system restart times.

If you know the maximum worst case execution time and the minimum execution time, you should set your watchdog as tightly as you can. If you don't know those values, but at least are confident you'll complete execution within your task period, then you should set the watchdog to twice the period. You should also bound the maximum watchdog timeout setting by taking into account how long the system can operate without a reset before it loses control stability.

Monday, September 28, 2015

Open Source IoT Code Is Not The Entire Answer

Summary: Whether or not to open sourcing embedded software is the wrong question. The right question is how can we ensure independent checks and balances on software safety and security. Independent certification agencies have been doing this for decades. So why not use them?

In the wake of the recent Volkswagen diesel software revelations, there has been a call from some that automotive software and even all Internet of Things software should be open source. The idea is that if the software is released publicly, then someone will notice if there is a security problem, a safety problem, or skulduggery of some sort. While open source can make sense, this is neither an economically realistic nor necessary step to apply across-the-board.

The Pro list for open source is pretty straightforward: if you publish the code, someone will come and read it and find all the problems.

The Con list is, however, more reflective of how things really work. You have to assume that someone with enough technical skill will actually spend the time to look, and will actually find the problem. That doesn't always happen. The relatively simple Heartbleed bug was there for all to see in OpenSSL, and it stayed there for a couple years despite being a widely used, crucial piece of open source Internet infrastructure software. Presumably a lot more people care about OpenSSL than your toaster oven's software.

Some of the opponents of open sourcing IoT software invoke the security bogeyman. They say that if you publish the source you'll be vulnerable to attacks. Well sure, it might make it easier to find a way to attack, but it doesn't make you "vulnerable." If your code was already full of vulnerabilities, publishing source code just might make it a little easier for someone to find them.  Did you notice that the automotive security exploits published recently did not rely on source code?  I can believe that exploits could, at least sometimes, be published more quickly for open source code, but I don't see this as a compelling argument for keeping code secret and un-reviewed.

A more fundamental point is that software is often the biggest competitive advantage in making products that would otherwise be commodities. Asking companies to reveal their most important trade secrets (their software), so that a hypothetical person with the time and skills might just happen to find a problem sounds like a hard sell to me.  Especially since there is the well established alternative of having an external, independent certification agency look things over in private.

Safety critical systems have had standards and independent review systems in place for decades. Aviation uses DO-178c and other standards, and has a set of independent reviewers called Designated Engineering Representatives (DERs) that provide design reviews during the development cycle. Rail systems follow EN-50126/8/9 and typically involve oversight from acquisition consultants. The chemical process industry generally follows IEC-61508, and has long used independent certification organizations to check their work (typically I see reviews have been done by Exida or TUV). The consumer appliance industry has long had Underwriters Laboratories (UL) certification, and is moving to a more comprehensive software safety standard approach based on IEC 60730, including external independent certification. There are also more recent domain-specific security standards that can be applied. (It is worth noting that ensuring safety and security requires a lot more than just source code, but that's a topic for another day.)

Cars have long had the option to use the MISRA software safety guidelines, and more recently the ISO 26262 safety standard. Historically, some companies have had external agencies certify automotive components to those standards. But, at least some car companies have not taken advantage of this external audit opportunity, and thus there has been no independent check and balance on their software until we their problems show up in the news. Software safety and security audits are not required to sell cars in the US. (There is some vehicle-level testing according to FMVSS requirements, but it's about vehicle behaviors, not the actual source code.)

For Internet of Things it will be interesting to see how things play out. As I understand it the EU is already requiring IEC 60730 compliance, which means external safety checks for safety critical IoT applications. We could see that mandate spread to more IoT products sold in Europe if there are high-profile problems. And perhaps we'll see a push on automotive software regulation too.

So, there is a well established alternative to open source in the form of external certifying organizations issuing compliance certificates based on international safety and security standards. Rather than get distracted by an open source debate, what we should be doing is asking "what's the most effective way to ensure adequate software safety and dependability in a way that doesn't put companies out of business." Sometimes that might be open source, especially for underlying infrastructure. But other times, probably most times, independent review by a trusted certification party will be up to the task. The question is really what it will take to make companies produce verifiably adequate software.

Having checks and balances works. We should use them.

(For the record, I made some of my source code public domain before "open source" was even a buzzword, and have released other source code under an older version of GPL (Ballista robustness testing) and Creative Commons BY 4.0 (CRC Hamming Distance length calculation). Some code I copyright and release. And some I keep as a trade secret. My interest here is in the public being able to use safe and secure embedded software. We should focus on that, and not let things get sidetracked into another iteration of the open source vs. proprietary software debate.)

Monday, September 7, 2015

Essential Embedded Software Skills

I spend a lot of time trying to grapple with what makes embedded systems different than desktop computer systems in terms of skills and development processes.  Often the answer to this question on  discussion groups ends up being something like "everything has to be super-optimized," or "you need to meet real-time deadlines." But those are technical measures that seem to me to be more symptoms of particular embedded system projects rather than root cause of the differences.  And, such answers tend to be a bit one-dimensional.

After some thought, perhaps the distinctive attributes of embedded systems can be summarized in the following way:

Interaction with the physical world:
Embedded systems generally have a primary goal of interacting with the physical world using sensors  and actuators. This in turn encompasses various topics depending on the application, including:
  - Real time responsiveness (scheduling, concurrency management, timekeeping)
  - Analog & digital interfacing
  - Control approaches
  - Signal processing
  - Coordination via networked and Cloud services
  - Reliability, safety, system robustness

Special-purpose computing platform:
Most embedded systems don't use a general purpose computing platform (a desktop comptuer, laptop, tablet, smart phone, etc.).  Rather, they use a customized hardware platform that is permanently embedded into the product. (Even those that do use somewhat standardized hardware often have specialized I/O devices attached.)  This in turn encompasses various topics depending on the application, including:
  - Software optimization (squeezing to fit into a cost-constrained platform)
  - Close-to-hardware programming (interrupts, device interfacing)
  - Hardware specialization (application-specific hardware, DSP platforms)
  - Specialized network protocols
  - Special-purpose human interaction devices
  - Hardware-dependent testing approaches
  - Customized operating system (or custom non-OS task manager)
  - Power management

Domain-centric development:
Outside the consumer electronics area, in my experience it is rare to meet a deeply embedded system developer with a primary college degree in computer engineering or computer science.  Generally they have a degree more relevant to their product domain. Yet, nonetheless, here they are writing significant amounts of code for a living. Those trained in software development are also missing somewhat different pieces. Regardless of background, developers usually need to understand the following areas:
  - General software process and technical practice literacy (for domain experts) / Domain expertise (for software experts)
  - Life-cycle support for long-lived, hard-to-update products
  - Distributed and federated system architecture design
  - Domain-optimized development (e.g., model-based design for control systems)
  - Domain-specific aspects of security

Looking at this list, it becomes clear that skills such as knowing how to write super-optimized code are merely pieces of a larger puzzle. In general, you need to be at least literate in all the topics above to be a well-rounded embedded system developer.  Sure, not everyone and not every project needs deep expertise in everything. But if you're planning on a career in embedded systems you'll likely hit just about everything on the list -- I know that I certainly have. (And, if you're a hiring manager, now you have a shopping list for skills for your senior developers.)

Monday, August 17, 2015

Three Suggestions from Les Chambers

Les Chambers is a knowledgeable software engineer with plenty of experience in critical embedded systems.  He had the following three comments about what kinds of "paper" (documentation) are essential for larger-scale embedded systems, which are all spot on.  (His points are in bullets below with permission and light editing for this format.)
  • Without an architectural design document it is impossible to plan and manage any software project with more than two or three people coding. You've got to know numbers of things to estimate resource requirements. How many functions, how many screens, how many objects. I've seen regularly a problem with big teams coming together and floundering around with nothing productive to do because the architect is still pulling his ideas together and has no vehicle for communicating them to the team.

Les is correct. Any time you have a project with more than a couple people you really need an architectural document of some sort.  Ideally a single sheet of paper, usually with boxes and arrows, that shows you what the pieces are and how they fit together. Once you have 5 people on the team, this is absolutely mandatory, and in my experience as Les says you will have just chaos until that picture is nailed down.

A related problem I've seen is when the architecture document shows all the hardware boxes and communication links, but software is nowhere to be found in the picture. You need to either put software on that same diagram or have a separate picture for the software structure that is compatible with the hardware architecture. Chapter 10 of my book has some general rules to help in constructing these types of diagrams. However, I'll be the first to admit that creating a good architecture is as much art as science. If you want to really delve into this, the best systems architecture book I've found is Rechtin's book on System Architecting (The first edition is by far the best for an initial read, if you can find it. The 3rd edition by Maier & Rechtin has a lot more material, but is a bit more complex if you are just going for the essentials.)
  • Another issue you've hinted at but not explicitly stated is the importance of detailed rationales for design approaches. The symptom here is, six months down the track, someone questions a nonintuitive design approach, spends a week working through the design rationale and then decides, "... oh yes it was right in the first place." Worse: someone, not in possession of the facts, changes an approach that was made for rational reasons and injects bugs.

Yes, I've seen this one too. This can especially be a problem if a design decision is made for an extra-functional purpose such as safety. For example, consider an aircraft in which two cable bundles are run down different sides of an aircraft.  Someone later might conclude that it is cheaper and easier to run them next to each other. Functionally there is (at least at first glance) no difference. But the point of separating the wires was so that if physical damage occurs to one part of the aircraft only one of the two cable bundles will be affected. (Could this happen? Read about the United Airlines Flight 232 crash where three hydraulic lines were damaged where they ran too close together.)

In general, it is a good idea to capture not just requirements but also design decisions with rationale so that the basis for important decisions is not lost. This is especially important for long-lived systems that are likely to be maintained and updated over periods of decades, which is a common enough situation in the embedded systems world.
  • Another piece of paper I think should be added is the configuration management documentation. Exactly what versions of what software are running on what versions of what hardware where. I once had to tackle this problem on a project with in excess of 200 computers deployed all over a [geographically distributed embedded system] network. The symptoms were: people in the development shop spending a week working on reproducing a bug found on site and being unable to do so because they are working on the wrong version of the code. Large deployment teams turning out to do site installation and having to abandon because they were armed with the wrong version of the software – incompatible with other control computers. The obvious solution was a database application, which took me three months to build, and turned out to be very useful - more useful than a stack of procedures and paper records.
And again, I've seen this one as well.  It is common enough for the configuration management for older systems to be a filing cabinet full of hard copy printouts, sometimes with the software for every single field installation having been customized by a field engineer. Usually the paper copy is out of date with reality.  If you find a bug, how can you fix it if you don't really even know what software is out in the field?

Configuration management is important not just for your build process, but also to keep track of what's out in the field. The most basic requirement is that your device needs to be able to tell you what version of software is installed (for example, with a start-up message). Beyond that, you really want a database that you can run queries on to find out what's out there. Such databases often get stale, so it's also very helpful to make a configuration audit part of every time you touch the equipment for maintenance to keep the database updated.

Les writes a thought-provoking (and nicely styled) long-form blog on system engineering :
The stories are interesting and well told, with a few twists and turns along the way.  For example, the article on Fagan inspections provides insight into how culture changes if you are serious about creating high quality software. The title gives you an idea of the style:  "Extreme Review: A Tale of Nakedness, Alsations and Fagan Inspection." Highly recommended reading.

Monday, July 20, 2015

Avoiding EEPROM Wearout

Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem.

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that  EEPROM only works for a limited number of write cycles.  After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.")  A million writes sounds like a lot, but they go by pretty quickly.  Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
  1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36.  This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.

Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you lose power.  Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough.  This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss.  In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.)  The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value.  You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage:

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device.  The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note:

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.