Better Embedded System SW

Monday, January 4, 2016

Top Two Mistakes In Main Loop Scheduling

In my previous posting I showed you how to build a multi-rate main loop. In this posting we're going to start digging into the critical issue of how to do timing analysis for a multi-rate main loop cooperative scheduler. I'll explain the top two mistakes I see in industry code that lead to missed task deadlines.

I'm going to assume your multi-rate main loop is implemented more or less the way I described in my previous blog posting. The important property we care about is whether each task gets to run to completion sometime within its assigned time period.

Multi-Rate Main Loop CPU Utilization

Let's say you have a set of tasks that looks like this, with the table columns being the task number, the period the task runs at (in msec), and the amount of CPU time it takes to run the task (again in msec):

Task Period Compute
0 5 1
1 10 2
2 20 3
3 100 11

In this example, Task 2 runs once every 20 msec, and takes 3 msec to run once it has started.

Here is the first big question: is the CPU overloaded to more than 100%? To answer that, for each row in the table add a CPU % that is computed by dividing each task's run time by its period. For example task 2 runs 3 msec out of every 20 msec, so the CPU utilization is 3/20 = 15%.

Task Period Compute CPU Load
0 5 1 20%
1 10 2 20%
2 20 3 15%
3 100 11 11%
Total CPU Load 66%

OK, total CPU load is only 66%. This is good news so far.

To do this math in the real world, you also need to add in the CPU time spent by interrupt service routines. Sometimes you end up with more than 100%, and that is obviously a problem.

It might seem impossible to get this wrong. But I would not be writing this if I hadn't seen this problem in a real industry design review. More than once. The way teams get this wrong is not by getting the math wrong. The problem is that they don't do the analysis in the first place, and have no idea what their CPU load is because they never measure it.

An especially tricky problem is when a task might normally run quickly, but once in a while has a very long compute time. This means that the system might meet its deadlines most of the time, but not all the time. If you were to monitor typical CPU load on a testbed things would be fine. But you might still miss deadlines once in a while -- perhaps only once in a really long while that you don't see during system test.

When doing a CPU loading calculation, the compute time has to be the worst case path through each and every task in the code (sometimes called Worst Case Execution Time: WCET), not just a typical path. Determining WCET can be a bit of a pain, but doing so for each task -- even the very infrequent ones, is a required first step in knowing if you are going to miss deadlines.

Blocking Time

But we're not out of the woods yet! It is pretty easy to design a system that misses deadlines despite low CPU loads. Have you spotted the problem with the example task set above yet?

The issue is that in a non-preemptive system like this once a task starts running, it runs to completion. If it is an infrequent task it might account for a small CPU load, but still hog the system for a long time.

Take another look at Task 3. It only runs every 100 msec, but it runs for 11 msec. That's only 11% CPU load. But 11 msec is more than twice as long as Task 1's period! So that means no matter what you do, once task 3 starts running, task 1 will miss a deadline. If you want to see a simple example, consider the following timeline:

Time 100: Task 0 runs for 1 msec
Time 101: Task 3 runs for 11 msec
Time 112: Task 0 can start running. But it needed to run again between times 105 and 110, so it has already missed its deadline.

This problem would occur regardless of CPU load. Even if Task 3 executes only once per day, every day Task 0 would miss its deadline. In fact, the longer the period for Task 3, the less frequently the problem will happen and the harder it is going to be to find this problem in testing. (Sure, some systems can occasionally miss a deadline. But you at least want to know that's going to happen to make sure you can design the system to be robust to a missed deadline. And for a safety critical system, having a system that misses any deadlines due to a design flaw is a bad idea regardless of system robustness.)

In general, if any task has a compute time longer than twice the period of the fastest task, you're going to miss deadlines. (A more precise formulation is if the sum of twice the compute time of the fastest task plus the compute time of the longest task is more than twice the period of the fastest task, you'll miss deadlines.) Note that this is a necessary but not sufficient condition for schedulability.

In design reviews I see this problem of low CPU % but blocked high priority task frequently. Very frequently. So don't let it bite you!

Even if this isn't the case it is still possible to miss deadlines with less than 100% CPU load, but that analysis takes a while to explain, so it will have to wait for another posting..

If you simply can't wait, the easy-to-explain (but tedious) way to check things is to lay out a timeline of when tasks will execute using a spreadsheet, with one row per msec and starting with all tasks being ready to run at time zero. Put down when each task will run as a series of occupied cells in the timeline. Repeat until you reach the least common multiple (LCM) of all the task periods. If they all fit, you should be OK. If not, you need to dig deeper into what's going on. (Note that this neglects the effects of jitter in task start times caused by race conditions between the timer ISR and the main loop if..else chain. So exercise care with that analysis.)

Monday, December 7, 2015

Multi-Rate Main Loop Tasking

Recently I was looking for an example of a prioritized, cooperative multitasking main loop and realized it is surprisingly difficult to find one that is (1) readily available and (2) comprehensible.

Sure, once you understand the concept maybe you can dig through an industrial-strength implementation with all sorts of complexity. But good luck getting up the learning curve! So, here is a simple (I hope!) description of a multi-rate main loop scheduler.

First of all, we're talking about non-preemptive multitasking. This is variously called main loop scheduling, a main loop tasker, a prioritized cooperative multitasker, a cyclic executive, a non-preemptive scheduler, and no doubt a bunch of other terms. (Not all of these terms are precisely descriptive, but in terms of search terms they'll get you in the ballpark.) The general idea is to be able to schedule multiple tasks that operate at different periods without having to use an RTOS and without having to use the generally bad idea of stuffing everything into timer-based interrupts.

Single Rate Main Loop

The starting point is a single-rate main loop. This is a classical main loop schedule in which all tasks are run to completion over and over again:

void main(void)
{ ... initialization ...
while(1)
{ Task0();
    Task1();
    Task2();
    Task3();
}
}

The good news about this is that you don't need an RTOS, and it's really hard to get things wrong. (In other words reviews and testing are easy to do well.)

The bad news is that all tasks have to run at the same period. That means that if one task needs to run really quickly you'll miss its deadlines.

I've seen way too many hacks that use conditional execution based on CPU load to sometimes skip tasks. But ad hoc approaches have the problem that you can't really know if you'll miss deadlines. Instead, you should use a methodical approach that can be analyzed mathematically to make sure you'll meet deadlines. The way people generally go is with some variation of a multi-rate main loop.

Multi-Rate Main Loop

The idea behind a multi-rate main loop is that you can run each task at a different periodic rate. Each task (which is just a subroutine) still runs to completion, so this is not a full-up preemptive multitasking system. But it is relatively simple to build, and flexible enough for many embedded systems.

Here is some example code of the main loop itself, with some explanation to follow.

void main(void)
{ ... initialization ...
while(1)
{ if( flag0 ) { flag0 = 0; Task0(); }
else if ( flag1 ) { flag1 = 0; Task1(); }
else if ( flag2 ) { flag2 = 0; Task2(); }
else if ( flag3 ) { flag3 = 0; Task3(); }
}
}

The way this code works is as follows. All the tasks that need to be run have an associated flag set to 1. So if Task1 and Task2 are the only tasks that need to run, flag0 is 0, flag1 is 1, flag2 is 1, and flag3 is 0. The code crawls through an "else if" cascade looking for the first non-zero flag. When it finds a non-zero flag it executes that task, and only that task.

Note that each task sets its flag to zero so that it runs exactly one time when it is activated by its flag. If all flags are zero then no task is executed and the do..while loop simply spins away until a flag finally becomes non-zero. More about how flags get set to 1 in a moment.

After executing at most one task, the loop goes back to the beginning. Because at most one task is executed per iteration of the main do..while loop, the tasks are prioritized. Task0 has the highest priority, and Task3 the lowest priority.

Yes, this prioritization means that if your CPU is overloaded Task0 may execute many times and Task3 may never get a turn. That's why its important to get scheduling right (this will be a topic in a later blog posting).

Multi-Rate Timers

The main loop wasn't so bad, except we swept under the rug the messy business of getting the flags set properly. Trying to do that in the main loop generally leads to problems, because a long task will cause many milliseconds to go by between timer checks, and it is too easy to have a bug that misses setting a flag some of the time. Thus, in general you tend to see flag maintenance in the timer interrupt service routine.

Conceptually the code looks like this, and for our example lives in a timer interrupt service routine (ISR) that is called every 1 msec. A variable called "timer" keeps track of system time and is incremented once every msec.

// in a timer ISR that is called once every msec
timer++;
if ((timer % 5) == 0) { flag0 = 1; } // enable 5 msec task0
if ((timer % 10) == 0) { flag1 = 1; } // enable 10 msec task1
if ((timer % 20) == 0) { flag2 = 1; } // enable 20 msec task2
if ((timer % 100) == 0) { flag3 = 1; } // enable 100 msec task3
if (timer >= 100) { timer = 0; } // avoid timer overflow bug

Every 5 msec the timer will be zero modulo 5, every 10 msec the timer will be zero modulo 10, and so on. So this gives us tasks with periods of 5, 10, 20, and 100 msec.   Division is slow, and in many low-end microcontrollers should be avoided in an ISR. So it is common to see a set of counters (one per flag), where each counter is set to the period of a particular task and counts down toward zero once per msec. When a counter reaches zero the associated flag is set to 1 and the counter is reset to the tasks' period. This takes a little more RAM, but avoids division. How it's implemented depends upon your system tradeoffs.

The last line of this code avoids weird things happening when the timer overflows. The reset to zero should run at the least common multiple of all periods, which in this case happens to be equal to the longest period.

Concurrency Issues

As with any scheduling system there are potential concurrency issues. One subtle one is that the timer ISR can run part-way down the else..if structure in the main loop. This could cause a low-priority task to run before a high priority task if they both have their flags set to 1 on the same timer tick. It turns out that this doesn't make the worst case latency much worse. You could try to synchronize things, but that adds complexity. Another way to handle this is to copy the current time into a temporary variable and do the checks for when to run each task in the main loop instead of the timer ISR.

It's also important to note that there is a potential concurrency problem in writing flags in the main loop since both the ISR and the main task can write the flag variables. In practice the concurrency bug will only hit when you're missing deadlines, but good coding style dictates disabling interrupts when you update the flag values in the main loop, which isn't shown in the main loop code in an attempt to keep things simple for the purpose of explanation.

The Big Picture

OK, that's pretty much it. We have a main loop that runs each task when its ready-to-run-flag is set, and a timer ISR that sets a ready-to-run flag for each task at the desired period. The result is a system that has the following properties:

Each task runs once during its assigned period
The tasks are prioritized, so for example task 2 only runs when task 0 and task 1 do not need to run

The big benefit is that, so long as you pay attention to schedulability math, you can run both fast and slow tasks without needing a fancy RTOS and without missing deadlines.

In terms of practical application this is quite similar to what I often see in commercial systems. Sometimes developers use arrays of counters, arrays of flags, and sometimes even arrays of pointers to functions if they have a whole lot of functions, allowing the code to be a generic loop rather than spelling out each flag name and each task name. This might be necessary, but I recommend keeping things simple and avoiding arrays and pointers if it is practical for your system.

Coming soon ... real time analysis of a multi-rate main loop

Monday, November 9, 2015

How Long Should You Set Your Watchdog Timer

Summary: You can often use a watchdog effectively even if you don't know worst case execution time. The more you know about system timing, the tighter you can set the watchdog.

Sometimes there is a situation in which a design team doesn't have resources to figure out a tight bound on worst case execution time for a system. Or perhaps they are worried about false alarm watchdog trips due to infrequent situations in which the software is running properly, but takes an unusually long time to complete a main loop. Sometimes they set the watchdog to the maximum possible time (perhaps several seconds) to avoid false alarm trips. And sometimes they just set it to the maximum possible value because they don't know what else to do.

In other words, some designers turn off the watchdog or set it to the maximum possible setting because they don't have time to do a detailed analysis. But, almost always, you can do a lot better than that.

To get maximum watchdog effectiveness you want to put a tight bound on worst case execution time with it. However, if your system safety strategy permits it, there are simpler ways to compute watchdog period by analyzing the application instead of the software itself. Below I'll work through some ways to set the watchdog, going from the more complicated way to a simpler way.

Consider software that heats water in an appliance. Just to make the problem concrete and use easy numbers, lets say that the control loop for heating executes every 100 msec, and takes between 40 and 75 msec to execute (worst case slow and fast speeds). Let's also say that it uses a single-task main loop scheduler without an RTOS so we don't have to worry about task start time jitter. How could we set the watchdog for this system? Ideally we'd like the tightest possible timing, but there may be some slack because water takes a while to heat up, and takes a while to boil dry. How long should we set the watchdog timer for?

Classical Watchdog Setup

Classically, you'd want to compute the worst case execution time range of the software (40-75 msec in this case). Let's assume the watchdog kick happens as the last instruction of each execution. Since the software only runs once every 100 msec, then the shortest time between kicks is when one cycle runs 75 msec, waits 25 msec, then the next cycle runs faster, completing the computation and kicking the watchdog in only 40 msec. 25+40=65 msec shortest time between kicks. In contrast, the longest time between kicks is when a short cycle of 40 msec is followed by 60 msec of waiting, then a long cycle of 75 msec. 60+75=135 msec longest time between kicks. It helps a lot to sketch this out:

If you're setting a conventional watchdog timer, you'd want to set it at 135 msec (or the closest setting greater than that). If you have a windowed watchdog, you'd want to set the minimum setting at 65 msec, or the closest setting lower than that.

Note that if your running an RTOS, the scheduling might guarantee that the task runs once in every 100 msec, but not when it starts within that period. In that case the worst case shortest time is the task running back-to-back at its shortest length = 40 msec. The longest time will be when a short task runs at the beginning of a period, and the next task completes right at the end of a period, giving 60 msec (idle time at end of period) + 100 msec (one more period) = 160 msec between watchdog kicks. Thus, a windowed watchdog for this system would have to permit a kick interval of 40 to 160 msec.

Watchdog Approximate Setup

Sometimes designers want a shortcut. It is usually a mistake to set the watchdog at exactly the period because of timing jitter in where the watchdog actually gets kicked. Instead, a handy rule of thumb for non-critical applications (for which you don't want to do the detailed analysis) is to set the watchdog timer interval to twice the software execution period. For this example, you'd set the watchdog timer to twice the period of 100 msec = 200 msec. There are a couple assumptions you need to make: (1) the software always finishes before the end of its period, and (2) the effectiveness of the watchdog at this longer timeout will be good enough to ensure adequate safety for your system. (If you are building a safety critical system you need to dig deeper on this point.)

This simpler approach sacrifices some watchdog timer effectiveness for detecting faults that perturb system timing compared to the theoretical bound of 135-160 msec. But it will still catch a system that is definitely hung without needing detailed timing analysis.

For a windowed watchdog the rule of thumb is a little more difficult. That is because, in principle, your task might run the full length of one period and complete instantly on the next period, giving effectively a zero-length minimum watchdog timer kick interval. If you can establish a lower bound on the minimum possible run time of your task, you can set that as an approximate lower bound on watchdog timer kicks. If you don't know the minimum time, you probably can't use the lower bound of a windowed watchdog timer.

Application-Based Watchdog Setup

The previous approaches required knowing the run time of the software. But, what do you do if you know nothing about the run time, or have low confidence in your ability to predict the worst case bounds even after extensive analysis and testing?

An alternate way to approach this problem is to skip the analysis of software timing and instead think about the application. Ask yourself, what is the longest period for which the application can withstand a hung CPU? For example, for a counter-top appliance heating water, how long can the heater be left full-on due to hung software without causing a problem such as a fire, smoke, or equipment damage? Probably it's quite a bit longer than 100 msec. But it might not be 5 or 10 seconds. And probably you don't want to risk melting a thermal fuse or a household smoke alarm by turning off the watchdog entirely.

As a practical matter, if the system's controls go unstable after a certain amount of time, you'd better make your watchdog timer period shorter than that length of time!

For the above example, the control period is 100 msec. But, let's say the system can withstand 500 msec of no control inputs without becoming uncrecoverable, so long as the control system starts working again within that time and it doesn't happen often. In this case, the watchdog timer must be shorter than 500 msec at least. But there is one other thing to consider. The system probably takes a while to reboot and start running the control loop after the watchdog timer trips. So we need to account for that restart time. For this example, let's say the time between a watchdog reset and the time the software starts controlling the system generally ranges from 70 to 120 msec based on test measurements.

Based on knowing the system reset time and stability grace period, we can get an approximate watchdog setting as follows. We have 500 msec of no-control grace period, minus 120 msec of worst case restart time. 500-120 = 380 msec. Thus, for this system the maximum permissible watchdog timer value is 380 msec to avoid losing control system stability. Using this approach, the watchdog maximum period should be set at the longest time that is shorter than 380 msec. Without knowing more about software computation time, there is not much we can say about the minimum period for a windowed watchdog.

Note that for this approach we still need to know something about the worst case execution of the software in case you hit a long execution path after the watchdog reset. However, it is often the case that you know the longest time that is likely to be seen (e.g., via measuring a number of runs) even if you don't know the details of the real time scheduling approach used. And often you might be willing to take the chance that you won't hit an unlikely, even worse running time right after a system reset. For a non-safety critical application this might be appropriate, and is certainly better than just turning the watchdog off entirely.

Finally, it is often useful to combine the period rule of thumb with the control stability rule of thumb (if you know the task execution period). You want the watchdog set shorter than the time required to ensure control stability, but longer than the time it actually takes to execute the software so that it will actually be kicked often enough. For the above example this means setting the watchdog somewhere between two periods and the control stability time limit, giving a range for the maximum watchdog limit of 200-380 msec. This can be set without detailed software execution time analysis beyond knowing the task period and the range of likely system restart times.

Summary
If you know the maximum worst case execution time and the minimum execution time, you should set your watchdog as tightly as you can. If you don't know those values, but at least are confident you'll complete execution within your task period, then you should set the watchdog to twice the period. You should also bound the maximum watchdog timeout setting by taking into account how long the system can operate without a reset before it loses control stability.

Monday, September 28, 2015

Open Source IoT Code Is Not The Entire Answer

Summary: Whether or not to open sourcing embedded software is the wrong question. The right question is how can we ensure independent checks and balances on software safety and security. Independent certification agencies have been doing this for decades. So why not use them?

In the wake of the recent Volkswagen diesel software revelations, there has been a call from some that automotive software and even all Internet of Things software should be open source. The idea is that if the software is released publicly, then someone will notice if there is a security problem, a safety problem, or skulduggery of some sort. While open source can make sense, this is neither an economically realistic nor necessary step to apply across-the-board.

The Pro list for open source is pretty straightforward: if you publish the code, someone will come and read it and find all the problems.

The Con list is, however, more reflective of how things really work. You have to assume that someone with enough technical skill will actually spend the time to look, and will actually find the problem. That doesn't always happen. The relatively simple Heartbleed bug was there for all to see in OpenSSL, and it stayed there for a couple years despite being a widely used, crucial piece of open source Internet infrastructure software. Presumably a lot more people care about OpenSSL than your toaster oven's software.

Some of the opponents of open sourcing IoT software invoke the security bogeyman. They say that if you publish the source you'll be vulnerable to attacks. Well sure, it might make it easier to find a way to attack, but it doesn't make you "vulnerable." If your code was already full of vulnerabilities, publishing source code just might make it a little easier for someone to find them. Did you notice that the automotive security exploits published recently did not rely on source code? I can believe that exploits could, at least sometimes, be published more quickly for open source code, but I don't see this as a compelling argument for keeping code secret and un-reviewed.

A more fundamental point is that software is often the biggest competitive advantage in making products that would otherwise be commodities. Asking companies to reveal their most important trade secrets (their software), so that a hypothetical person with the time and skills might just happen to find a problem sounds like a hard sell to me. Especially since there is the well established alternative of having an external, independent certification agency look things over in private.

Safety critical systems have had standards and independent review systems in place for decades. Aviation uses DO-178c and other standards, and has a set of independent reviewers called Designated Engineering Representatives (DERs) that provide design reviews during the development cycle. Rail systems follow EN-50126/8/9 and typically involve oversight from acquisition consultants. The chemical process industry generally follows IEC-61508, and has long used independent certification organizations to check their work (typically I see reviews have been done by Exida or TUV). The consumer appliance industry has long had Underwriters Laboratories (UL) certification, and is moving to a more comprehensive software safety standard approach based on IEC 60730, including external independent certification. There are also more recent domain-specific security standards that can be applied. (It is worth noting that ensuring safety and security requires a lot more than just source code, but that's a topic for another day.)

Cars have long had the option to use the MISRA software safety guidelines, and more recently the ISO 26262 safety standard. Historically, some companies have had external agencies certify automotive components to those standards. But, at least some car companies have not taken advantage of this external audit opportunity, and thus there has been no independent check and balance on their software until we their problems show up in the news. Software safety and security audits are not required to sell cars in the US. (There is some vehicle-level testing according to FMVSS requirements, but it's about vehicle behaviors, not the actual source code.)

For Internet of Things it will be interesting to see how things play out. As I understand it the EU is already requiring IEC 60730 compliance, which means external safety checks for safety critical IoT applications. We could see that mandate spread to more IoT products sold in Europe if there are high-profile problems. And perhaps we'll see a push on automotive software regulation too.

So, there is a well established alternative to open source in the form of external certifying organizations issuing compliance certificates based on international safety and security standards. Rather than get distracted by an open source debate, what we should be doing is asking "what's the most effective way to ensure adequate software safety and dependability in a way that doesn't put companies out of business." Sometimes that might be open source, especially for underlying infrastructure. But other times, probably most times, independent review by a trusted certification party will be up to the task. The question is really what it will take to make companies produce verifiably adequate software.

Having checks and balances works. We should use them.

(For the record, I made some of my source code public domain before "open source" was even a buzzword, and have released other source code under an older version of GPL (Ballista robustness testing) and Creative Commons BY 4.0 (CRC Hamming Distance length calculation). Some code I copyright and release. And some I keep as a trade secret. My interest here is in the public being able to use safe and secure embedded software. We should focus on that, and not let things get sidetracked into another iteration of the open source vs. proprietary software debate.)

Monday, September 7, 2015

Essential Embedded Software Skills

I spend a lot of time trying to grapple with what makes embedded systems different than desktop computer systems in terms of skills and development processes. Often the answer to this question on discussion groups ends up being something like "everything has to be super-optimized," or "you need to meet real-time deadlines." But those are technical measures that seem to me to be more symptoms of particular embedded system projects rather than root cause of the differences. And, such answers tend to be a bit one-dimensional.

After some thought, perhaps the distinctive attributes of embedded systems can be summarized in the following way:

Interaction with the physical world:
Embedded systems generally have a primary goal of interacting with the physical world using sensors and actuators. This in turn encompasses various topics depending on the application, including:
- Real time responsiveness (scheduling, concurrency management, timekeeping)
- Analog & digital interfacing
- Control approaches
- Signal processing
- Coordination via networked and Cloud services
- Reliability, safety, system robustness

Special-purpose computing platform:
Most embedded systems don't use a general purpose computing platform (a desktop comptuer, laptop, tablet, smart phone, etc.). Rather, they use a customized hardware platform that is permanently embedded into the product. (Even those that do use somewhat standardized hardware often have specialized I/O devices attached.) This in turn encompasses various topics depending on the application, including:
- Software optimization (squeezing to fit into a cost-constrained platform)
- Close-to-hardware programming (interrupts, device interfacing)
- Hardware specialization (application-specific hardware, DSP platforms)
- Specialized network protocols
- Special-purpose human interaction devices
- Hardware-dependent testing approaches
- Customized operating system (or custom non-OS task manager)
- Power management

Domain-centric development:
Outside the consumer electronics area, in my experience it is rare to meet a deeply embedded system developer with a primary college degree in computer engineering or computer science. Generally they have a degree more relevant to their product domain. Yet, nonetheless, here they are writing significant amounts of code for a living. Those trained in software development are also missing somewhat different pieces. Regardless of background, developers usually need to understand the following areas:
- General software process and technical practice literacy (for domain experts) / Domain expertise (for software experts)

- Life-cycle support for long-lived, hard-to-update products

- Distributed and federated system architecture design
- Domain-optimized development (e.g., model-based design for control systems)
- Domain-specific aspects of security

Looking at this list, it becomes clear that skills such as knowing how to write super-optimized code are merely pieces of a larger puzzle. In general, you need to be at least literate in all the topics above to be a well-rounded embedded system developer. Sure, not everyone and not every project needs deep expertise in everything. But if you're planning on a career in embedded systems you'll likely hit just about everything on the list -- I know that I certainly have. (And, if you're a hiring manager, now you have a shopping list for skills for your senior developers.)

Monday, August 17, 2015

Three Suggestions from Les Chambers

Les Chambers is a knowledgeable software engineer with plenty of experience in critical embedded systems. He had the following three comments about what kinds of "paper" (documentation) are essential for larger-scale embedded systems, which are all spot on. (His points are in bullets below with permission and light editing for this format.)

Without an architectural design document it is impossible to plan and manage any software project with more than two or three people coding. You've got to know numbers of things to estimate resource requirements. How many functions, how many screens, how many objects. I've seen regularly a problem with big teams coming together and floundering around with nothing productive to do because the architect is still pulling his ideas together and has no vehicle for communicating them to the team.

Les is correct. Any time you have a project with more than a couple people you really need an architectural document of some sort. Ideally a single sheet of paper, usually with boxes and arrows, that shows you what the pieces are and how they fit together. Once you have 5 people on the team, this is absolutely mandatory, and in my experience as Les says you will have just chaos until that picture is nailed down.

A related problem I've seen is when the architecture document shows all the hardware boxes and communication links, but software is nowhere to be found in the picture. You need to either put software on that same diagram or have a separate picture for the software structure that is compatible with the hardware architecture. Chapter 10 of my book has some general rules to help in constructing these types of diagrams. However, I'll be the first to admit that creating a good architecture is as much art as science. If you want to really delve into this, the best systems architecture book I've found is Rechtin's book on System Architecting (The first edition is by far the best for an initial read, if you can find it. The 3rd edition by Maier & Rechtin has a lot more material, but is a bit more complex if you are just going for the essentials.)

Another issue you've hinted at but not explicitly stated is the importance of detailed rationales for design approaches. The symptom here is, six months down the track, someone questions a nonintuitive design approach, spends a week working through the design rationale and then decides, "... oh yes it was right in the first place." Worse: someone, not in possession of the facts, changes an approach that was made for rational reasons and injects bugs.

Yes, I've seen this one too. This can especially be a problem if a design decision is made for an extra-functional purpose such as safety. For example, consider an aircraft in which two cable bundles are run down different sides of an aircraft. Someone later might conclude that it is cheaper and easier to run them next to each other. Functionally there is (at least at first glance) no difference. But the point of separating the wires was so that if physical damage occurs to one part of the aircraft only one of the two cable bundles will be affected. (Could this happen? Read about the United Airlines Flight 232 crash where three hydraulic lines were damaged where they ran too close together.)

In general, it is a good idea to capture not just requirements but also design decisions with rationale so that the basis for important decisions is not lost. This is especially important for long-lived systems that are likely to be maintained and updated over periods of decades, which is a common enough situation in the embedded systems world.

Another piece of paper I think should be added is the configuration management documentation. Exactly what versions of what software are running on what versions of what hardware where. I once had to tackle this problem on a project with in excess of 200 computers deployed all over a [geographically distributed embedded system] network. The symptoms were: people in the development shop spending a week working on reproducing a bug found on site and being unable to do so because they are working on the wrong version of the code. Large deployment teams turning out to do site installation and having to abandon because they were armed with the wrong version of the software – incompatible with other control computers. The obvious solution was a database application, which took me three months to build, and turned out to be very useful - more useful than a stack of procedures and paper records.

And again, I've seen this one as well. It is common enough for the configuration management for older systems to be a filing cabinet full of hard copy printouts, sometimes with the software for every single field installation having been customized by a field engineer. Usually the paper copy is out of date with reality. If you find a bug, how can you fix it if you don't really even know what software is out in the field?

Configuration management is important not just for your build process, but also to keep track of what's out in the field. The most basic requirement is that your device needs to be able to tell you what version of software is installed (for example, with a start-up message). Beyond that, you really want a database that you can run queries on to find out what's out there. Such databases often get stale, so it's also very helpful to make a configuration audit part of every time you touch the equipment for maintenance to keep the database updated.

Les writes a thought-provoking (and nicely styled) long-form blog on system engineering : http://www.systemsengineeringblog.com/
The stories are interesting and well told, with a few twists and turns along the way. For example, the article on Fagan inspections provides insight into how culture changes if you are serious about creating high quality software. The title gives you an idea of the style: "Extreme Review: A Tale of Nakedness, Alsations and Fagan Inspection." Highly recommended reading.

Monday, July 20, 2015

Avoiding EEPROM and Flash Memory Wearout

Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem. (Flash memory has similar issues.)

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that EEPROM only works for a limited number of write cycles. After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.") A million writes sounds like a lot, but they go by pretty quickly. Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36. This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.

Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you lose power. Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough. This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss. In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.) The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value. You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage: http://www.atmel.com/images/doc2526.pdf

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device. The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note: http://ww1.microchip.com/downloads/en/AppNotes/01449A.pdf

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:
ftp://ftp.microchip.com/tools/memory/total50/tutorial.html

Don't Re-Write Unchanging Values
Another way to reduce wearout is to read the current value in a memory location before updating. If the value is the same, skip the update, and eliminate the wearout cycle associated with an update that has no effect on the data value. Make sure you account for the worst case (how often can you expect values to be the same?). But even if the worst case is bad, this technique will give you a little extra margin of safety if you get lucky once in a while and can skip writes.

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.

Leraning More
Nash Reilly has a nice series of tutorial postings on how Flash/EEPROM technology works. (I found out about these via Jack Ganssle's newsletter.)
http://cushychicken.github.io/nand-pt1-transistors/
http://cushychicken.github.io/nand-pt2-floating/
http://cushychicken.github.io/nand-pt3-arrays/
http://cushychicken.github.io/nand-pt4-pages-blocks/
http://cushychicken.github.io/nand-pt5-how-nand-breaks/
http://cushychicken.github.io/nand-pt6-dealing-with-flaws/
http://cushychicken.github.io/inconvenient-truths/

Oct 2019: Tesla is said to have a flash wearout problem for its SSDs. https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/