Monday, February 1, 2016

Multi-Rate Main Loop Task Timing

In the past couple posts I've talked about how to build a multi-rate main loop scheduler and the two biggest mistakes I tend to see with timing analysis. In this posting I'll wrap up the topic (at least for now) by describing how to determine whether a particular task in a non-preemptive multi-rate system is going to meet its deadlines.  The math gets pretty complex and is pretty painful, so I'll work an example and give the general rules without trying to show the iterative math behind the method. At that, this posting will take more work to understand completely than my usual posts, but it's just a complex topic and this isn't the place for an entire book chapter worth of math. On real systems you can generally do the checking with a spreadsheet once you understand the technique.

Consider the following task set:

Task  Period  Compute  CPU Load 
 0       7       2       28.6%
 1      10       2       20%
 2      20       3       15%    
 3     101       5        5%
 4     199       3        1.5%
         Total CPU Load  70.1%  (numbers rounded to nearest 0.1%)

If we want to find out the worst case response time for Task 2, we need to look at the following worst case:
  • All the tasks in the task set become ready to run simultaneously. In practice this means that the system timer is at zero or equal to the least common multiple of all periods
  • The most obnoxious task with priority lower than the task we care about sneaks in and starts running just before the highest priority task gets to run. (Generally this happens because that task got  a chance to start right at the last instant before the counter ticked.)
  • All the tasks with higher priority run one or more times (based on their periods) before the task we care about runs. In general some tasks will run multiple times before our task gets a chance to run.
So for the example case of Task #2, that means:
  • Identify Task #3 as the task with the largest compute time from those tasks with a priority higher than Task #2.  Assume it starts running at time zero because it snuck in ahead of all the other tasks.
  • Because it is time zero, all other tasks are ready to run starting at time zero. But because Task #3 snuck in first, all the other tasks are in the run queue at time zero.
Now we do an iterative calculation as follows, with each numbered step being an iteration of run the highest queued priority task and compute the new system time when it ends.
  1. Start at time 0 with Task 3 running.
    - Tasks 0, 1, 2, 4 queue to run at time zero.
    - Running Task 3 takes us to time 5 when Task 3 completes.
    - At time 5 tasks still queued are:  0, 1, 2, 4
  2. Time 5: run highest priority queued task: Task 0.
    - This runs for 2 msec, bringing us to time 7.
    - At time 7 we reach the Task 0 period, so another copy of Task 0 queues to run.
    - At time 7 tasks still queued are: 0, 1, 2, 4
  3. At time 7 run highest priority queued task: Task 0.
    - This runs for 2 msec, bringing us to time 9.
    - At time 9 tasks still queued are: 1, 2, 4
  4. At time 9 run highest priority queued task: Task 1.
    - This runs for 2 msec, bringing us to time 11.
    - At time 10 another copy of Task 1 is queued (note that we have missed the deadline for Task 1 since it should have completed at time 10, but ran until time 11).
    - At time 11 tasks still queued are: 1, 2, 4
  5. At time 11 run highest priority queued task: Task 1.
    - This runs for 2 msec, bringing us to time 13.
    - At time 13 tasks still queued are: 2, 4
  6. At time 13 run highest priority queued task: Task 2.
    - This runs for 3 msec, bringing us to time 16.
    - At time 14 another copy of Task 0 is queued.
    - At time 14 tasks still queued are: 0, 4.
  7. At time 16 Task #2 has completed, ending our analysis
We've computed the worst case time to complete Task 2 is 16 msec, which is less than its 20 msec period. So Task 2 will meet its deadline. However, along the way we noticed that even though the CPU is only loaded to 70.1% Task 1 is going to miss its deadline.

Graphically, the above scenario looks like this:

The blue arrows show where tasks become queued to run, and the boxes show which task is running when.

To determine if your system is schedulable you need to repeat this analysis for every task in the system. In this case, repeating it for Task 1 will reveal that Task 1 misses its deadlines even though the CPU is only loaded at 70.1%.

In general the technique is to assume the longest-running task (biggest compute time) with priority lower than yours starts running and all other tasks queue at that same time. Then play out the sequence to see if you meet your deadline. There are a few notes on special cases. The longest task with lower priority may vary depending on which task you are evaluating. For example, the longest lower priority task for Task #3 is not Task #3, but rather Task #4. And the lowest priority task doesn't have a lower priority blocking task, so when evaluating Task #4 you can just assume it starts at time zero (there is no initial blocker).  This can be expressed as an iterative equation that has to be cycled until it converges. If you really want a math-based approach take a look at the CAN performance lecture slides in my grad course, which is pretty much the same math. But trying to explain the math takes a lot more words and complicated typography than are practical in a blog entry.

If you find out you are missing deadlines and want to fix it, there are two general techniques that can help:
  • Make any really long-running low priority task run shorter, possibly by breaking it up into multiple shorter tasks that work together, or that can "yield" control back to the main loop every once in a while during their execution and pick up the next time they run where they left off. This will reduce the length of the initial blocking effect for all higher priority tasks.
  • Schedule tasks so their periods evenly divide each other (this will result in the Least Common Multiple of all periods equals the largest period). This corresponds to the the approach of harmonic task periods discussed in the real time scheduling chapter of my book. For non-preemptive tasking it will NOT support 100% CPU usage, but probably it will make worst case latency better.

Notes for readers who want to be thorough:
  • If you have interrupts you have to include the ISR contribution to CPU workload when doing this analysis. Doing so can get a bit tricky.  The real time scheduling chapter of my book shows how to do this for a single-rate main loop scheduler.
  • It may be that you want to resolve a situation in which the fastest task gets blocked for too long by putting that faster task into an ISR. If you do that, keep it short and don't forget that it still affects CPU workload. This approach is discussed for a single-rate main loop scheduler, also in the real time scheduling chapter of my book.
  • We don't include the current task in the "most obnoxious" consideration because we start by assuming the system will meet deadlines where deadline = task period. Any particular task has to have finished running before it can run again unless it misses its deadline. So a task can't self-block in a system that is schedulable.

Monday, January 4, 2016

Top Two Mistakes In Main Loop Scheduling

In my previous posting I showed you how to build a multi-rate main loop. In this posting we're going to start digging into the critical issue of how to do timing analysis for a multi-rate main loop cooperative scheduler. I'll explain the top two mistakes I see in industry code that lead to missed task deadlines.

I'm going to assume your multi-rate main loop is implemented more or less the way I described in my previous blog posting. The important property we care about is whether each task gets to run to completion sometime within its assigned time period.

Multi-Rate Main Loop CPU Utilization

Let's say you have a set of tasks that looks like this, with the table columns being the task number, the period the task runs at (in msec), and the amount of CPU time it takes to run the task (again in msec):

Task  Period  Compute   
 0       5       1    
 1      10       2    
 2      20       3        
 3     100      11     

In this example, Task 2 runs once every 20 msec, and takes 3 msec to run once it has started.

Here is the first big question: is the CPU overloaded to more than 100%? To answer that, for each row in the table add a CPU % that is computed by dividing each task's run time by its period. For example task 2 runs 3 msec out of every 20 msec, so the CPU utilization is 3/20 = 15%.

Task  Period  Compute  CPU Load 
 0       5       1       20%
 1      10       2       20%
 2      20       3       15%    
 3     100      11       11%
         Total CPU Load  66%

OK, total CPU load is only 66%.  This is good news so far.

To do this math in the real world, you also need to add in the CPU time spent by interrupt service routines. Sometimes you end up with more than 100%, and that is obviously a problem.

It might seem impossible to get this wrong. But I would not be writing this if I hadn't seen this problem in a real industry design review. More than once. The way teams get this wrong is not by getting the math wrong. The problem is that they don't do the analysis in the first place, and have no idea what their CPU load is because they never measure it.

An especially tricky problem is when a task might normally run quickly, but once in a while has a very long compute time. This means that the system might meet its deadlines most of the time, but not all the time. If you were to monitor typical CPU load on a testbed things would be fine. But you might still miss deadlines once in a while -- perhaps only once in a really long while that you don't see during system test.

When doing a CPU loading calculation, the compute time has to be the worst case path through each and every task in the code (sometimes called Worst Case Execution Time: WCET), not just a typical path. Determining WCET can be a bit of a pain, but doing so for each task -- even the very infrequent ones, is a required first step in knowing if you are going to miss deadlines.

Blocking Time

But we're not out of the woods yet! It is pretty easy to design a system that misses deadlines despite low CPU loads. Have you spotted the problem with the example task set above yet?

The issue is that in a non-preemptive system like this once a task starts running, it runs to completion. If it is an infrequent task it might account for a small CPU load, but still hog the system for a long time.

Take another look at Task 3. It only runs every 100 msec, but it runs for 11 msec. That's only 11% CPU load. But 11 msec is more than twice as long as Task 1's period! So that means no matter what you do, once task 3 starts running, task 1 will miss a deadline.  If you want to see a simple example, consider the following timeline:

Time 100:   Task 0 runs for 1 msec
Time 101:   Task 3 runs for 11 msec
Time 112:   Task 0 can start running.  But it needed to run again between times 105 and 110, so it has already missed its deadline.

This problem would occur regardless of CPU load. Even if Task 3 executes only once per day, every day Task 0 would miss its deadline. In fact, the longer the period for Task 3, the less frequently the problem will happen and the harder it is going to be to find this problem in testing.  (Sure, some systems can occasionally miss a deadline. But you at least want to know that's going to happen to make sure you can design the system to be robust to a missed deadline. And for a safety critical system, having a system that misses any deadlines due to a design flaw is a bad idea regardless of system robustness.)

In general, if any task has a compute time longer than twice the period of the fastest task, you're going to miss deadlines.  (A more precise formulation is if the sum of twice the compute time of the fastest task plus the compute time of the longest task is more than twice the period of the fastest task, you'll miss deadlines.) Note that this is a necessary but not sufficient condition for schedulability.

In design reviews I see this problem of low CPU % but blocked high priority task frequently.  Very frequently. So don't let it bite you!

Even if this isn't the case it is still possible to miss deadlines with less than 100% CPU load, but that analysis takes a while to explain, so it will have to wait for another posting..

If you simply can't wait, the easy-to-explain (but tedious) way to check things is to lay out a timeline of when tasks will execute using a spreadsheet, with one row per msec and starting with all tasks being ready to run at time zero. Put down when each task will run as a series of occupied cells in the timeline. Repeat until you reach the least common multiple (LCM) of all the task periods. If they all fit, you should be OK.  If not, you need to dig deeper into what's going on. (Note that this neglects the effects of jitter in task start times caused by race conditions between the timer ISR and the main loop if..else chain. So exercise care with that analysis.) 

Monday, December 7, 2015

Multi-Rate Main Loop Tasking

Recently I was looking for an example of a prioritized, cooperative multitasking main loop and realized it is surprisingly difficult to find one that is (1) readily available and (2) comprehensible.

Sure, once you understand the concept maybe you can dig through an industrial-strength implementation with all sorts of complexity. But good luck getting up the learning curve!  So, here is a simple (I hope!) description of a multi-rate main loop scheduler.

First of all, we're talking about non-preemptive multitasking. This is variously called main loop scheduling, a main loop tasker, a prioritized cooperative multitasker, a cyclic executive, a non-preemptive scheduler, and no doubt a bunch of other terms. (Not all of these terms are precisely descriptive, but in terms of search terms they'll get you in the ballpark.) The general idea is to be able to schedule multiple tasks that operate at different periods without having to use an RTOS and without having to use the generally bad idea of stuffing everything into timer-based interrupts.

Single Rate Main Loop

The starting point is a single-rate main loop. This is a classical main loop schedule in which all tasks are run to completion over and over again:

void main(void)
{ ... initialization ...

  { Task0();

The good news about this is that you don't need an RTOS, and it's really hard to get things wrong. (In other words reviews and testing are easy to do well.)

The bad news is that all tasks have to run at the same period. That means that if one task needs to run really quickly you'll miss its deadlines.

I've seen way too many hacks that use conditional execution based on CPU load to sometimes skip tasks. But ad hoc approaches have the problem that you can't really know if you'll miss deadlines. Instead, you should use a methodical approach that can be analyzed mathematically to make sure you'll meet deadlines. The way people generally go is with some variation of a multi-rate main loop.

Multi-Rate Main Loop

The idea behind a multi-rate main loop is that you can run each task at a different periodic rate. Each task (which is just a subroutine) still runs to completion, so this is not a full-up preemptive multitasking system. But it is relatively simple to build, and flexible enough for many embedded systems.

Here is some example code of the main loop itself, with some explanation to follow.

void main(void)
{ ... initialization ...

  { if(       flag0 )  { flag0 = 0; Task0(); }
    else if ( flag1 )  { flag1 = 0; Task1(); }
    else if ( flag2 )  { flag2 = 0; Task2(); } 
    else if ( flag3 )  { flag3 = 0; Task3(); }

The way this code works is as follows.  All the tasks that need to be run have an associated flag set to 1.  So if Task1 and Task2 are the only tasks that need to run, flag0 is 0, flag1 is 1, flag2 is 1, and flag3 is 0. The code crawls through an "else if" cascade looking for the first non-zero flag.  When it finds a non-zero flag it executes that task, and only that task.

Note that each task sets its flag to zero so that it runs exactly one time when it is activated by its flag. If all flags are zero then no task is executed and the do..while loop simply spins away until a flag finally becomes non-zero. More about how flags get set to 1 in a moment.

After executing at most one task, the loop goes back to the beginning. Because at most one task is executed per iteration of the main do..while loop, the tasks are prioritized. Task0 has the highest priority, and Task3 the lowest priority.

Yes, this prioritization means that if your CPU is overloaded Task0 may execute many times and Task3 may never get a turn. That's why its important to get scheduling right (this will be a topic in a later blog posting).

Multi-Rate Timers

The main loop wasn't so bad, except we swept under the rug the messy business of getting the flags set properly.  Trying to do that in the main loop generally leads to problems, because a long task will cause many milliseconds to go by between timer checks, and it is too easy to have a bug that misses setting a flag some of the time. Thus, in general you tend to see flag maintenance in the timer interrupt service routine.

Conceptually the code looks like this, and for our example lives in a timer interrupt service routine (ISR) that is called every 1 msec.  A variable called "timer" keeps track of system time and is incremented once every msec.

 // in a timer ISR that is called once every msec
 if ((timer %   5) == 0) { flag0 = 1; } // enable 5 msec task0 

 if ((timer %  10) == 0) { flag1 = 1; } // enable 10 msec task1
 if ((timer %  20) == 0) { flag2 = 1; } // enable 20 msec task2
 if ((timer % 100) == 0) { flag3 = 1; } // enable 100 msec task3
 if (timer >= 100) { timer = 0; } // avoid timer overflow bug

Every 5 msec the timer will be zero modulo 5, every 10 msec the timer will be zero modulo 10, and so on.  So this gives us tasks with periods of 5, 10, 20, and 100 msec.   Division is slow, and in many low-end microcontrollers should be avoided in an ISR. So it is common to see a set of counters (one per flag), where each counter is set to the period of a particular task and counts down toward zero once per msec. When a counter reaches zero the associated flag is set to 1 and the counter is reset to the tasks' period. This takes a little more RAM, but avoids division. How it's implemented depends upon your system tradeoffs.

The last line of this code avoids weird things happening when the timer overflows. The reset to zero should run at the least common multiple of all periods, which in this case happens to be equal to the longest period.

Concurrency Issues

As with any scheduling system there are potential concurrency issues. One subtle one is that the timer ISR can run part-way down the else..if structure in the main loop. This could cause a low-priority task to run before a high priority task if they both have their flags set to 1 on the same timer tick. It turns out that this doesn't make the worst case latency much worse. You could try to synchronize things, but that adds complexity. Another way to handle this is to copy the current time into a temporary variable and do the checks for when to run each task in the main loop instead of the timer ISR.

It's also important to note that there is a potential concurrency problem in writing flags in the main loop since both the ISR and the main task can write the flag variables. In practice the concurrency bug will only hit when you're missing deadlines, but good coding style dictates disabling interrupts when you update the flag values in the main loop, which isn't shown in the main loop code in an attempt to keep things simple for the purpose of explanation.

The Big Picture

OK, that's pretty much it.  We have a main loop that runs each task when its ready-to-run-flag is set, and a timer ISR that sets a ready-to-run flag for each task at the desired period. The result is a system that has the following properties:
  • Each task runs once during its assigned period
  • The tasks are prioritized, so for example task 2 only runs when task 0 and task 1 do not need to run
The big benefit is that, so long as you pay attention to schedulability math, you can run both fast and slow tasks without needing a fancy RTOS and without missing deadlines.

In terms of practical application this is quite similar to what I often see in commercial systems. Sometimes developers use arrays of counters, arrays of flags, and sometimes even arrays of pointers to functions if they have a whole lot of functions, allowing the code to be a generic loop rather than spelling out each flag name and each task name. This might be necessary, but I recommend keeping things simple and avoiding arrays and pointers if it is practical for your system.

Coming soon ... real time analysis of a multi-rate main loop

Monday, November 9, 2015

How Long Should You Set Your Watchdog Timer

Summary: You can often use a watchdog effectively even if you don't know worst case execution time. The more you know about system timing, the tighter you can set the watchdog.

Sometimes there is a situation in which a design team doesn't have resources to figure out a tight bound on worst case execution time for a system. Or perhaps they are worried about false alarm watchdog trips due to infrequent situations in which the software is running properly, but takes an unusually long time to complete a main loop. Sometimes they set the watchdog to the maximum possible time (perhaps several seconds) to avoid false alarm trips. And sometimes they just set it to the maximum possible value because they don't know what else to do.

In other words, some designers turn off the watchdog or set it to the maximum possible setting because they don't have time to do a detailed analysis.  But, almost always, you can do a lot better than that.

To get maximum watchdog effectiveness you want to put a tight bound on worst case execution time with it. However, if your system safety strategy permits it, there are simpler ways to compute watchdog period by analyzing the application instead of the software itself.  Below I'll work through some ways to set the watchdog, going from the more complicated way to a simpler way.

Consider software that heats water in an appliance. Just to make the problem concrete and use easy numbers, lets say that the control loop for heating executes every 100 msec, and takes between 40 and 75 msec to execute (worst case slow and fast speeds).   Let's also say that it uses a single-task main loop scheduler without an RTOS so we don't have to worry about task start time jitter.  How could we set the watchdog for this system? Ideally we'd like the tightest possible timing, but there may be some slack because water takes a while to heat up, and takes a while to boil dry. How long should we set the watchdog timer for?

Classical Watchdog Setup

Classically, you'd want to compute the worst case execution time range of the software (40-75 msec in this case). Let's assume the watchdog kick happens as the last instruction of each execution. Since the software only runs once every 100 msec, then the shortest time between kicks is when one cycle runs 75 msec, waits 25 msec, then the  next cycle runs faster, completing the computation and kicking the watchdog in only 40 msec.    25+40=65 msec shortest time between kicks. In contrast, the longest time between kicks is when a short cycle of 40 msec is followed by 60 msec of waiting, then a long cycle of 75 msec.  60+75=135 msec longest time between kicks. It helps a lot to sketch this out:

If you're setting a conventional watchdog timer, you'd want to set it at 135 msec (or the closest setting greater than that). If you have a windowed watchdog, you'd want to set the minimum setting at 65 msec, or the closest setting lower than that. 

Note that if your running an RTOS, the scheduling might guarantee that the task runs once in every 100 msec, but not when it starts within that period. In that case the worst case shortest time is the task running back-to-back at its shortest length = 40 msec. The longest time will be when a short task runs at the beginning of a period, and the next task completes right at the end of a period, giving 60 msec (idle time at end of period) + 100 msec (one more period) = 160 msec between watchdog kicks. Thus, a windowed watchdog for this system would have to permit a kick interval of 40 to 160 msec.

Watchdog Approximate Setup

Sometimes designers want a shortcut. It is usually a mistake to set the watchdog at exactly the period because of timing jitter in where the watchdog actually gets kicked. Instead, a handy rule of thumb for non-critical applications (for which you don't want to do the detailed analysis) is to set the watchdog timer interval to twice the software execution period. For this example, you'd set the watchdog timer to twice the period of 100 msec = 200 msec.  There are a couple assumptions you need to make: (1) the software always finishes before the end of its period, and (2) the effectiveness of the watchdog at this longer timeout will be good enough to ensure adequate safety for your system. (If you are building a safety critical system you need to dig deeper on this point.) 

This simpler approach sacrifices some watchdog timer effectiveness for detecting faults that perturb system timing compared to the theoretical bound of 135-160 msec. But it will still catch a system that is definitely hung without needing detailed timing analysis.

For a windowed watchdog the rule of thumb is a little more difficult.  That is because, in principle, your task might run the full length of one period and complete instantly on the next period, giving effectively a zero-length minimum watchdog timer kick interval. If you can establish a lower bound on the minimum possible run time of your task, you can set that as an approximate lower bound on watchdog timer kicks. If you don't know the minimum time, you probably can't use the lower bound of a windowed watchdog timer.

Application-Based Watchdog Setup

The previous approaches required knowing the run time of the software. But, what do you do if you know nothing about the run time, or have low confidence in your ability to predict the worst case bounds even after extensive analysis and testing?

An alternate way to approach this problem is to skip the analysis of software timing and instead think about the application. Ask yourself, what is the longest period for which the application can withstand a hung CPU?  For example, for a counter-top appliance heating water, how long can the heater be left full-on due to hung software without causing a problem such as a fire, smoke, or equipment damage?  Probably it's quite a bit longer than 100 msec.  But it might not be 5 or 10 seconds. And probably you don't want to risk melting a thermal fuse or a household smoke alarm by  turning off the watchdog entirely.

As a practical matter, if the system's controls go unstable after a certain amount of time, you'd better make your watchdog timer period shorter than that length of time!

For the above example, the control period is 100 msec. But, let's say the system can withstand 500 msec of no control inputs without becoming uncrecoverable, so long as the control system starts working again within that time and it doesn't happen often. In this case, the watchdog timer must be shorter than 500 msec at least. But there is one other thing to consider. The system probably takes a while to reboot and start running the control loop after the watchdog timer trips. So we need to account for that restart time. For this example, let's say the time between a watchdog reset and the time the software starts controlling the system generally ranges from 70 to 120 msec based on test measurements.

Based on knowing the system reset time and stability grace period, we can get an approximate watchdog setting as follows.  We have 500 msec of no-control grace period, minus 120 msec of worst case restart time.  500-120 = 380 msec. Thus, for this system the maximum permissible watchdog timer value is 380 msec to avoid losing control system stability. Using this approach, the watchdog maximum period should be set at the longest time that is shorter than 380 msec. Without knowing more about software computation time, there is not much we can say about the minimum period for a windowed watchdog.

Note that for this approach we still need to know something about the worst case execution of the software in case you hit a long execution path after the watchdog reset. However, it is often the case that you know the longest time that is likely to be seen (e.g., via measuring a number of runs) even if you don't know the details of the real time scheduling approach used.  And often you might be willing to take the chance that you won't hit an unlikely, even worse running time right after a system reset. For a non-safety critical application this might be appropriate, and is certainly better than just turning the watchdog off entirely.

Finally, it is often useful to combine the period rule of thumb with the control stability rule of thumb (if you know the task execution period). You want the watchdog set shorter than the time required to ensure control stability, but longer than the time it actually takes to execute the software so that it will actually be kicked often enough. For the above example this means setting the watchdog somewhere between two periods and the control stability time limit, giving a range for the maximum watchdog limit of 200-380 msec. This can be set without detailed software execution time analysis beyond knowing the task period and the range of likely system restart times.

If you know the maximum worst case execution time and the minimum execution time, you should set your watchdog as tightly as you can. If you don't know those values, but at least are confident you'll complete execution within your task period, then you should set the watchdog to twice the period. You should also bound the maximum watchdog timeout setting by taking into account how long the system can operate without a reset before it loses control stability.

Monday, September 28, 2015

Open Source IoT Code Is Not The Entire Answer

Summary: Whether or not to open sourcing embedded software is the wrong question. The right question is how can we ensure independent checks and balances on software safety and security. Independent certification agencies have been doing this for decades. So why not use them?

In the wake of the recent Volkswagen diesel software revelations, there has been a call from some that automotive software and even all Internet of Things software should be open source. The idea is that if the software is released publicly, then someone will notice if there is a security problem, a safety problem, or skulduggery of some sort. While open source can make sense, this is neither an economically realistic nor necessary step to apply across-the-board.

The Pro list for open source is pretty straightforward: if you publish the code, someone will come and read it and find all the problems.

The Con list is, however, more reflective of how things really work. You have to assume that someone with enough technical skill will actually spend the time to look, and will actually find the problem. That doesn't always happen. The relatively simple Heartbleed bug was there for all to see in OpenSSL, and it stayed there for a couple years despite being a widely used, crucial piece of open source Internet infrastructure software. Presumably a lot more people care about OpenSSL than your toaster oven's software.

Some of the opponents of open sourcing IoT software invoke the security bogeyman. They say that if you publish the source you'll be vulnerable to attacks. Well sure, it might make it easier to find a way to attack, but it doesn't make you "vulnerable." If your code was already full of vulnerabilities, publishing source code just might make it a little easier for someone to find them.  Did you notice that the automotive security exploits published recently did not rely on source code?  I can believe that exploits could, at least sometimes, be published more quickly for open source code, but I don't see this as a compelling argument for keeping code secret and un-reviewed.

A more fundamental point is that software is often the biggest competitive advantage in making products that would otherwise be commodities. Asking companies to reveal their most important trade secrets (their software), so that a hypothetical person with the time and skills might just happen to find a problem sounds like a hard sell to me.  Especially since there is the well established alternative of having an external, independent certification agency look things over in private.

Safety critical systems have had standards and independent review systems in place for decades. Aviation uses DO-178c and other standards, and has a set of independent reviewers called Designated Engineering Representatives (DERs) that provide design reviews during the development cycle. Rail systems follow EN-50126/8/9 and typically involve oversight from acquisition consultants. The chemical process industry generally follows IEC-61508, and has long used independent certification organizations to check their work (typically I see reviews have been done by Exida or TUV). The consumer appliance industry has long had Underwriters Laboratories (UL) certification, and is moving to a more comprehensive software safety standard approach based on IEC 60730, including external independent certification. There are also more recent domain-specific security standards that can be applied. (It is worth noting that ensuring safety and security requires a lot more than just source code, but that's a topic for another day.)

Cars have long had the option to use the MISRA software safety guidelines, and more recently the ISO 26262 safety standard. Historically, some companies have had external agencies certify automotive components to those standards. But, at least some car companies have not taken advantage of this external audit opportunity, and thus there has been no independent check and balance on their software until we their problems show up in the news. Software safety and security audits are not required to sell cars in the US. (There is some vehicle-level testing according to FMVSS requirements, but it's about vehicle behaviors, not the actual source code.)

For Internet of Things it will be interesting to see how things play out. As I understand it the EU is already requiring IEC 60730 compliance, which means external safety checks for safety critical IoT applications. We could see that mandate spread to more IoT products sold in Europe if there are high-profile problems. And perhaps we'll see a push on automotive software regulation too.

So, there is a well established alternative to open source in the form of external certifying organizations issuing compliance certificates based on international safety and security standards. Rather than get distracted by an open source debate, what we should be doing is asking "what's the most effective way to ensure adequate software safety and dependability in a way that doesn't put companies out of business." Sometimes that might be open source, especially for underlying infrastructure. But other times, probably most times, independent review by a trusted certification party will be up to the task. The question is really what it will take to make companies produce verifiably adequate software.

Having checks and balances works. We should use them.

(For the record, I made some of my source code public domain before "open source" was even a buzzword, and have released other source code under an older version of GPL (Ballista robustness testing) and Creative Commons BY 4.0 (CRC Hamming Distance length calculation). Some code I copyright and release. And some I keep as a trade secret. My interest here is in the public being able to use safe and secure embedded software. We should focus on that, and not let things get sidetracked into another iteration of the open source vs. proprietary software debate.)