Better Embedded System SW: December 2016

Monday, December 26, 2016

Embedded System Security Pitfalls Tutorial (Preview)

Here's a summary video on Security Pitfalls. (Hint: security via obscurity doesn't make you secure!)

Security Pitfalls Preview [ECR]

Other pointers on this topic (my blog posts unless otherwise noted):

Other useful pointers:

Bruce Schneier: Security Pitfalls in Cryptography

Schneier on Security

For more about Edge Case Research and how to subscribe to our video training channel, please see this Blog posting.

Monday, December 19, 2016

A Driver Test For Self-Driving Cars Isn't Enough

I recently read yet another argument that a driving road test should be enough to certify an autonomous vehicle as safe for driving. In general, the idea was that if it's good enough to put a 16 year old on the road, it should be good enough for a self-driving vehicle. I see this idea enough that it's worth explaining why it it's a really bad one.

Even if we were to assume that a self-driving vehicle is no different than a person (which is clearly NOT true), applying the driving test is only half the driver license formula. The other half is the part about being 16 years old. If a 12 year old is proficient at operating a vehicle, we still don't issue a drivers license. In addition to technical skills and book knowledge, we as a society have imposed a maturity requirement in most states of "being 16." It is typical that you don't get an unrestricted license until you're perhaps 18. And even then you're probably not a great driver at any age until you get some experience. But, we won't even let you on the road under adult supervision at 12!

The maturity requirement is essential. As a driver we're expected to have the maturity to recognize when something isn't right, to avoid dangerous situations, to bring the vehicle to a safe state when something has gone wrong, to avoid operating when the vehicle system (vehicle + driver) is impaired, to improvise when something weird happens, and to compensate for other drivers who are having a bad day (or are simply suffering from only being 16). Autonomous driving systems might be able to do some of that, or even most of it in the near term. (I would argue that they are especially bad at self-detecting when they don't know what's going on.) But the point is a normal driving test doesn't come close to demonstrating "maturity" if we could even define that term in a rigorous, testable way. It's not supposed to -- that's why licenses require both testing and "being 16."

To be sure, human age is not a perfect correlation to maturity. But as a society we've come to the situation in which this system is working well enough that we're not changing it except for some tweaks every few years for very young and very old drivers who have historically higher mishap rates. But the big point is if a 12 year old demonstrates they are a whiz at vehicle operation and traffic rules, they still don't get a license. In fact, they don't even get permission to operate on a public road with adult supervision (i.e., no learners permit at 12 in any US state that I know of.) So why does it make sense to use a human driving test analogy to give a driver license, or even a learner permit, to an autonomous vehicle that was designed in the last few months? Where's the maturity?

Autonomy advocates argue that encapsulating skill and fleet-wide learning from diverse situations could help cut down the per-driver learning curve. And it could reduce the frequency of poor choices such as impaired driving and distracted driving. If properly implemented, this could all work well and could improve driving safety -- especially for drivers who are prone to impaired and distracted driving. But while it's plausible to argue that autonomous vehicles won't make stupid choices about driving impaired, that is not at all the same thing as saying that they will be mature drivers who can handle unexpected situations and in general display good judgment comparable to a typical, non-impaired human driver. Much of safe driving is not about technical skill, but rather amounts to driving judgment maturity. In other words, saying that autonomous vehicles won't make stupid mistakes does not automatically make them better than human drivers.

I'd want to at least see an argument of technological maturity as a gate before even getting to a driving skills test. In other words, I want an argument that the car is the equivalent of "being 16" before we even issue the learner permit, let alone the driver license. Suggesting that a driving test is all it takes to put a vehicle on the road means: building a mind-boggling complex software system with technology we really don't understand how to validate, doing an abbreviated acceptance test of basic skills (a few minutes on the road), and then deciding it's fine to put in charge of several thousand pounds of metal, glass, and a human cargo as it hurtles down the highway. (Not to mention the innocent bystanders it puts at risk.)

This is a bad idea for any software. We know from history that a functional acceptance test doesn't prove something is safe (at most it can prove it is unsafe if a mishap occurs during the test). Not crashing during a driver exam is to be sure an impressive technical achievement, but on its own it's not the same as being safe! Simple acceptance testing as the gatekeeper for autonomous vehicles is an even worse idea. For other types of software we have found that in practice that you can't understand software quality for life-critical systems without also considering the rigor of the engineering process. Think of good engineering process as a proxy for "being 16 years old." It's the same for self-driving cars.

(BTW, "life-critical" doesn't mean perfect. It means designed with sufficient engineering rigor to be suitable for its intended criticality. See ISO 26262 or your favorite software safety standard, which is currently NOT required by the government for autonomous vehicles, but should be.)

It should be noted that some think that a few hundred million miles of testing can be a substitute for documenting engineering rigor. That's a different discussion. What this essay is about is saying that a road test -- even an hour long grueling road test -- does not fulfill the operational requirement of "being 16" for issuing a driving license under our current system. I'd prefer a different, more engineering-based method of certifying self-driving vehicles for use on public roads. But if you really want there to be a driver test, please tell me how you plan to argue the part about demonstrating the vehicle, the design team, or some aspect of the development project has the equivalent 16-year-old maturity. If you can't, you're just waving your hands about vehicle safety.

Sunday, December 11, 2016

Better Reboot Your Boeing 787 Every Three Weeks

Crashing after a prolonged up time due to a counter rollover or other problem is a classic mistake in computer software. And, it just bit the Boeing 787. Again.

The Problem:

The Boeing 787 aircraft has three Flight Control Modules (FCMs) that are the subject of a new FAA Airworthiness Directive. Based on that sentence alone, you want to make sure whatever that involves gets fixed before you fly on a 787!

The FAA says there is "a report" that all three FCMs can fail at the same time 22 days after they have been rebooted. If you don't reboot the FCMs the FAA says this "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability." This is FAA-speak for the airplane could crash. The're telling airlines to reboot the plane every 21 days to avoid this. Hope nobody forgets to do that! (In fairness, I understand it is likely that most planes get rebooted more often than this anyway. But this is not something you want to leave to chance.)

At this point we can only guess at the cause, but the usual guess is that it is a timer overflow problem. Let's hypothesize a 32-bit signed integer is counting the passing of time in milliseconds. So a value of 32700 in that counter is 32.700 seconds.

How long until it overflows 31 bits of counting into the 32nd bit, which is the sign bit?

0x7FFFFFFF = 2147483647 ==> 2147483.647 seconds
2147483.647 seconds * (1 min/60 sec) (1 hr/60 min)(1 day/24 hr) = 24.9 days

Hmm, a bit longer than the 22 days the FAA reports. Some time spent playing with various multipliers didn't seem to give a likely candidate. Possible factors if it is a timer rollover would include fixed point math (e.g., time keeping in 256ths of a second) or scaling from a 400 Hz aircraft AC frequency. Or there could be some divided-down crystal oscillator frequency on the FCM that is involved.

Or, it could be something completely different. Maybe there is memory that records operating parameters periodically and the system crashes when that fills up that memory (for example, logs that get downloaded every maintenance interval, with an expectation that the maintenance interval is more like a few days than a few weeks).

For now the cause is a bit of a mystery to us. I'll bet the FCM engineers have a pretty good idea at this point. No doubt they'll issue a fix as fast as they can get the FAA to review it.

But the big news is that for the second time, the FAA is telling is telling the airlines they have to do a maintenance reboot of their planes. Last time it was every 248 days. This time it's every 21 days.

It's bad enough that they have to reboot the infotainment systems once in a while. For flight controls, this is not good news. This is the kind of problem that should be caught in design reviews. Always think about what happens if any counter, timer, or data structure overflows.

Other Examples:

This is not the first time a problem with long-running software has happened beyond the usual memory leaks in everyday applications. Some examples are:

Timer rollover bugs:

Another B787 reboot every 22 days [Seattle Times 2020\
A350 needs to be rebooted every 149 hours, likely due to a timer overflow bug [Register][EASA]
B787 needs to be rebooted every 248 days due to a likely timer overflow bug [Blog][NY Times] [FAA]
Air Traffic control loses contact with 400 aircraft due to a 32-bit time rollover in 2004 [IEEE Spectrum]
IBM: Interface adapters hang after 497 days of uptime [IBM]
Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft] (I met the engineers who found that one. And congratulated them on the significant feat of actually getting Windows 95 to run that long without crashing for some other reason!)
SSD Drives 40,000 and 32,768 hour failures. [BleepingComputer]
AMD EPYC Rome CPU chips crash after 1044 days of uptime [Tom's Hardware]

There are also plenty of date roll-over bugs:

NASA Deep Impact Comet Mission terminated unexpectedly when at 2**32 seconds after Jan 1, 2000 (a time rollover bug). [IEEE Software]
Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00) [Wikipedia]
GPS: 1024 week rollover on 22 August 1999 [USCG]
Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such