Sunday, December 11, 2016

Better Reboot Your Boeing 787 Every Three Weeks

Crashing after a prolonged up time due to a counter rollover or other problem is a classic mistake in computer software.  And, it just bit the Boeing 787. Again.

The Problem:

The Boeing 787 aircraft has three Flight Control Modules (FCMs) that are the subject of a new FAA Airworthiness Directive.  Based on that sentence alone, you want to make sure whatever that involves gets fixed before you fly on a 787!

The FAA says there is "a report" that all three FCMs can fail at the same time 22 days after they have been rebooted.  If you don't reboot the FCMs the FAA says this "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability."  This is FAA-speak for the airplane could crash. The're telling airlines to reboot the plane every 21 days to avoid this. Hope nobody forgets to do that! (In fairness, I understand it is likely that most planes get rebooted more often than this anyway. But this is not something you want to leave to chance.)

At this point we can only guess at the cause, but the usual guess is that it is a timer overflow problem. Let's hypothesize a 32-bit signed integer is counting the passing of time in milliseconds.  So a value of 32700 in that counter is 32.700 seconds.

How long until it overflows 31 bits of counting into the 32nd bit, which is the sign bit?

0x7FFFFFFF = 2147483647 ==> 2147483.647 seconds
2147483.647 seconds * (1 min/60 sec) (1 hr/60 min)(1 day/24 hr) = 24.9 days

Hmm, a bit longer than the 22 days the FAA reports.  Some time spent playing with various multipliers didn't seem to give a likely candidate.  Possible factors if it is a timer rollover would include fixed point math (e.g., time keeping in 256ths of a second) or scaling from a 400 Hz aircraft AC frequency. Or there could be some divided-down crystal oscillator frequency on the FCM that is involved.

Or, it could be something completely different.  Maybe there is memory that records operating parameters periodically and the system crashes when that fills up that memory (for example, logs that get downloaded every maintenance interval, with an expectation that the maintenance interval is more like a few days than a few weeks).

For now the cause is a bit of a mystery to us.  I'll bet the FCM engineers have a pretty good idea at this point. No doubt they'll issue a fix as fast as they can get the FAA to review it.

But the big news is that for the second time, the FAA is telling is telling the airlines they have to do a maintenance reboot of their planes.  Last time it was every 248 days. This time it's every 21 days.

It's bad enough that they have to reboot the infotainment systems once in a while.  For flight controls, this is not good news. This is the kind of problem that should be caught in design reviews.  Always think about what happens if any counter, timer, or data structure overflows.

Other Examples:

This is not the first time a problem with long-running software has happened beyond the usual memory leaks in everyday applications.  Some examples are:

Timer rollover bugs:
  • B787 needs to be rebooted every 248 days due to a likely timer overflow bug [Blog][NY Times] [FAA]
  • Air Traffic control loses contact with 400 aircraft due to a 32-bit time rollover in 2004 [IEEE Spectrum]
  • IBM: Interface adapters hang after 497 days of uptime [IBM]
  • Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]  (I met the engineers who found that one. And congratulated them on the significant feat of actually getting Windows 95 to run that long without crashing for some other reason!)
There are also plenty of date roll-over bugs:
  • NASA Deep Impact Comet Mission terminated unexpectedly when at 2**32 seconds after Jan 1, 2000 (a time rollover bug). [IEEE Software]
  • Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00)   [Wikipedia]
  • GPS: 1024 week rollover on 22 August 1999 [USCG]
  • Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]
There are also somewhat related capacity overflow issues such
And floating-point roundoff issues (thanks to Dan for reminding me of this one):

  • Patriot Missile mishap after operating for 100 hours without a maintenance reboot [GAO]

If you want to dig further, there is a "zoo" of related problems on Wikipedia:  "Time formatting and storage bugs"

1 comment:

  1. Phil -- also notable was the timing skew that accumulated in the Patriot missile system... after a few hours, the timing drift was so severe that the Patriot missile system would be off by as much as half a kilometer when targeting an incoming Scud missile. Until the root cause was identified and a firmware update was delivered, the work-around was to simply reboot the missile system (wiping out the timing error and starting over). Then again, missile systems don't work too well while they're rebooting!

    It's also a good example of a system that passes all tests, but still contains bugs. Either the initial conditions of the tests always began with "Power up the missile system", or it was a failure of imagination in creating test cases.

    As you said, these types of bugs in long-running software must be caught during design / code review, not in testing.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!