The Problem:
The Boeing 787 aircraft has three Flight Control Modules (FCMs) that are the subject of a new FAA Airworthiness Directive. Based on that sentence alone, you want to make sure whatever that involves gets fixed before you fly on a 787!
The FAA says there is "a report" that all three FCMs can fail at the same time 22 days after they have been rebooted. If you don't reboot the FCMs the FAA says this "could result in flight control surfaces not moving in response to flight crew inputs for a short time and consequent temporary loss of controllability." This is FAA-speak for the airplane could crash. The're telling airlines to reboot the plane every 21 days to avoid this. Hope nobody forgets to do that! (In fairness, I understand it is likely that most planes get rebooted more often than this anyway. But this is not something you want to leave to chance.)
At this point we can only guess at the cause, but the usual guess is that it is a timer overflow problem. Let's hypothesize a 32-bit signed integer is counting the passing of time in milliseconds. So a value of 32700 in that counter is 32.700 seconds.
How long until it overflows 31 bits of counting into the 32nd bit, which is the sign bit?
0x7FFFFFFF = 2147483647 ==> 2147483.647 seconds
2147483.647 seconds * (1 min/60 sec) (1 hr/60 min)(1 day/24 hr) = 24.9 days
Hmm, a bit longer than the 22 days the FAA reports. Some time spent playing with various multipliers didn't seem to give a likely candidate. Possible factors if it is a timer rollover would include fixed point math (e.g., time keeping in 256ths of a second) or scaling from a 400 Hz aircraft AC frequency. Or there could be some divided-down crystal oscillator frequency on the FCM that is involved.
Or, it could be something completely different. Maybe there is memory that records operating parameters periodically and the system crashes when that fills up that memory (for example, logs that get downloaded every maintenance interval, with an expectation that the maintenance interval is more like a few days than a few weeks).
For now the cause is a bit of a mystery to us. I'll bet the FCM engineers have a pretty good idea at this point. No doubt they'll issue a fix as fast as they can get the FAA to review it.
But the big news is that for the second time, the FAA is telling is telling the airlines they have to do a maintenance reboot of their planes. Last time it was every 248 days. This time it's every 21 days.
It's bad enough that they have to reboot the infotainment systems once in a while. For flight controls, this is not good news. This is the kind of problem that should be caught in design reviews. Always think about what happens if any counter, timer, or data structure overflows.
Other Examples:
This is not the first time a problem with long-running software has happened beyond the usual memory leaks in everyday applications. Some examples are:
Timer rollover bugs:
- Another B787 reboot every 22 days [Seattle Times 2020\
- A350 needs to be rebooted every 149 hours, likely due to a timer overflow bug [Register][EASA]
- B787 needs to be rebooted every 248 days due to a likely timer overflow bug [Blog][NY Times] [FAA]
- Air Traffic control loses contact with 400 aircraft due to a 32-bit time rollover in 2004 [IEEE Spectrum]
- IBM: Interface adapters hang after 497 days of uptime [IBM]
- Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft] (I met the engineers who found that one. And congratulated them on the significant feat of actually getting Windows 95 to run that long without crashing for some other reason!)
- SSD Drives 40,000 and 32,768 hour failures. [BleepingComputer]
- AMD EPYC Rome CPU chips crash after 1044 days of uptime [Tom's Hardware]
There are also plenty of date roll-over bugs:
- NASA Deep Impact Comet Mission terminated unexpectedly when at 2**32 seconds after Jan 1, 2000 (a time rollover bug). [IEEE Software]
- Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00) [Wikipedia]
- GPS: 1024 week rollover on 22 August 1999 [USCG]
- Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]
- 911 outage caused by database record limit [Washington Post]
- 512K day for IPv4 routers
- Lightsail spacecraft dies when log exceeds 32 MB [Space.com]
- Google Chrome version rollover at 100 [Techradar.pro]
- Patriot Missile mishap after operating for 100 hours without a maintenance reboot [GAO]
If you want to dig further, there is a "zoo" of related problems on Wikipedia: "Time formatting and storage bugs"
Phil -- also notable was the timing skew that accumulated in the Patriot missile system... after a few hours, the timing drift was so severe that the Patriot missile system would be off by as much as half a kilometer when targeting an incoming Scud missile. Until the root cause was identified and a firmware update was delivered, the work-around was to simply reboot the missile system (wiping out the timing error and starting over). Then again, missile systems don't work too well while they're rebooting!
ReplyDeleteIt's also a good example of a system that passes all tests, but still contains bugs. Either the initial conditions of the tests always began with "Power up the missile system", or it was a failure of imagination in creating test cases.
As you said, these types of bugs in long-running software must be caught during design / code review, not in testing.