Sunday, May 3, 2015

Counter Rollover Bites Boeing 787

Counter rollover is a classic mistake in computer software.  And, it has just bit the Boeing 787.

The Problem:

The Boeing 787 aircraft's electrical power control units shut down if powered without interruption for 248 days (a bit over 8 months). In the likely case that all the control units were turned on at about the same time, that means they all shut down at the same time -- potentially in the middle of a flight. Fortunately, the power is usually not left on for 8 continuous months, so apparently this has not actually happened in flight.  But the problem was seen in a long-duration simulation and could happen in a real aircraft. (There are backup power supplies, but do you really want to be relying on them over the middle of an ocean?  I thought not.) The fix is turning off the power and turning it back on every 120 days.

That's right -- the FAA is telling the airlines they have to do a maintenance reboot of their planes every 120 days.

(Sources: NY Times ; FAA)


Analysis:

Just for fun, let's do the math and figure out what's going on.
248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200
Hmmm ... what if those systems keep time as an 32-bit signed integer in hundredths of a second? The maximum positive value for such a counter would give:
0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.
Bingo!

If they had used a 32-bit unsigned it would still overflow after twice as long = 497.1 days.


Other Examples:

This is not the first time a counter rollover has caused a problem.  Some examples are:

  • IBM: Interface adapters hang after 497 days of uptime [IBM]
  • Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]  
There are also plenty of date roll-over bugs:
  • Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00)   [Wikipedia]
  • GPS: 1024 week rollover on 22 August 1999 [USCG]
  • Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such as 512K day for IPv4 routers.

If you want to dig further, there is a "zoo" of related problems on Wikipedia:  "Time formatting and storage bugs"


Friday, May 1, 2015

How To Report An Unintended Acceleration Problem

Every once in a while I get e-mail from someone concerned about unintended acceleration that has happened to them or someone they know.  Commonly they go to the car dealer and get told (directly or indirectly) that it must have been the driver's fault. I'm sure that must be a frustrating experience.

Fortunately, you can do more than just get blown off by the dealer (if you feel like that is what happened to you).  Visit the US Dept. of Transportation's complaint system and file a complaint with their Office of Defects Investigation (ODI):

What this does is put information into the database that DoT uses to look for unsafe trends in vehicles. ODI conducts defect investigations and administers safety recalls, and this database is a primary source of information for them.  Putting in an entry does not mean that anyone will necessarily get back to you about your particular complaint, but eventually if enough drivers have similar problems with a particular vehicle type, ODI is supposed to investigate. You should be sure to use several different words and phrases to thoroughly describe your situation since often this database is searched via key words. (That means that if they are looking for a particular trend they might only look at records that contain a specific word or phrase, not all records for that vehicle.)  You should include specifics, and in particular things that you can recall that would suggest it is not simply driver error. But, realize that the description you type in will be publicly available, so think about what you write. 

To be sure, this should not be the only thing you do.  If you believe you have a problem with your vehicle should talk to the dealer and perhaps escalate things from there. (If it happened to me I would at a minimum demand a written problem report to the manufacturer central defects office be created and demand a written response from the manufacturer customer relations office to leave a record.)  But, if you skip the DoT database then one of the important feedback mechanisms independent of the car companies that triggers recalls won't have the data it needs to work. If you had an incident that did not result in a police report or insurance claim reporting is especially important, since there is no other way for DoT or the manufacturer to even know it happened.

Even if you haven't suffered unintended acceleration, you might be interested to look at complaints others have filed for your vehicle type, which are publicly available. And of course you can report any defect you like, not just acceleration issues. 

I recently came across the web site:  http://www.safercar.gov/
This has general car safety information and also has a way to file a vehicle safety complaint.  It is unclear to me what the different is between this and the IVOQ database above.  One would hope the data ends up in the same place, but I don't have information either way on that.

 (For those who are interested in how the keyword search might be done, you can see a NHTSA Document for an example from the Toyota UA investigations.)

Monday, April 13, 2015

FAA CRC and Checksum Report


I've been working for several years on an FAA document covering good practices for CRC and Checksum use.  At long last this joint effort with Honeywell researchers has been issued by the FAA as an official report.  Those of you who have seen my previous tutorial slides will recognize much of the material. But this is the official FAA-released version.

Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity
DOT/FAA/TC-14/49
Philip Koopman, Kevin Driscoll, Brendan Hall

http://users.ece.cmu.edu/~koopman/pubs/faa15_tc-14-49.pdf


Abstract:

This report explores the characteristics of checksums and cyclic redundancy codes (CRCs) in an aviation context. It includes a literature review, a discussion of error detection performance metrics, a comparison of various checksum and CRC approaches, and a proposed methodology for mapping CRC and checksum design parameters to aviation integrity requirements. Specific examples studied are Institute of Electrical and Electronics Engineers (IEEE) 802.3 CRC-32; Aeronautical Radio, Incorporated (ARINC)-629 error detection; ARINC-825 Controller Area Network (CAN) error detection; Fletcher checksum; and the Aeronautical Telecommunication Network (ATN)-32 checksum. Also considered are multiple error codes used together, specific effects relevant to communication networks, memory storage, and transferring data from nonvolatile to volatile memory.

Key findings include:
  • (1) significant differences exist in effectiveness between error-code approaches, with CRCs being generally superior to checksums in a wide variety of contexts; 
  • (2) common practices and published standards may provide suboptimal (or sometimes even incorrect) information, requiring diligence in selecting practices to adopt in new standards and new systems; 
  • (3) error detection effectiveness depends on many factors, with the Hamming distance of the error code being of primary importance in many practical situations; 
  • (4) no one-size-fits-all error-coding approach exists, although this report does propose a procedure that can be followed to make a methodical decision as to which coding approach to adopt; and 
  • (5) a number of secondary considerations must be taken into account that can substantially influence the achieved error-detection effectiveness of a particular error-coding approach.
You can see my other CRC and checksum posts via the CRC/Checksum label on this blog.

Official FAA site for the report is here: http://www.tc.faa.gov/its/worldpac/techrpt/tc14-49.pdf



Monday, February 2, 2015

Tester To Developer Ratio Should Be 1:1 For A Typical Embedded Project

Believe it or not, most high quality embedded software I've seen has been created by organizations that spend twice as much on validation as they do on software creation.  That's what it takes to get good embedded software. And it is quite common for organizations that don't have this ratio to be experiencing significant problems with their projects. But to get there, the head count ratio is often about 1:1, since developers should be spending a significant fraction on their time doing "testing" (in broad terms).

First, let's talk about head count. Good embedded software organizations tend to have about equal number of testers and developers (i.e., 50% tester head count, 50% developer head count). Also, typically, the testers are in a relatively independent part of the organization so as to reduce pressure to sign off on software that isn't really ready to ship. Ratios can vary significantly depending on the circumstance.  5:1 tester to developer ratio is something I'd expect in safety-critical flight controls for an aerospace application.  1:5 tester to developer ratio might be something I'd expect on ephemeral web application development. But for a typical embedded software project that is expected to be solid code with well defined functionality, I typically expect to see 1:1 tester to developer staffing ratios.

Beyond the staffing ratio is how the team spends their time, which is not 1:1.  Typically I expect to see the developers each spend about two thirds of their time on development, and one-third of their time on verification/validation (V&V). I tend to think of V&V as within the "testing" bin in terms of effort in that it is about checking the work product rather than creating it. Most of the V&V effort from developers is spent doing peer reviews and unit testing.

For the testing group, effort is split between software testing, system testing, and process quality assurance (SQA).  Confusingly, testing is often called "quality assurance," but by SQA what I mean is time spent creating, training the team on, and auditing software processes themselves (i.e., did you actually follow the process you were supposed to -- not actual testing). A common rule of thumb is that about 5%-6% of total project effort should be SQA, split about evenly between process creation/training and process auditing. So that means in a team of 20 total developers+testers, you'd expect to see one full time equivalent SQA person. And for a team size of 10, you'd expect to see one person half-time on SQA.

Taking all this into account, below is a rough shot at how effort across a project should break down in terms of head counts and staff.

Head count for a 20 person project:
  • 10 developers
  • 9 testers
  • 1 SQA (both training and auditing)
Note that this does not call out management, nor does it deal with after-release problem resolution, support, and so on. And it would be no surprise if more testers were loaded at the end of the project than the beginning, so these are average staffing numbers across the length of the project from requirements through product release.

Typical effort distribution (by hours of effort over the course of the project):
  • 33%: Developer team: Development (left-hand side of "V" including requirements through implementation)
  • 17%: Developer team: V&V (peer reviews, unit test)
  • 44%: V&V team: integration testing and system testing
  • 3%: V&V team: Process auditing
  • 3%: V&V team: Process training, intervention, improvement
If you're an agile team you might slice up responsibilities differently and that may be OK. But in the end the message is that you're probably only spending about a third of your time actually doing development compared to verification and validation if you want to create very high quality software and ensure you have a rigorous process while doing so.  Agile teams I've seen that violated this rule of thumb often did so at the cost of not controlling their process rigor and software quality. (I didn't say the quality was necessarily bad, but rather what I'm saying is they are flying blind and don't know what their software and process quality is until after they ship and have problems -- or don't.  I've seen it turn out both ways, so you're rolling the dice if you sacrifice V&V and especially SQA to achieve higher lines of code per hour.)

There is of course a lot more to running a project than the above. For example, there might need to be a cross-functional build team to ensure that product builds have the right configuration, are cleanly built, and so on. And running a regression test infrastructure could be assigned to either testers or developers (or both) depending on the type of testing that was included. But the above should set rough expectations.  If you differ by a few percent, probably no big deal. But if you have 1 tester for every 5 developers and your effort ratio is also 1:5, and you are expecting to ship a rock-solid industrial-control or similar embedded system, then think again.

(By way of comparison, Microsoft says they have a 1:1 tester to developer head count. See: How We Test Software at Microsoft, Page, Johston & Rollison.)









Monday, November 17, 2014

Not Getting Software Wrong


The majority of time creating software is typically spent making sure that you got the software right (or if it's not, it should be!).  But, sometimes, this focus on making software right gets in the way of actually ensuring the software works.  Instead, in many cases it's more important to make sure that software is not wrong.  This may seem like just a bit of word play, but I've come to believe that it reflects a fundamental difference in how one views software quality.

Consider the following code concept for stopping four wheel motors on a robot:

// Turn off all wheel motors
  for (uint_t motor = 1; motor < maxWheels; motor++)  {
    SetSpeed(motor, OFF);
  }

Now consider this code:
// Turn off all wheel motors
  SetSpeed(leftFront,  OFF);
  SetSpeed(rightFront, OFF);
  SetSpeed(leftRear,   OFF);
  SetSpeed(rightRear,  OFF);

(Feel free to make obvious assumptions, such as leftFront being an const uint_t, and modify to your favorite naming conventions and style guidelines. None of that is the point here.)

Which code is better?   Well, from our intro to programming course probably we want to write the first type of code. It is more flexible, "elegant," and uses the oh-so-cool concept of iteration.  On the other hand, the second type of code is likely to get our fingers smacked with a ruler in a freshman programming class.  Where's the iteration?  Where is the flexibility? What if there are more than four items that need to be turned off? Where is the dazzling display of (not-so-advanced) computer science concepts?

But hold on a moment. It's a four wheeled robot we're building.  Are you really going to change to a six wheeled robot?  (Well, maybe, and I've even worked a bit with such robots.  But sticking on two extra wheels isn't all that common after the frame has been welded!)  So what are you really buying by optimizing for a change that is unlikely to happen?

Let's look at it a different way. Many times I'm not interested in elegance, clever use of computer science concepts, or even number of lines of code. What I am interested in is that the code is actually correct. Have you ever written the loop type of code and gotten it wrong?  Be honest!  There is a reason that "off by one error" has its own wikipedia entry. How long did it take you to look at the loop and make sure it was right? You had to assume or go look up the value of maxWheels, right?  If you are writing unit tests, you need to somehow figure out a way to test that all the motors got turned off -- and account for the fact that maxWheels might not even be set to the correct value.

But, with the second set of code, it's pretty easy to see that all four wheels are getting turned off.  There's no loop to get wrong.  (You do have to make sure that the four wheel consts are set properly.)  There is no loop to test.  Cyclomatic complexity is lower because there is no looping branch.  You don't have to mentally execute the loop to make sure there is no off-by-one error. You don't have to ask whether the loop can (or should) execute zero times. Instead, with the second set of codes you can just look at it and count up that the wheels are being turned off.

Now of course this is a pretty trivial example, and no doubt one that at least someone will take exception to.  But the point I'm making is the following.  The first code segment is what we were all taught to do, and arguably easier to write. But the second code segment is arguably easier to check for correctness.  In peer reviews someone might miss a bug in the first code. I'd say that missing a bug is less likely in the second code segment.

Am I saying that you should avoid loops if you have 10,000 things to initialize?  No, of course not. Am I saying you should cut and paste huge chunks of code to avoid a loop?  Nope. And maybe you actually do want to use the loop approach in your situation even with 3 or 4 things to initialize for a good reason. And if you get lots of lines of code that is likely to be more error-prone than a well-considered loop. All that I'm saying is that when you write code, consider carefully the tradeoff between writing concise, but clever, code and writing code that is hard to get wrong. Importantly, this includes the risk of a varying style causing an increased risk of getting things wrong. How that comes out will depend on your situation.  What I am saying is think about what is likely to change, what is unlikely to change, and what the path is to least risk of having bugs.

By the way, did you see the bug?  If you did, good for you.  If not, well, then I've already proved my point, haven't I?  The wheel number should start at 0 or the limit should be maxWheels+1.  Or the "<" should be "<=" -- depending on whether wheel numbers are base 0 or base 1.  As it is, assuming maxWheels is 4 as you'd expect, only 3 out of 4 wheels actually get turned off.  By the way, this wasn't an intentional trick on the reader. As I was writing this article, after a pretty long day, that's how I wrote the code. (It doesn't help that I've spent a lot of time writing code in languages with base-1 arrays instead of base-0 arrays.) I didn't even realize I'd made that mistake until I went back to check it when writing this paragraph.  Yes, really.  And don't tell me you've never made that mistake!  We all have. Or you've never tried to write code at the end of a long day.  And thus are bugs born.  Most are caught, as this one would have been, but inevitably some aren't.  Isn't it better to write code without bugs to begin with instead of find (most) bugs after you write the code?

Instead of trying to write code and spending all our effort to prove that whatever code we have written is correct, I would argue we should spend some of our effort on writing code that can't be wrong.  (More precisely, code that is difficult to get wrong in the first place, and easy to detect is wrong.)  There are many tools at our disposal to do this beyond the simple example shown.  Writing code in a style that makes very rigorous static analysis tools happy is one way. Many other style practices are designed to help with this as well, chief among them being various type checking strategies, whether automated or implemented via naming conventions. Avoiding high cyclomatic complexity is another way that comes to mind. So is creating a statechart before actually writing the code, then using a switch statement that traces directly to the statechart. Avoiding premature optimization is also important in avoiding bugs.  The list goes on.

So the next time you are looking at a piece of code, don't ask yourself "do I think this is right?"  Instead, ask yourself "how easy is it to be sure that it is not wrong?"  If you have to think hard about it -- then maybe it really is incorrect code even if you can't see a bug.  Ask whether it could be rewritten in a way that is more obviously not wrong.  Then do it.

Certainly I'm not the first to have this idea. But I see small examples of this kind of thing so often that it's worth telling this story once in a while to make sure others have paused to consider it.  Which brings us to a quote that I've come to appreciate more and more over time.  Print it out and stick it on the water cooler at work:

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies."

        — C.A.R. Hoare, The 1980 ACM Turing Award Lecture  (CACM 24(2), Feb 1981, p. 81)

(I will admit that this quote is a bit clever and therefore not a sterling example of making a statement easy to check for correctness.  But then again he is the one who got the Turing Award, so we'll allow some slack for clever wording in his acceptance essay.)