Sunday, June 5, 2011

Nobody Writes Perfect Software -- get over it!

Have you ever written perfect software?  Really?  (And if you did, how exactly do you know that?)  If you're imperfect like the rest of us, how do you take that into account in your software architecture?

As far as I can tell nobody knows how to write perfect software. Nearly perfect is as good as it gets, and that comes at exponentially increasing costs as you approach perfection. While imperfect may be good enough for many cases, the bigger issue is that we all seem to act as if our software is perfect, even when it's not.

I first ran into this issue when I was doing software robustness testing (the Ballista project -- many moons ago). A short version of some of the conversations we'd have went like this.  Me: "If someone passes a null pointer into your routine, things will crash." Most people responded: "well they shouldn't do that" or "passing a null pointer is a bug, and should be fixed."  Or "nobody makes that kind of stupid mistake" (really??).  We even found one-liner programs that provoked kernel panics in commercial desktop operating systems.  Some folks just didn't care. But the folks who concentrated on highly available systems said: "thanks -- we're going to fix everything you find." Because they know that problems happen all the time, and the only way to improve dependability in the presence of buggy software is to make things resilient to bugs.

Now ask yourself about your embedded system. You know there are software bugs in there somewhere. Do you pin all your hopes on debugging finding every last bug?  (Good luck with that.) Or do you plan for the reality that software is imperfect and act accordingly to increase your product's resilience?

Here are some of the techniques that can help.
  • Watchdog timer in case your system wedges (is it turned on? is it kicked properly?)
  • Input parameter sanity checks (check for null pointers, values out of range, other problems)
  • Defaults on switch statements that invoke an error handler (what if you forgot a case and it maps into whatever case you picked as the default?)
  • Run-time assertions (the value if "i" should be positive -- oops it's negative right now)
  • Error return codes (what happens if the subroutine call didn't work?)
  • Robustness testing (some folks call this fuzz testing although this just one approach -- toss bad values at your software and see if things fall apart)
  • Error logging (so you can track down problems in units returned for service)
Many of them you may have heard of, and you probably have heard the umbrella term "defensive coding" somewhere. Do you have any favorite techniques I've missed?

But the real question is do you actually use them?  Or is your software so perfect you needn't bother?