As far as I can tell nobody knows how to write perfect software. Nearly perfect is as good as it gets, and that comes at exponentially increasing costs as you approach perfection. While imperfect may be good enough for many cases, the bigger issue is that we all seem to act as if our software is perfect, even when it's not.
I first ran into this issue when I was doing software robustness testing (the Ballista project -- many moons ago). A short version of some of the conversations we'd have went like this. Me: "If someone passes a null pointer into your routine, things will crash." Most people responded: "well they shouldn't do that" or "passing a null pointer is a bug, and should be fixed." Or "nobody makes that kind of stupid mistake" (really??). We even found one-liner programs that provoked kernel panics in commercial desktop operating systems. Some folks just didn't care. But the folks who concentrated on highly available systems said: "thanks -- we're going to fix everything you find." Because they know that problems happen all the time, and the only way to improve dependability in the presence of buggy software is to make things resilient to bugs.
Now ask yourself about your embedded system. You know there are software bugs in there somewhere. Do you pin all your hopes on debugging finding every last bug? (Good luck with that.) Or do you plan for the reality that software is imperfect and act accordingly to increase your product's resilience?
Here are some of the techniques that can help.
- Watchdog timer in case your system wedges (is it turned on? is it kicked properly?)
- Input parameter sanity checks (check for null pointers, values out of range, other problems)
- Defaults on switch statements that invoke an error handler (what if you forgot a case and it maps into whatever case you picked as the default?)
- Run-time assertions (the value if "i" should be positive -- oops it's negative right now)
- Error return codes (what happens if the subroutine call didn't work?)
- Robustness testing (some folks call this fuzz testing although this just one approach -- toss bad values at your software and see if things fall apart)
- Error logging (so you can track down problems in units returned for service)
But the real question is do you actually use them? Or is your software so perfect you needn't bother?
It's all very well to say that you should check for null pointers, etc.
ReplyDeleteBut that just raises another question: what action should you take if you do detect a null pointer (or whatever)...?!
Excellent question Andrew! That all depends upon what you are trying to accomplish. Keep in mind that a null pointer deference only causes a core dump on some systems, and that other faulty values may not trigger a hardware exception at all.
ReplyDeleteThe big issue is that if you don't check for invalid values you really don't know what might happen. It might be a crash -- or it might not. We saw one OS where the system would gradually grow stupid until is just sat there saying it was alive -- but with nothing happening.
Some of the approaches I've seen are:
- Log the error and do a safety shutdown. (Note the "log the error" part -- don't just let it crash.)
- Log the error and do a system reset (or intentionally hang to invoke the watchdog timer). Again, this is forcing clean error handling rather than undefined results.
- Restart just the offending task in a multitasking system and hope that clears things out.
- If the higher level software understands things well enough to do so, do a retry. This can be a lot of work, so usually only happens in selected high-value situations.
- Convert the bad value to a reasonable value. One widely used OS we saw converted dereferenced null pointers to zero to avoid crashes (changing that behavior apparently broke _lots_ of code!). Sometimes you can saturate bad sensor values to the min/max acceptable value. Think about whether logging the fault makes sense to avoid debugging problems caused by silent fault masking.
I realize it may be impractical to handle every possible error situation cleanly. But it is often reasonable to log errors and do some sort of coarse-grain system recovery so that the system tries to keep working and developers can find out what's going wrong.
Most importantly, keep in mind that bugs are inevitable, and you should have a plan for what to do at run-time when you hit one. That includes run-time system response as well as data collection to feed back fixes to the field (if that is possible).
While I agree wholeheartedly with most of what you say, I take issue with two of your assertions:-
ReplyDelete(1) "As far as I can tell nobody knows how to write perfect software."
(2) "bugs are inevitable"
It is quite possible to know how to write perfect software (if, by perfect, we mean free of bugs) and, indeed, we should strive to do so. The real problem is that we make mistakes, and we make more mistakes if we are overly pressured. Time is indeed money but a little more time up front can save a lot of time and anguish later on. This is not an argument against taking the measures you describe - on the contrary, they are very important. Rather, it is an argument for taking a little more care when designing and coding, so that there are fewer bugs in there in the first place.
The statement that bugs are inevitable is simply untrue; they are merely quite likely in all but the simplest projects. However, this statement has been made so often that many people seem to believe it, as if bugs were inserted at random by some unseen entity. The software field is the only one I can think of in which perfection is a viable goal. While, as engineers, we should remain pessimistic about achieving it (and design accordingly), to aim for anything less demeans us professionally.