Better Embedded System SW: Robustness Testing

I have been doing research in the area of robustness testing for many years, and once in a while I have to explain how that approach to testing fits into the bigger umbrella of fault injection and related ideas. Here's a summary of typical approaches (there are many variations and extensions beyond these as you might imagine). At the end is a description of the robustness testing work my group has been doing over many years.

Mutation Testing:
Goal: Evaluate coverage/effectiveness of an existing test suite. (Also known as "bebugging.")
Approach: Modify System under Test (SuT) with a hypothetical bug and see if an existing test suite finds it.
Narrative: I have a test suite. I wonder how thorough it is? Let me put a bug (mutation) into my code and see if my test suite finds it. If it finds all the mutations I insert, maybe my test suite is thorough.
Fault Model: Source code bug that is undetected by testing
Strengths: Can find problems even if code was already 100% branch-covered in test suite (e.g., mutate a comparison to > instead of >= to see if test suite exercises the equality case for that comparison)
Limitations: Requires an existing test suite (but, can combine with automated test generation to create additional tests automatically). Effectiveness heavily depends on inserting realistic mutations, which is not necessarily so easy.

Classical Fault Injection Testing:
Goal: Determine robustness of SuT when its code or data is corrupted.
Approach: Corrupt the binary image of the SuT code or corrupt data during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system. I wonder what happens if I flip a bit in the data -- does the system crash? I wonder what happens if I corrupt the compiled code -- does the system act in an unsafe way?
Fault Model: Hardware bit flip or software-based memory corruption.
Strengths: Can find realistic failures caused by single-event upsets, hardware faults, and software defects that corrupt computational state.
Limitations: The fault model is memory and program bit-level corruption. Fault injection testing is useful for high-integrity systems deployed in large volume (they have to survive even very infrequent faults), and essential for aviation and space systems that will see high rates of single event upsets. But it is arguably a bit excessive for non-safety-critical applications.

Robustness Testing:
Goal: Determine robustness of SuT when it is fed exceptional or unusual values.
Approach: Corrupt the inputs to the SuT during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system or subsystem. I wonder what happens if the inputs from other components or sensors have garbage, unusual, or random values -- does the system crash or act in an unsafe way?
Fault Model: Some other module than the SuT has a bug that results in exceptional data being sent to the SuT.
Strengths: Can find realistic failures caused by likely run-time faults such as null pointers, NaNs (Not-a-Number floating point values), corrupted input data, and in general faults in modules that are not the SuT, but rather other software or sensors present in the system that might have bugs that generate exceptional data.
Limitations: The fault model is generally that some other piece of software has a bug and that bug will generate bad data that kills the SuT. You have decide how likely that is and whether it's OK in such a case for the SuT to misbehave. We have found many situations in which such test results are important, even in systems that are not safety critical.

Fuzzing:
A classical form of robustness testing is "fuzzing," in which random inputs are tossed into a system to see what happens rather than carefully selected specific input values. My research group's work centers on finding efficient ways to do robustness testing so that fewer tests are needed to find system-killer values.

Ballista:
The Ballista project pioneered efficient robustness testing in the late 1990s, and is still active today on stress testing robots and autonomous vehicles.

Two key ideas of Ballista are:

Have a dictionary of interesting exceptional values so you don't have to stumble onto them by chance (e.g., just try a NULL pointer straight out rather than wait for a random number generator to happen to generate a zero value as a fuzzing input value)
Make it easy to generate tests by basing that dictionary on the data types taken by a function call instead of the function being performed. So we don't care if it is a memory management function or a file write being tested - we just say for example that if it's a memory pointer, let's try NULL as an input value to the function. This gets us excellent scalability and portability across systems we test.

A key benefit of Ballista and other robustness testing approaches is that they look for holes in the code itself, rather than holes in the test cases. Consider that most test coverage approaches (including mutation testing) are interested in testing all the code that is there (which is a good thing!). In contrast, robustness testing goes beyond code coverage to find the places where you should have had code to handle exceptional situations, but that code is missing. In other words, robustness testing often finds bugs due to missing code that should have been there. We find that it is pretty typical for software to be non-robust unless this type of testing has been done it identify such problems.

You can find more about our research at the Stress Tests for Autonomy Architectures (STAA) project page, which includes video of what goes wrong when you stress test a couple robotic systems:

https://www.rec.ri.cmu.edu/projects/stress_testing/

Our older work on Ballista can be found here:
http://www.ece.cmu.edu/~koopman/ballista/