Showing posts with label testing. Show all posts
Showing posts with label testing. Show all posts

Monday, May 8, 2017

Optimize for V&V, not for writing code



Writing code should be made more difficult so that Verification &Validation can be made easier.

I first heard this notion years ago at a workshop in which several folks from industry who build high assurance software (think flight controls) stood up and said that V&V is what matters. You might expect that from flight control folks, but their reasoning applies to pretty much every embedded project. That's because it is a matter of economics. 

Multiple speakers at that workshop said that aviation software can require 4 or 5 hours of V&V for every 1 hour of creating software. It makes no economic sense to make life easy for the 1 hour side of the ratio at the expense of making life painful for the 5 hour side of the ratio.

Good, but non-life-critical, embedded software requires about 2 hours of V&V for every 1 hour of code creation. So the economic argument still holds, with a still-compelling multiplier of 2:1.  I don't care if you're Vee,  Agile, hybrid model or whatever. You're spending time on V&V, including at least some activities such as peer review, unit test, created automated tests, performing testing, chasing down bugs, and so on. For embedded products that aren't flaky, probably you spend more time on V&V than you do on creating the code. If you're doing TDD you're taking an approach that has the idea of starting with a testing viewpoint built in already, by starting from testing and working outward from there. But that's not the only way to benefit from this observation.

The good news is that making code writing "difficult" does not involve gratuitous pain. Rather, it involves being smart and a bit disciplined so that the code you produce is easier for others to perform V&V on. A bit of up front thought and organization can save a lot on downstream effort. Some examples include:
  • Writing concise but helpful code comments so that reviewers can understand what you meant.
  • Writing code to be obvious rather than clever, again to help reviewers.
  • Follow a style guide to make your code consistent, and thus easier to understand.
  • Writing code that compiles clean for static analysis, avoiding time wasted finding defects in test that a tool could have found, and avoiding a person having to puzzle out which warnings matter, and which don't.
  • Spending some time to make your unit interfaces easier to test, even if it requires a bit more work designing and coding the unit.
  • Spending time making it easy to trace between your design and the code. For example, if you have a statechart, make sure the statechart uses names that map directly to enum names rather than using arbitrary state variables such as "magic number" integers between 1 and 7. This makes it easier to ensure that the code and design match. (For that matter, just using statecharts to provide a guide to what the code does also helps.)
  • Spending time up front documenting module interaction so that integration testers don't have to puzzle out how things are supposed to work together. Sequence diagrams can help a lot.
  • Making the requirements both testable and easy to trace. Make every requirement idea a stand-alone sentence or paragraph and give it a number so it's easy to trace to a specific test primarily designed to test that particular requirement. Avoid having requirements in huge paragraphs of free-form text that mix lots of different concepts together.
Sure, these sound like a good idea, but many developers skip or skimp on them because they don't think they can afford the time. They don't have time to make their code clean because they're too busy writing bugs to meet a deadline. Then they, and everyone else, pay for this during the test cycle. (I'm not saying the programmers are necessarily the main culprits here, especially if they didn't get a vote on their deadline. But that doesn't change the outcome.)

I'm here to say you can't afford not to follow these basic code quality practices. That's because every hour you're saving by cutting corners up front is probably costing you double (or more) downstream by making V&V more painful than it should be. It's always hard to invest in downstream benefits when the pressure is on, but doing so is costing you dearly when you skimp on code quality.

Do you have any tricks to make code easier to understand that I missed?

Monday, May 16, 2016

Robustness Testing

I have been doing research in the area of robustness testing for many years, and once in a while I have to explain how that approach to testing fits into the bigger umbrella of fault injection and related ideas. Here's a summary of typical approaches (there are many variations and extensions beyond these as you might imagine).  At the end is a description of the robustness testing work my group has been doing over many years.

Mutation Testing:
Goal: Evaluate coverage/effectiveness of an existing test suite. (Also known as "bebugging.")
Approach: Modify System under Test (SuT) with a hypothetical bug and see if an existing test suite finds it.
Narrative: I have a test suite. I wonder how thorough it is?  Let me put a bug (mutation) into my code and see if my test suite finds it. If it finds all the mutations I insert, maybe my test suite is thorough.
Fault Model: Source code bug that is undetected by testing
Strengths: Can find problems even if code was already 100% branch-covered in test suite (e.g., mutate a comparison to > instead of >= to see if test suite exercises the equality case for that comparison)
Limitations: Requires an existing test suite (but, can combine with automated test generation to create additional tests automatically). Effectiveness heavily depends on inserting realistic mutations, which is not necessarily so easy.

Classical Fault Injection Testing:
Goal: Determine robustness of SuT when its code or data is corrupted.
Approach: Corrupt the binary image of the SuT code or corrupt data during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system. I wonder what happens if I flip a bit in the data -- does the system crash?  I wonder what happens if I corrupt the compiled code -- does the system act in an unsafe way?
Fault Model: Hardware bit flip or software-based memory corruption.
Strengths: Can find realistic failures caused by single-event upsets, hardware faults, and software defects that corrupt computational state.
Limitations: The fault model is memory and program bit-level corruption. Fault injection testing is useful for high-integrity systems deployed in large volume (they have to survive even very infrequent faults), and essential for aviation and space systems that will see high rates of single event upsets.  But it is arguably a bit excessive for non-safety-critical applications.

Robustness Testing:
Goal: Determine robustness of SuT when it is fed exceptional or unusual values.
Approach: Corrupt the inputs to the SuT during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system or subsystem. I wonder what happens if the inputs from other components or sensors have garbage, unusual, or random values -- does the system crash or act in an unsafe way?
Fault Model: Some other module than the SuT has a bug that results in exceptional data being sent to the SuT.
Strengths: Can find realistic failures caused by likely run-time faults such as null pointers, NaNs (Not-a-Number floating point values), corrupted input data, and in general faults in modules that are not the SuT, but rather other software or sensors present in the system that might have bugs that generate exceptional data.
Limitations: The fault model is generally that some other piece of software has a bug and that bug will generate bad data that kills the SuT.  You have decide how likely that is and whether it's OK in such a case for the SuT to misbehave. We have found many situations in which such test results are important, even in systems that are not safety critical.

Fuzzing:
A classical form of robustness testing is "fuzzing," in which random inputs are tossed into a system to see what happens rather than carefully selected specific input values. My research group's work centers on finding efficient ways to do robustness testing so that fewer tests are needed to find system-killer values.

Ballista:
The Ballista project pioneered efficient robustness testing in the late 1990s, and is still active today on stress testing robots and autonomous vehicles.

Two key ideas of Ballista are:
  • Have a dictionary of interesting exceptional values so you don't have to stumble onto them by chance (e.g., just try a NULL pointer straight out rather than wait for a random number generator to happen to generate a zero value as a fuzzing input value)
  • Make it easy to generate tests by basing that dictionary on the data types taken by a function call instead of the function being performed.  So we don't care if it is a memory management function or a file write being tested - we just say for example that if it's a memory pointer, let's try NULL as an input value to the function. This gets us excellent scalability and portability across systems we test.
A key benefit of Ballista and other robustness testing approaches is that they look for holes in the code itself, rather than holes in the test cases. Consider that most test coverage approaches (including mutation testing) are interested in testing all the code that is there (which is a good thing!). In contrast, robustness testing goes beyond code coverage to find the places where you should have had code to handle exceptional situations, but that code is missing. In other words, robustness testing often finds bugs due to missing code that should have been there. We find that it is pretty typical for software to be non-robust unless this type of testing has been done it identify such problems.

You can find more about our research at the Stress Tests for Autonomy Architectures (STAA) project page, which includes video of what goes wrong when you stress test a couple robotic systems:

Saturday, April 16, 2016

Challenges in Autonomous Vehicle Testing and Validation

It's a lot of work to demonstrate that self-driving cars will actually work properly.  Testing alone is probably not going to be enough, and probably there will need to be some clever architectural approaches as well.  Last week I gave a talk at the SAE World Congress in Detroit about this and related challenges.

Link to paper 
Link to slides  (also see slideshare version by scrolling down).

Challenges in Autonomous Vehicle Testing and Validation 
     Philip Koopman & Michael Wagner
     Carnegie Mellon University; Edge Case Research LLC
     SAE World Congress, April 14, 2016

Abstract:
Software testing is all too often simply a bug hunt rather than a well considered exercise in ensuring quality. A more methodical approach than a simple cycle of system-level test-fail-patch-test will be required to deploy safe autonomous vehicles at scale. The ISO 26262 development V process sets up a framework that ties each type of testing to a corresponding design or requirement document, but presents challenges when adapted to deal with the sorts of novel testing problems that face autonomous vehicles. This paper identifies five major challenge areas in testing according to the V model for autonomous vehicles: driver out of the loop, complex requirements, non-deterministic algorithms, inductive learning algorithms, and fail operational systems. General solution approaches that seem promising across these different challenge areas include: phased deployment using successively relaxed operational scenarios, use of a monitor/actuator pair architecture to separate the most complex autonomy functions from simpler safety functions, and fault injection as a way to perform more efficient edge case testing. While significant challenges remain in safety-certifying the type of algorithms that provide high-level autonomy themselves, it seems within reach to instead architect the system and its accompanying design process to be able to employ existing software safety approaches.



Monday, February 2, 2015

Tester To Developer Ratio Should Be 1:1 For A Typical Embedded Project

Believe it or not, most high quality embedded software I've seen has been created by organizations that spend twice as much on validation as they do on software creation.  That's what it takes to get good embedded software. And it is quite common for organizations that don't have this ratio to be experiencing significant problems with their projects. But to get there, the head count ratio is often about 1:1, since developers should be spending a significant fraction on their time doing "testing" (in broad terms).

First, let's talk about head count. Good embedded software organizations tend to have about equal number of testers and developers (i.e., 50% tester head count, 50% developer head count). Also, typically, the testers are in a relatively independent part of the organization so as to reduce pressure to sign off on software that isn't really ready to ship. Ratios can vary significantly depending on the circumstance.  5:1 tester to developer ratio is something I'd expect in safety-critical flight controls for an aerospace application.  1:5 tester to developer ratio might be something I'd expect on ephemeral web application development. But for a typical embedded software project that is expected to be solid code with well defined functionality, I typically expect to see 1:1 tester to developer staffing ratios.

Beyond the staffing ratio is how the team spends their time, which is not 1:1.  Typically I expect to see the developers each spend about two thirds of their time on development, and one-third of their time on verification/validation (V&V). I tend to think of V&V as within the "testing" bin in terms of effort in that it is about checking the work product rather than creating it. Most of the V&V effort from developers is spent doing peer reviews and unit testing.

For the testing group, effort is split between software testing, system testing, and process quality assurance (SQA).  Confusingly, testing is often called "quality assurance," but by SQA what I mean is time spent creating, training the team on, and auditing software processes themselves (i.e., did you actually follow the process you were supposed to -- not actual testing). A common rule of thumb is that about 5%-6% of total project effort should be SQA, split about evenly between process creation/training and process auditing. So that means in a team of 20 total developers+testers, you'd expect to see one full time equivalent SQA person. And for a team size of 10, you'd expect to see one person half-time on SQA.

Taking all this into account, below is a rough shot at how effort across a project should break down in terms of head counts and staff.

Head count for a 20 person project:
  • 10 developers
  • 9 testers
  • 1 SQA (both training and auditing)
Note that this does not call out management, nor does it deal with after-release problem resolution, support, and so on. And it would be no surprise if more testers were loaded at the end of the project than the beginning, so these are average staffing numbers across the length of the project from requirements through product release.

Typical effort distribution (by hours of effort over the course of the project):
  • 33%: Developer team: Development (left-hand side of "V" including requirements through implementation)
  • 17%: Developer team: V&V (peer reviews, unit test)
  • 44%: V&V team: integration testing and system testing
  • 3%: V&V team: Process auditing
  • 3%: V&V team: Process training, intervention, improvement
If you're an agile team you might slice up responsibilities differently and that may be OK. But in the end the message is that you're probably only spending about a third of your time actually doing development compared to verification and validation if you want to create very high quality software and ensure you have a rigorous process while doing so.  Agile teams I've seen that violated this rule of thumb often did so at the cost of not controlling their process rigor and software quality. (I didn't say the quality was necessarily bad, but rather what I'm saying is they are flying blind and don't know what their software and process quality is until after they ship and have problems -- or don't.  I've seen it turn out both ways, so you're rolling the dice if you sacrifice V&V and especially SQA to achieve higher lines of code per hour.)

There is of course a lot more to running a project than the above. For example, there might need to be a cross-functional build team to ensure that product builds have the right configuration, are cleanly built, and so on. And running a regression test infrastructure could be assigned to either testers or developers (or both) depending on the type of testing that was included. But the above should set rough expectations.  If you differ by a few percent, probably no big deal. But if you have 1 tester for every 5 developers and your effort ratio is also 1:5, and you are expecting to ship a rock-solid industrial-control or similar embedded system, then think again.

(By way of comparison, Microsoft says they have a 1:1 tester to developer head count. See: How We Test Software at Microsoft, Page, Johston & Rollison.)









Monday, September 29, 2014

Go Beyond System Functional Testing To Ensure Safety

Testing alone is insufficient to ensure safety in critical systems. Other technical approaches and software development process management approaches must also be used to assure sufficient software integrity.

Consequences:
Relying upon just system functional testing to achieve safety can be expected to eventually lead to an unsafe situation in a widely released product. Even if system functional testing is completely representative of situations that will happen in practice, such testing normally won’t be long enough to see all of the infrequent events that will occur with a much larger fleet of vehicles deployed for a much longer period of time.

Accepted Practices:

  • Specifically identify and follow a process to design in safety rather than attempting to test it in after the product has already been built. The MISRA Guidelines describe an example of an automotive-specific process.
  • Include defined activities beyond hiring smart designers and performing extensive functional testing. While details might vary depending upon the project, as an example, an acceptable set of practices for critical software by the late 1990s would have included the following (assuming that MISRA Safety Integrity Level 3 were an appropriate categorization of the functions): precisely written functional specifications, use of a restricted language subset (e.g., MISRA C), a way of ensuring compilers produced correct code, configuration management, change management, automated build processes, automated configuration audits, unit testing to a defined level of coverage, stress testing, static analysis, a written safety case, deadlock analysis, justification/demonstration of test coverage, safety training of personnel, and availability of written documentation for assessment of safety (auditability of the process). (The required level of care today is, if anything, even more rigorous for such systems.)

Discussion:
There is a saying about quality: “You can’t test in quality; you have to design it in from the start.” It is well known that the same is true of safety.

Assuring safety requires more than just using capable designers and performing extensive testing (although those two factors are important). Even the best designers – like all humans – are imperfect, and even the most extensive system-level functional testing cannot hope to find everything that can go wrong in a large deployed fleet such as an automobile. It should be apparent than everyone can make a mistake, even careful designers. But beyond that, system level functional testing (e.g., driving a car around in a variety of circumstances) cannot be expected to find all the defects in software, because there are just too many situations that can occur to experience them all in testing. This is especially true if a combination of events that causes a software failure just happens to be one that the testers didn’t think of putting into the test plan. (Test plans have bugs and gaps too.) Therefore, it has long been recognized that creating safe software requires more than just trying hard to get the design right and trying really hard to test well.

Accepted practices require a holistic approach to safety, including executing a well-defined process, having a written plan to achieve safety, using techniques to ensuring safety such as fault tree analysis, and auditing the process to ensure all required steps are being performed.

An accepted way of ensuring that safety has been considered appropriately is to have a written document that argues why a system is safe (sometimes called a safety case or safety argument). The safety case should give quantitative arguments as to why safety is inherent in the system. An argument that says “we tested for X hours” would be insufficient – unless it also said “and that covered 99.999% of all anticipated operating scenarios as well as thoroughly exercising every line of code” or some other type of argument that testing was thorough. After all, running a car in circles around a track is not the same level of testing as a cross-country drive over mountains. Or one that goes to Alaska in the winter and Death Valley in the summer. Or one that does so with 1000 cars to catch situations in which things inside one of those many cars just happen to line up in just the wrong way to cause a system failure. But even with the significant level of testing done by automotive companies, the safety case must also include things such as the level of peer reviews conducted, whether fault tree analysis revealed single points of failure, and so on. In other words, it’s inadequate to say “we tried really hard” or “we are really smart” or “we spent a whole lot of time testing.” It is essential to also justify that broad coverage was achieved using a variety of relevant techniques.

Selected Sources:
Beatty, in a paper aimed at educating embedded system practitioners, explains that code inspections and testing aren’t sufficient to detect many common types of errors in complex embedded systems (Beatty 2003, pg. 36). He identifies five areas that require special attention: stack overflows, race conditions, deadlocks, timing problems, and reentrancy conditions. He states that “All of these issues are prevalent in systems that employ multitasking real-time designs.”

Lists of techniques that could be applied to ensure safety beyond just testing have been well known for many years, with a relatively comprehensive example being IEC 61508 Part 7.

Even if you could test everything (which you can’t), dealing with low-probability faults that can be expected to affect a huge deployed fleet of automobiles just takes too long. “It is impossible to gain confidence about a system reliability of 100,000 years by testing,” (written in reference specifically to drive-by-wire automobiles and their requirement for a mean-time-to-failure of 1 billion hours) (Kopetz 2004, p. 32, emphasis per original)

Butler and Finelli wrote the classical academic reference on this point, stating that attaining software needed for safety critical applications will “inevitably lead to a need for testing beyond what is practical” because the testing time must be longer than the acceptable catastrophic software failure rate. (Butler 1993, p. 3, paper entitled “The infeasibility of quantifying the reliability of life-critical real-time software.” See also Littlewood and Strigini 1993.)

Knutson gives an overview of software safety practices, and makes it clear that testing isn’t enough to create a safe system: “Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of faults. The only way to prove correctness via testing would be to hit all possible states, which as we’ve stated previously, is fundamentally intractable.” (Knutson 2000, pg. 34). Knutson suggests peer reviews as a technique beyond testing that will help.

NASA says that “You can’t test everything. Exhaustive testing cannot be done except for the most trivial of systems.” (NASA 2004, p. 77).

Kendall presents a case study for an electronic throttle control (with mechanical fail-safes) using a two-CPU approach (a “sub Processor” and a “Main Processor”). The automotive supplier elected to follow the IEC 1508 draft standard (a draft of the IEC 61508 standard), also borrowing elements from the MISRA software guidelines. Steps that were performed include: preliminary hazard analysis with mapping to MISRA SILs, review of standards and procedures to ensure they were up to date with accepted practices; on-site audits of development processes; FMEA by an independent agency; FTA by an independent agency; Markov modeling (a technique for analyzing failure probabilities); independent documentation review; mathematical proofs of correctness; and safety validation testing. (Kendall 1996)  Important points from this paper relevant to this case include: “it is well accepted that software cannot be shown to be suitable for [its] intended use by testing alone” (id. pg. 6); “Software robustness must be demonstrated by ensuring the process used to develop it is appropriate, and that this process is rigorously followed.” (id., pg. 6); “safety validation must consider the effect of the vehicle under as many failure conditions as is possible to generate.” (id., p. 7).

Roger Rivett from Rover Group wrote a paper in 1997 based on a collaborative government-sponsored research effort that specifically addresses how automotive manufacturers should proceed to ensure the safety of vehicles. He makes an important point that rigorous use of good software practice is required in addition to testing (Rivett 1997, pg. 3). He has four specific conclusions for achieving a level of “good practice” for safety: use a quality management system, use a safety integrity level approach; be compliant with a sector standard (e.g., MISRA Software Guidelines), and use a third party assessment to ensure that high-integrity levels have been achieved. (Rivett 1997, pg. 10).

MISRA Development Guidelines, section 3.6.1, provides a set of points that make it clear that testing is necessary, but not sufficient, to establish safety (MISRA Guidelines, pg. 49):

MISRA Testing Guidance (MISRA Software Guidelines, p. 49)

This last point of the MISRA Guidelines is key – testing can discover if something is unsafe, but testing alone cannot prove that a system is safe.

"Testing on its own is not adequate for assessing safety-related software."  (MISRA report 2 pg. iv) In particular, system-level testing (such as at the vehicle level), cannot hope to uncover all the possible faults or exceptional situations can will result in mishaps.

References:
  • Beatty, Where testing fails, Embedded Systems Programming, Aug 2003, pp. 36-41.
  • Butler et al., The infeasibility of quantifying the reliability of life-critical real-time software,  IEEE Trans. Software Engineering, Jan 1993, pp. 3-12.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Part 7.
  • Kendall, “The safety assurance of the AJV8 electronic throttle,” IEE Colloquium on the Electrical System of the Jaguar XK8, Oct 18, 1996, pp. 2/1-2/8.
  • Knutson, C. & Carmichael, S., Safety First: avoiding software mishaps, Embedded Systems Programming, November 2000, pp. 28-40.
  • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
  • Littlewood, B., Strigini, L. (1993) “Validation of Ultra-High Dependability for Software-Based Systems,” Communications of the ACM, 36(11):69-80, November 1993.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Rivett, "Emerging Software Best Practice and how to be Compliant", Proceedings of the 6th International EAEC Congress July 1997.

Thursday, September 11, 2014

Fail-Safe Mechanisms Must Be Tested

Some systems base their safety arguments on the presence of “fail-safe” behaviors. In other words, if a failure occurs, the argument is that the system will respond in a safe way, such as by shutting down in a safe manner. If you have fail-safe mechanisms, you need to test them with a full range of faults within the intended fault model to make sure they work properly.

Consequences:
Failing to specifically test for mitigation of single points of failure means that there is no way to be sure that the mitigation really works, putting safety of the system into doubt.

As an example, if a hardware watchdog timer is not turned on, it won’t reset the system, but there might be no way to tell whether the watchdog timer is on or not (or set to the wrong value, or otherwise used improperly) without specifically testing whether the watchdog works or not. Thus, you can’t take credit for having a watchdog timer unless you have actually tested that it works for each fault that matters (or, if there are many such faults, argue that you have attained sufficient coverage with the tests that are run).

Accepted Practices:
  • Each and every fail-safe mechanism and fault management mechanism must be tested, preferably on a fully integrated system. Such tests may be difficult to perform in normal functional testing and may require intentional fault injection from the outside of the system (e.g., breaking a sensor) or fault injection at test points inside the system (e.g., intentionally killing a task using special test support infrastructure).
Discussion:
Fault injection is the process of intentionally inducing a hardware or software fault and determining its effect upon the system.

Fault management mechanisms, and especially fail-safe mechanisms, are often the key points upon which an argument as to the safety of a system rests. As an example, a safety case based on a watchdog timer detecting task failures requires that the watchdog timer actually work. While it is of course important to make sure that the system has been designed properly, there is no substitute for testing whether the watchdog timer is actually turned on during system test. (To revisit a point on system testing made elsewhere in my postings – system testing is not sufficient to ensure safety, but thorough system testing is certainly an important thing to do.) It is similarly important to specifically test every fault mode that must be handled by the system to ensure fault handling is done correctly.

Some examples of fault tests that should be performed include: killing each task independently to ensure that the death of any task is caught by the watchdog (and, by extension, cannot cause an unsafe system state); overloading the system to ensure that it behaves safely in an unanticipated CPU overload situation; checking that diagnostic fail-safes detect the faults they are supposed to and react by putting the system into a safe state; disabling sensors; disabling actuators; and others.

Another perspective on this topic is that ensuring safety usually involves arguing that all single points of failure have been mitigated to make the system safe. To demonstrate that the reasoning is accurate, a system must have corresponding failures injected to make sure that the mitigation approaches actually work, since the system’s safety case rests upon that assumption. This might include intentionally corrupting bits in memory, corrupting computations that take place, corrupting stack contents, and so on.

It is important to note that ordinary system functional testing tends to do a poor job at exercising fault mitigation mechanisms. As an example, if a particular task is never supposed to die, and testing has been thorough, then that task won’t die during normal functional testing (if it did, the system would be defective!). The point of detecting task death is to handle situations you missed in testing. But that means the mechanism to detect task death and perform a restart hasn’t been tested by normal system-level functional tests. Therefore, testing fail-safe mechanisms requires special techniques that intentionally introduce faults into the system to activate those fail-safes.

Selected Sources:
Safety critical systems are deemed safe only if they can withstand the occurrence of any single point fault. But, there is no way to know if they will really do that unless testing includes actually injecting representative single point faults to see if the system will respond in a safe manner. You can’t know if a system is safe if you don’t actually test its safety capabilities, and doing so requires fault injection. For example, if you expect a watchdog to detect failed tasks, you need to kill each and every task in turn to see if the watchdog really works. Arlat correctly states that “physical fault injection will always be needed to test the actual implementation of a fault tolerant system” (Arlat 1990, pg. 180)

The need to actually test fail-safe mechanisms to see if they really work should be readily apparent to any engineer. Pullum discusses this topic by suggesting the use of fault injection (intentionally causing faults as a testing technique) in the context of “verification of integration of fault and error processing mechanisms” for creating dependable systems (Pullum 2001, pg. 93).

“Fault injection is important to evaluating the dependability of computer systems. … It is particularly hard to recreate a failure scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, speaking about the need for fault injection as part of testing a system). Mariani refers to the IEC 61508 safety standard and concludes that “fault-injection will be mandatory for soft error sensitivity verification” for safety critical systems (Mariani03, pg. 60). “A fault-tolerant computer system’s dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection – the deliberate insertion of faults into an operational system to determine its response – offers an effective solution to this problem.” (Clark 1995, pg. 47).

Fault injection must include all possible single-point faults, not just faults that can be conveniently injected via the pins or connectors of a component. Rimen et al. compared internal vs. external fault injection, and found that that only 9%-12% of bit flip faults that occur inside a microcontroller could be tested via external pin fault injection (Rimen et al. 1994, p. 76). In 1994, Karlsson reported on the effectiveness of using a radioactive isotope to inject faults into a microcontroller (Karlsson 1994). Later fault injection work by Karlsson’s research group was performed on automotive brake-by-wire applications, sponsored by Volvo (Aidemark 2002), clearly demonstrating the applicability of fault injection as a relevant technique for safety critical automotive systems. And other similar work found defects in a safety critical automotive network protocol. (Ademaj 2003)

A test specifically on an engine control program using fault injection caused “permanently locking the engine’s throttle at full speed.” (Vinter 2001).

There are numerous other scholarly works in this area.  An early example is Bossen (1981). Some others include: Arlat et al. (1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and Kanawati (1995). As a more recent example, Baumeister et al. performed fault injection on an automotive braking controller via irradiating it and measuring the errors, finding that unprotected SRAM and unprotected microcontroller paths were both sensitive to upsets (Baumeister 2012, pg. 5)

MISRA Software Guidelines take it for granted that fault management capabilities will be tested (e.g., MISRA Software Guidelines 3.4.8.3 pg. 44, MISRA Report 4 p. v) “Fault injection test” is recommended by ISO 26262-6 (pg. 23) for software integration, noting that “This includes injection of arbitrary faults in order to test safety mechanisms (e.g., by corrupting software or hardware components).”

By the late 1990s fault injection tools had become quite sophisticated, and were capable of injecting faults while a system was running at full speed even if source code was not available (e.g., Carreira 1998).
An example of a testing approach along these lines is E-GAS (E-GAS), which includes numerous tests based on auto manufacturer experience to ensure that various faults will be handled safely.

It is important to note that while mitigation techniques such as watchdog timers are a good practice if implemented properly, they are not sufficient to guarantee safety in the face of random errors. For example, Gunneflo presents experimental evidence indicating that watchdog effectiveness is less than perfect, and depends heavily on the particular software being run. Gunneflo recommends: “To accurately estimate coverage and latency for watch-dog mechanisms in a specific system, fault injection experiments must be carried out with the final implementation of the system using the real software.” (Gunneflo 1989, pg. 347). In other words, even if you have a watchdog timer, you need to perform fault injection to understand whether there are holes in your fault tolerance approach.

References:
  • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
  • Aidemark, Experimental evaluation of time-redundant execution for a brake-by-wire application, DSN 2002.
  • Arlat et al., Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems, FTCS, 1989.
  • Arlat et al., Fault Injection for Dependability Validation: a methodology and some applications, IEEE Trans. SW Eng., 16(2), pp. 166-182 Feb. 1990.
  • Barton et al., Fault injection experiments using FIAT, IEEE Trans. Computers, pp. 575-592, April 1990.
  • Baumeister et al., Evaluation of chip-level irradiation effects in a 32-bit safety microcontroller for automotive braking applications, IEEE Workshop on Silicon Errors in Logic – System Effects, 2012.
  • Benso et al., Fault injection for embedded microprocessor-bases systems, J. Universal Computer Science, 5(10), pp. 693-711, 1999.
  • Bossen & Hsiao, ED/FI: a technique for improving computer system RAS, FTCS, 1981.
  • Carreira et al., Xception: a technique for the experimental evaluation of dependability in modern computers, IEEE Trans. Software Engineering, Feb. 1998, pp. 125-136.
  • Clark, J. et al. Fault Injection: a method for validating computer-system dependability, IEEE Computer, June 1995, pp., 47-56
  • EGAS, Standardized E-Gas monitoring concept for engine management systems of gasoline and diesel engines version 4.0, Work Group EGAS, Jan. 30, 2007.
  • Gunneflo et al., Evaluation of error detection schemes using fault injection by heavy-ion radiation, Fault Tolerant Computing Symposium, 1989, pp. 340-347.
  • Han et al., DOCTOR: An integrated software fault injection environment for distributed real-time systems, International Computer and Dependability Symposium, 1995, pp. 204-213.
  • Hsueh, M. et al., Fault injection techniques and tools, IEEE Computer, April 1997, pp. 75-82.
  • ISO 26262, Road vehicles – Functional Safety, International Standard, First Edition, Nov 15, 2011, ISO, part 6.
  • Kanawati et al., FERRARI: a flexible software-based fault and error injection system, IEEE Trans. Computers, 44(2), Feb. 1995, pp. 248-260.
  • Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • Pullum, L., Software Fault Tolerance Techniques and Implementation, Artech House, 2001.
  • Rimen et al., On microprocessor error behavior modeling, Fault Tolerant Computing Symposium, 1994, pp. 76-85.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.

Monday, December 2, 2013

Why You Need A Software-Specific Test Plan

Summary: Pretty much every embedded system goes through some sort of system test. But, you need to do software-specific testing in addition to that.  For example, what if you ship a system with the watchdog timer accidentally turned off?



In essentially every embedded system there is some sort of product testing. Typically there is a list of product-level requirements (what the product does), and a set of tests designed to make sure the product works correctly. For many products there is also a set of tests dealing with fault conditions (e.g., making sure that an overloaded power supply will correctly shed load). And many companies think this is enough .. but I've found that such tests usually fall short in many cases.

The problem is that there are features built into your software that are difficult or near-impossible to test in traditional product-level testing.  Take the watchdog timer for example. I have heard more than one developer say that there was a case where a product shipped (at least one version of a product) with the watchdog timer accidentally turned off. How could this happen?  Easy: a field problem is reported; developer turns off watchdog to do single-step debugging; bug found and fixed; forgot to turn the watchdog back on; product test doesn't have a way to intentionally crash the software to see if the watchdog is working; new software version ships with watchdog timer still turned off.

Continuing with the watchdog example, how do you solve this? One way is to include user-accessible functions that exercise the watchdog timer by intentionally crashing the software. Sounds a bit dangerous, especially if you are worried about security. More likely you'll need to have some separate, special way of testing functions that you don't want visible to the end user. And you'll need a plan for executing those tests.

And ... well, here we are, needing a Software Test Plan in addition to a Product Test Plan. Maybe the software tests are done by the same testers who do product test, but that's not the point. The point is you are likely to need some strategy for testing things that are there not because the end product user manual lists them as functions, but rather because the software requirements say they are needed to provide reliability, security, or other properties that aren't typically thought of as product functions. ("Recovers from software crashes quickly" is typically not something you boast about in the user manual.) For similar reasons, the normal product testers might not even think to test such things, because they are product experts and not software experts.

So to get this right the software folks are going to have to work with the product testers to create a software-specific test plan that tests what the software requirements need to have tested, even if they have little directly to do with normal product functions. You can put it in product test or not, but I'd suggest making it a separate test plan, because some tests probably need to be done by testers who have particular skill and knowledge in software internals beyond ordinary product testers. Some products have a "diagnostic mode" that, for example, sends test messages on a network interface. Putting the software tests here makes a lot of sense.

But for products that don't have such a diagnostic mode, you might have to do some ad hoc testing before you build the final system by, for example, manually putting infinite loops into each task to make sure the watchdog picks them up. (Probably I'd use conditional compilation to do that -- but have a final product test make sure the conditional compilation flags are off for the final product!)

Here are some examples of areas you might want to put in your software test plan:

  • Watchdog timer is turned on and stays turned on; product reboots as desired when it trips
  • Watchdog timer detects timing faults with each and every task, with appropriate recovery (need a way to kill or delay individual tasks to test this)
  • Tasks and interrupts are meeting deadlines (watchdog might not be sensitive enough to detect minor deadline misses, but deadline misses usually are a symptom of a deeper problem)
  • CPU load is as expected (even if it is not 100%, if you predicted an incorrect number it means you have a problem with your scheduling estimates)
  • Maximum stack depth is as expected
  • Correct versions of all code have been included in the build
  • Code included in the build compiles "clean" (no warnings)
  • Run-time error logs are clean at the end of normal product testing
  • Fault injection has been done for systems that are safety critical to test whether single points of failure turn up (of course it can't be exhaustive, but if you find a problem you know something is wrong)
  • Exception handlers have all been exercised to make sure they work properly. (For example, if your code hits the "this can never happen" default in a switch statement, does the system do something reasonable, even if that means a system reset?)
Note that some of these are, strictly speaking, not really "tests." For example, making sure the code compiles free of static analysis warnings isn't done by running the code. But, it is properly part of a software test plan if you think of the plan as ensuring that the software you're shipping out meets quality and functionality expectations beyond those that are explicit product functions.

And, while we're at it, if any of the above areas aren't in your software requirements, they should be. Typically you're going to miss tests if there is nothing in the requirements saying that your product should have these capabilities.

If you have any areas like the above that I missed, please leave a comment.  I welcome your feedback!

Monday, July 22, 2013

Making Main Loop Scheduling Timing More Deterministic

Summary: Wasting time in a main loop scheduler can make testing system-level timing a whole lot easier.



It's common enough to see a main loop scheduler in an embedded system along the lines of the following:

for(;;)
{ Task1();
  Task2();
  Task3();
}

I've heard this referred to as a "static scheduler," "cyclic executive," "main loop scheduler," or "static non-preemptive scheduler" among other terms. Regardless of what you call it, the idea is simple: run all the tasks that need to be run, then go back and do it again until the system is shut down or reset. There might be interrupt service routines (ISRs) also running in the system.

The main appeal of this approach is that it is simple. You don't need a real time operating system and, even more importantly, it would appear to be difficult to get wrong. But, there's a little more to it than that...

The first potential problem is those pesky ISRs. They might have timing problems, cause concurrency problems, and so on. Those issues are beyond what I want to talk about today except for making the point that they can affect the execution speed of one iteration through the main loop in ways that may not be obvious. You should do timing analysis for the ISRs (chapter 14 of my book has gory details).  But for today's discussion we're going to assume that you have the ISRs taken care of.

The next problem is timing analysis of the main loop. The worst case response time for running a task is one trip through the loop. But how long that trip is might vary depending on the calculations each task performs and how much time ISRs steal. It can be difficult to figure out the absolute worst case execution time (you should try, but it might not be easy). But the really bad news is, even if you know the theoretical worst case timing you're unlikely to actually see it during testing.

Consider the tester trying to make sure the system will function with worst case timing. How do you get the above static scheduler to take the worst case path through the code a bunch of times in a row to see what breaks? It's a difficult task, and probably most testers don't have a way to pull that off. So what is happening is you are shipping product that has never been tested for worst case main loop execution time. Will it work?  Who knows.  Do you want to take that chance with 10,000 or 100,000 units in the field? Eventually one of them will see worst case conditions and you haven't actually tested what will happen.

Fortunately there is an easy way to mitigate this risk. Add a time-waster at the end of the main loop. The time waster should convert the above main loop, which runs as fast as it can, to a main loop that runs exactly once per a defined period (for example, once every 100 msec):

for(;;)
{ StartTimer(100);  // start a 100 msec countdown
  Task1();
  Task2();
  Task3();
  WaitForTimer(0);  // wait for the 100 msec countdown to reach 0
}

This is just a sketch of the code -- how you build it will depend upon your system. The idea is that you waste time in the WaitForTimer routine until you've spent 100 msec in the main loop, then you run the loop again. Thus, the main loop runs exactly once every 100 msec.  If the tasks run faster than 100 msec as determined by a hardware timer, you waste time at the end, waiting for the 100 msec period to be up before starting the next main loop iteration. If the tasks take exactly 100 msec then you just start the main loop again immediately. If the tasks run longer than 100 msec, then you should log an error or perform some other action so you know something went wrong.

The key benefit to doing this is to ensure that in testing the average timing behavior is identical to the worst case timing behavior. That way, if something works when the system is fast, but breaks when it actually takes 100 msec to complete the main loop, you'll see it right away in testing. A second benefit is that since you are actively managing the main loop timing, you have a way to know the timing ran a little long on some loops even if it isn't bad enough to cause a watchdog reset.






Monday, February 25, 2013

Using Profiling To Improve System-Level Test Coverage


Sometimes I run into confusion about how to deal with "coverage" during system level testing. Usually this happens in a context where system level testing is the only testing that's being done, and coverage is thought about in terms of lines of code executed.  Let me explain those concepts, and then give some ideas about what to do when you find yourself in that situation.

"Coverage" is the fraction of the system you've exercised during testing. For unit tests of small chunks of code, it is typically what fraction of the code was exercised during testing (e.g., 95% means in a 100 line program 95 lines were executed at least once and 5 weren't executed at all).  You'd think it would be easy to get 100%, but in fact getting the last few lines of code in test is really difficult even if you are just testing a single subroutine in isolation (but that's another topic). Let's call this basic block coverage, where a basic block is a segment of code with no branches in or out, so if you execute one line of a basic block's code you have to execute all of them as a set. So in a perfect world you get 100% basic block coverage, meaning every chunk of code in the system has been executed at least once during testing. (That's not likely to be "perfect" testing, but if you have less than 100% basic block coverage you know you have holes in your test plan.)

By system level test I mean that the testing involves more or less a completely running full system. It is common to see this as the only level of testing, especially when complex I/O and timing constraints make it painful to exercise the software any other way. Often system level tests are based purely on the functional specification and not on the way the software is constructed. For example, it is the rare set of system level tests that checks to make sure the watchdog timer operates properly.  Or whether the watchdog is even turned on. But I digress...

The problem comes when you ask the question of what basic block coverage is achieved by a certain set of system-level tests. The question is usually posed less precisely, in the form "how good is our testing?"  The answer is usually that it's pretty low basic block coverage. That's because if it is difficult to reach into the corner cases when testing a single subroutine, it's almost impossible to reach all the dusty corners of the code in a system level test.  Testers think they're good at getting coverage with system level test (I know I used to think that myself), but I have a lot of doubt that coverage from system-level testing is high. I know -- your system is different, and it's just my opinion that your system level test coverage is poor. But consider it a challenge. If you care about how good your testing is for an embedded product that has to work (even the obscure corner cases), then you need data to really know how well you're doing.

If you don't have a fancy test coverage tool available to you, a reasonable way to go about getting coverage data is to use some sort of profiler to see if you're actually hitting all the basic blocks. Normally a profiler helps you know what code is executed the most for performance optimization. But in this case what you want to do is use the profiler while you're running system tests to see what code is *not* executed at all.  That's where your corner cases and imperfect coverage are.  I'm not going to go into profiler technology in detail, but you are likely to have one of two types of profiler. Maybe your profiler inserts instructions into every basic block and counts up executions.  If so, you're doing great and if the count is zero for any basic block you know it hasn't been tested (that would be how a code coverage tool is likely to work). But more likely your profiler uses a timer tick to sample where the program counter is periodically.  That makes it easy to see what pieces of code are executed a lot (which is the point of profiling for optimization), but almost impossible to know whether a particular basic block was executed zero times or one time during a multi-hour test suite.

If your profiler only samples, you'll be left with a set of basic blocks that haven't been seen to execute. If you want to go another step further you may need to put your own counters (or just flags) into those basic blocks by hand and then run your tests to see if they ever get executed. But at some point that may be hard too. Running the test suite a few times may help catch the moderately rare pieces of code that are executed. So may increasing your profile sample rate if you can do that. But there is no silver bullet here -- except getting a code coverage tool if one is available for your system.  (And even then the tool will likely affect system timing, so it's not a perfect solution.)

At any rate, after profiling and some hand analysis you'll be left with some level of idea of what's getting tested and what's not.  Try not to be shocked if it's a low fraction of the code (and be very pleased with yourself it if is above 90%).  

If you want to improve the basic block coverage number you can use coverage information to figure out what the untested code does, and add some tests to help. These tests are likely to be more motivated by how the software is built rather than by the product-level function specification.  But that's why you might want both a software test plan in addition to a product test plan.  Product tests never cover all the software corner cases.

Even with expanded testing, at some point it's going to be really really hard to exercise every last line of code -- or even every last subroutine. For those pieces you can test portions of the software to make it easier to get into the corner cases. Peer reviews are another alternative that is more cost effective than testing if you have limited resources and limited time to do risk reduction before shipping. (But I'm talking about formal peer reviews, not just asking your next cube neighbor to look over your code at lunch time.) Or there is the tried-and-true strategy of putting a breakpoint in before a hard-to-get-to basic block with a debugger, and then changing variables to force the software down the path you want it to go. (Whether such a test is actually useful depends upon the skill of the person doing it.)

The profiler-based coverage improvement strategy I've discussed is really about how to get yourself out of a hole if all you have is system tests and you're getting pressure to ship the system. It's better than just shipping blind and finding out later that you had poor test coverage. But the best way to handle coverage is to get very high basic block coverage via unit test, then shake out higher level problems with integration test and system test.

If you have experience using these ideas I'd love to hear about them -- let me know.


Monday, February 21, 2011

Categorizing Bug Severity

One of the imponderables when using a bug tracking system such as Bugzilla is how to assign "severity" to a particular defect entry. It is instinctive to assign severity based on how dramatic the outcome is. So in that sort of system an LED that sometimes doesn't blink might be "medium" and "low" and a system crash might be "critical."  BUT, that's usually not the best way to look at severity.

Instead defect severity should be based on how urgent it is to get fixed for the business environment your product lives in. To use the above example, what if the non-blinking LED causes 1000 field support visits calls because people think the device isn't working properly (perhaps it is a cable modem and "blinking" means "I'm working properly")?  What if the system crash is likely to happen to only one out of every 100,000 customers, and happens in a situation in which the customer is very likely just to cycle power to clear the problem without any big deal?  In that situation the first defect might be "critical" and the second might only be "medium" severity.

So when you are thinking about assigning issue severity, consider the business context and not just how the defect feels to a tester or developer at a test bench. The general idea is that defect severity should correspond to value to the business of fixing the defect, and not how embarrassing it might be to the developers. (You can read more about good practices for defect tracking in chapter 24 of my book.)

Wednesday, November 17, 2010

Embedded Software Risk Areas -- Verification & Validation

Series Intro: this is one of a series of posts summarizing the different red flag areas I've encountered in more than a decade of doing design reviews of industry embedded system software projects. You can read more about the study here. If one of these bullets applies to your project, you should consider whether that presents undue risk to project success (whether it does or not depends upon your specific project and goals). The results of this study inspired the chapters in my book.

Here are the Verification and Validation red flags:
  • No peer reviews
Code, requirements, design and other documents are not subject to a methodical peer review, or undergo ineffective peer reviews. As a result, most bugs are found late in the development cycle when it is more expensive to fix them.
  • No test plan
Testing is ad hoc, and not according to a defined plan. Typically there is no defined criterion for how much testing is enough. This can result in poor test coverage or an inconsistent depth of testing.
  • No defect tracking
Defects and other issues are not being put into a bug tracking system. This can result in losing track of outstanding bugs and poor prioritization of bug-fixing activities.
  • No stress testing
There is no specific stress testing to ensure that real time scheduling and other aspects of the design can handle worst case expected operating conditions. As a result, products may fail when used for demanding applications.

Monday, September 13, 2010

Testing a Watchdog Timer

I recently got a query about how to test a watchdog timer. This is a special case of the more general question of how you test for a fault that is never supposed to happen. This is a tricky topic in general, but the simple answer (if you can figure out how to do it easily) is to insert the fault intentionally into your system and see what happens.

Here are some ideas that I've seen or thought up that you may find helpful:

(1) Set the watchdog timer period to shorter and shorter periods until it trips in normal operation. This will give you an idea how close to the edge you are. But, it is easy to make a mistake changing the watchdog period back to the normal value. So it isn't a good test for nearly-final code.

(2) Add a timeout loop (for example, a do-nothing loop that you have made sure won't be removed by your optimizer). Increase the timeout value until the watchdog timer trips. This similarly gives you an idea how close to the edge you are in terms of timing, but with a nicer level of detail. It also has the advantage of testing operation without modifying the watchdog code itself.

(3) Use a jumper that, when inserted, activates a time-wasting task (similar to idea #2). The idea is when a jumper is installed it enables the running of a task that wastes so much time it is guaranteed to trip the watchdog. When the jumper is removed, that task doesn't run and the system operates properly.  You can insert the jumper during system test to make sure the watchdog function works properly.  (So just jumping to the watchdog handling code isn't the idea -- you have to simulate a situation of CPU overload for this to be a realistic test.)  When you ship the system, you make sure the jumper has been removed. Just to avoid problems, put the watchdog test first on the outgoing test plan instead of last. That way the jumper is sure to be removed before shipment. If you want to be really clever, that same jumper hardware could be used to disable the watchdog in early testing, and code could be changed to have it trip the watchdog as the system nears completion.  But whether you want to be that tricky is a matter of taste and the type of system you are building.

Monday, August 9, 2010

How Do You Know You Tested Everything? (Simple Traceability)

The idea of traceability is simple: look at the inputs to a design process and then look at the outputs from that same process. Compare them to make sure you didn't miss anything.

The best place to start using traceability is usually comparing requirements to acceptance tests.  Here's how:

Make a table (spreadsheets work great for this). Label each column with a requirement. Label each row with an acceptance test. For each acceptance test, put an X in the requirement columns that are exercised to a non-trivial degree by that test.

When you're done making the table, you can do traceability analysis. An empty row (an acceptance test with no "X" marks) means you are running a test that isn't required. It might be a really good test to run -- in which case your requirements are missing something. Or it might be a waste of time.

An empty column (a requirement with no "X" marks) means you have a requirement that isn't being tested. This means you have a hole in your testing, a requirement that didn't get implemented, or a requirement that can't be tested. No matter the cause, you've got a problem.  (It's OK to use non-testing validation such as a design review to check some requirements. For traceability purposes put it in as a "test" even though it doesn't involve actually running the code.)

If you have a table with no missing columns and no missing rows, then you've achieved complete traceability -- congratulations!  This doesn't mean you're perfect.  But it does mean you've managed to avoid some easy-to-detect gaps in your software development efforts.

These same ideas can be used elsewhere in the design process to avoid similar mistakes. The point for most projects is to use this idea in a way that doesn't take a lot of time, but catches "stupid" mistakes early on. It may seem too simplistic, but in my experience having written traceability tables helps. You can find out more about traceability in my book, Chapter 7: Tracing Requirements To Test.
---

Monday, June 14, 2010

White Box Testing

White box software tests are designed in light of the particular software design and implementation being tested. For example, if you have an  if {} else {} code construct, a white box test would intentionally try to execute both the if side and the else clause of that statement (probably using separate tests). This may sound trivial, but designing tests to execute rare cases and fault handling code can be a real challenge.

The fraction of code that is tested is known as the test coverage.  In general, higher coverage is good. For example, white box test code coverage might be the fraction of lines of code executed by tests, with 95% being a pretty good result and 98% to 99% is often the best that people do without heroic effort.  (95% coverage means 5% of the code is never executed -- not even once -- during testing. Hard to believe this is a pretty good result. But as I said, executing code that handles rare cases can be a challenge.)

Although lines of code executed is the classic coverage metric, there are other possible coverage metrics that might be useful depending on your situation:
  • Testing that code can correctly handle exceptions it might encounter (for example, does it handle malloc failing?)
  • Testing that all entries of a lookup table are exercised (what if only one table entry is out to lunch?)
  • Testing that all states and arcs of a statechart have been exercised
  • Testing that algorithms have been checked for numerical stability in tricky areas where they might have problems
The point of all the above is that the tester knows exactly how the code is trying to perform its functions, and makes sure that nothing was missed in testing, especially corner cases.

It's important to remember that even if you have 100% test coverage, it doesn't mean you have tested the system completely (usually "complete" testing is impossible -- there are too many possibilities). What good coverage does mean is that you haven't left out anything obvious. And that's good enough to make understanding the coverage of your testing worthwhile. Chapter 23 of my book discusses the concepts and practices of embedded software testing in more detail.
---

Static Analysis Ranked Defect List

  Crazy idea of the day: Static Analysis Ranked Defect List. Here is a software analysis tool feature request/product idea: So many times we...