Monday, February 2, 2015

Tester To Developer Ratio Should Be 1:1 For A Typical Embedded Project

Believe it or not, most high quality embedded software I've seen has been created by organizations that spend twice as much on validation as they do on software creation.  That's what it takes to get good embedded software. And it is quite common for organizations that don't have this ratio to be experiencing significant problems with their projects. But to get there, the head count ratio is often about 1:1, since developers should be spending a significant fraction on their time doing "testing" (in broad terms).

First, let's talk about head count. Good embedded software organizations tend to have about equal number of testers and developers (i.e., 50% tester head count, 50% developer head count). Also, typically, the testers are in a relatively independent part of the organization so as to reduce pressure to sign off on software that isn't really ready to ship. Ratios can vary significantly depending on the circumstance.  5:1 tester to developer ratio is something I'd expect in safety-critical flight controls for an aerospace application.  1:5 tester to developer ratio might be something I'd expect on ephemeral web application development. But for a typical embedded software project that is expected to be solid code with well defined functionality, I typically expect to see 1:1 tester to developer staffing ratios.

Beyond the staffing ratio is how the team spends their time, which is not 1:1.  Typically I expect to see the developers each spend about two thirds of their time on development, and one-third of their time on verification/validation (V&V). I tend to think of V&V as within the "testing" bin in terms of effort in that it is about checking the work product rather than creating it. Most of the V&V effort from developers is spent doing peer reviews and unit testing.

For the testing group, effort is split between software testing, system testing, and process quality assurance (SQA).  Confusingly, testing is often called "quality assurance," but by SQA what I mean is time spent creating, training the team on, and auditing software processes themselves (i.e., did you actually follow the process you were supposed to -- not actual testing). A common rule of thumb is that about 5%-6% of total project effort should be SQA, split about evenly between process creation/training and process auditing. So that means in a team of 20 total developers+testers, you'd expect to see one full time equivalent SQA person. And for a team size of 10, you'd expect to see one person half-time on SQA.

Taking all this into account, below is a rough shot at how effort across a project should break down in terms of head counts and staff.

Head count for a 20 person project:
  • 10 developers
  • 9 testers
  • 1 SQA (both training and auditing)
Note that this does not call out management, nor does it deal with after-release problem resolution, support, and so on. And it would be no surprise if more testers were loaded at the end of the project than the beginning, so these are average staffing numbers across the length of the project from requirements through product release.

Typical effort distribution (by hours of effort over the course of the project):
  • 33%: Developer team: Development (left-hand side of "V" including requirements through implementation)
  • 17%: Developer team: V&V (peer reviews, unit test)
  • 44%: V&V team: integration testing and system testing
  • 3%: V&V team: Process auditing
  • 3%: V&V team: Process training, intervention, improvement
If you're an agile team you might slice up responsibilities differently and that may be OK. But in the end the message is that you're probably only spending about a third of your time actually doing development compared to verification and validation if you want to create very high quality software and ensure you have a rigorous process while doing so.  Agile teams I've seen that violated this rule of thumb often did so at the cost of not controlling their process rigor and software quality. (I didn't say the quality was necessarily bad, but rather what I'm saying is they are flying blind and don't know what their software and process quality is until after they ship and have problems -- or don't.  I've seen it turn out both ways, so you're rolling the dice if you sacrifice V&V and especially SQA to achieve higher lines of code per hour.)

There is of course a lot more to running a project than the above. For example, there might need to be a cross-functional build team to ensure that product builds have the right configuration, are cleanly built, and so on. And running a regression test infrastructure could be assigned to either testers or developers (or both) depending on the type of testing that was included. But the above should set rough expectations.  If you differ by a few percent, probably no big deal. But if you have 1 tester for every 5 developers and your effort ratio is also 1:5, and you are expecting to ship a rock-solid industrial-control or similar embedded system, then think again.

(By way of comparison, Microsoft says they have a 1:1 tester to developer head count. See: How We Test Software at Microsoft, Page, Johston & Rollison.)









Monday, November 17, 2014

Not Getting Software Wrong


The majority of time creating software is typically spent making sure that you got the software right (or if it's not, it should be!).  But, sometimes, this focus on making software right gets in the way of actually ensuring the software works.  Instead, in many cases it's more important to make sure that software is not wrong.  This may seem like just a bit of word play, but I've come to believe that it reflects a fundamental difference in how one views software quality.

Consider the following code concept for stopping four wheel motors on a robot:

// Turn off all wheel motors
  for (uint_t motor = 1; motor < maxWheels; motor++)  {
    SetSpeed(motor, OFF);
  }

Now consider this code:
// Turn off all wheel motors
  SetSpeed(leftFront,  OFF);
  SetSpeed(rightFront, OFF);
  SetSpeed(leftRear,   OFF);
  SetSpeed(rightRear,  OFF);

(Feel free to make obvious assumptions, such as leftFront being an const uint_t, and modify to your favorite naming conventions and style guidelines. None of that is the point here.)

Which code is better?   Well, from our intro to programming course probably we want to write the first type of code. It is more flexible, "elegant," and uses the oh-so-cool concept of iteration.  On the other hand, the second type of code is likely to get our fingers smacked with a ruler in a freshman programming class.  Where's the iteration?  Where is the flexibility? What if there are more than four items that need to be turned off? Where is the dazzling display of (not-so-advanced) computer science concepts?

But hold on a moment. It's a four wheeled robot we're building.  Are you really going to change to a six wheeled robot?  (Well, maybe, and I've even worked a bit with such robots.  But sticking on two extra wheels isn't all that common after the frame has been welded!)  So what are you really buying by optimizing for a change that is unlikely to happen?

Let's look at it a different way. Many times I'm not interested in elegance, clever use of computer science concepts, or even number of lines of code. What I am interested in is that the code is actually correct. Have you ever written the loop type of code and gotten it wrong?  Be honest!  There is a reason that "off by one error" has its own wikipedia entry. How long did it take you to look at the loop and make sure it was right? You had to assume or go look up the value of maxWheels, right?  If you are writing unit tests, you need to somehow figure out a way to test that all the motors got turned off -- and account for the fact that maxWheels might not even be set to the correct value.

But, with the second set of code, it's pretty easy to see that all four wheels are getting turned off.  There's no loop to get wrong.  (You do have to make sure that the four wheel consts are set properly.)  There is no loop to test.  Cyclomatic complexity is lower because there is no looping branch.  You don't have to mentally execute the loop to make sure there is no off-by-one error. You don't have to ask whether the loop can (or should) execute zero times. Instead, with the second set of codes you can just look at it and count up that the wheels are being turned off.

Now of course this is a pretty trivial example, and no doubt one that at least someone will take exception to.  But the point I'm making is the following.  The first code segment is what we were all taught to do, and arguably easier to write. But the second code segment is arguably easier to check for correctness.  In peer reviews someone might miss a bug in the first code. I'd say that missing a bug is less likely in the second code segment.

Am I saying that you should avoid loops if you have 10,000 things to initialize?  No, of course not. Am I saying you should cut and paste huge chunks of code to avoid a loop?  Nope. And maybe you actually do want to use the loop approach in your situation even with 3 or 4 things to initialize for a good reason. And if you get lots of lines of code that is likely to be more error-prone than a well-considered loop. All that I'm saying is that when you write code, consider carefully the tradeoff between writing concise, but clever, code and writing code that is hard to get wrong. Importantly, this includes the risk of a varying style causing an increased risk of getting things wrong. How that comes out will depend on your situation.  What I am saying is think about what is likely to change, what is unlikely to change, and what the path is to least risk of having bugs.

By the way, did you see the bug?  If you did, good for you.  If not, well, then I've already proved my point, haven't I?  The wheel number should start at 0 or the limit should be maxWheels+1.  Or the "<" should be "<=" -- depending on whether wheel numbers are base 0 or base 1.  As it is, assuming maxWheels is 4 as you'd expect, only 3 out of 4 wheels actually get turned off.  By the way, this wasn't an intentional trick on the reader. As I was writing this article, after a pretty long day, that's how I wrote the code. (It doesn't help that I've spent a lot of time writing code in languages with base-1 arrays instead of base-0 arrays.) I didn't even realize I'd made that mistake until I went back to check it when writing this paragraph.  Yes, really.  And don't tell me you've never made that mistake!  We all have. Or you've never tried to write code at the end of a long day.  And thus are bugs born.  Most are caught, as this one would have been, but inevitably some aren't.  Isn't it better to write code without bugs to begin with instead of find (most) bugs after you write the code?

Instead of trying to write code and spending all our effort to prove that whatever code we have written is correct, I would argue we should spend some of our effort on writing code that can't be wrong.  (More precisely, code that is difficult to get wrong in the first place, and easy to detect is wrong.)  There are many tools at our disposal to do this beyond the simple example shown.  Writing code in a style that makes very rigorous static analysis tools happy is one way. Many other style practices are designed to help with this as well, chief among them being various type checking strategies, whether automated or implemented via naming conventions. Avoiding high cyclomatic complexity is another way that comes to mind. So is creating a statechart before actually writing the code, then using a switch statement that traces directly to the statechart. Avoiding premature optimization is also important in avoiding bugs.  The list goes on.

So the next time you are looking at a piece of code, don't ask yourself "do I think this is right?"  Instead, ask yourself "how easy is it to be sure that it is not wrong?"  If you have to think hard about it -- then maybe it really is incorrect code even if you can't see a bug.  Ask whether it could be rewritten in a way that is more obviously not wrong.  Then do it.

Certainly I'm not the first to have this idea. But I see small examples of this kind of thing so often that it's worth telling this story once in a while to make sure others have paused to consider it.  Which brings us to a quote that I've come to appreciate more and more over time.  Print it out and stick it on the water cooler at work:

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies."

        — C.A.R. Hoare, The 1980 ACM Turing Award Lecture  (CACM 24(2), Feb 1981, p. 81)

(I will admit that this quote is a bit clever and therefore not a sterling example of making a statement easy to check for correctness.  But then again he is the one who got the Turing Award, so we'll allow some slack for clever wording in his acceptance essay.)

Monday, October 13, 2014

Safety Culture

A weak safety culture makes it extremely difficult to create safe systems.

Consequences:
A poor safety culture dramatically elevates the risk of creating an unsafe product. If an organization cuts corners on safety, one should reasonably expect the result to be an unsafe outcome.

Accepted Practices:
  • Establish a positive safety culture in which all stakeholders put safety first, rigorous adherence to process is expected, and all developers are incentivized to report and correct both process and product problems.
Discussion:
A “safety culture” is the set of attitudes and beliefs employees have to attaining safety. Key aspects of such a culture include a willingness to tell management that there are safety problems, and an insistence that all processes relevant to safety be followed rigorously.

Part of establishing a healthy safety culture in an organization is a commitment to improving processes and products over time. For example, when new practices become accepted in an industry (for example, the introduction of a new version of the MISRA C coding style, or the introduction of a new safety standard such as ISO 26262), the organization should evaluate and at least selectively adopt those practices while formally recording the rationale for excluding and/or slow-rolling the adoption of new practices. (In general, one expects substantially all new accepted practices in an industry to be adopted over time by a company, and it is simply a matter of how aggressively this is done and in what order.)

Ideally, organizations should identify practices that will improve safety proactively instead of reactively. But regardless, it is unacceptable for an organization building safety critical systems to ignore new safety-relevant accepted practices with an excuse such as “that way was good enough before, so there is no reason to improve” – especially in the absence of a compelling proof that the old practice really was “good enough.”

Another aspect of a healthy safety culture is aggressively pursuing every potential safety problem to root cause resolution. In a safety-critical system there is no such thing as a one-off failure.  If a system is observed to behave incorrectly, then that behavior must be presumed to be something that will happen again (probably frequently) on a large deployed fleet.  It is, however, acceptable to log faults in a hazard log and then prioritize their resolution based on risk analysis such as using a risk table (Koopman 2010, ch. 28).

Along these lines, blaming a person for a design defect is usually not an acceptable root cause. Since people (developers and system operators alike) make mistakes, saying something like “programmer X made a mistake, so we fired him and now the problem is fixed” is simply scapegoating. The new replacement programmer is similarly sure to make mistakes. Rather, if a bug makes it through a supposedly rigorous process, the fact that the process didn’t prevent, detect, and catch the bug is what is broken (for example, perhaps design reviews need to be modified to specifically look for the type of defect that escaped into the field). Similarly, it is all too easy to scapegoat operators when the real problem is a poor design or even when the real problem is a defective product. In short, blaming a person should be the last alternative when all other problems have been conclusively ruled out – not the first alternative to avoid fixing the problem with a broken process or broken safety culture.

Believing that certain classes of defects are impossible to the degree that there is no point even looking for them is a sure sign of a defective safety culture. For example, saying that software defects cannot possibly be responsible for safety problems and instead blaming problems on human operators (or claiming that repeated problems simply didn’t happen) is a sure sign of a defective safety culture. See, for example, the Therac 25 radiation accidents. No software is defect free (although good ones are nearly defect free to begin with, and are improved as soon as new hazards are identified). No system is perfectly safe under all possible operating conditions. An organization with a mature safety culture recognizes this and responds to an incident or accident in a manner that finds out what really happened (with no preconceptions as to whether it might be a software fault or not) so it can be truly fixed. It is important to note that both incidents and accidents must be addressed. A “near miss” must be sufficient to provoke corrective action. Waiting for people to die (or dozens of people to die) after multiple incidents have occurred and been ignored is unacceptable (for an example of this, consider the continual O-ring problems that preceded the Challenger space shuttle accident).

The creation of safe software requires adherence to a defined process with minimal deviation, and the only practical way to ensure this is by having a robust Software Quality Assurance (SQA) function. This is not the same as thorough testing, nor is it the same as manufacturing quality. Rather than being based on testing the product, SQA is based on defining and auditing how well the development process (and other aspects of ensuring system safety) have been followed. No matter how conscientious the workers, independent checks, balances, and quantifiable auditing results are required to ensure that the process is really being followed, and is being followed in a way that is producing the desired results. It is also necessary to make sure the SQA function itself is healthy and operational.

Selected Sources:
Making the transition from creating ordinary software to safety critical software is well known to require a cultural shift that typically involves a change from an all-testing approach to quality to one that has a balance of testing and process management. Achieving this state is typically referred to as having a “safety culture” and is necessary step in achieving safety. (Storey 1996, p. 107)  Without a safety culture it is extremely difficult, if not impossible, to create safe software. The concept of a “safety culture” is borrowed from other, non-software fields, such as nuclear power safety and occupational safety.

MISRA Software Guidelines Section 3.1.4 Assessment recommends an independent assessor to ensure that required practices are being followed (i.e., an SQA function).

MISRA provides a section on “human error management” that includes: “it is recommended that a fear free but responsible culture is engendered for the reporting of issues and errors” (MISRA Software Guidelines p. 58) and “It is virtually impossible to prevent human errors from occurring, therefore provision should be made in the development process for effective error detection and correction; for example, reviews by individuals other than the authors.”

References:


Friday, October 3, 2014

A Case Study of Toyota Unintended Acceleration and Software Safety

Oct 3, 2014:  updated with video of the lecture

Here is my case study talk on the Toyota unintended acceleration cases that have been in the news and the courts the past few years.

The talk summary is below and embedded slides are below.  Additional pointers:
(Please see end of post for video download and copyright info.)

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =





A Case Study of Toyota Unintended Acceleration and Software Safety 

Abstract:
Investigations into potential causes of Unintended Acceleration (UA) for Toyota vehicles have made news several times in the past few years. Some blame has been placed on floor mats and sticky throttle pedals. But, a jury trial verdict was based on expert opinions that defects in Toyota's Electronic Throttle Control System (ETCS) software and safety architecture caused a fatal mishap.  This talk will outline key events in the still-ongoing Toyota UA litigation process, and pull together the technical issues that were discovered by NASA and other experts. The results paint a picture that should inform future designers of safety critical software in automobiles and other systems.

Bio:
Prof. Philip Koopman has served as a Plaintiff expert witness on numerous cases in Toyota Unintended Acceleration litigation, and testified in the 2013 Bookout trial.  Dr. Koopman is a member of the ECE faculty at Carnegie Mellon University, where he has worked in the broad areas of wearable computers, software robustness, embedded networking, dependable embedded computer systems, and autonomous vehicle safety. Previously, he was a submarine officer in the US Navy, an embedded CPU architect for Harris Semiconductor, and an embedded system researcher at United Technologies.  He is a senior member of IEEE, senior member of the ACM, and a member of IFIP WG 10.4 on Dependable Computing and Fault Tolerance. He has affiliations with the Carnegie Mellon Institute for Software Research (ISR) and the National Robotics Engineering Center (NREC).

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Other info:
  • Download copy of the video file set of talk (340 MB .zip file of a web directory. Experts only!  Please do not ask me for support -- it works for me but I don't have any details about this format beyond saying to unzip it and open Default.html in a web browser.)
All materials (slides & video) are licensed under Creative Commons Attribution BY v. 4.0.
Please include "Prof. Philip Koopman, Carnegie Mellon University" as the attribution.
If you are planning on using the materials in a course or similar, I would appreciate it if you let me know so I can track adoption.  If you need a variation from the CC BY 4.0 license (for example, to incorporate materials in a situation that is at odds with the license terms) please contact me and it can usually be arranged.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Monday, September 29, 2014

Go Beyond System Functional Testing To Ensure Safety

Testing alone is insufficient to ensure safety in critical systems. Other technical approaches and software development process management approaches must also be used to assure sufficient software integrity.

Consequences:
Relying upon just system functional testing to achieve safety can be expected to eventually lead to an unsafe situation in a widely released product. Even if system functional testing is completely representative of situations that will happen in practice, such testing normally won’t be long enough to see all of the infrequent events that will occur with a much larger fleet of vehicles deployed for a much longer period of time.

Accepted Practices:

  • Specifically identify and follow a process to design in safety rather than attempting to test it in after the product has already been built. The MISRA Guidelines describe an example of an automotive-specific process.
  • Include defined activities beyond hiring smart designers and performing extensive functional testing. While details might vary depending upon the project, as an example, an acceptable set of practices for critical software by the late 1990s would have included the following (assuming that MISRA Safety Integrity Level 3 were an appropriate categorization of the functions): precisely written functional specifications, use of a restricted language subset (e.g., MISRA C), a way of ensuring compilers produced correct code, configuration management, change management, automated build processes, automated configuration audits, unit testing to a defined level of coverage, stress testing, static analysis, a written safety case, deadlock analysis, justification/demonstration of test coverage, safety training of personnel, and availability of written documentation for assessment of safety (auditability of the process). (The required level of care today is, if anything, even more rigorous for such systems.)

Discussion:
There is a saying about quality: “You can’t test in quality; you have to design it in from the start.” It is well known that the same is true of safety.

Assuring safety requires more than just using capable designers and performing extensive testing (although those two factors are important). Even the best designers – like all humans – are imperfect, and even the most extensive system-level functional testing cannot hope to find everything that can go wrong in a large deployed fleet such as an automobile. It should be apparent than everyone can make a mistake, even careful designers. But beyond that, system level functional testing (e.g., driving a car around in a variety of circumstances) cannot be expected to find all the defects in software, because there are just too many situations that can occur to experience them all in testing. This is especially true if a combination of events that causes a software failure just happens to be one that the testers didn’t think of putting into the test plan. (Test plans have bugs and gaps too.) Therefore, it has long been recognized that creating safe software requires more than just trying hard to get the design right and trying really hard to test well.

Accepted practices require a holistic approach to safety, including executing a well-defined process, having a written plan to achieve safety, using techniques to ensuring safety such as fault tree analysis, and auditing the process to ensure all required steps are being performed.

An accepted way of ensuring that safety has been considered appropriately is to have a written document that argues why a system is safe (sometimes called a safety case or safety argument). The safety case should give quantitative arguments as to why safety is inherent in the system. An argument that says “we tested for X hours” would be insufficient – unless it also said “and that covered 99.999% of all anticipated operating scenarios as well as thoroughly exercising every line of code” or some other type of argument that testing was thorough. After all, running a car in circles around a track is not the same level of testing as a cross-country drive over mountains. Or one that goes to Alaska in the winter and Death Valley in the summer. Or one that does so with 1000 cars to catch situations in which things inside one of those many cars just happen to line up in just the wrong way to cause a system failure. But even with the significant level of testing done by automotive companies, the safety case must also include things such as the level of peer reviews conducted, whether fault tree analysis revealed single points of failure, and so on. In other words, it’s inadequate to say “we tried really hard” or “we are really smart” or “we spent a whole lot of time testing.” It is essential to also justify that broad coverage was achieved using a variety of relevant techniques.

Selected Sources:
Beatty, in a paper aimed at educating embedded system practitioners, explains that code inspections and testing aren’t sufficient to detect many common types of errors in complex embedded systems (Beatty 2003, pg. 36). He identifies five areas that require special attention: stack overflows, race conditions, deadlocks, timing problems, and reentrancy conditions. He states that “All of these issues are prevalent in systems that employ multitasking real-time designs.”

Lists of techniques that could be applied to ensure safety beyond just testing have been well known for many years, with a relatively comprehensive example being IEC 61508 Part 7.

Even if you could test everything (which you can’t), dealing with low-probability faults that can be expected to affect a huge deployed fleet of automobiles just takes too long. “It is impossible to gain confidence about a system reliability of 100,000 years by testing,” (written in reference specifically to drive-by-wire automobiles and their requirement for a mean-time-to-failure of 1 billion hours) (Kopetz 2004, p. 32, emphasis per original)

Butler and Finelli wrote the classical academic reference on this point, stating that attaining software needed for safety critical applications will “inevitably lead to a need for testing beyond what is practical” because the testing time must be longer than the acceptable catastrophic software failure rate. (Butler 1993, p. 3, paper entitled “The infeasibility of quantifying the reliability of life-critical real-time software.”))

Knutson gives an overview of software safety practices, and makes it clear that testing isn’t enough to create a safe system: “Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of faults. The only way to prove correctness via testing would be to hit all possible states, which as we’ve stated previously, is fundamentally intractable.” (Knutson 2000, pg. 34). Knutson suggests peer reviews as a technique beyond testing that will help.

NASA says that “You can’t test everything. Exhaustive testing cannot be done except for the most trivial of systems.” (NASA 2004, p. 77).

Kendall presents a case study for an electronic throttle control (with mechanical fail-safes) using a two-CPU approach (a “sub Processor” and a “Main Processor”). The automotive supplier elected to follow the IEC 1508 draft standard (a draft of the IEC 61508 standard), also borrowing elements from the MISRA software guidelines. Steps that were performed include: preliminary hazard analysis with mapping to MISRA SILs, review of standards and procedures to ensure they were up to date with accepted practices; on-site audits of development processes; FMEA by an independent agency; FTA by an independent agency; Markov modeling (a technique for analyzing failure probabilities); independent documentation review; mathematical proofs of correctness; and safety validation testing. (Kendall 1996)  Important points from this paper relevant to this case include: “it is well accepted that software cannot be shown to be suitable for [its] intended use by testing alone” (id. pg. 6); “Software robustness must be demonstrated by ensuring the process used to develop it is appropriate, and that this process is rigorously followed.” (id., pg. 6); “safety validation must consider the effect of the vehicle under as many failure conditions as is possible to generate.” (id., p. 7).

Roger Rivett from Rover Group wrote a paper in 1997 based on a collaborative government-sponsored research effort that specifically addresses how automotive manufacturers should proceed to ensure the safety of vehicles. He makes an important point that rigorous use of good software practice is required in addition to testing (Rivett 1997, pg. 3). He has four specific conclusions for achieving a level of “good practice” for safety: use a quality management system, use a safety integrity level approach; be compliant with a sector standard (e.g., MISRA Software Guidelines), and use a third party assessment to ensure that high-integrity levels have been achieved. (Rivett 1997, pg. 10).

MISRA Development Guidelines, section 3.6.1, provides a set of points that make it clear that testing is necessary, but not sufficient, to establish safety (MISRA Guidelines, pg. 49):

MISRA Testing Guidance (MISRA Software Guidelines, p. 49)

This last point of the MISRA Guidelines is key – testing can discover if something is unsafe, but testing alone cannot prove that a system is safe.

"Testing on its own is not adequate for assessing safety-related software."  (MISRA report 2 pg. iv) In particular, system-level testing (such as at the vehicle level), cannot hope to uncover all the possible faults or exceptional situations can will result in mishaps.

References:
  • Beatty, Where testing fails, Embedded Systems Programming, Aug 2003, pp. 36-41.
  • Butler et al., The infeasibility of quantifying the reliability of life-critical real-time software,  IEEE Trans. Software Engineering, Jan 1993, pp. 3-12.
  • IEC 61508, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES), International Electrotechnical Commission, 1998. Part 7.
  • Knutson, C. & Carmichael, S., Safety First: avoiding software mishaps, Embedded Systems Programming, November 2000, pp. 28-40.
  • Kopetz, H., On the fault hypothesis for a safety-critical real-time system, ASWSD 2004, LNCS 4147, pp. 31-42, 2006.
  • MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 2: Integrity, February 1995.
  • NASA-GB-8719.13, NASA Software Safety Guidebook, NASA Technical Standard, March 31, 2004.
  • Rivett, "Emerging Software Best Practice and how to be Compliant", Proceedings of the 6th International EAEC Congress July 1997.