Tuesday, September 13, 2016

How Safe Is The Tesla Autopilot?

It was only a matter of time until there was a fatality involving a "self-driving" car, because no car is likely to be 100% safe.  And it happened in June this year.  This week, Tesla announced a software update involving a number of strategy changes.  https://www.tesla.com/blog/upgrading-autopilot-seeing-world-radar

In this post I'm going to try to take a look at what safety claims can, and cannot, be made for the Tesla Autopilot based on publicly available information.  It's sufficient just to look at what Tesla itself says and assume it is 100% accurate to get a broad idea of where they are, and where they aren't.  Short version: they have a really long way to go to demonstrate they are at least as good as a normal vehicle.

The Math:

First, let's take Tesla's blog posting about the tragic death that occurred in June.
Tesla claims 130 million miles of Autopilot driving, and a US fatality every 94 million miles.

One might be tempted to draw an incorrect conclusion from Tesla's statement that Autopilot is safer than manual driving, because 130 million is larger than 94 million. However, that is far too simplistic an approach.  Rather, the jury is still very much out on whether Tesla Autopilot will prove safer than everyday vehicles.

From a purely statistical approach, the question is: how many miles does Tesla have to drive to demonstrate they are at least as good as a human (94 million mile mean time to fatality).

This tool:  http://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator tells you how long you need to test to get a 94M mile Mean Time Between Failure (MTBF) with 95% confidence.  Assuming that a failure is a fatal mishap in this case, you need to test 282M miles with no fatalities to be 95% sure that the mean time is only 94 million miles.  Yes, it's hard to get that lucky.  But if you do have a mishap, you have to do more testing to distinguish between 130 million miles having been lucky, or just reflecting normal statistical fluctuations in having achieved a target average mishap rate.  If you only have one mishap, you need a total of 446M miles to reach 95% confidence.  If another mishap occurs, that extends to 592M miles for a failure budget of two mishaps, and so on.  It would be no surprise if Tesla needs about 1-2 billion miles of accumulated experience for the statistical fluctuations to settle out, assuming the system actually is as good as a human driver.

Looking at it another way, given the current data (1 mishap in 130 million miles), this tool: http://reliabilityanalyticstoolkit.appspot.com/field_mtbf_calculator tells us that Tesla has only demonstrated an MTBF of 27.4M miles or better at 95% confidence at this point, which is less than a third of the way to break-even with a human.
(Please note, I did NOT say that they are only a third as safe as a human.  What I am saying is that the data available only supports a claim of about 29.1% as good.  Beyond that, the jury is still out.)

In other words, they need a whole lot more testing to make strong claims about safety.  This is because it is really difficult (usually impractical) to test safety into a system.  You usually need to do something more, which is one of the reasons why software safety standards such as ISO 26262 exist.

Threats To Validity:

With any analysis like this, it's important to be clear about the assumptions and possible flaws in analysis, of which there are often many.  Here are some that come to mind.

- The calculations above assume that it is more or less the same software for all 130 million miles.  If Tesla makes a "major" change to software, pretty much all bets are off as to whether the new software will be better or worse in practice unless a safety critical assurance methodology (e.g., ISO 26262) has been used.  (In fact, one can argue that *any* change invalidates previous test data, but that is a fine point beyond what I want to cover here.) Tesla says they're making a dramatic switch to radar as a primary sensor with this version.  That sounds like it could be a major change.  It would be no surprise if this software version resets the clock to zero miles of experience in terms of software field reliability both for better and for worse.

- The Tesla is most definitely not a fully autonomous vehicle.  Even Tesla makes it quite clear in their public announcements that constant driver attention and supervision is required.  So the Tesla experience isn't really about fully self-driving cars per se. Rather, it is about the safety of a human+car partnership (Level 3 autonomy in NHTSA-speak). That includes both the pros (humans can react to weird things the software might get wrong), and the cons (humans easily "drop out" of systems that don't require their attention).  If Tesla moves to Level 4 autonomy later, at best they're going to have to make a very nuanced argument to take credit for Level 3 autonomy experience.

- The calculations assume random independent failures, which in general is probably not a good assumption for software.  Whether it is a valid assumption for this scenario is an interesting question.

(Calculation note:  I just plugged miles instead of hours into the MTBF calculations, because the units don't really matter as long as you are consistent.  If you want to translate to hours and back, 30 mph is not a bad approximation for vehicle speed accounting for city and highway driving.  If you'd like 90% confidence or 99% confidence feel free to plug the numbers into the tools for yourself.)

Thursday, August 25, 2016

Boehm's Top 10 Software Defect Reduction list

A while back, Barry Boehm & Vic Basili wrote a nice summary of best ways to get better quality software. Their advice still applies to embedded systems today. Below is their list (in bold) with my commentary (the parts not in bold).

1. Finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it during the requirements and design phase

Bug fix cost gets worse as your software gets closer to deployment, because you have to not only spend a lot of time tracking down the source of the bug, but also retest the system after the fix.  It is common to find situations in which a bug "escape" into field units costs dramatically more than 100x.  Consider having to recall a fleet of cars to install a bug fix, or do maintenance visits to thousands of sites to manually install a fix.  (Over-the-air fixes have their own problems, but that's a topic for another time.)

2. Current software projects spend about 40 to 50 percent of their effort on avoidable rework.

As the joke goes, the first 90% of the project is spent on design, and the second 90% on debugging.  The cheapest way to debug is to avoid putting the bugs into the software in the first place.  The next best way is to find them at the point of introduction (e.g., peer review of design before code is written) rather than at system test.

3. About 80 percent of avoidable rework comes from 20 percent of the defects

If you have a "bug farm" that's often not because the code is bad, but rather because the underlying requirements and design are bad.  If one module has a lot of bugs you should rewrite the module rather than keep patching it.  However, if during the rewrite you might well discover that the problem isn't really the code, but rather a bad design or unclear requirements. Writing new code for a bad design ultimately won't solve the problem.

4. About 80 percent of the defects come from 20 percent of the modules, and about half the modules are defect free. 

In addition to comments for #3 above, modules with high cyclomatic complexity tend to be difficult to test and tend to be more bug-prone.  Keeping a limit on complexity can help with this problem.

5. About 90 percent of the downtime comes from, at most, 10 percent of the defects.

It makes sense to weight testing on the areas that are the highest risk.  For desktop software this often corresponds to the common use cases.  For embedded systems and other mission-critical systems this also means testing failure detection and recovery for high-cost failures.

6. Peer reviews catch 60 percent of the defects.

Yes, really.  Peer reviews catch the majority of defects.  Why aren't you doing them?  (If you are doing them, are they catching at least half the defects?)

7. Perspective-based reviews catch 35 percent more defects than nondirected reviews.

When you do reviews, give each reviewer a checklist with a different set of areas to think about or different role to play.  For example, control flow, data flow, exception handling, testability, coding style.

8. Disciplined personal practices can reduce defect introduction rates by up to 75 percent.

As much fun as it is to be a coding cowboy, on average even the best programmer will introduce fewer bugs by following a methodical engineering practice rather than slinging code. As mentioned above, the cheapest bugs to fix are the ones that never made it into the code.  Beyond this, there are practices such as PSP and TSP that are shown to dramatically improve quality without really changing productivity.

9. All other things being equal, it costs 50 percent more per source instruction to develop high-dependability software products than to develop low-dependability software products. However, the investment is more than worth it if the project involves significant operations and maintenance costs.

In other words, if a product recall puts your company out of business, it's worth investing in good software quality up front.  In my experience if you are already shipping a mission-critical product you're already spending that 50 percent more (and then some), but still shipping defects.  This isn't saying spend even more.  Rather, doing peer reviews and some other basic software quality practices you can improve quality at the same cost you're already spending.  Testing software into submission is simply not the way to go.

10. About 40 to 50 percent of user programs contain nontrivial defects.

If you have programmable features, your customers will have bugs in what they do.  And don't forget that your financial management spreadsheets are user programs (i.e., can, and often do have bugs).

Items #1 - #10 from:
  Boehm & Basili, Software Defect Reduction Top 10 List, IEEE Computer, Jan. 2001, pp. 135-137.

You can read the original three-page article here:

Monday, July 25, 2016

Road Warrior Items

This is a different flavor of posting.  For the past couple years I've been on the road a lot and built a collection of useful gadgets to have on a trip.  So here is a product recommendation posting. There might be better alternatives, but these are all items I use on my trips every week. As they say, your mileage may vary.  (If you don't want to read a product list, just skip to this post instead.)  I'll get back to technical posting next time.


Second screen for my Dell Windows 7 laptop.  Thin and light.  Plugs into USB.  Doesn't need a power brick.  Sorry, but I have no idea how it plays with Mac.

USB charging protector so that when you plug into a rental car it doesn't try to download your phone book, airplanes don't try to connect to your tablet file system, etc.  Or worse, actually attack your phone via juice jacking. This one also supports at least some phone fast charge signals so you get the quick charge rate.  Lots of people use a "c" word to describe this, but better to avoid it to keep the blog safe for mindless nanny filters.
PortaPow USB Block

Travel wireless router.  Input can be wired Ethernet or hotel WiFi.  Output is a Wifi signal that you can use for your multiple devices.  Handy if you want to pay for one internet connection at a hotel and share with multiple devices.  Arguably provides a little more security due to built-in NAT (but not bulletproof). It takes a couple minutes to set it up on a new location, but pretty painless once set up.
ASUS Pocket Router

Power plug adapter. Universal international power adapter that works in every plug I've ever seen.  It's super clever.  (It is a plug adapter, not a voltage transformer.)
Kikkerland UL03 Power Adapter

USB AC charger. Travel size USB power modules so you can keep one in your suitcase. Slightly bigger than the official Apple ones, but a lot cheaper.  110/220V compatible.
USB car charger. For keeping your phone going when you're using it to serve up travel directions
In-ear headset. Noise reducing in-ear plugs with memory foam for a tight fit.  Great for times you don't want an over-hear headset.
Koss ear phones

Over-ear noise cancelling headset.  Yes, this is the one you always see on frequent travelers.  It really does make a difference in reduce wear and tear on your mind from the constant low-frequency roar of jet engines.  You can still hear people talking (although it is reduced; I pull back one ear when talking to a flight attendant).  Gets multiple flights out of a single AAA battery (if you remember to turn it off before you stow it). The one I got came with an iPhone cable so you can also use it for phone calls (or skype) home from a somewhat noisy location.  I like the ones like these that have full ear "cans" that fully cover the ear so they provide noise reduction even when turned off.
Bose QC 15
I see that these are now discontinued, and I don't have an opinion on newer models, but generally I've had good luck with Bose headsets so probably the newer one is fine too.  (Bose QC 25)

Laser pointer and slide advance.  Bright green laser pointer, forward/backward buttons for slides shows, and a built-in timer.  I use this every week.
Logitech R800

The charger electronics are the "cheap" type which might have random build quality, but none of the above have given me any problems after significant use.


Computer travel backpack.  This is the best backpack I have used for laptop and other stuff.  The top zippered compartment is way more useful than I had expected.  Very durable. Love it.  I see it all the time on other frequent business travelers.
SwissGear Blue Ibex backpack

Small rollaboard. This one is perfect for overnight trips on regional jets.  If fits even into the small bins perfectly.  With just a little bit of care it has held up impressively (I wear out bags quickly and this one is holding its own).  A lot lighter than the Tumi rollaboard that it replaced.  Expensive, but long term worth it compared to buying a new bag every 20 trips.

Semi-disposable canvas bags.  Medium-weight drawstring "beach backpack" bags.  Use for light groceries when they charge for bags at checkout, keeping your stuff together if you have your backpack in the overhead, keeping your pillow clean if you need to stash it on an airplane, dirty laundry, etc.  Folds up to be pretty small.  (No doubt plastic ones would fold up smaller, but I like the canvas fabric in this one).  About $3 apiece but you need to buy a dozen at a time.  Alternately bring your favorite beach backpack.
Drawstring Bags

Miscellany bags. Durable small zippered bags to carry USB cables, chargers, tea bags, and so on.
Travel Pouches

Neck pillow.  This is the only travel neck pillow that really works for me on redeye flights. It squishes down to be moderately compact if you put a shoelace or conference tag lanyard around it to tie it up.  I modified it with safety pins holding a small loop of stretchy paracord to keep it closed.  Note that you put the thick end in front under your chin, and put the opening behind your head.
Ergonomic Travel Pillow

Travel liquid containers. Small and easy to squeeze silicone bottles.  Great for shampoo, skin lotion, sunscreen.  Get the 1.25 oz size for carry-on.  They are color coded (mine are green, blue, pink).
iNeibo Bottles


If you like 3-D printing then I recommend the following free designs:

One-piece business card holder: http://www.thingiverse.com/thing:31281

Card case for holding assorted cards not in your wallet:  http://www.thingiverse.com/thing:66327

Travel battery holders (I travel with two AA and four AAA): http://www.thingiverse.com/thing:57281

Cable clip for USB cables: https://www.thingiverse.com/thing:69983

Travel-size toothpaste refill adapter: http://www.thingiverse.com/thing:158910

All of the above will no doubt require the usual fiddling with scaling, but I've found them very useful once I get them the way I like them.


Finally, there is the stress of dealing with the rental car collision coverage dilemma. Do you buy the expen$ive daily collision waiver from the rental company?  If you don't, do you really believe your company will pay out if there is rental car damage? (Depends on your company I'm sure, but I've heard stories of this not working out well.) Or do you take your chances and hope it comes out OK?

For a couple decades now I've carried a Diners Club personal card that comes with primary collision insurance if you use it to pay for the car rental. The US flavor of Diners Club is now just a particular bank logo on an ordinary Mastercard, so acceptance isn't an issue. I've had to use the insurance twice over the years (minor car damage while parked) and no muss, no fuss.  They just took care of it with no deductible.  There are various other cards with various deals. The thing to look for is "primary" (first-payer) coverage. Your other credit card may say it covers rentals, but most of them are "secondary" and thus only pay what's left after your personal car insurance takes care of most of it and you take the hit to your insurance premiums.  I've found that hotel parking lots are particularly dangerous places for scratch-and-dent damage, so worth looking into this.

Here's a good list of credit cards useful for this (I get no compensation from this referral):
Most come with annual fee or per-use fee, but if you rent even a few days a year it pays off quickly. Of course it is important to pay attention to which locations are covered if you're traveling away from your home country.

Happy Travels!

Monday, May 30, 2016

Top 5 Embedded Software Problem Areas

The biggest problems I see in industry code reviews are code complexity, real time performance, code quality, weak development process, and dependability gaps. Here's an index into blog postings and other sources that explains the problems and how to deal with them.


Several times a year I fly or drive (or webex) to visit an embedded system design team and give them some feedback on their embedded software. I've done this perhaps 175 times so far (and counting). Every project is different and I ask different questions every time. But the following are the top five areas I've found that need attention in the past few years.  "(Blog)" pointers will send you to my previous blog postings on these topics.

How does your project look through the lens of these questions?

(1) How is your code complexity?
  • Is all your code in a single .c file (single huge main.c)?
    • If so, you should break it up into more bite-sized .c and .h files (Blog)
  • Do you have subroutines more than a printed page or so long (more than 50-100 lines of code)?
    • If so, you should improve modularity. (Blog)
  • Do you have "if" statements nested more than 2 or 3 deep?
    • If so, in embedded systems much of the time you should be using a state machine design approach instead of a flow chart (or no chart) design approach. (Book Chapter 13). Or perhaps you need to untangle your exception handling. (Blog)
    • If you have very high cyclomatic complexity you're pretty much guaranteed to have bugs that you won't find in unit test nor peer review. (Blog)
  • Did you follow an appropriate style guideline and use a static analysis tool?
    • Your code should compile with zero warnings for an appropriate warning set. Consider using the MISRA C rule set (Blog) and a good static analysis tool. (Blog)
  • Do you limit variable scope aggressively, or is your code full of global variables?
    • You should have essentially zero global variables. (Blog)  
    • It's not that hard to get rid of globals if you look at things the right way. (Blog)

(2) How do you know your real time code will meet its deadlines?
  • Did you set up your watchdog timer properly to detect timing problems and task death?
    • The watchdog has to detect the death or hang of each and every task in the system to provide a reasonable level of protection. (Blog)
    • How long to set the watchdog is a little trickier than you might think. (Blog)
  • Do you know the worst case execution time and deadline for all your time-sensitive tasks?
    • Just because the system works sometimes in testing doesn't mean it will work all the time in the field, whether the failure is due to a timing fault or other problems. (Blog)
  • Did you do scheduling math for your system, such as main loop scheduling?
    • You need to actually do the scheduling analysis. (Blog ; Blog)
    • Less than 100% CPU usage does not mean you'll meet deadlines unless you can verify you meet some special conditions, and probably you don't meet those conditions if you didn't know what they were. (Blog
  • Did you consider worst case blocking time (interrupts disabled and/or longest non-context-switched task situation)?
    • If you have one long-running task that ties up the CPU only once per day, then you'll miss deadlines when it runs once per day.  But perhaps you get lucky on timing most days and don't notice this in testing. (Blog)
  • Did you follow good practices for interrupts?
    • Interrupts should be short -- really short. (Blog) So short you aren't tempted to re-enable interrupts in them. (Blog)
    • There are rules for good interrupts -- follow them! (Blog)
(3) How do you know your software quality is good enough?
  • What's your unit test coverage?
    • If you haven't exercised, say, 95% of your code in unit test, you're waiting to find those bugs until later, when it's more expensive to find them. (Blog) (There is an assumption that the remaining 5% are exception cases that should "never" happen, but it's even better to exercise them too.)
    • In general, you should have coverage metrics and traceability for all your testing to make sure you are actually getting what you want out of testing. (Blog)
  • What's your peer review coverage?
    • Peer review finds half the defects for 10% of the project cost. (Blog)  But only if you do the reviews! (Blog)
  • Are your peer reviews finding at least 50% of your defects?
    • If you're finding more than 50% of your defects in test instead of peer review, then your peer reviews are broken. It's as simple as that. (Blog)
    • Here's a peer review checklist to get you started. (Blog)
  • Does your testing include software specific aspects?
    • A product-level test plan is pretty much sure to miss some potentially big software bugs that will come back to bite you in the field. You need a software-specific test plan as well. (Blog)
  • How do you know you are actually following essential design practices, such as protecting shared variables to avoid concurrency bugs?
    • Are you actually checking your code against style guidelines using peer review and static analysis tools? (Blog)
    • Do your style guidelines include not just cosmetics, but also technical practices such as disabling task switching or using a mutex when accessing a shared variable? (Blog) Or avoiding stack overflow? (Blog)

(4) Is your software process methodical and rigorous enough?
  • Do you have a picture showing the steps in your software and problem fix process?
    • If it's just in your head then probably every developer has a different mental picture and you're not all following the same process. (Book Chapter 2)
  • Are there gaps in the process that are causing you pain or leading to problems?
    • Very often technical defects trace back to cutting corners in the development process or skipping review/test steps.
    • Skipping peer reviews and unit test in the hopes that product testing catches all the problems is a dangerous game. In the end cutting corners on development process takes at least as long and tends to ship with higher defect rates.
  • Are you doing a real usability analysis instead of just having your engineers wing it?
    • Engineers are a poor proxy for users. Take human usability seriously. (BlogBook Chapter 15)
  • Do you have configuration management, version control, bug tracking, and other basic software development practices in place?
    • You'd think we would not have to ask. But we find that we do.
    • Do you prioritize bugs based on value to project rather than severity of symptoms? (Blog)
  • Is your test to development effort ratio appropriate? Usually you should have twice as many hours on test+reviews than creating the design and implementation
    • Time and again when we poll companies doing a reasonable job on embedded software of decent quality we find the following ratios.  One tester for every developer (1:1 head count ratio).  Two test/review hours (including unit test and peer review) for every development hour (2:1 effort ratio). The companies that go light on test/review usually pay for it with poor code quality. (Blog)
  • Do you have the right amount of paperwork (neither too heavy nor too light)
    • Yes, you need to have some paper even if you're doing Agile. (Blog) It's like having ballast in a ship. Too little and you capsize. Too much and you sink. (Probably you have too little unless you work on military/aerospace projects.)  And you need the right paper, not just paper for paper's sake.  (Book Chapters 3-4)

(5) What about dependability aspects?
  • Have you considered maintenance issues, such as patch deployment?
    • If your product is not disposable, what happens when you need to update the firmware?
  • Have you done stress testing and other evaluation of robustness?
    • If you sell a lot of units they will see things in the field you never imagined and will (you hope) run without rebooting for years in many cases. What's your plan for testing that? (Blog)
    • Don't forget specialty issues such as EEPROM wearout (Blog), time/date management (Blog), and error detection code selection (Blog).
  • Do your requirements and design address safety and security?
    • Maybe safety (Blog) and security (Blog) don't matter for you, but that's increasingly unlikely. (Blog)
    • Probably if this is the first time you've dealt with safety and security you should either consult an internal expert or get external help. Some critical aspects for safety and security take some experience to understand and get right, such as avoiding security pittfalls (Blog) and eliminating single points of failure. (Blog)
    • And while we're at it, you do have written, complete, and measurable requirements for everything, don't you?  (Book Chapters 5-9)
For another take on these types of issues, see my presentation on Top 43 embedded software risk areas (Blog). There is more to shipping a great embedded system than answering all the above questions. (Blog) And I'm sure everyone has their own list of things they like to look for that can be added. But, if you struggle with the above topics, then getting everything else right isn't going to be enough.

Finally, the point of my book is to explain how to detect and resolve the most common issues I see in design reviews. You can find more details in the book on most of these topics beyond the blog postings. But the above list and links to many of the blog postings I've made since releasing the book should get you started.

Monday, May 16, 2016

Robustness Testing

I have been doing research in the area of robustness testing for many years, and once in a while I have to explain how that approach to testing fits into the bigger umbrella of fault injection and related ideas. Here's a summary of typical approaches (there are many variations and extensions beyond these as you might imagine).  At the end is a description of the robustness testing work my group has been doing over many years.

Mutation Testing:
Goal: Evaluate coverage/effectiveness of an existing test suite. (Also known as "bebugging.")
Approach: Modify System under Test (SuT) with a hypothetical bug and see if an existing test suite finds it.
Narrative: I have a test suite. I wonder how thorough it is?  Let me put a bug (mutation) into my code and see if my test suite finds it. If it finds all the mutations I insert, maybe my test suite is thorough.
Fault Model: Source code bug that is undetected by testing
Strengths: Can find problems even if code was already 100% branch-covered in test suite (e.g., mutate a comparison to > instead of >= to see if test suite exercises the equality case for that comparison)
Limitations: Requires an existing test suite (but, can combine with automated test generation to create additional tests automatically). Effectiveness heavily depends on inserting realistic mutations, which is not necessarily so easy.

Classical Fault Injection Testing:
Goal: Determine robustness of SuT when its code or data is corrupted.
Approach: Corrupt the binary image of the SuT code or corrupt data during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system. I wonder what happens if I flip a bit in the data -- does the system crash?  I wonder what happens if I corrupt the compiled code -- does the system act in an unsafe way?
Fault Model: Hardware bit flip or software-based memory corruption.
Strengths: Can find realistic failures caused by single-event upsets, hardware faults, and software defects that corrupt computational state.
Limitations: The fault model is memory and program bit-level corruption. Fault injection testing is useful for high-integrity systems deployed in large volume (they have to survive even very infrequent faults), and essential for aviation and space systems that will see high rates of single event upsets.  But it is arguably a bit excessive for non-safety-critical applications.

Robustness Testing:
Goal: Determine robustness of SuT when it is fed exceptional or unusual values.
Approach: Corrupt the inputs to the SuT during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system or subsystem. I wonder what happens if the inputs from other components or sensors have garbage, unusual, or random values -- does the system crash or act in an unsafe way?
Fault Model: Some other module than the SuT has a bug that results in exceptional data being sent to the SuT.
Strengths: Can find realistic failures caused by likely run-time faults such as null pointers, NaNs (Not-a-Number floating point values), corrupted input data, and in general faults in modules that are not the SuT, but rather other software or sensors present in the system that might have bugs that generate exceptional data.
Limitations: The fault model is generally that some other piece of software has a bug and that bug will generate bad data that kills the SuT.  You have decide how likely that is and whether it's OK in such a case for the SuT to misbehave. We have found many situations in which such test results are important, even in systems that are not safety critical.

A classical form of robustness testing is "fuzzing," in which random inputs are tossed into a system to see what happens rather than carefully selected specific input values. My research group's work centers on finding efficient ways to do robustness testing so that fewer tests are needed to find system-killer values.

The Ballista project pioneered efficient robustness testing in the late 1990s, and is still active today on stress testing robots and autonomous vehicles.

Two key ideas of Ballista are:
  • Have a dictionary of interesting exceptional values so you don't have to stumble onto them by chance (e.g., just try a NULL pointer straight out rather than wait for a random number generator to happen to generate a zero value as a fuzzing input value)
  • Make it easy to generate tests by basing that dictionary on the data types taken by a function call instead of the function being performed.  So we don't care if it is a memory management function or a file write being tested - we just say for example that if it's a memory pointer, let's try NULL as an input value to the function. This gets us excellent scalability and portability across systems we test.
A key benefit of Ballista and other robustness testing approaches is that they look for holes in the code itself, rather than holes in the test cases. Consider that most test coverage approaches (including mutation testing) are interested in testing all the code that is there (which is a good thing!). In contrast, robustness testing goes beyond code coverage to find the places where you should have had code to handle exceptional situations, but that code is missing. In other words, robustness testing often finds bugs due to missing code that should have been there. We find that it is pretty typical for software to be non-robust unless this type of testing has been done it identify such problems.

You can find more about our research at the Stress Tests for Autonomy Architectures (STAA) project page, which includes video of what goes wrong when you stress test a couple robotic systems: