Monday, May 30, 2016

Top 5 Embedded Software Problem Areas

Several times a year I fly or drive (or webex) to visit an embedded system design team and give them some feedback on their embedded software. I've done this perhaps 175 times so far (and counting). Every project is different and I ask different questions every time. But the following are the top five areas I've found that need attention in the past few years.  "(Blog)" pointers will send you to my previous blog postings on these topics.

How does your project look through the lens of these questions?

(1) How is your code complexity?
  • Is all your code in a single .c file (single huge main.c)?
    • If so, you should break it up into more bite-sized .c and .h files (Blog)
  • Do you have subroutines more than a printed page or so long (more than 50-100 lines of code)?
    • If so, you should improve modularity. (Blog)
  • Do you have "if" statements nested more than 2 or 3 deep?
    • If so, in embedded systems much of the time you should be using a state machine design approach instead of a flow chart (or no chart) design approach. (Book Chapter 13). Or perhaps you need to untangle your exception handling. (Blog)
    • If you have very high cyclomatic complexity you're pretty much guaranteed to have bugs that you won't find in unit test nor peer review. (Blog)
  • Did you follow an appropriate style guideline and use a static analysis tool?
    • Your code should compile with zero warnings for an appropriate warning set. Consider using the MISRA C rule set (Blog) and a good static analysis tool. (Blog)
  • Do you limit variable scope aggressively, or is your code full of global variables?
    • You should have essentially zero global variables. (Blog)  
    • It's not that hard to get rid of globals if you look at things the right way. (Blog)

(2) How do you know your real time code will meet its deadlines?
  • Did you set up your watchdog timer properly to detect timing problems and task death?
    • The watchdog has to detect the death or hang of each and every task in the system to provide a reasonable level of protection. (Blog)
    • How long to set the watchdog is a little trickier than you might think. (Blog)
  • Do you know the worst case execution time and deadline for all your time-sensitive tasks?
    • Just because the system works sometimes in testing doesn't mean it will work all the time in the field, whether the failure is due to a timing fault or other problems. (Blog)
  • Did you do scheduling math for your system, such as main loop scheduling?
    • You need to actually do the scheduling analysis. (Blog ; Blog)
    • Less than 100% CPU usage does not mean you'll meet deadlines unless you can verify you meet some special conditions, and probably you don't meet those conditions if you didn't know what they were. (Blog
  • Did you consider worst case blocking time (interrupts disabled and/or longest non-context-switched task situation)?
    • If you have one long-running task that ties up the CPU only once per day, then you'll miss deadlines when it runs once per day.  But perhaps you get lucky on timing most days and don't notice this in testing. (Blog)
  • Did you follow good practices for interrupts?
    • Interrupts should be short -- really short. (Blog) So short you aren't tempted to re-enable interrupts in them. (Blog)
    • There are rules for good interrupts -- follow them! (Blog)
(3) How do you know your software quality is good enough?
  • What's your unit test coverage?
    • If you haven't exercised, say, 95% of your code in unit test, you're waiting to find those bugs until later, when it's more expensive to find them. (Blog) (There is an assumption that the remaining 5% are exception cases that should "never" happen, but it's even better to exercise them too.)
    • In general, you should have coverage metrics and traceability for all your testing to make sure you are actually getting what you want out of testing. (Blog)
  • What's your peer review coverage?
    • Peer review finds half the defects for 10% of the project cost. (Blog)  But only if you do the reviews! (Blog)
  • Are your peer reviews finding at least 50% of your defects?
    • If you're finding more than 50% of your defects in test instead of peer review, then your peer reviews are broken. It's as simple as that. (Blog)
    • Here's a peer review checklist to get you started. (Blog)
  • Does your testing include software specific aspects?
    • A product-level test plan is pretty much sure to miss some potentially big software bugs that will come back to bite you in the field. You need a software-specific test plan as well. (Blog)
  • How do you know you are actually following essential design practices, such as protecting shared variables to avoid concurrency bugs?
    • Are you actually checking your code against style guidelines using peer review and static analysis tools? (Blog)
    • Do your style guidelines include not just cosmetics, but also technical practices such as disabling task switching or using a mutex when accessing a shared variable? (Blog) Or avoiding stack overflow? (Blog)

(4) Is your software process methodical and rigorous enough?
  • Do you have a picture showing the steps in your software and problem fix process?
    • If it's just in your head then probably every developer has a different mental picture and you're not all following the same process. (Book Chapter 2)
  • Are there gaps in the process that are causing you pain or leading to problems?
    • Very often technical defects trace back to cutting corners in the development process or skipping review/test steps.
    • Skipping peer reviews and unit test in the hopes that product testing catches all the problems is a dangerous game. In the end cutting corners on development process takes at least as long and tends to ship with higher defect rates.
  • Are you doing a real usability analysis instead of just having your engineers wing it?
    • Engineers are a poor proxy for users. Take human usability seriously. (BlogBook Chapter 15)
  • Do you have configuration management, version control, bug tracking, and other basic software development practices in place?
    • You'd think we would not have to ask. But we find that we do.
    • Do you prioritize bugs based on value to project rather than severity of symptoms? (Blog)
  • Is your test to development effort ratio appropriate? Usually you should have twice as many hours on test+reviews than creating the design and implementation
    • Time and again when we poll companies doing a reasonable job on embedded software of decent quality we find the following ratios.  One tester for every developer (1:1 head count ratio).  Two test/review hours (including unit test and peer review) for every development hour (2:1 effort ratio). The companies that go light on test/review usually pay for it with poor code quality. (Blog)
  • Do you have the right amount of paperwork (neither too heavy nor too light)
    • Yes, you need to have some paper even if you're doing Agile. (Blog) It's like having ballast in a ship. Too little and you capsize. Too much and you sink. (Probably you have too little unless you work on military/aerospace projects.)  And you need the right paper, not just paper for paper's sake.  (Book Chapters 3-4)

(5) What about dependability aspects?
  • Have you considered maintenance issues, such as patch deployment?
    • If your product is not disposable, what happens when you need to update the firmware?
  • Have you done stress testing and other evaluation of robustness?
    • If you sell a lot of units they will see things in the field you never imagined and will (you hope) run without rebooting for years in many cases. What's your plan for testing that? (Blog)
    • Don't forget specialty issues such as EEPROM wearout (Blog), time/date management (Blog), and error detection code selection (Blog).
  • Do your requirements and design address safety and security?
    • Maybe safety (Blog) and security (Blog) don't matter for you, but that's increasingly unlikely. (Blog)
    • Probably if this is the first time you've dealt with safety and security you should either consult an internal expert or get external help. Some critical aspects for safety and security take some experience to understand and get right, such as avoiding security pittfalls (Blog) and eliminating single points of failure. (Blog)
    • And while we're at it, you do have written, complete, and measurable requirements for everything, don't you?  (Book Chapters 5-9)
For another take on these types of issues, see my presentation on Top 43 embedded software risk areas (Blog). There is more to shipping a great embedded system than answering all the above questions. (Blog) And I'm sure everyone has their own list of things they like to look for that can be added. But, if you struggle with the above topics, then getting everything else right isn't going to be enough.

Finally, the point of my book is to explain how to detect and resolve the most common issues I see in design reviews. You can find more details in the book on most of these topics beyond the blog postings. But the above list and links to many of the blog postings I've made since releasing the book should get you started.

Monday, May 16, 2016

Robustness Testing

I have been doing research in the area of robustness testing for many years, and once in a while I have to explain how that approach to testing fits into the bigger umbrella of fault injection and related ideas. Here's a summary of typical approaches (there are many variations and extensions beyond these as you might imagine).  At the end is a description of the robustness testing work my group has been doing over many years.

Mutation Testing:
Goal: Evaluate coverage/effectiveness of an existing test suite. (Also known as "bebugging.")
Approach: Modify System under Test (SuT) with a hypothetical bug and see if an existing test suite finds it.
Narrative: I have a test suite. I wonder how thorough it is?  Let me put a bug (mutation) into my code and see if my test suite finds it. If it finds all the mutations I insert, maybe my test suite is thorough.
Fault Model: Source code bug that is undetected by testing
Strengths: Can find problems even if code was already 100% branch-covered in test suite (e.g., mutate a comparison to > instead of >= to see if test suite exercises the equality case for that comparison)
Limitations: Requires an existing test suite (but, can combine with automated test generation to create additional tests automatically). Effectiveness heavily depends on inserting realistic mutations, which is not necessarily so easy.

Classical Fault Injection Testing:
Goal: Determine robustness of SuT when its code or data is corrupted.
Approach: Corrupt the binary image of the SuT code or corrupt data during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system. I wonder what happens if I flip a bit in the data -- does the system crash?  I wonder what happens if I corrupt the compiled code -- does the system act in an unsafe way?
Fault Model: Hardware bit flip or software-based memory corruption.
Strengths: Can find realistic failures caused by single-event upsets, hardware faults, and software defects that corrupt computational state.
Limitations: The fault model is memory and program bit-level corruption. Fault injection testing is useful for high-integrity systems deployed in large volume (they have to survive even very infrequent faults), and essential for aviation and space systems that will see high rates of single event upsets.  But it is arguably a bit excessive for non-safety-critical applications.

Robustness Testing:
Goal: Determine robustness of SuT when it is fed exceptional or unusual values.
Approach: Corrupt the inputs to the SuT during run-time to see if system crashes, is unsafe, or tolerates the fault.
Narrative: I have a running system or subsystem. I wonder what happens if the inputs from other components or sensors have garbage, unusual, or random values -- does the system crash or act in an unsafe way?
Fault Model: Some other module than the SuT has a bug that results in exceptional data being sent to the SuT.
Strengths: Can find realistic failures caused by likely run-time faults such as null pointers, NaNs (Not-a-Number floating point values), corrupted input data, and in general faults in modules that are not the SuT, but rather other software or sensors present in the system that might have bugs that generate exceptional data.
Limitations: The fault model is generally that some other piece of software has a bug and that bug will generate bad data that kills the SuT.  You have decide how likely that is and whether it's OK in such a case for the SuT to misbehave. We have found many situations in which such test results are important, even in systems that are not safety critical.

A classical form of robustness testing is "fuzzing," in which random inputs are tossed into a system to see what happens rather than carefully selected specific input values. My research group's work centers on finding efficient ways to do robustness testing so that fewer tests are needed to find system-killer values.

The Ballista project pioneered efficient robustness testing in the late 1990s, and is still active today on stress testing robots and autonomous vehicles.

Two key ideas of Ballista are:
  • Have a dictionary of interesting exceptional values so you don't have to stumble onto them by chance (e.g., just try a NULL pointer straight out rather than wait for a random number generator to happen to generate a zero value as a fuzzing input value)
  • Make it easy to generate tests by basing that dictionary on the data types taken by a function call instead of the function being performed.  So we don't care if it is a memory management function or a file write being tested - we just say for example that if it's a memory pointer, let's try NULL as an input value to the function. This gets us excellent scalability and portability across systems we test.
A key benefit of Ballista and other robustness testing approaches is that they look for holes in the code itself, rather than holes in the test cases. Consider that most test coverage approaches (including mutation testing) are interested in testing all the code that is there (which is a good thing!). In contrast, robustness testing goes beyond code coverage to find the places where you should have had code to handle exceptional situations, but that code is missing. In other words, robustness testing often finds bugs due to missing code that should have been there. We find that it is pretty typical for software to be non-robust unless this type of testing has been done it identify such problems.

You can find more about our research at the Stress Tests for Autonomy Architectures (STAA) project page, which includes video of what goes wrong when you stress test a couple robotic systems:

Saturday, April 16, 2016

Challenges in Autonomous Vehicle Testing and Validation

It's a lot of work to demonstrate that self-driving cars will actually work properly.  Testing alone is probably not going to be enough, and probably there will need to be some clever architectural approaches as well.  Last week I gave a talk at the SAE World Congress in Detroit about this and related challenges.

Link to paper 
Link to slides  (also see slideshare version by scrolling down).

Challenges in Autonomous Vehicle Testing and Validation 
     Philip Koopman & Michael Wagner
     Carnegie Mellon University; Edge Case Research LLC
     SAE World Congress, April 14, 2016

Software testing is all too often simply a bug hunt rather than a well considered exercise in ensuring quality. A more methodical approach than a simple cycle of system-level test-fail-patch-test will be required to deploy safe autonomous vehicles at scale. The ISO 26262 development V process sets up a framework that ties each type of testing to a corresponding design or requirement document, but presents challenges when adapted to deal with the sorts of novel testing problems that face autonomous vehicles. This paper identifies five major challenge areas in testing according to the V model for autonomous vehicles: driver out of the loop, complex requirements, non-deterministic algorithms, inductive learning algorithms, and fail operational systems. General solution approaches that seem promising across these different challenge areas include: phased deployment using successively relaxed operational scenarios, use of a monitor/actuator pair architecture to separate the most complex autonomy functions from simpler safety functions, and fault injection as a way to perform more efficient edge case testing. While significant challenges remain in safety-certifying the type of algorithms that provide high-level autonomy themselves, it seems within reach to instead architect the system and its accompanying design process to be able to employ existing software safety approaches.

Wednesday, March 23, 2016

Automotive Remote Keyless Entry Security

Recently there has been another round of reports on the apparent insecurity of remote keyless entry devices -- the electronic key fobs that open your car doors with a button press or even hands-free.  In this case it's not a lot to get excited about as I'll explain, but in general this whole area could use significant improvement because there are some serious concerns.  You'd think that in the 20+ years since I was first involved in this area the industry would get this stuff right on a routine basis, but the available data suggests otherwise. The difference now is that attackers are paying attention to these types of systems.

The latest attack involves a man-in-the-middle intercept that relays signals back and forth between the car and an owner's key fob:

Here's why this particular risk might be overblown.  In systems of this type typically the signal sent by the car is an inductively coupled low frequency signal with a range of about a meter.  That's how the car knows you're actually standing by the door and not sipping a latte at a cafe 10 meters away.  So having a single intercept box near the car won't work.  Typically it's going to be hard to transmit an inductively coupled signal from your parking lot to your bedroom with any reasonably portable intercept device (unless car is really really close to your bedroom).  That's most likely why this article says there must be TWO intercept devices: one near the car, and one near the keys.  So I wouldn't worry about someone using this particular attack to break into a car in the middle of the night, because if they can get an intercept box onto your bedroom dresser you have bigger problems.

The more likely attack is someone walking near you in a shopping mall or at an airport who has targeted your specific car.  They can carry an intercept device near your pocket/bag while at the same time putting an intercept device near your car.  No crypto hack required -- it's a classic relay attack.  Sure, it could happen, but a little far fetched for most folks unless you have an exceptionally valuable car.  And really, there are often easier ways to go than this. Attackers might be able to parlay this into a playback attack if the car's crypto is stupid enough -- but at some point they have to get within a meter or so of your physical key to ping it and have a shot at such an attack.

If you have a valuable car and already have your passport and non-contact credit cards in a shielded case, it might be worthwhile putting your keys in a Faraday cage when out of the house (perhaps an Altoids box).  But I'd avoid the freezer at home as both unnecessary and possibly producing condensate inside your device that could ruin it.

The more concerning thing is that devices to break into cars have been around for a while and there is no reason to believe they are based on the attack described in this article.  They could simply be exploiting bad security design, possibly without proximity to the legitimate transmitter.  Example scenarios include badly designed crypto (e.g., Keeloq), badly designed re-synch, or badly designed playback attack protection for RF intercepts when the legitimate user is transmitting on purpose.  Clever variations include blanket jamming and later playback, and jamming one of a pair of messages for later playback. Or broken authentication for OnStar-like remote unlock.  If the system  has too few bits in its code and doesn't use a leaky bucket rate limiting algorithm, you can just use a brute force attack.

Here's a video from which you learn both that the problem seems real in practice and that folks like the media, insurance, police, and investigators could do with a bit more education in this area:

Note that similar or identical technology is used for garage door openers.

On a related note, there is also some concern about the safety of smart keys regarding compliance with the Federal safety standard for rollaway protection and whether a car can keep running when nobody is in the car.  These concerns are related to the differences between electronic keys and physical keys that go into a traditional ignition switch. There have been some lawsuits and discussions about changing the Federal safety regulations:

Monday, March 7, 2016

Multiple Returns and Error Checking

Summary: Whether or not to allow multiple returns from a function is a controversial matter, but I recommend having a single return statement AND avoiding use of goto.


If you've been programming robust embedded systems for a while, you've seen code that looks something like this:

int MyRoutine(...)
{ ...
  if(..something fails..) { ret = 0; }
  { .. do something ...
    if(..somethingelse fails..) {ret = 0;}
    { .. do something ..
      if(...yetanotherfail..) {ret = 0;}
      { .. do the computation ...
        ret = value; 
  // perform default function

(Note: the "ret = 0" and "ret = value" parts are usually more complex; I'm keeping this simple for the sake of the discussion.)

The general pattern to this code is doing a bunch of validity checks and then, finally, in the most deeply nested "else" doing the desired action. Generally the code in the most deeply nested "else" is actually the point of the entire routine -- that's the action that you really want to take if all the exception checks pass.  This type of code shows up a lot in a robust embedded system, but can be difficult to understand.

This problem has been dealt with a number of times and most likely every reasonable idea has been re-invented many times. But when I ran up against it in a recent discussion I did some digging and found out I mostly disagree with a lot of the common wisdom, so here is what I think.

A common way to simplify things is the following, which is the guard clause pattern described by Martin Fowler in his refactoring catalog (an interesting resource with accompanying book if you haven't run into it before.)

int MyRoutine(...)
{ ...
  if(..something fails..)     {return(0);}
  if(..somethingelse fails..) {return(0);}
  if(...yetanotherfail..)     {return(0);}
  // perform default function

Reasonable people can (and do!) disagree about how to handle this.   Steve McConnell devotes a few pages of his book Code Complete (Section 17.1, and 17.3 in the second edition) that covers this territory, with several alternate suggestions, including the guard clause pattern and a discussion that says maybe a goto is OK sometimes if used carefully.

I have a somewhat different take, possibly because I specialize in high-dependability systems where being absolutely sure you are not getting things wrong can be more important than subjective code elegance. (The brief point of view is: getting code just a little klunky but obviously right and easy to check for correctness is likely to be safe. Getting code to look nice, but with a residual small chance of missing a subtle bug might kill someone. More this line of thought can be found in a different posting.)

Here is a sketch of some of the approaches I've seen and my thoughts on them:

(1) Another Language. Use another language that does a better job at this.  Sure, but I'm going to assume that you're stuck with just C.

(2) Guard Clauses.  Use the guard class pattern with early returns as shown above . The downside of this is that it breaks the single return per function rule that is often imposed on safety critical code. While a very regular structure might work out OK in many cases, things get tricky if the returns are part of code that is more complex. Rather than deal with shades of gray about whether it is OK to have more than one return, I prefer a strict single-return-per-function rule. The reason is that it avoids having to waste a lot of time hashing out when multiple returns are OK and when they aren't -- and the accompanying risk of getting that judgment call wrong. In other words black-and-white rules take less interpretation.

If you are building a safety critical system, every subjective judgement call as to what is OK and what is not OK is a chance to get it wrong.  (Could you get airtight design rules in place? Perhaps. But you still need to check them accurately after every code modification. If you can't automate the checks, realistically the code quality will likely degrade over time, so it's a path I prefer not to go down.)

(3) Else If Chain.  Use an if/else chain and put the return at the end:

int MyRoutine(...)
{ ...
  if(..something fails..)          {ret=0;}
  else if(..somethingelse fails..) {ret=0;}
  else if(...yetanotherfail..)     {ret=0;}
  else { // perform default function
         ret = value;

Wait, isn't this the same code with flat indenting?  Not quite.  It uses "else if" rather than "else { if". That means that the if statements aren't really nested -- there is really only a single main path through the code (a bunch of checks and the main action in the final code segment), without the possibility of lots of complex conditions in the "if" sidetrack branches.

If your code is flat enough that this works then this is a reasonable way to go. Think of it as guard clauses without the multiple returns.  The regular "else if" structure makes it pretty clear that this is a sequence of alternatives, or often a set of "check that everything is OK and in the end take the action." However, there may be cases in which the code logic is difficult to flatten this way, and in those cases you need something else (keep reading for more ideas).  The trivial case is:

(3a) Simplify To An Else If Chain.  Require that all code be simplified enough that an "else if" chain works.  This is a nice goal, and my first choice of style approaches.  But sometimes you might need a more capable and flexible approach.  Which brings us to the rest of the techniques...

(4) GOTO. Use a "goto" to break out of the flow and jump to the end of the routine.  I have seen many discussion postings saying this is the right thing to do because the code is cleaner than lots of messy if/else structures. At the risk of being flamed for this, I respectfully disagree. The usual argument is that a good programmer exhibiting discipline will do just fine.  But that entirely misses what I consider to be the bigger picture.  I don't care how smart the programmer who wrote the code is (or thinks he is).  I care a lot about whoever has to check and maintain the code. Sure, a good programmer on a good day can use a "goto" and basically emulate a try/throw/catch structure.  But not all programmers are top 10 percentile, and a lot of code is written by newbies who simply don't have enough experience to have acquired mature judgment on such matters. Beyond that, nobody has all good days.

The big issue isn't whether a programmer is likely to get it right. The issue is how hard (and error-prone) it is for a code reviewer and static analysis tools to make sure the programmer got it right (not almost right, or subtly wrong). Using a goto is like pointing a loaded gun at your foot. If you are careful it won't go off. But even a single goto shoves you onto a heavily greased slippery slope. Better not to go there in the first place. Better to find a technique that might seem a little more klunky, but that gets the job done with minimum fuss, low overhead, and minimal chance to make a mistake. Again, see my posting on not getting things wrong.

(Note: this is with respect to unrestricted "goto" commands used by human programmers in C. Generated code might be a different matter, as might be a language where "goto" is restricted.)

(5) Longjmp.  Use setjmp/longjmp to set a target and then jump to that target if the list of "if" error checks wants to return early. In the final analysis this is really the moral equivalent of a "goto," although it is a bit better controlled. Moreover, it uses a pointer to code (the setjmp/longjmp variable), so it is an indirect goto, and that in general can be hazardous. I've used setjmp/longjmp and it can be made to work (or at least seem to work), but dealing with pointers to code in your source code is always a dicey proposition. Jumping to a corrupted or uninitialized pointer can easily crash your system (or sometimes worse).  I'd say avoid using this approach.

I've seen discussion forum posts that wrap longjmp-based approaches up in macros to approximate Try/Throw/Catch. I can certainly appreciate the appeal of this approach, and I could see it being made to work. But I worry about things such as whether it will work if the macros get nested, whether reviewers will be aware of any implementation assumptions, and what will happen with static analysis tools on those structures. In high-dependability code if you can't be sure it will work, you shouldn't do it.

(6) Do..While..Break. Out of the classical C approaches I've seen, the one I like the most (beyond else..if chains) is using a "do..while..break" structure:

int MyRoutine(...)
{ ...
  do { //  start error handling code sequence
    if(..something fails..)     {ret=0; break;}
    if(..somethingelse fails..) {ret=0; break;}
    if(...yetanotherfail..)     {ret=0; break;}
    // perform default function
    ret = value;
  } while (0); // end error handling code sequence

This code is a hybrid of the guard pattern and the "else if" block pattern. It uses a "break" to skip the rest of the code in the code block (jumping to the while(0), which always exits the loop). The while(0) converts this structure from a loop into just a structured block of code with the ability to branch to the end of the code block using the "break" keyword. This code ought to compile to an executable that is has essentially identical efficiency to code using goto or code using an else..if chain. But, it puts an important restriction on the goto-like capability -- all the jumps have to point to the end of the do..while without exception.

What this means in practice is that when reviewing the code (or a change to the code) there is no question as to whether a goto is well behaved or not. There are no gotos, so you can't make an unstructured goto mistake with this approach.

While this toy example looks pretty much the same as the "else if" structure, an important point is that the "break" can be placed anywhere -- even deeply within a nested if statement -- without raising questions as to what happens when the "break" is hit. If in doubt or there is some reason why this technique won't work for you, I'd suggest falling back on restructuring the code so "else if" or this technique works if the code gets too complex to handle. The main problem to keep in mind is that if you nest do..while structures the break will un-nest only one level.

I recognize that this area falls a little bit into a matter of taste and context. My taste is for code that is easy to review and unlikely to have bugs. If that is at odds with subjective notions of elegance, so be it. In part my preference is to outlaw the routine use of techniques that require manual analysis to determine of a potentially unsafe structure is being used the "right" way. Every such judgement call is a chance to get it wrong, and a distraction of human reviewer attention away from more important things. And I dislike arguments of the form that a "good" and experienced programmer won't make a mistake. It is just too easy to miss a subtle bug, especially when you're modifying code in a hurry.

If you've run into another way to handle this problem let me know.