Saturday, April 16, 2016

Challenges in Autonomous Vehicle Testing and Validation

It's a lot of work to demonstrate that self-driving cars will actually work properly.  Testing alone is probably not going to be enough, and probably there will need to be some clever architectural approaches as well.  Last week I gave a talk at the SAE World Congress in Detroit about this and related challenges.

Link to paper 
Link to slides  (also see slideshare version by scrolling down).

Challenges in Autonomous Vehicle Testing and Validation 
     Philip Koopman & Michael Wagner
     Carnegie Mellon University; Edge Case Research LLC
     SAE World Congress, April 14, 2016

Abstract:
Software testing is all too often simply a bug hunt rather than a well considered exercise in ensuring quality. A more methodical approach than a simple cycle of system-level test-fail-patch-test will be required to deploy safe autonomous vehicles at scale. The ISO 26262 development V process sets up a framework that ties each type of testing to a corresponding design or requirement document, but presents challenges when adapted to deal with the sorts of novel testing problems that face autonomous vehicles. This paper identifies five major challenge areas in testing according to the V model for autonomous vehicles: driver out of the loop, complex requirements, non-deterministic algorithms, inductive learning algorithms, and fail operational systems. General solution approaches that seem promising across these different challenge areas include: phased deployment using successively relaxed operational scenarios, use of a monitor/actuator pair architecture to separate the most complex autonomy functions from simpler safety functions, and fault injection as a way to perform more efficient edge case testing. While significant challenges remain in safety-certifying the type of algorithms that provide high-level autonomy themselves, it seems within reach to instead architect the system and its accompanying design process to be able to employ existing software safety approaches.



Wednesday, March 23, 2016

Automotive Remote Keyless Entry Security


Recently there has been another round of reports on the apparent insecurity of remote keyless entry devices -- the electronic key fobs that open your car doors with a button press or even hands-free.  In this case it's not a lot to get excited about as I'll explain, but in general this whole area could use significant improvement because there are some serious concerns.  You'd think that in the 20+ years since I was first involved in this area the industry would get this stuff right on a routine basis, but the available data suggests otherwise. The difference now is that attackers are paying attention to these types of systems.

The latest attack involves a man-in-the-middle intercept that relays signals back and forth between the car and an owner's key fob:
  http://www.wired.com/2016/03/study-finds-24-car-models-open-unlocking-ignition-hack/

Here's why this particular risk might be overblown.  In systems of this type typically the signal sent by the car is an inductively coupled low frequency signal with a range of about a meter.  That's how the car knows you're actually standing by the door and not sipping a latte at a cafe 10 meters away.  So having a single intercept box near the car won't work.  Typically it's going to be hard to transmit an inductively coupled signal from your parking lot to your bedroom with any reasonably portable intercept device (unless car is really really close to your bedroom).  That's most likely why this article says there must be TWO intercept devices: one near the car, and one near the keys.  So I wouldn't worry about someone using this particular attack to break into a car in the middle of the night, because if they can get an intercept box onto your bedroom dresser you have bigger problems.

The more likely attack is someone walking near you in a shopping mall or at an airport who has targeted your specific car.  They can carry an intercept device near your pocket/bag while at the same time putting an intercept device near your car.  No crypto hack required -- it's a classic relay attack.  Sure, it could happen, but a little far fetched for most folks unless you have an exceptionally valuable car.  And really, there are often easier ways to go than this. Attackers might be able to parlay this into a playback attack if the car's crypto is stupid enough -- but at some point they have to get within a meter or so of your physical key to ping it and have a shot at such an attack.

If you have a valuable car and already have your passport and non-contact credit cards in a shielded case, it might be worthwhile putting your keys in a Faraday cage when out of the house (perhaps an Altoids box).  But I'd avoid the freezer at home as both unnecessary and possibly producing condensate inside your device that could ruin it.

The more concerning thing is that devices to break into cars have been around for a while and there is no reason to believe they are based on the attack described in this article.  They could simply be exploiting bad security design, possibly without proximity to the legitimate transmitter.  Example scenarios include badly designed crypto (e.g., Keeloq), badly designed re-synch, or badly designed playback attack protection for RF intercepts when the legitimate user is transmitting on purpose.  Clever variations include blanket jamming and later playback, and jamming one of a pair of messages for later playback. Or broken authentication for OnStar-like remote unlock.  If the system  has too few bits in its code and doesn't use a leaky bucket rate limiting algorithm, you can just use a brute force attack.

Here's a video from which you learn both that the problem seems real in practice and that folks like the media, insurance, police, and investigators could do with a bit more education in this area:
   https://www.youtube.com/watch?v=97ceREjpIvI


Note that similar or identical technology is used for garage door openers.

On a related note, there is also some concern about the safety of smart keys regarding compliance with the Federal safety standard for rollaway protection and whether a car can keep running when nobody is in the car.  These concerns are related to the differences between electronic keys and physical keys that go into a traditional ignition switch. There have been some lawsuits and discussions about changing the Federal safety regulations: https://www.federalregister.gov/articles/2011/12/12/2011-31441/federal-motor-vehicle-safety-standards-theft-protection-and-rollaway-prevention

Monday, March 7, 2016

Multiple Returns and Error Checking

Summary: Whether or not to allow multiple returns from a function is a controversial matter, but I recommend having a single return statement AND avoiding use of goto.

Discussion:

If you've been programming robust embedded systems for a while, you've seen code that looks something like this:

int MyRoutine(...)
{ ...
  if(..something fails..) { ret = 0; }
  else 
  { .. do something ...
    if(..somethingelse fails..) {ret = 0;}
    else 
    { .. do something ..
      if(...yetanotherfail..) {ret = 0;}
      else 
      { .. do the computation ...
        ret = value; 
      }
    }
  }
  // perform default function
  return(ret);
}

(Note: the "ret = 0" and "ret = value" parts are usually more complex; I'm keeping this simple for the sake of the discussion.)

The general pattern to this code is doing a bunch of validity checks and then, finally, in the most deeply nested "else" doing the desired action. Generally the code in the most deeply nested "else" is actually the point of the entire routine -- that's the action that you really want to take if all the exception checks pass.  This type of code shows up a lot in a robust embedded system, but can be difficult to understand.

This problem has been dealt with a number of times and most likely every reasonable idea has been re-invented many times. But when I ran up against it in a recent discussion I did some digging and found out I mostly disagree with a lot of the common wisdom, so here is what I think.

A common way to simplify things is the following, which is the guard clause pattern described by Martin Fowler in his refactoring catalog (an interesting resource with accompanying book if you haven't run into it before.)

int MyRoutine(...)
{ ...
  if(..something fails..)     {return(0);}
  if(..somethingelse fails..) {return(0);}
  if(...yetanotherfail..)     {return(0);}
  // perform default function
  return(value);
}

Reasonable people can (and do!) disagree about how to handle this.   Steve McConnell devotes a few pages of his book Code Complete (Section 17.1, and 17.3 in the second edition) that covers this territory, with several alternate suggestions, including the guard clause pattern and a discussion that says maybe a goto is OK sometimes if used carefully.

I have a somewhat different take, possibly because I specialize in high-dependability systems where being absolutely sure you are not getting things wrong can be more important than subjective code elegance. (The brief point of view is: getting code just a little klunky but obviously right and easy to check for correctness is likely to be safe. Getting code to look nice, but with a residual small chance of missing a subtle bug might kill someone. More this line of thought can be found in a different posting.)

Here is a sketch of some of the approaches I've seen and my thoughts on them:

(1) Another Language. Use another language that does a better job at this.  Sure, but I'm going to assume that you're stuck with just C.

(2) Guard Clauses.  Use the guard class pattern with early returns as shown above . The downside of this is that it breaks the single return per function rule that is often imposed on safety critical code. While a very regular structure might work out OK in many cases, things get tricky if the returns are part of code that is more complex. Rather than deal with shades of gray about whether it is OK to have more than one return, I prefer a strict single-return-per-function rule. The reason is that it avoids having to waste a lot of time hashing out when multiple returns are OK and when they aren't -- and the accompanying risk of getting that judgment call wrong. In other words black-and-white rules take less interpretation.

If you are building a safety critical system, every subjective judgement call as to what is OK and what is not OK is a chance to get it wrong.  (Could you get airtight design rules in place? Perhaps. But you still need to check them accurately after every code modification. If you can't automate the checks, realistically the code quality will likely degrade over time, so it's a path I prefer not to go down.)

(3) Else If Chain.  Use an if/else chain and put the return at the end:

int MyRoutine(...)
{ ...
  if(..something fails..)          {ret=0;}
  else if(..somethingelse fails..) {ret=0;}
  else if(...yetanotherfail..)     {ret=0;}
  else { // perform default function
         ret = value;
        }
  return(ret);
}

Wait, isn't this the same code with flat indenting?  Not quite.  It uses "else if" rather than "else { if". That means that the if statements aren't really nested -- there is really only a single main path through the code (a bunch of checks and the main action in the final code segment), without the possibility of lots of complex conditions in the "if" sidetrack branches.

If your code is flat enough that this works then this is a reasonable way to go. Think of it as guard clauses without the multiple returns.  The regular "else if" structure makes it pretty clear that this is a sequence of alternatives, or often a set of "check that everything is OK and in the end take the action." However, there may be cases in which the code logic is difficult to flatten this way, and in those cases you need something else (keep reading for more ideas).  The trivial case is:

(3a) Simplify To An Else If Chain.  Require that all code be simplified enough that an "else if" chain works.  This is a nice goal, and my first choice of style approaches.  But sometimes you might need a more capable and flexible approach.  Which brings us to the rest of the techniques...

(4) GOTO. Use a "goto" to break out of the flow and jump to the end of the routine.  I have seen many discussion postings saying this is the right thing to do because the code is cleaner than lots of messy if/else structures. At the risk of being flamed for this, I respectfully disagree. The usual argument is that a good programmer exhibiting discipline will do just fine.  But that entirely misses what I consider to be the bigger picture.  I don't care how smart the programmer who wrote the code is (or thinks he is).  I care a lot about whoever has to check and maintain the code. Sure, a good programmer on a good day can use a "goto" and basically emulate a try/throw/catch structure.  But not all programmers are top 10 percentile, and a lot of code is written by newbies who simply don't have enough experience to have acquired mature judgment on such matters. Beyond that, nobody has all good days.

The big issue isn't whether a programmer is likely to get it right. The issue is how hard (and error-prone) it is for a code reviewer and static analysis tools to make sure the programmer got it right (not almost right, or subtly wrong). Using a goto is like pointing a loaded gun at your foot. If you are careful it won't go off. But even a single goto shoves you onto a heavily greased slippery slope. Better not to go there in the first place. Better to find a technique that might seem a little more klunky, but that gets the job done with minimum fuss, low overhead, and minimal chance to make a mistake. Again, see my posting on not getting things wrong.

(Note: this is with respect to unrestricted "goto" commands used by human programmers in C. Generated code might be a different matter, as might be a language where "goto" is restricted.)

(5) Longjmp.  Use setjmp/longjmp to set a target and then jump to that target if the list of "if" error checks wants to return early. In the final analysis this is really the moral equivalent of a "goto," although it is a bit better controlled. Moreover, it uses a pointer to code (the setjmp/longjmp variable), so it is an indirect goto, and that in general can be hazardous. I've used setjmp/longjmp and it can be made to work (or at least seem to work), but dealing with pointers to code in your source code is always a dicey proposition. Jumping to a corrupted or uninitialized pointer can easily crash your system (or sometimes worse).  I'd say avoid using this approach.

I've seen discussion forum posts that wrap longjmp-based approaches up in macros to approximate Try/Throw/Catch. I can certainly appreciate the appeal of this approach, and I could see it being made to work. But I worry about things such as whether it will work if the macros get nested, whether reviewers will be aware of any implementation assumptions, and what will happen with static analysis tools on those structures. In high-dependability code if you can't be sure it will work, you shouldn't do it.

(6) Do..While..Break. Out of the classical C approaches I've seen, the one I like the most (beyond else..if chains) is using a "do..while..break" structure:

int MyRoutine(...)
{ ...
  do { //  start error handling code sequence
    if(..something fails..)     {ret=0; break;}
    if(..somethingelse fails..) {ret=0; break;}
    if(...yetanotherfail..)     {ret=0; break;}
    // perform default function
    ret = value;
  } while (0); // end error handling code sequence
  return(ret);
}

This code is a hybrid of the guard pattern and the "else if" block pattern. It uses a "break" to skip the rest of the code in the code block (jumping to the while(0), which always exits the loop). The while(0) converts this structure from a loop into just a structured block of code with the ability to branch to the end of the code block using the "break" keyword. This code ought to compile to an executable that is has essentially identical efficiency to code using goto or code using an else..if chain. But, it puts an important restriction on the goto-like capability -- all the jumps have to point to the end of the do..while without exception.

What this means in practice is that when reviewing the code (or a change to the code) there is no question as to whether a goto is well behaved or not. There are no gotos, so you can't make an unstructured goto mistake with this approach.

While this toy example looks pretty much the same as the "else if" structure, an important point is that the "break" can be placed anywhere -- even deeply within a nested if statement -- without raising questions as to what happens when the "break" is hit. If in doubt or there is some reason why this technique won't work for you, I'd suggest falling back on restructuring the code so "else if" or this technique works if the code gets too complex to handle. The main problem to keep in mind is that if you nest do..while structures the break will un-nest only one level.

I recognize that this area falls a little bit into a matter of taste and context. My taste is for code that is easy to review and unlikely to have bugs. If that is at odds with subjective notions of elegance, so be it. In part my preference is to outlaw the routine use of techniques that require manual analysis to determine of a potentially unsafe structure is being used the "right" way. Every such judgement call is a chance to get it wrong, and a distraction of human reviewer attention away from more important things. And I dislike arguments of the form that a "good" and experienced programmer won't make a mistake. It is just too easy to miss a subtle bug, especially when you're modifying code in a hurry.

If you've run into another way to handle this problem let me know.

Wednesday, February 24, 2016

A Nice Rant About Representing Computer System Time

Here's a nice rant about dealing with time, time zones, daylight savings time, leap seconds, and why keeping things straight is so difficult. This especially applies to embedded systems which might not have a network connection, let alone access to a networked time service.

 The overall video series looks pretty interesting too, although it's more about computers in general than embedded computers specifically.  (Scroll down past the video link to see a list of how time-keeping bugs have caused severe outages and even deaths.)

"Summary: Published on Dec 30, 2013 A web app that works out how many seconds ago something happened. How hard can coding that be? Tom Scott explains how time twists and turns like a twisty-turny thing. It's not to be trifled with! (Embedded from YouTube; <computerphile>)"




Bonus content: here is a rogue's gallery of time goofs that I happen to include in my classroom lectures.  If you know of other high profile outages or worse please submit as comments and I'll update the list as we go. (Note: this is only bugs caused by bad time-keeping, not all software outages):
  • Feb. 1991: Patriot missile failure due to floating point time roundoff; 28 deaths (link)
  • Dec 31, 1999: Y2K
  • Feb. 2007: F-22 raptor computer system crash due to crossing the international date line (link)
  • Feb. 2008: Microsoft Zunes basically bricked by leap year bug (link
  • Mar. 2011:  iPhones spring back instead of springing forward (link)
  • Mar. 2012: Windows Azure leap-year bug takes down G-cloud (link)
  • Jul. 1, 2012: Leap second bug wreaks havoc upon on-line services using NTP (link)
  • Sep. 2013: Deep Impact comet mission ends due to calendar date rollover fault (link)
  • May 2015: Boeing 787 timer rollover bug crashes engine software after 248 days (link)
  • Jul. 1, 2015: Another leap second problem, but not nearly as bad as 2012 (link)
  • Feb. 2016: iPhone prank bricks iPhones if date set back to zero Unix time (link)
  • Feb. 2016: Leap year bug leaves passengers without bags at Dusseldorf airport (link)
  • Jan. 19, 2038 03:15:07 GMT:  Unix time rolls over ("Y2K for Unix")
(There are relatively few listings before 2010.  Don't think for a minute that time suddenly got harder. What got worse was probably more things can be broken and result in headlines due to timekeeping faults.)

Monday, February 1, 2016

Multi-Rate Main Loop Task Timing


In the past couple posts I've talked about how to build a multi-rate main loop scheduler and the two biggest mistakes I tend to see with timing analysis. In this posting I'll wrap up the topic (at least for now) by describing how to determine whether a particular task in a non-preemptive multi-rate system is going to meet its deadlines.  The math gets pretty complex and is pretty painful, so I'll work an example and give the general rules without trying to show the iterative math behind the method. At that, this posting will take more work to understand completely than my usual posts, but it's just a complex topic and this isn't the place for an entire book chapter worth of math. On real systems you can generally do the checking with a spreadsheet once you understand the technique.

Consider the following task set:

Task  Period  Compute  CPU Load 
 0       7       2       28.6%
 1      10       2       20%
 2      20       3       15%    
 3     101       5        5%
 4     199       3        1.5%
         Total CPU Load  70.1%  (numbers rounded to nearest 0.1%)

If we want to find out the worst case response time for Task 2, we need to look at the following worst case:
  • All the tasks in the task set become ready to run simultaneously. In practice this means that the system timer is at zero or equal to the least common multiple of all periods
  • The most obnoxious task with priority lower than the task we care about sneaks in and starts running just before the highest priority task gets to run. (Generally this happens because that task got  a chance to start right at the last instant before the counter ticked.)
  • All the tasks with higher priority run one or more times (based on their periods) before the task we care about runs. In general some tasks will run multiple times before our task gets a chance to run.
So for the example case of Task #2, that means:
  • Identify Task #3 as the task with the largest compute time from those tasks with a priority higher than Task #2.  Assume it starts running at time zero because it snuck in ahead of all the other tasks.
  • Because it is time zero, all other tasks are ready to run starting at time zero. But because Task #3 snuck in first, all the other tasks are in the run queue at time zero.
Now we do an iterative calculation as follows, with each numbered step being an iteration of run the highest queued priority task and compute the new system time when it ends.
  1. Start at time 0 with Task 3 running.
    - Tasks 0, 1, 2, 4 queue to run at time zero.
    - Running Task 3 takes us to time 5 when Task 3 completes.
    - At time 5 tasks still queued are:  0, 1, 2, 4
  2. Time 5: run highest priority queued task: Task 0.
    - This runs for 2 msec, bringing us to time 7.
    - At time 7 we reach the Task 0 period, so another copy of Task 0 queues to run.
    - At time 7 tasks still queued are: 0, 1, 2, 4
  3. At time 7 run highest priority queued task: Task 0.
    - This runs for 2 msec, bringing us to time 9.
    - At time 9 tasks still queued are: 1, 2, 4
  4. At time 9 run highest priority queued task: Task 1.
    - This runs for 2 msec, bringing us to time 11.
    - At time 10 another copy of Task 1 is queued (note that we have missed the deadline for Task 1 since it should have completed at time 10, but ran until time 11).
    - At time 11 tasks still queued are: 1, 2, 4
  5. At time 11 run highest priority queued task: Task 1.
    - This runs for 2 msec, bringing us to time 13.
    - At time 13 tasks still queued are: 2, 4
  6. At time 13 run highest priority queued task: Task 2.
    - This runs for 3 msec, bringing us to time 16.
    - At time 14 another copy of Task 0 is queued.
    - At time 14 tasks still queued are: 0, 4.
  7. At time 16 Task #2 has completed, ending our analysis
We've computed the worst case time to complete Task 2 is 16 msec, which is less than its 20 msec period. So Task 2 will meet its deadline. However, along the way we noticed that even though the CPU is only loaded to 70.1% Task 1 is going to miss its deadline.

Graphically, the above scenario looks like this:


The blue arrows show where tasks become queued to run, and the boxes show which task is running when.

To determine if your system is schedulable you need to repeat this analysis for every task in the system. In this case, repeating it for Task 1 will reveal that Task 1 misses its deadlines even though the CPU is only loaded at 70.1%.

In general the technique is to assume the longest-running task (biggest compute time) with priority lower than yours starts running and all other tasks queue at that same time. Then play out the sequence to see if you meet your deadline. There are a few notes on special cases. The longest task with lower priority may vary depending on which task you are evaluating. For example, the longest lower priority task for Task #3 is not Task #3, but rather Task #4. And the lowest priority task doesn't have a lower priority blocking task, so when evaluating Task #4 you can just assume it starts at time zero (there is no initial blocker).  This can be expressed as an iterative equation that has to be cycled until it converges. If you really want a math-based approach take a look at the CAN performance lecture slides in my grad course, which is pretty much the same math. But trying to explain the math takes a lot more words and complicated typography than are practical in a blog entry.

If you find out you are missing deadlines and want to fix it, there are two general techniques that can help:
  • Make any really long-running low priority task run shorter, possibly by breaking it up into multiple shorter tasks that work together, or that can "yield" control back to the main loop every once in a while during their execution and pick up the next time they run where they left off. This will reduce the length of the initial blocking effect for all higher priority tasks.
  • Schedule tasks so their periods evenly divide each other (this will result in the Least Common Multiple of all periods equals the largest period). This corresponds to the the approach of harmonic task periods discussed in the real time scheduling chapter of my book. For non-preemptive tasking it will NOT support 100% CPU usage, but probably it will make worst case latency better.

Notes for readers who want to be thorough:
  • If you have interrupts you have to include the ISR contribution to CPU workload when doing this analysis. Doing so can get a bit tricky.  The real time scheduling chapter of my book shows how to do this for a single-rate main loop scheduler.
  • It may be that you want to resolve a situation in which the fastest task gets blocked for too long by putting that faster task into an ISR. If you do that, keep it short and don't forget that it still affects CPU workload. This approach is discussed for a single-rate main loop scheduler, also in the real time scheduling chapter of my book.
  • We don't include the current task in the "most obnoxious" consideration because we start by assuming the system will meet deadlines where deadline = task period. Any particular task has to have finished running before it can run again unless it misses its deadline. So a task can't self-block in a system that is schedulable.