Monday, June 10, 2013

Seven Deadly Sins of CRCs and Checksums

We're wrapping up the final report for an FAA-sponsored study of CRC and Checksum performance for aviation applications, although the results in general apply to all uses of those error detection codes.

As part of our results we came up with an informal list of "Seven Deadly Sins" (bad ideas):
  1. Picking a CRC based on a popularity contest instead of analysis
    • This includes using “standard” polynomials such as IEEE 802.3
  2. Saying that a good checksum is as good as a bad CRC
    • Many “standard” polynomials have poor HD at long lengths
  3. Evaluating with randomly corrupted data instead of BER fault model
    • Any useful error code looks good on random error patterns vs. BER random bit flips
  4. Blindly using polynomial factorization to choose a CRC
    • It works for long dataword special cases, but not beyond that
    • Divisibility by (x+1) doubles undetected fraction on even # bit errors
  5. Failing to protect message length field
    • Results in pointing to data as FCS, giving HD=1
  6. Failing to pick an accurate fault model and apply it
    • “We added a checksum, so ‘all’ errors will be caught” (untrue!)
    • Assuming a particular standard BER without checking the actual system
  7. Ignoring interaction with bit encoding
    • E.g., bit stuffing compromises HD to give HD=2
    • E.g., 8b10b encoding – seems to be OK, but depends on specific CRC polynomial
(I haven't tried to map it onto the more traditional sin list... if someone comes up with a clever mapping I'll post it!)

Thanks to Kevin Driscoll and Brendan Hall at Honeywell for their work as co-investigators. You can read more about the research on my CRC and Checksum Blog.  That blog has more detailed postings, slide sets, and will have the final research report when it is made publicly available.


Saturday, May 25, 2013

Adding Prioritization to an Single Level Interrupt Priority System


Summary of technique: Add a software structure that executes only the highest priority pending interrupt within the ISR polling loop. Then start again at the top of the polling loop instead of polling all possible ISRs. This gives you a prioritized non-preemptive interrupt service routine scheduler.

- - - - - - - - - - - - - - - - - - - -

With some microcontrollers, all of your interrupts come in at the same priority level (for example, via an external interrupt request pin). The usual thing to do in that case is create a polling loop to check all the sources of interrupts and see which one needs to be serviced by looking at peripheral status registers.  For example:
if(HWTimerTick)  { ... ISR to service hardware timer tick ... }
if(ADCReady)  { ... ISR to service A to D converter ... }
if(SerialPortDataInReady ) { ... ISR to read a serial port byte... }
if(SerialPortDataOutReady) { ... ISR to write a serial port byte ... }
...
(Of course this isn't real code ... I'm just sketching a flow that you've seen before if you've written this type of ISR that polls all the devices that can cause interrupts to see which one actually needs to be serviced.)

If only one of these devices is active, then this approach should work pretty well. And if you do system-level testing probably things will work fine -- at least most of the time.

But the way you can get into trouble is if one of the interrupts has a short deadline for being serviced. Let's say you have the above code and are seeing serial input bytes being dropped once in a while.  What could be happening?

One cause of dropping bytes might be that the HW Timer Tick and/or the ADC Ready interrupts are active at the same time that the serial port data input interrupt is ready. You need to execute them before you can get data from the serial port. If the sum of their two execution times is longer than the time between serial byte arrivals, you're going to take too long to get to the serial port input ISR and will drop bytes.

You might buy a faster processor (which might be unnecessary as we'll see), but before doing that you might reorganize the code to put the serial input first in the list of ISRs so you can get to it faster when an interrupt comes in:
if(SerialPortDataInReady ) { ... read a serial port byte... }
if(HWTimerTick)  { ... service hardware timer tick ... }
if(ADCReady)  { ... service A to D converter ... }
if(SerialPortDataOutReady) { ... write a serial

And that will *almost* work. Things might get a little better, but it won't cure the problem. (Or, MUCH worse, it will cure the problem in testing only to have the problem reappear in the field after you've shipped a lot of systems!)  Now when you get an interrupt you'll service the serial port input ISR first. But, then you'll go off and do the other ISRs. If those other ISRs take enough time, you will be stuck in those other ISRs too long and will miss the next byte -- you won't get back to the top of the list of ISRs in time.

You might try re-enabling interrupts inside any long ISRs to let the serial port get processed sooner. But resist the temptation -- that probably won't work, and will likely result in stack overflows due to recursive interrupt processing. (Simple rule: NEVER re-enable interrupts from inside an ISR.)

What we really need here is prioritization. And it's pretty easy to get even though we don't have hardware interrupt prioritization. All you have to do is (1) put the checks for each ISR in priority order, and (2) only execute the first one in the list each time you process interrupts. This can be done as follows:

if(SerialPortDataInReady ) { ... read a serial port byte... }
else if(HWTimerTick)  { ... service hardware timer tick ... }
else if(ADCReady)  { ... service A to D converter ... }
else if(SerialPortDataOutReady) { ... write a serial port byte ... }

Now only the first active interrupt will be serviced and the rest ignored. When you drop out of this structure and exit, any pending interrupt will re-trigger the checks from the beginning, again executing the highest priority interrupt that is still active (i.e., the first active one in the list). This will continue until all pending interrupts have been processed. You can use a "while" loop around the code above, or in many systems it may make sense just to exit interrupt processing and let the hardware interrupts re-trigger to re-run the polling code as a new interrupt.


This approach means that the worst case delay between processing serial input bytes is no longer all the ISRs running (if all interrupts are active). Rather, the worst case is the single longest ISR happens to be running, completes, and the serial port input ISR runs next. This happens because the list only runs at most one ISR rather than all of them. If that one ISR runs too long to meet deadlines, then it's probably too "fat" and should be simplified or its job moved out of ISRs and into the main loop.

There is no free lunch. The lowest priority ISR (the one at the end of the list) might starve. Making sure you meet all your ISR deadlines is trickier with this structure. Without the "elseif" approach the worst case timing is easy to compute -- it is the run time of all ISRs. But it might be too slow to live with. With this structure you have a nonpreemptive prioritized scheduling system for ISRs, and need to use suitable math and a suitable scheduling approach. Generally you'd want to use rate monotonic analysis (RMA) suitably adapted for the ISRs being non-preemptive. The analysis may be a little more complex, but this approach might help you salvage a situation in which you're missing deadlines and have already committed to a certain speed of microcontroller.


(Note on terminology: technically the whole thing is one big ISR that calls a different function depending upon what's active. But I'm calling each such function an ISR because that is really what it does ... you're using a software dispatcher to pick which ISR to run instead hardware prioritization logic to pick an ISR.)

Thursday, April 25, 2013

Why Short Interrupt Service Routines Matter


Most embedded systems I see use interrupts to handle high priority events, which are typically triggered by some peripheral device. So far so good. But it is also common for these systems to have significant timing problems even though their CPUs are not 100% loaded.

Let's take an example of three interrupts and see how this type of thing can happen.  Let's call their service routines IntH, IntM, and IntL (for high/medium/low priority), and assume this is a single-level interrupt priority system. By that I mean that these Interrupt Service Routines (ISRs) can't be interrupted by any of the others once they start executing.

Say that you write your software and you measure an idle task at taking 80% of the CPU.  The most important ISR has highest priority, etc.  And maybe this time it works fine.  But eventually you'll run into a system which has timing problems.  You're only 80% loaded; how could you have timing problems? To find out why, we need to dig deeper.

The first step is to measure the worst case (longest) execution time and worst case (fastest) period for each ISR.  Let's say it turns out this way:

IntH: Execution time = 10 msec       Period = 1000 msec
IntM: Execution time =  0.01 msec    Period =    1 msec
IntL: Execution time =  2 msec       Period =  100 msec

Let's take a look a the numbers. This task set is loaded at:  (10/1000) + (0.01/1) + (2/100) = 4%.
BUT it will miss deadlines! How can that be?

IntM and IntL are both going to miss their deadlines (if we assume deadline = period) periodically.  IntM will miss its deadline up to 10 times every time IntH runs, because the CPU is tied up for 10 msec with IntH, but IntM needs to run every 1 msec. So once per second IntM will miss its deadlines because it is starved by IntH.

OK, so maybe you saw that one coming.  But there is a more insidious problem here. IntM can also miss its deadline because of IntL.  Once IntL executes, it ties up the CPU for 2 msec, causing IntM to miss its 1 msec period. Even though IntL has a lower priority, once it runs it can't be interrupted, so it hogs the CPU and causes a deadline miss.

There are plenty of bandaids that can be tossed at this system (and I have the feeling I've seen them all in design reviews). The obvious hack of re-enabling interrupts partway through an ISR is dangerous and should not be used under any circumstance.  It leads to timing-dependent stack overflows, race conditions and so on. And more importantly, re-enabling interrupts in an ISR is, in my experience, a sign that the designers didn't understand the root cause of the timing problems.

But there is a principled way to solve these problems involving two general rules:
 - If possible, sort ISR and task priority by period; shortest period with highest priority.  This minimizes effective CPU use when you do scheduling. To understand why this is important you'll need to read up on Rate Monotonic Scheduling and related techniques.
 - Keep ISR worst case execution time as small as possible -- only a few handfuls of instructions.  If you need to get more done, dump data from the ISR into a buffer and kick off a non-ISR task do do the processing. This prevents one ISR from making another miss its deadline and largely deflects the problem of ISRs not necessarily being assigned the priority you'd like in your particular hardware.

The key insight is that "important" and "priority" are not the same things. Priority is about making real time scheduling math work, and boils down to assigning highest priority to short-period and short-deadline tasks. Getting that to work in turn requires all ISRs (even low priority ones) to be short. The importance of an ISR from the point of view of functionality ("this function is more important to the customer") is largely irrelevant -- the point of real time scheduling is to make sure everything executes every time.  Sometimes "important" and "short deadline" correspond, but not always. It is the deadline that should be paid attention to when assigning priorities if you want to meet real-time deadlines.  (Or, put another way, "important" means real-time and unimportant means non-real-time.)

The discussion above also applies to systems with multiple levels of interrupt priorities. Within each level of priority (assuming one level can interrupt ISRs in another level), once a pig ISR starts none of the other interrupts at that task level can interrupt it.

Make all your ISRs short, and do the analysis to make sure the worst case clumping of ISR executions doesn't overload your CPU.

Monday, March 25, 2013

Rules for Using Interrupts

Here's a brief guide to rules for good interrupt design.

  • Keep your Interrupt Service Routine (ISR) short. Ideally half a page of C code max.  If you must use assembly code, keep it to one page max. Long ISRs cause timing problems, often in surprising ways.
  • Keep ISR execution time very short. 100-200 clock cycles tops, although there is room for discussion on the exact number. If you have a lot of work to do, shovel the data into a holding buffer and let the main loop or a non-ISR task do the rest.
  • Know the worst case ISR execution time so you can do real-time scheduling. Avoid loops, because these make worst case trickier, and an indefinite loop might hang once in a while due to something you didn't think of.
  • Actually do the real time scheduling, which is a bit tricky because ISRs are non-preemptive within the same ISR priority level. (My book chapter on this works out the math in gory detail.)
  • Don't waste time in an ISR (for example, don't put in a wait loop for some hardware response).
  • Save the registers you modify if your hardware doesn't already do that for you. (Seems obvious, but if you have a lot of registers it might take a lot of testing to catch the one place where a register is used in the main code and the ISR clobbers it.)
  • Acknowledge the interrupt source at the beginning of the ISR (right after you save registers). It makes code reviews easier if it is always in the same place.
  • Don't re-enable interrupts within an ISR.  That's just asking for subtle race condition and stack overflow problems.
There also some system-level issues having to do with playing well with ISRs:
  • Make sure to disable interrupts when accessing a variable shared with an ISR. Do so for the shortest possible time. Do this even if you "know" it is safe (compiler optimizer behavior is difficult to predict, and code generation may change with a new compiler version). Ideally, protect those variables with access methods so you only have to get this right in one place in the code.
  • Declare any shared ISR/non-ISR variables as volatile. 
  • When you do timing analysis, don't forget that ISRs consume time too.
  • When you do stack depth analysis, don't forget worst-case stack-up of interrupts all occurring at the same time (especially if your processor supports multiple levels of interrupts).
  • Make sure that all interrupt vectors are initialized, even if you don't plan on using them.
  • Only use Non-Maskable Interrupts for a catastrophic system event such as system reset.
  • Be sure to initialize all your interrupt-related data structures and hardware that can generate interrupts before you enable interrupts during the boot-up cycle.
  • Once you turn on the watchdog timer, don't ever mask its interrupt.
  • If you find yourself doing something weird within an ISR, go back and fix the root cause of the problem. Weird ISRs spell trouble.
If you've been bitten by an interrupt in a way that isn't covered above, let me know!

Monday, February 25, 2013

Using Profiling To Improve System-Level Test Coverage


Sometimes I run into confusion about how to deal with "coverage" during system level testing. Usually this happens in a context where system level testing is the only testing that's being done, and coverage is thought about in terms of lines of code executed.  Let me explain those concepts, and then give some ideas about what to do when you find yourself in that situation.

"Coverage" is the fraction of the system you've exercised during testing. For unit tests of small chunks of code, it is typically what fraction of the code was exercised during testing (e.g., 95% means in a 100 line program 95 lines were executed at least once and 5 weren't executed at all).  You'd think it would be easy to get 100%, but in fact getting the last few lines of code in test is really difficult even if you are just testing a single subroutine in isolation (but that's another topic). Let's call this basic block coverage, where a basic block is a segment of code with no branches in or out, so if you execute one line of a basic block's code you have to execute all of them as a set. So in a perfect world you get 100% basic block coverage, meaning every chunk of code in the system has been executed at least once during testing. (That's not likely to be "perfect" testing, but if you have less than 100% basic block coverage you know you have holes in your test plan.)

By system level test I mean that the testing involves more or less a completely running full system. It is common to see this as the only level of testing, especially when complex I/O and timing constraints make it painful to exercise the software any other way. Often system level tests are based purely on the functional specification and not on the way the software is constructed. For example, it is the rare set of system level tests that checks to make sure the watchdog timer operates properly.  Or whether the watchdog is even turned on. But I digress...

The problem comes when you ask the question of what basic block coverage is achieved by a certain set of system-level tests. The question is usually posed less precisely, in the form "how good is our testing?"  The answer is usually that it's pretty low basic block coverage. That's because if it is difficult to reach into the corner cases when testing a single subroutine, it's almost impossible to reach all the dusty corners of the code in a system level test.  Testers think they're good at getting coverage with system level test (I know I used to think that myself), but I have a lot of doubt that coverage from system-level testing is high. I know -- your system is different, and it's just my opinion that your system level test coverage is poor. But consider it a challenge. If you care about how good your testing is for an embedded product that has to work (even the obscure corner cases), then you need data to really know how well you're doing.

If you don't have a fancy test coverage tool available to you, a reasonable way to go about getting coverage data is to use some sort of profiler to see if you're actually hitting all the basic blocks. Normally a profiler helps you know what code is executed the most for performance optimization. But in this case what you want to do is use the profiler while you're running system tests to see what code is *not* executed at all.  That's where your corner cases and imperfect coverage are.  I'm not going to go into profiler technology in detail, but you are likely to have one of two types of profiler. Maybe your profiler inserts instructions into every basic block and counts up executions.  If so, you're doing great and if the count is zero for any basic block you know it hasn't been tested (that would be how a code coverage tool is likely to work). But more likely your profiler uses a timer tick to sample where the program counter is periodically.  That makes it easy to see what pieces of code are executed a lot (which is the point of profiling for optimization), but almost impossible to know whether a particular basic block was executed zero times or one time during a multi-hour test suite.

If your profiler only samples, you'll be left with a set of basic blocks that haven't been seen to execute. If you want to go another step further you may need to put your own counters (or just flags) into those basic blocks by hand and then run your tests to see if they ever get executed. But at some point that may be hard too. Running the test suite a few times may help catch the moderately rare pieces of code that are executed. So may increasing your profile sample rate if you can do that. But there is no silver bullet here -- except getting a code coverage tool if one is available for your system.  (And even then the tool will likely affect system timing, so it's not a perfect solution.)

At any rate, after profiling and some hand analysis you'll be left with some level of idea of what's getting tested and what's not.  Try not to be shocked if it's a low fraction of the code (and be very pleased with yourself it if is above 90%).  

If you want to improve the basic block coverage number you can use coverage information to figure out what the untested code does, and add some tests to help. These tests are likely to be more motivated by how the software is built rather than by the product-level function specification.  But that's why you might want both a software test plan in addition to a product test plan.  Product tests never cover all the software corner cases.

Even with expanded testing, at some point it's going to be really really hard to exercise every last line of code -- or even every last subroutine. For those pieces you can test portions of the software to make it easier to get into the corner cases. Peer reviews are another alternative that is more cost effective than testing if you have limited resources and limited time to do risk reduction before shipping. (But I'm talking about formal peer reviews, not just asking your next cube neighbor to look over your code at lunch time.) Or there is the tried-and-true strategy of putting a breakpoint in before a hard-to-get-to basic block with a debugger, and then changing variables to force the software down the path you want it to go. (Whether such a test is actually useful depends upon the skill of the person doing it.)

The profiler-based coverage improvement strategy I've discussed is really about how to get yourself out of a hole if all you have is system tests and you're getting pressure to ship the system. It's better than just shipping blind and finding out later that you had poor test coverage. But the best way to handle coverage is to get very high basic block coverage via unit test, then shake out higher level problems with integration test and system test.

If you have experience using these ideas I'd love to hear about them -- let me know.


Friday, January 25, 2013

Exception Handling Fishbone Diagram

Exception handling is the soft underbelly of many software systems. A common observation is that there a lot more ways for things to go wrong than there are for them to go right, so designing and testing for exceptional conditions is difficult to do well. Anecdotally, it is also where you find many bugs that, while infrequently encountered, can cause dramatic system failures. While not every software system has to be bullet-proof, embedded systems often have to be quite robust. And it's a lot easier to make a system robust if you have a checklist of things to consider when designing and testing.

Fortunately, just such a checklist already exists. Roy Maxion and Bob Olszewski at Carnegie Mellon created a structured list of exceptional conditions to consider when designing a robust system in the form of a fishbone diagram (click on the diagram to see the full detail in a new window).




(Source: Maxion & Olszewski, Improving Software Robustness with Dependability Cases, FTCS, June 1998.)

The way to read this diagram is that an exception failure could be caused by any of the general causes listed in the boxes at the end of the fish-bone segments, and the arrows into each fishbone are more specific examples of those types of problems.

If you don't have the picture handy, a way to remember the main branches is:
C - Computational problem
H - Hardware problem
I  - I/O and file problem
L - Library function problem
D - Data input problem
R - Return value problem
E - External user/client problem  (in embedded systems this may include control network exceptions)
N - Null pointer or memory problems

There isn't a silver bullet for exception handling -- getting it right takes attention to detail and careful work. But, this fishbone diagram does help developers avoid missing exception vulnerabilities. You can read more about the idea and the human subject experiments showing its effectiveness in the free on-line copy of their conference paper: Improving Software Robustness with Dependability Cases,

You can read more detail in the (non-free unless you have a subscription) journal paper:
Eliminating exception handling errors with dependability cases: a comparative, empirical study, IEEE Transactions on Software paper, Sept. 2000.  http://dx.doi.org/10.1109/32.877848



Tuesday, December 25, 2012

Global Variables Are Evil sample chapter

My publisher has authorized me to release a free sample chapter from my book.  They let me pick one, and I decided to go with the one on global variables. If this is a success, a couple more chapters may be released one way or another, so I'd welcome input on which the other best topics would be from the table of contents.


Entire chapter in Adobe Acrobat (166 KB):
http://koopman.us/bess/chap19_globals.pdf



Chapter 19
Global Variables Are Evil

• Global variables are memory locations that are directly visible to an entire
software system.
• The problem with using globals is that different parts of the software are
coupled in ways that increase complexity and can lead to subtle bugs.
• Avoid globals whenever possible, and at most use only a handful of globals
in any system.

Contents:

19.1 Chapter overview

19.2 Why are global variables evil?
Globals have system-wide visibility, and aren’t quite the same as static variables.
Globals increase coupling between modules, increase risk of bugs due to
interference between modules, and increase risk of data concurrency problems.
Pointers to globals are even more evil.

19.3 Ways to avoid or reduce the risk of globals
If optimization forces you into using globals, you’re better off getting a faster
processor. Use the heap or static locals instead of globals if possible. Hide
 globals inside objects if possible, and document any globals that can’t be
eliminated or hidden.

19.4 Pitfalls

19.5 For more information



19.1. Overview

“Global Variables Are Evil” is a pretty strong statement! Especially for some-
thing that is so prevalent in older embedded system software. But, if you can
build your system without using global variables, you’ll be a lot better off.

Global variables (called globals for short) can be accessed from any part of a
software system, and have a globally visible scope. In its plainest form, a global
variable is accessed directly by name rather than being passed as a parameter. By
way of contrast, non-global variables can only be seen from a particular module
or set of related methods. We’ll show how to recognize global variables in detail
in the following sections.

19.1.1. Importance of avoiding globals
The typical motivation for using global variables is efficiency. But sometimes
globals are used simply because a programmer learned his trade with a
non-scoped language (for example, classical BASIC doesn’t support variable
scoping – all variables are globals). Programmers moving to scoped languages
such as C need to learn new approaches to take full advantage of the scoping and
parameter passing mechanisms available.

The main problem with using global variables is that they create implicit cou-
plings among various pieces of the program (various routines might set or mod-
ify a variable, while several more routines might read it). Those couplings are not
well represented in the software design, and are not explicitly represented in the
implementation language. This type of opaque data coupling among modules re-
sults in difficult to find and hard to understand bugs.

19.1.2. Possible symptoms
You may have problems with global variable use if you observe any of the fol-
lowing when reviewing your implementation:

- More than a handful of variables are defined as globally visible (ideally, there
are none). In C or C++, they are defined outside the scope of any procedure,
so they are visible to all procedures. An indicator is the use of the keyword
extern to access global variables defined outside the scope of your compiled
module.

- Variables are used in a routine that are neither defined locally, nor passed as
parameters. (This is just another way of saying they are global.)

- In assembly language, variables are accessed by label from subroutines or
modules rather than via being passed on the stack as a parameter value. Any
access to a labeled memory location (other than in a single module that has ex-
clusive access to that location) is using a global.


The use of global variables is sometimes defensible. But their usage should be
for a very few, special values, and not a matter of routine. Consider their occa-
sional use a necessary evil.

19.1.3. Risks of using globals
The problem with global variables is that they make programs unnecessarily
complex. That can lead to:

- Bugs caused by hidden coupling between modules. For example, misbehav-
ing code in one place breaks things in another place.

- Bugs created in one module due to a change in a seemingly unrelated second
module. For example, a change to a program that writes a global variable
might break the behavior of code that reads that variable.

It may be necessary to have variables that are global, or at least serve a similar
purpose as global variables. The risk comes from using global variables too
freely, and not using mitigation strategies that are available.

19.2. Why are global variables evil?

Global variables should be avoided to the maximum extent possible. At a high
level, you can think of global variables as analogous to GOTO statements in pro-
gramming languages (this idea dates back at least as far as Wulf (1973)). Most of
us reflexively avoid GOTO statements. It’s odd that we still think globals are
OK.

First, let’s discuss what globals are, then talk about why they should be
avoided.

... read more ...


(c) Copyright 2010 Philip Koopman

Sunday, December 16, 2012

Software Timing Loops

Once in a while I see someone using code that uses a timing loop to wait for some time to go by. Code might look like this:

// You should *NOT* use this code as-is!
#define SCALE   15501    // System-dependent time scaling factor

void WaitMS(unsigned long ms)
{ unsigned long counter = ms * SCALE; // Compute how long to wait  
  while(counter > 0) { counter--;}    // Waste time



The idea is that the designer tweaks "SCALE" so that WaitMS takes exactly one msec for each integer value of the input ms.  In other words WaitMS(5) waits for 5 msec. But, there are some significant problems with this code.
  • Different compiler versions and different optimization levels could dramatically change the timing of this loop depending upon the code that is generated. If you aren't careful, your system could stop working for this reason when you do a recompile, without you even knowing the timing has changed or why that has happened.
  •  Changes to hardware can change the timing even if the code is recompiled. Changing the system clock speed is a fairly obvious problem. But other more subtle problems include clock throttling on high-end processors due to thermal management, putting the code in memory with a wait states for access, moving to a processor with instruction cache, or using a different CPU variant that has different instruction timings.
  • The timing will change based on interrupt service routines delaying execution of the while loop. You are unlikely to see the worst case timing disruption in testing unless you have a very deterministic system and/or really great testing.
  • This approach ties up the CPU, using power and preventing other tasks from running.
What I recommend is that you don't use software-based timing loops!  Instead you should change WaitMS() to look at a hardware timer, possibly going to sleep (or yielding to other tasks) until the amount of time desired has passed. In this approach, the inner loop checks a hardware timer or a time of day value set by a timer interrupt service routine.


Sometimes there isn't a hardware timer or there is some other compelling reason to use a software loop. If that's the case, the following advice might prove helpful.
  • Make sure you put the software timing loop in one place like this instead of having lots of small in-line timing loops all over the place. It is hard enough to get one timing loop right!
  • The variable "counter" should be defined as "volatile" to make sure it is actually decremented each time through the loop. Otherwise the optimizer might decide to just eliminate the entire while loop.
  • You should calibrate the "SCALE" value somehow to make sure it is accurate. In some systems it makes sense to calibrate a variable during system startup, or do an internal sanity check during outgoing system test to make sure that it isn't too far from the right value. This is tricky to do well, but if you skip it you have no safety net in case the timing does change.

(There are no doubt minor problems that readers will point out as well depending upon style preferences.   As with all my code examples, I'm presenting a simple "how I usually see it" example to explain the concept.  If you can find style problems then probably you already know the point I'm making. Some examples:  WaitMS might check for overflow, uint32_t is better than "unsigned long," "const unsigned long SCALE = 15501L" is better if your compiler supports it, and there may be mixed-length math issues with the multiplication of ms * SCALE if they aren't both 32-bit unsigned integers.  But the above code is what I usually see in code reviews, and these style details tend to distract from the main point of the article.)

Monday, November 26, 2012

Top 10 Best Practices for Peer Reviews

Here is a nice list of best practices for peer reviews from SmartBear. These parallel the recommendations I usually give, but it is nice to have the longer version readily available too (see link below).

1. Review fewer than 200-400 lines of code at a time.
2. Aim for an inspection rate of less than 300-500 LOC/hr (But, see comment below)
3. Take enough time for a proper, slow review, but not more than 60-90 minutes
4. Authors should annotate source code before the review begins
5. Establish quantifiable goals for code review and capture metrics so you can improve your process
6. Checklists substantially improve results for both authors and reviewers
7. Verify that defects are actually fixed
8. Managers must foster a good code review culture in which finding defects is viewed positively
9. Beware the "Big Brother" effect  (don't use metrics to punish people)
10. The Ego Effect: do at least some code review, even if you don't have time to review it all

And now my comments:  The data I've seen shows 300-500 LOC/hr is too high by a factor of 2 or so. I recommend 100-200 lines of code per hour for 60-120 minutes. It may be that SmartBear's tool lets you go faster, but I believe that comes at a cost that exceeds the time save.

I deleted their best practice #11, which says that lightweight reviews are great stuff, because I don't entirely buy it.  Everything I've seen shows that lightweight reviews (which they advocate) are better than no reviews, and for that reason perhaps they make a good first step. But if you skip the in-person review meeting you're losing out on a lot of potential benefit.  Your mileage may vary.

You can get the full white paper here: http://support.smartbear.com/resources/cc/11_Best_Practices_for_Peer_Code_Review.pdf
 (They are not compensating me for posting this, and I presume they don't mind the free publicity.)

Monday, June 4, 2012

Cool Reliability Calculation Tools

I ran across a cool set of tools for computing reliability properties, including reliability improvements due to redundancy, MTBF based on testing data, availability, spares provisioning, and all sorts of things.  The interfaces are simple but useful, and the tools are a lot easier than looking up the math and doing the calculations from scratch.  If you need to do anything with reliability it's worth a look:

http://reliabilityanalyticstoolkit.appspot.com

The one I like the most right now is the one that tells you how long to test to determine MTBF based on test data, even if you don't see any failures in testing:
http://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator

Here is a nice rule of thumb based on the results of that tool. If you want to use testing to ensure that MTBF is at least some value X, you need to test about 3 times longer than X without ANY failures being observed. That's a lot of testing! If you do observe a failure, you have to test even longer to determine if it was an unlucky break or whether MTBF is smaller than it needs to be. (This rule of thumb assumes 95% confidence level and no testing failures observed -- as well as random independent failures. Play with the tool to evaluate other scenarios.)