Friday, September 30, 2011

The NOP Trick For Speed Profiling

An important principle in code optimization is speed up the parts that matter. One of the interpretations of Amdahl's Law is that speedup is limited by the fraction of total time taken by a particular piece of code. If for example a particular snippet of code is 10% of overall execution time, then making it execute infinitely fast is only going to give you a 10% speedup. 10% might be worthwhile, but if instead a snippet of code only takes 0.001% of total execution time, then it is unlikely spending time optimizing it is worth your while. In other words, there is no point wasting hours optimizing parts of the code that won't make any difference to overall execution speed.

But, how do you know what parts of the code matter? Intuition can help sometimes, but in practice it is hard to know your code well enough that you always guess the bottleneck locations correctly.

If you have a tool-rich development environment you can use a profiling tool (wikipedia has a description and a list of common tools). But sometimes you are on a platform that is too small to use these tools, or you're in too much of a hurry to use them because you are just absolutely sure you are right about where all the time is being spent.

Here is a trick that can save you a lot of time doing optimization. I use it pretty much every time I'm going to optimize code as a sanity check (even if a tool tells me where to optimize, because tools aren't perfect and I've learned to be careful before investing valuable time optimizing). The trick is: insert some NOPs in the code you want to optimize and see how much slower it goes. If you don't see a speed difference, then you are wasting your time optimizing that part of the code to make it go faster. If you do see a speed difference, that difference can help you estimate how much speedup you are going to get if you do the optimization.

Here's some example C code before optimization:
  for(int i = 0; i < 1000; i++)
  { x[i] += y[i];
  }

Depending on your compiler it might well be that you can optimize this by switching to pointers instead of indexed arrays. Assuming that you've decided it is worth the trouble to optimize your code instead of buying a faster processor (that's a whole other discussion!) then probably this code can be made faster. But, if this code is only called once in a while and is only a small part of your overall program it might not give enough speedup to be worth optimizing.

Let's say you think you can squeeze 5 clock cycles per loop out of this code with a rewrite. Before you optimize it, use a timer (or an oscilloscope watching an output pulse on an I/O pin, or a stopwatch) to run the code as-is, and then time the following code to see if there is a speed difference:

  for(int i = 0; i < 1000; i++)
  { x[i] += y[i];
    asm("nop");
    asm("nop");
    asm("nop");
    asm("nop");
    asm("nop");
  }

This inserts 5 NOP do-nothing instructions that are executed every time the loop is executed. Assuming a NOP instruction takes 1 clock, it slows the loop down by 5 clock cycles.

If the whole program with the NOPs in the loop runs 10% slower, then you know that optimizing the loop to save 5 clocks will likely cause the whole program to run about 10% faster. (The slowdown and speedup aren't quite the same percent if you think about the math, but for small percents it's close enough to not worry about this.)

If on the other hand adding a bunch of NOPs makes no difference to the overall program speed, then you are wasting your time optimizing that loop. If adding instructions doesn't make things noticeably slower, then removing them won't make things faster.