Better Embedded System SW: 2012

Tuesday, December 25, 2012

Global Variables Are Evil sample chapter

My publisher has authorized me to release a free sample chapter from my book. They let me pick one, and I decided to go with the one on global variables. If this is a success, a couple more chapters may be released one way or another, so I'd welcome input on which the other best topics would be from the table of contents.

Entire chapter in Adobe Acrobat (166 KB):
http://koopman.us/bess/chap19_globals.pdf

Chapter 19
Global Variables Are Evil

• Global variables are memory locations that are directly visible to an entire
software system.
• The problem with using globals is that different parts of the software are
coupled in ways that increase complexity and can lead to subtle bugs.
• Avoid globals whenever possible, and at most use only a handful of globals
in any system.

Contents:

19.1 Chapter overview

19.2 Why are global variables evil?
Globals have system-wide visibility, and aren’t quite the same as static variables.
Globals increase coupling between modules, increase risk of bugs due to
interference between modules, and increase risk of data concurrency problems.
Pointers to globals are even more evil.

19.3 Ways to avoid or reduce the risk of globals
If optimization forces you into using globals, you’re better off getting a faster
processor. Use the heap or static locals instead of globals if possible. Hide
globals inside objects if possible, and document any globals that can’t be
eliminated or hidden.

19.4 Pitfalls

19.5 For more information

19.1. Overview

“Global Variables Are Evil” is a pretty strong statement! Especially for some-
thing that is so prevalent in older embedded system software. But, if you can
build your system without using global variables, you’ll be a lot better off.

Global variables (called globals for short) can be accessed from any part of a
software system, and have a globally visible scope. In its plainest form, a global
variable is accessed directly by name rather than being passed as a parameter. By
way of contrast, non-global variables can only be seen from a particular module
or set of related methods. We’ll show how to recognize global variables in detail
in the following sections.

19.1.1. Importance of avoiding globals
The typical motivation for using global variables is efficiency. But sometimes
globals are used simply because a programmer learned his trade with a
non-scoped language (for example, classical BASIC doesn’t support variable
scoping – all variables are globals). Programmers moving to scoped languages
such as C need to learn new approaches to take full advantage of the scoping and
parameter passing mechanisms available.

The main problem with using global variables is that they create implicit cou-
plings among various pieces of the program (various routines might set or mod-
ify a variable, while several more routines might read it). Those couplings are not
well represented in the software design, and are not explicitly represented in the
implementation language. This type of opaque data coupling among modules re-
sults in difficult to find and hard to understand bugs.

19.1.2. Possible symptoms
You may have problems with global variable use if you observe any of the fol-
lowing when reviewing your implementation:

- More than a handful of variables are defined as globally visible (ideally, there
are none). In C or C++, they are defined outside the scope of any procedure,
so they are visible to all procedures. An indicator is the use of the keyword
extern to access global variables defined outside the scope of your compiled
module.

- Variables are used in a routine that are neither defined locally, nor passed as
parameters. (This is just another way of saying they are global.)

- In assembly language, variables are accessed by label from subroutines or
modules rather than via being passed on the stack as a parameter value. Any
access to a labeled memory location (other than in a single module that has ex-
clusive access to that location) is using a global.

The use of global variables is sometimes defensible. But their usage should be
for a very few, special values, and not a matter of routine. Consider their occa-
sional use a necessary evil.

19.1.3. Risks of using globals
The problem with global variables is that they make programs unnecessarily
complex. That can lead to:

- Bugs caused by hidden coupling between modules. For example, misbehav-
ing code in one place breaks things in another place.

- Bugs created in one module due to a change in a seemingly unrelated second
module. For example, a change to a program that writes a global variable
might break the behavior of code that reads that variable.

It may be necessary to have variables that are global, or at least serve a similar
purpose as global variables. The risk comes from using global variables too
freely, and not using mitigation strategies that are available.

19.2. Why are global variables evil?

Global variables should be avoided to the maximum extent possible. At a high
level, you can think of global variables as analogous to GOTO statements in pro-
gramming languages (this idea dates back at least as far as Wulf (1973)). Most of
us reflexively avoid GOTO statements. It’s odd that we still think globals are
OK.

First, let’s discuss what globals are, then talk about why they should be
avoided.

... read more ...

Sunday, December 16, 2012

Software Timing Loops

Once in a while I see someone using code that uses a timing loop to wait for some time to go by. Code might look like this:

// You should *NOT* use this code as-is!
#define SCALE 15501 // System-dependent time scaling factor

void WaitMS(unsigned long ms)
{ unsigned long counter = ms * SCALE; // Compute how long to wait
while(counter > 0) { counter--;} // Waste time
}

The idea is that the designer tweaks "SCALE" so that WaitMS takes exactly one msec for each integer value of the input ms. In other words WaitMS(5) waits for 5 msec. But, there are some significant problems with this code.

Different compiler versions and different optimization levels could dramatically change the timing of this loop depending upon the code that is generated. If you aren't careful, your system could stop working for this reason when you do a recompile, without you even knowing the timing has changed or why that has happened.
Changes to hardware can change the timing even if the code is recompiled. Changing the system clock speed is a fairly obvious problem. But other more subtle problems include clock throttling on high-end processors due to thermal management, putting the code in memory with a wait states for access, moving to a processor with instruction cache, or using a different CPU variant that has different instruction timings.
The timing will change based on interrupt service routines delaying execution of the while loop. You are unlikely to see the worst case timing disruption in testing unless you have a very deterministic system and/or really great testing.
This approach ties up the CPU, using power and preventing other tasks from running.

What I recommend is that you don't use software-based timing loops! Instead you should change WaitMS() to look at a hardware timer, possibly going to sleep (or yielding to other tasks) until the amount of time desired has passed. In this approach, the inner loop checks a hardware timer or a time of day value set by a timer interrupt service routine.

Sometimes there isn't a hardware timer or there is some other compelling reason to use a software loop. If that's the case, the following advice might prove helpful.

Make sure you put the software timing loop in one place like this instead of having lots of small in-line timing loops all over the place. It is hard enough to get one timing loop right!
The variable "counter" should be defined as "volatile" to make sure it is actually decremented each time through the loop. Otherwise the optimizer might decide to just eliminate the entire while loop.
You should calibrate the "SCALE" value somehow to make sure it is accurate. In some systems it makes sense to calibrate a variable during system startup, or do an internal sanity check during outgoing system test to make sure that it isn't too far from the right value. This is tricky to do well, but if you skip it you have no safety net in case the timing does change.

(There are no doubt minor problems that readers will point out as well depending upon style preferences. As with all my code examples, I'm presenting a simple "how I usually see it" example to explain the concept. If you can find style problems then probably you already know the point I'm making. Some examples: WaitMS might check for overflow, uint32_t is better than "unsigned long," "const unsigned long SCALE = 15501L" is better if your compiler supports it, and there may be mixed-length math issues with the multiplication of ms * SCALE if they aren't both 32-bit unsigned integers. But the above code is what I usually see in code reviews, and these style details tend to distract from the main point of the article.)

Monday, November 26, 2012

Top 10 Best Practices for Peer Reviews

Here is a nice list of best practices for peer reviews from SmartBear. These parallel the recommendations I usually give, but it is nice to have the longer version readily available too (see link below).

1. Review fewer than 200-400 lines of code at a time.
2. Aim for an inspection rate of less than 300-500 LOC/hr (But, see comment below)
3. Take enough time for a proper, slow review, but not more than 60-90 minutes
4. Authors should annotate source code before the review begins
5. Establish quantifiable goals for code review and capture metrics so you can improve your process
6. Checklists substantially improve results for both authors and reviewers
7. Verify that defects are actually fixed
8. Managers must foster a good code review culture in which finding defects is viewed positively
9. Beware the "Big Brother" effect (don't use metrics to punish people)
10. The Ego Effect: do at least some code review, even if you don't have time to review it all

And now my comments: The data I've seen shows 300-500 LOC/hr is too high by a factor of 2 or so. I recommend 100-200 lines of code per hour for 60-120 minutes. It may be that SmartBear's tool lets you go faster, but I believe that comes at a cost that exceeds the time save.

I deleted their best practice #11, which says that lightweight reviews are great stuff, because I don't entirely buy it. Everything I've seen shows that lightweight reviews (which they advocate) are better than no reviews, and for that reason perhaps they make a good first step. But if you skip the in-person review meeting you're losing out on a lot of potential benefit. Your mileage may vary.

You can get the full white paper here: http://support.smartbear.com/resources/cc/11_Best_Practices_for_Peer_Code_Review.pdf
(They are not compensating me for posting this, and I presume they don't mind the free publicity.)

Monday, June 4, 2012

Cool Reliability Calculation Tools

I ran across a cool set of tools for computing reliability properties, including reliability improvements due to redundancy, MTBF based on testing data, availability, spares provisioning, and all sorts of things. The interfaces are simple but useful, and the tools are a lot easier than looking up the math and doing the calculations from scratch. If you need to do anything with reliability it's worth a look:

http://reliabilityanalyticstoolkit.appspot.com

The one I like the most right now is the one that tells you how long to test to determine MTBF based on test data, even if you don't see any failures in testing:
http://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator

Here is a nice rule of thumb based on the results of that tool. If you want to use testing to ensure that MTBF is at least some value X, you need to test about 3 times longer than X without ANY failures being observed. That's a lot of testing! If you do observe a failure, you have to test even longer to determine if it was an unlucky break or whether MTBF is smaller than it needs to be. (This rule of thumb assumes 95% confidence level and no testing failures observed -- as well as random independent failures. Play with the tool to evaluate other scenarios.)

Tuesday, May 15, 2012

CRC and Checksum Tutorial

I published a tutorial on CRCs and Checksums on the blog I maintain on that topic. Since I get a lot of hits on those search terms here, I'm providing a pointer for those who are interested:
http://checksumcrc.blogspot.com/2012/05/crc-and-checksum-tutorial-slides.html

Friday, May 4, 2012

The End of ESP/ESD Magazine

The print edition Embedded System Design (formerly Embedded System Programming) has published its last issue. For folks who have been doing this for a long time this is a huge change in the embedded systems landscape. Jack Ganssle has an article about his experiences in his last regular column (http://www.eetimes.com/discussion/break-points/4372144/Farewell--ESD). They're going to invest effort into the web version, and I wish them all the best. Hopefully it will be even better, but it won't be the same.

I started doing embedded systems while I was an undergrad at RPI. My first big project was writing microcode for a graphics display system at IBM over a summer, complete with fixed point trig functions. Then I wrote the software for a control system that ran a hotel in the early 1980s. When I say "ran" I mean turning power off to unused rooms, running a timer for a 3-hour TV movie rental system, giving room status to the maid break room monitor, and running a fairly complex cash register. The entire thing ran on an Atari 1200 at first (IBM PC original later, when it came on the market). Those systems ran at a handful of MHz and had 64KB of memory. Only floppy disks -- hard disks were too expensive. It was all programmed in Forth (no other language of that era could pull that off on an inexpensive personal computer, and the company learned that the hard way by trying others). Since then computers in hotels have become common. But we were there in the first wave on automation side (others did bookkeeping at about the same time), and it was a fun ride.

I then got my undergrad degree, drove US Navy submarines, did some combat system integration, got a PhD, and then parlayed a CPU design startup venture from my bedroom into a job with Harris Semiconductor doing embedded controller design. I managed to get two articles into Byte magazine along the way (remember that? or at least heard of it?). One of them was a cover article on a particular style of computer design (April 1987). When I visited the Smithsonian computer history exhibit there was a huge stack of Byte magazines on display, and I could see the spine of that issue buried in the stack. Nothing makes you feel older than seeing something you wrote in a museum!

Jack has posted a copy of the very first ESP magazine from 1988 posted here: http://www.ganssle.com/misc/firstesp.pdf In addition to several pictures of original IBM PC "clone" systems (as we called them in the 80s) you'll also see a 4-page spread for the RTX CPU series on pages 53-56. Those are the guys I started working for in 1989, designing the 32-bit version of the processor family and helping with evolution of the 16-bit version they already had been working on. I found and started reading ESP around the second or third issue, and have read every single issue since. I also assign selected articles from back issues to students in my courses today as a way to supplement embedded system textbooks with articles that in many cases are more directly applicable to the real world. I contributed a few articles to ESP over the years and did the Embedded Systems Conference a few times -- most recently about a year ago in Silicon Valley. It's always fun to connect with the community that way.

I've been doing embedded systems one way or another since I got my degrees, including a few years at United Technologies (mostly elevators and automotive, but also time with jet engines, radars, sonars, helicopters, HVAC systems, and things I'm forgetting). Then I went to Carnegie Mellon to work on wearable computers, software robustness testing, embedded networking, graceful degradation, embedded system safety, and embedded system security. And most recently, autonomous vehicle dependability.

Along the way I've kept up with ESP/ESD even though I was really supposed to be spending more time reading journals and other academic-type stuff. But the embedded system articles are more fun! I've noticed the magazine getting thinner and thinner in recent years, and figured it would eventually come to an end. But it is still sad to see it go. Embedded systems as a profession is quite decentralized. For many years the only real common element that professionals were likely to have was a subscription to that magazine. Other publications have appeared, blogs like this one have popped up, and I know the ESD web site will live on. Hopefully in ten years the community will have even better resources available. We'll see how it turns out.

(PS: for those of you who have noticed no blog postings in a while, this blog and I are both still alive. I have just been that busy. I expect to get back to regular posting eventually. And my CRC blog will see more activity when I get some research funding permissions issues cleared up, although that may be a while.)

Monday, February 27, 2012

Floating Point Comparison Problems

We all learned (or should have learned) that comparing floating point numbers for equality is a risky proposition. The code:

  if (a == b) { ... }

could easily evaluate as not equal rather than equal if a and b differ due to round-off error. There are ways to deal with that involving making sure that the difference is a small fraction of the value. But as I said, you've probably seen this one.

Today I want to talk about a different comparison problem that is more subtle, and frankly one I never even thought about until I saw it as a failure in a real system that had (everyone thought) been thoroughly tested.

Consider code that looks like this:

#define SAFETYLIMIT 357.9
double MonitoredValue;
. . .
... compute MonitoredValue based on sensor values ...

if (MonitoredValue > SAFETYLIMIT)
{ ... slow down or shut down system ... }

The idea is to measure an actual value and check against a safety limit. If the value is too high, command a lower actuation set-point or do a safety shutdown. Alternately, the code might check to make sure that the monitored value is close enough to a commanded value. The details don't matter beyond there being a floating point value comparison of any type involved.

How, you ask, can this fail since it is not an exact equality test? The answer is that the computation of MonitoredValue could result in a numeric exception and produce a NaN value ("Not a Number"). Division by zero is a classic way to get this problem, but there are other more subtle problems involving numeric underflow, overflow, or hitting a discontinuity in a trig function. Any comparison involving a NaN fails, depending upon your compilation flags, floating point library, and so on. So, you could have a really fast speed that results in a numerical exception and the speed limit won't work. Because this is a numerical exception problem, using a double instead of a float usually won't help.

There are at least three solutions to this problem. One is to always make sure that NaN values result in a "safe" outcome. I'd recommend against this -- it is just too easy to forget, get wrong, or get into a situation where you aren't even sure what the safe action is. (Consider that most code doesn't bother to check for null pointers. In most projects, checking for NaN just isn't going to happen.)

Another solution is to use integers or fixed point math instead of floating point math. Fixed point computations are a pain, but at least they don't have a NaN value.

The last one I can think of is to set up your compiler or run-time system to trap on NaN comparisons. (For example, -fsignaling-nans and other related options for GCC if they are supported.) What you do when you trap is not necessarily simple, but at least your system will know there is a problem instead of blinding going along ignoring safety limits.

None of theses solutions is perfect -- but having a plan to deal with this situation is a good first step.

(BTW, there is a LOT more to writing safety critical code than putting in a speed limit check in the main code, so please consider this just an example to motivate the NaN problem. If you have a safety-critical system that doesn't have a mechanical safety backup, you need to do a whole lot more than this before you can trust software to get it right.)

Sunday, February 19, 2012

Controller Area Network (CAN) Protocol Vulnerabilities

Controller Area Network (CAN)is a very popular embedded network protocol. It's widely used in automotive applications, and the low cost that results in makes it popular in a wide variety of other applications.

CAN is a very nice embedded protocol, and it's very appropriate for a wide variety of applications. But, like most protocols it has some faults. And like many protocol faults, they aren't widely advertised. They are the sort of thing you have to worry about if you need to create an ultra-dependable system. For a robust system architecture for every day control functions, probably they are not a huge deal.

Here is a summary of CAN Protocol faults I've run into.

(1) Unprotected header. The length bits of a CAN message aren't protected by the CRC. That means that a single corrupted bit in the length field can cause the CRC checker to look in the wrong place for the CRC. For example, if a 3-byte payload is corrupted to look like a 1-byte payload, the last two bytes of the payload will be interpreted as a CRC. This could cause an undetected 1-bit error. There are some other framing considerations that might or might not cause the error to be discovered, but it is a vulnerability in CAN that is fixed in later protocols via use of a dedicated header CRC (e.g., as done in FlexRay). I've never seen this written up for CAN in particular, but this is discussed on page 20 of [Driscoll09]

(2) Stuff bit errors compromising the CRC. A pair of bit errors in the bit-stuffed format of a CAN message (i.e., one on the wire) can cause a cascading of bit errors in the message, leading to undetected two-bit errors that are undetected by the CRC. See [Tran99] and slides 15-16 of
http://www.ece.cmu.edu/~ece649/lectures/13_can.pdf
In general there are issues in any embedded network if the physical layer encoding can cause the propagation of one physical data bit fault into affecting multiple data bits.

(NOTE: added 10/16/2013. I just found about about the CAN-FD protocol. Among other things it includes the stuff bits in the CRC calculation to mitigate this problem. Whether or not there are any holes in the scheme I don't know, but it is clearly a step in the right direction.)

(3) CAN is intended to provide exactly-once delivery of messages via the use of a broadcast acknowledgement mechanism. However, there are faults with this mechanism. In one fault scenario, some nodes see an acknowledge but others don't, so some nodes will receive two copies of a message (thinking they are independent copies rather than a retransmission) while others receive just one copy. In another less likely (but plausible) scenario, the above happens but the sender fails before retransmitting, resulting in some recipients having a copy of the message and others never having received a copy. [Rufino98] Further info here: [Tindel20b]

(4) It is possible for a CAN node to lock up the network due to retries if a transmitter sees a different value on the bus than it transmitted. (This is a failure of the transmitter, but the point is one node can take down the network.) [Perez03]

(5) If a CAN nodes' transmitter gets stuck at a dominant bit value, it will jam up the entire network. The error counters inside each CAN controller are supposed to help mitigate this, but if you have a stuck-at-"on" transmitter that ignores "turn off" commands there isn't anything you can do with a standard CAN setup.

(6) CAN suffers from a priority inversion issue, which can cause delays of high priority messages due to queuing issues in nodes that locally use FIFO transmission order. [Tindell20] Note that strictly speaking this is a CAN driver and hardware issue rather than something inherent to the on-wire protocol. To quote Ken Tindell via a social media message: "Any software that does FIFO queuing of CAN frames in a critical application is broken."

There are various countermeasures you might wish to take to combat the above failure modes for highly dependable applications. For example, you might put an auxiliary CRC or checksum in the payload (although that doesn't solve the stuff bit vulnerability). You might shut down the network if the CAN controller error counter shows a high error rate. Or you might just switch to a more robust protocol such as FlexRay.

CAN is also not secure in that provides no message authentication. It was not designed to be secure, so this is more of limitation in scope than a design defect.

Have you heard of any other CAN failure modes? If so, let me know.

References:

[Driscoll09] Driscoll, K., Hall, B., Koopman, P., Ray, J., DeWalt, M., Data Network Evaluation Criteria Handbook, AR-09/24, FAA, 2009.
http://www.ece.cmu.edu/~koopman/pubs/faa09-24_data_network_evaluation_criteria_handbook.pdf

[Tran99] Eushiuan Tran Tsung, Multi-Bit Error Vulnerabilities in the Controller Area Network Protocol, MS Thesis, Carnegie Mellon University ECE Dept, May 1999
http://www.ece.cmu.edu/~koopman/thesis/etran.pdf

[Rufino98] Rufino, J.; Verissimo, P.; Arroz, G.; Almeida, C.; Rodrigues, L., Fault-tolerant broadcasts in CAN, FTCS 1998, pp. 150-159.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=689464

[Perez03] Perez, J., Reorda, M., Veiolante, M.; Accurate dependability analysis of CAN-based networked systems, 2003 SBBCCI.
http://porto.polito.it/1418994/

[Tindell20] Tindell, K., CAN Priority Inversion, June 29, 2020.
https://kentindell.github.io/2020/06/29/can-priority-inversion/

[Tindell20b[ Tindell, K., "CAN, atomic broadcast and Buridan's Ass," https://kentindell.github.io/2020/07/11/can-atomic-multicast/

Friday, February 3, 2012

Avoid Floating Point Counters

Every once in a while I encounter code that uses a float to track time, keep count of events, or otherwise add small increments to a running sum. This can lead to some nasty bugs that don't show up during system test.

Consider the following C code:

#include
#define SKIP 0x00FFFFF8

int main(void)
{
union { float fv; int iv; } count;
long i;

for (i = 0; i < SKIP; i++)
{ count.fv += 1;
}

for (i = 0; i < 16; i++)
{ count.fv += 1;
    printf(" +1 = %8.0f   0x%08X\n", count.fv, count.iv);
}
return;
}

At first blush, it seems this counter will be able to count up to a really high number, because even a 32-bit float can handle very large numeric values. But, the problem is that a float only has a 24-bit mantissa value. So if you have a total count that requires more than a 24-bit integer to represent, the increment will stop working due to roundoff error. In other words, each incremental +1 gets rounded off to be equivalent to +0, and the counter stops counting.

Here's the output of that program compiled with gcc v3.4.4 (header line added):
       Count     Hex Representation
+1 = 16777209   0x4B7FFFF9
+1 = 16777210   0x4B7FFFFA
+1 = 16777211   0x4B7FFFFB
+1 = 16777212   0x4B7FFFFC
+1 = 16777213   0x4B7FFFFD
+1 = 16777214   0x4B7FFFFE
+1 = 16777215   0x4B7FFFFF
+1 = 16777216   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000
+1 = 16777217   0x4B800000

As you can see, once the count hits a 24-bit value (16M), the counter stops counting.

This sort of thing can be hard to find in system test, because you may not run the test long enough to catch the problem. For an example of a similar sort of bug that contributed to a tragedy, read about the Patriot Missile bug at http://www.fas.org/spp/starwars/gao/im92026.htm

Here it is in a nutshell: Never Use A Float For A Running Sum
More generally, avoid floats in embedded systems if possible.
You can often get away with a double, but most times you are better off just using an int or long int.

Detailed notes:
IEEE single precision floating point has a 23 bit mantissa with an implicit leading 1 bit. So I say it is a 24 bit value even though only 23 of those bits show up in the actual binary representation.

The fact that the printed floating point value increments one more time but the hex value doesn't change is interesting. Likely it involves the compiled code temporarily storing the increment value in a double precision (or larger) floating point holding register and using that longer value for the printf. Note that the hex value and thus the 32-bit stored value does stop incrementing at the predicted point.

Monday, January 2, 2012

New Blog on Checksums and CRCs

I've decided to start a separate blog for checksum and CRC postings. There are only a few posts at this point, but I expect it will accumulate quite a bit more material over time.

http://checksumcrc.blogspot.com/
Checksum and CRC Central: Cyclic Redundancy Codes, Checksums, and related issues for computer system data integrity.

The most recent post gives CRC polynomials for optimal protection of 8-bit, 16-bit, and 32-bit data words such as you might want for ensuring integrity of key variables. The blog you're reading now (Better Embedded System SW) will continue as before.

Here's wishing everyone a happy and prosperous new year!

Better Embedded System SW