Better Embedded System SW: May 2010

Monday, May 31, 2010

More than 80% full is too full

It is common for embedded systems to optimize hardware costs without really looking at the effect that has on software development costs. Many of us have spent a few hours searching for a trick that will save a handful of memory bytes. That only makes sense if you believe engineering is free. (It isn't.)

Most developers have a sense that 99%+ is too full for memory and CPU cycles. But it is much less clear where to draw the line. Is 95% too full? How about 90%?

As it turns out, there is very little guidance on this area. But performing some what-if analysis with some classic software cost data leads to a rather startling conclusion:

If your memory or CPU is more than 80% full
and you are making fewer than 1 million units
then you should get more memory and a faster CPU.

This may seem too conservative. But what is happening between about 60% full and 80% full is that software is gradually becoming more difficult to develop. In part that is because a lot of optimizations have to be added. And in part that is because there is limited room for run-time monitoring and debug support. You have to crunch the data to see the curves (chapter 18 of my book has the details). But this is where you end up.

This is a rule of thumb rather than an absolute cutoff. 81% isn't much different than 80%. Neither is 79%. But by the time you are getting up to 90% full, you are spending dramatically more on the overall product than you really should be. Chapter 18 of my book discusses the true cost of nearly full resources and explains where these numbers come from.
---

Thursday, May 27, 2010

Managing developer staff by head count

Last I checked, engineers were paid pretty good salaries. So why does management act like they are free?

Let me explain. It is common to manage engineers by headcount. Engineering gets, say, 12 people, and their job is to make products happen. But this approach means engineers are an overhead resource that is just there to use without really worrying about who pays for it. Like electricity. Or free parking spots.

The good part about engineering headcount is it is simple to specify and implement. And it can let the engineering staff concentrate on doing design instead of spending a lot of time of marketing themselves to internal customers. But it can cause all sorts of problems. Here are some of the more common ones:

Decoupling of cost from workload. Most software developers are over-committed. You get the peanut butter effect. No matter how big your slice of bread, it always seems possible to spread the peanut butter a little thinner to cover it all. After too much spreading you spend all your time fighting fires and no time being productive. (Europeans -- feel free to substitute Nutella in this analogy! I'm not sure how well the analogy works Down Under -- I've only been brave enough to eat Vegemite once.)
Inability to justify tool spending. Usually tools, outside consultants, and other expenses come from "real" money budgets. Usually those budgets are limited. This often results in head-count engineers spending or wasting many hours doing something that they could get from outside much more cheaply. If a $5000 software tool saves you a month of time that's a win. But you can't do it if you don't have the $5000 to spend.
Driving use of the lowest possible cost hardware. Many companies still price products based on hardware costs and assume software is free. If you have an engineering headcount based system, engineers are in fact "free" (as far as the accountants can tell). This is a really bad idea for most products. Squeezing things into constrained software gets really expensive! And even if you can throw enough people at it, resource constraints increase the risk of bugs.

There are no doubt other problems with using an engineering headcount approach (drop me a line if you think of them!). And there are some benefits as well. But hopefully this gives you the big picture. In my opinion head counts do more harm than good.

Corporate budgeting is not a simple thing, so I don't pretend to have a simple magic wand to fix things. But in my opinion adding elements of the below will help over the long term:

Include engineering cost in product development cost. You probably budget for manufacturing tooling and other up-front (NRE) costs. Why isn't software development budgeted? This will at the very least screen out ill conceived products with expensive (complex, hard to get right) software put into marginally viable products.
If you must use head-counts, revisit them annually based on workload and adjust accordingly.
Treat headcount as a fixed resource that is rationed, not an infinite resource. Or at least make product lines pay into the engineering pool in proportion to how much engineering they use, even if they can't budget for it beforehand.

There aren't any really easy answers, but establishing some link between engineering workload and cost probably helps both the engineers and the company as a whole. I doubt everyone agrees with my opinions, so I'd like to hear what you have to say!

Monday, May 24, 2010

Improving CAN Bit Error Rates

If you are using the CAN protocol (Controller Area Network) in an embedded system, you should take some care to ensure you have acceptable bit error rates. (Sometimes we talk about embedded hardware even though this is a mostly software blog.)

The first step is to find out if you have a problem. Most CAN interface chips have an error counter that keeps track of how many message errors have occurred, and generates an exception if some number of errors has been detected. Turn that feature on, set the threshold as low as possible, and run the system in as noisy an environment as you reasonably expect. Let it run a good long while (a weekend at least). See if any errors are reported. If possible, send varying data patterns on some messages and check the values as you receive them to make sure there are no undetected errors. (CAN has some vulnerabilities that let some specific types of single and double bit errors slip through. So if you have detected CAN errors there are good odds you also have undetected CAN errors slipping through. More on that topic another time.)

If you find you are getting errors, here are some things to check and consider:

Make sure the cabling is appropriately terminated, grounded, shielded, and so on (see your CAN interface documentation for more).

Use differential signals instead of a single signal (differential signals are the usual practice, but it never hurts to make sure that is what you are using).

Make sure you aren't exceeding the maximum bus length for your chosen bit rate (your CAN chip reference materials should have a description of this issue).

Make sure you haven't hooked up too many nodes for your bus drivers to handle.

Those are all great things to check. And you probably thought of most or all of them. But here are the ones people tend to miss:

If you still have noise, switch to optocouplers (otherwise known as optical isolation). In high noise environments (such as those with large electric motors) that is often what it takes.

Don't use transformer coupling. While that is a great approach to isolation in high noise environments, it won't work with CAN because it doesn't give you bit dominance.

If you know of a trick I've missed, please send it to me!

Wednesday, May 19, 2010

Only 10 lines of code per day. Really??

If you want to estimate how long it's going to take to create a piece of embedded software (and how much it will cost), it's useful to have an idea of how productive you're going to be. Lines of code written per day is a reasonable starting point for this. It's a crude metric to be sure. If you have enough experience that you can criticize this metric you probably don't need to read further. But if you are just starting to keep count, this is a reasonable way to go. For tallying purposes we just consider executable code, and ignore comments as well as blank lines.

It's pretty typical for solid embedded software to come in at between 1 and 2 lines of code (LOC) per developer-hour. That's 8 to 16 LOC per developer each day, or about 2000-4000 LOC per year.

If you want just a single rough number, call it 10 LOC per day per developer. This is relatively language independent, and is for experienced developers producing code that is reasonably reliable, but not intended to be safety critical.

Wow, that's not much code, is it? Heck, most of us can recall cranking out a couple hundred lines in an evening when we were taking programming courses. So what's the deal with such low numbers?

First of all, we're talking about real code that goes into real products, that actually works. (Not kind-of sort-of works like student homework. This stuff actually has to work!)

More importantly, this is the total cost for everything, including requirements, design, testing, meetings, and so on. So to get this number, divide total code by the total person-hours for the whole development project. Generally you should leave out beta testing and marketing, but include all in-house testers and first-level technical managers.

It's unusual to see a number less than 1 LOC/hour unless you're developing safety critical code. Up to 3 LOC/hour might be reasonable for an agile development team. BUT, that 3 LOC tends to have little supporting design documentation, which is a problem in some circumstances (agile+embedded is a discussion for another time). It's worth mentioning that any metric can be gamed, and that there are situations in which metrics are misleading or useless. But, it's worth knowing the rules of thumb and applying them when they make sense.

Generally teams with a higher LOC/day number are cutting corners somewhere. Or they are teams composed entirely of the world's most amazing programmers. Or both. Think about it. If your LOC/day number is high, ask yourself what's really going on.

Friday, May 14, 2010

Security for automotive control networks

News is breaking today that automotive control networks are vulnerable to attacks if you inject malicious messages onto them (see this NY Times article as an example). It's good that someone has taken the trouble to demonstrate the attack, but to our research group the fact that such vulnerabilities exist isn't really news. We've been working on countermeasures for several years, sponsored by General Motors.

If this sort of issue affects you, here is a high level overview. Pretty much any embedded network doesn't have support for authentication of messages. By that, I mean there is no way to tell if the node sending a particular message is really the node that is supposed to be sending it. It is pretty easy to reverse engineer the messages on a car network and find out, for example, that message header ID #427 is the one that disables the car engine (not the real ID number and not necessarily a real message -- just an example). Once you know that, all you have to do is connect to the network and send that message. Easy to do. Probably a lot of our undergraduates could do it. (Not that we teach them to be malicious -- but they shouldn't get an "A" in the courses I teach if they can't handle something as simple as that!)

The problem is that, historically, embedded networks have been closed systems. Designers assumed there was no way to connect to them from the outside, and by extension assumed they wouldn't be attacked. That is all changing with connectivity to infotainment systems and the Internet.

As I said, we've worked out a solution to this problem. My PhD student Chris Szilagyi published the long version in this paper from 2009. The short version is that what you want to do is add a few bits of cryptographically secure authentication to each network message. You don't have a lot of bits to work with (CAN has a maximum 8 byte payload). So you put in just a handful of authentication bits in each message. Then you accumulate multiple messages over time until the receiver is convinced that the message is authentic enough for its purposes. For something low risk, a couple messages might be fine. For something high risk, you collect more messages to be sure it is unlikely an attacker has faked the message. It's certainly not "free", but the approach seems to provide reasonable tradeoff points among cost, speed, and security.

There is no such thing as perfectly secure, and it is reasonable for manufacturers to avoid the expense of security measures if attacks aren't realistically going to happen. But if they are going to happen, it is our job as researchers to have countermeasures ready for when they are needed. (And, if you are a product developer, your job to make sure you know about solutions when it is time to deploy them.)

I'm going to get into my car and drive home today without worrying about attacks on my vehicle network at all. But, eventually it might be a real concern.

Thursday, May 13, 2010

What's the best CRC polynomial to use?

(If you want to know more, see my Webinar on CRCs and checksums based on work sponsored by the FAA.)

If you are looking for a lightweight error detection code, a CRC is usually your best bet. There are plenty of tutorials on CRCs and a web search will turn them up. If you're looking at this post probably you've found them already.

The tricky part is the "polynomial" or "feedback" term that determines how the bits are mixed in the shift-and-XOR process. If you are following a standard of some sort then you're stuck with whatever feedback term the standard requires. But many times embedded system designers don't need to follow a standard -- they just need a "good" polynomial. For a long time folk wisdom was to use the same polynomial other people were using on the presumption that it must be good. Unfortunately, that presumption is often wrong!

Some polynomials in widespread use are OK, but many are mediocre, some are terrible if used the wrong way, and some are just plain wrong due factors such as a typographical error.

Fortunately, after spending a many CPU-years of computer time doing searches, a handful of researchers have come up with optimal CRC polynomials. You can find my results below. They've been cross-checked against other known results and published in a reviewed academic paper. (This doesn't guarantee they are perfect! But they are probably right.)

(click for larger version)
(**** NOTE: this data is now a bit out of date. See this page for the latest ****)

Here is as thumbnail description of using the table. HD is the Hamming Distance, which is minimum number of bit errors undetected. For example, HD=4 means all 1, 2, and 3 bit errors are detected, but some 4-bit errors are undetected, as are some errors with more than 4 bits corrupted.

The CRC Size is how big the CRC result value is. For a 14-bit CRC, you add 14 bits of error detection to your message or data packet.

The bottom number in each box within the table is the CRC polynomial in implicit "+1" hex format, meaning the trailing "+1" is omitted from the polynomial number. For example, hex 0x583 = binary 101 1000 0011 = x^11 + x^9 + x^8 + x^2 + x + 1. (This is "Koopman" notation in the wikipedia page. No, I didn't write the wikipedia entry, and I wasn't trying to be gratuitously different. A lot of the comparison stuff happened after I'd already done too much work to have any hope of changing my notation without introducing mistakes.)

The top number in each box is the maximum data word length you can protect at that HD. For example, the polynomial 0x583 is an 11-bit CRC that can provide HD=4 protection for all data words up to 1012 bits in length. (1012+11 gives a 1023 bit long combined data word + CRC value.)

You can find the long version in this paper: Koopman, P. & Chakravarty, T., "Cyclic Redundancy Code (CRC) Polynomial Selection For Embedded Networks," DSN04, June 2004. Table 4 lists many common polynomials, their factorizations, and their relative performance. It covers up to 16-bit CRCs. Longer CRCs are a more difficult search and the results aren't quite published yet.

You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/

(Note: updated 8/3/2014 to correct the entry for 0x5D7, which provides HD=5 up to 26 bits. The previous graphic incorrectly gave this value as 25 bits. Thanks to Berthold Gick for pointing out the error.)

Monday, May 10, 2010

Which Error Detection Code Should You Use?

Any time you send a message or save some data that might be corrupted in storage, you should think about using some sort of error detection code so you can tell if the data has been corrupted. If you do a web search you will find a lot of information about error detection codes. Some of it is great stuff. But much of it is incorrect, or on a good day merely suboptimal. It turns out that the usual rule of thumb of "do what the other guys do and you can't go far wrong" works terribly for error detection. There is lots of folk wisdom that just isn't right.

So, here is a guide to simple error correction in embedded systems. There is a journal paper with all the details (see this link), but this is the short version.

If you want the fastest possible computation with basic error detection:

Parity is fine if you have only one bit to spend, but takes about as much work to compute as a checksum.
Stay away from XOR checksums (often called LRCs or Longitudinal Redundancy Checks). Use an additive checksum instead to get better error detection at the same cost.
Use an additive checksum if you want something basic and fast. If possible, use a one's complement additive checksum instead of normal addition. This involves adding up all the bytes or words of your data using one's complement addition and saving the final sum as the checksum. One's complement addition cuts vulnerability to undetected errors in the top bit of the checksum in half. In a pinch normal integer addition will work, but gives up some error detection capability.

If you want intermediate computation speed and intermediate error detection:

Use a Fletcher checksum. Make sure that you use one's complement addition in computing the parts of that checksum, not normal integer addition. Normal integer addition just kills error detection performance for this approach.
Don't use an Adler checksum. In most cases it isn't as good as a Fletcher checksum and it is a bit slower to compute. The Adler checksum seems like a cool idea but it doesn't really pay off compared to a Fletcher checksum of the same size.

If you can afford to spend a little more computation speed to get a lot better error detection:

Use a CRC (cyclic redundancy check)
If you are worried about speed there are a variety of table lookup methods that trade memory for speed. CRCs aren't really as slow as people think they will be. Probably you can use a CRC, and you should if you can. Mike Barr has a posting on CRC implementations.
Use an optimal CRC polynomial if you don't have to conform to a standard. If you use a commonly used polynomial because other people use it, probably you are missing out on a lot of error detection capability. (More on this topic in a later post.)

You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/

Thursday, May 6, 2010

Intangible Benefits of In-Person Peer Reviews

Beyond finding bugs, in my opinion, in-person reviews also provide the following intangible benefits:

Synergy of comments: one reviewer's comment triggers something in another reviewer's head that leads to more thorough reviews.
Training: probably not everyone on your team has 25 years+ experience. Reviews are a way for the younger team members to learn about everything having to do with embedded systems.
Focus: we'd all rather be doing something than be in a meeting room, but a review meeting masks human interrupts pretty effectively -- if you silence your cell phone and exit your e-mail client.
Pride: make a point of saying something nice about code you are reviewing. It will help the ego of the author (we all need ego stroking!) and give the new guys something concrete to learn from.
Consistency: a group review is going to be more effective at encouraging code and design consistency and in making sure everything follows whatever standards are relevant. In on-line reviews you might not make the effort to comment upon things that aren't hard-core bugs, but in a meeting it is much easier to make a passing comment about finer points of style that doesn't need to be logged as an issue.

So if you're going to spend the effort to do reviews, it is probably worth spending the extra effort to make them actual physical meetings rather than e-mail pass-arounds. Chapter 22 of my book discusses peer reviews in more detail.
---

Tuesday, May 4, 2010

Do On-Line Peer Reviews Work?

If you don't do peer reviews of your design and code you're missing the boat. It is the most effective way I know of to improve software quality. It really works!

A more interesting question is whether or not e-mail or on-line tool peer reviews are effective. From what I've seen they often don't work. I have no doubt that if you use a well thought out support tool and have just the right group of developers it can be made to work. But more often I have seen it not work. This includes some cases when I've been able to do more or less head-to-head comparisons, both for students and industry designers. The designers using on-line reviews are capable, hard-working, and really think the reviews are working. But they're not! They aren't finding the usual 40%-60% of defects in reviews (with most of the rest -- hopefully -- found via test).

I've also seen this effect in external reviews where sometimes I send comments via e-mail, and sometimes I subject myself to the US Air Transportation System and visit a design team. The visits are invariably more productive.

The reasons most people have for electronic reviews are that they are more convenient. I can believe that. But (just to stir the pot) when you say that, what you're really saying is you can't set aside a meeting time for a face to face review because you have more important things to do (like writing code).

Reviews let you save many hours of debugging for each review hour. If all you care about is getting to buggy code as fast as possible, then sure, skip reviews. But if what you really care about is getting to working product with the least effort possible, then you can't afford to skip reviews or do them in an ineffective way. Thus far I haven't seen data that shows tools are consistently effective.

If you're using on-line tools for reviews and they really work (or have been burned by them) let me know! If you think they work, please say how you know that they do. Usually when people claim that I'm looking for them to find about half their bugs via review, but if you have a novel and defensible measurement approach I'd be interested in hearing about it. I'd also be interested in hearing about differences between informal (e-mail pass-around) and tool based review approaches.

Monday, May 3, 2010

Effective Use of an External Watchdog Timer

It's a good idea to use an external watchdog timer chip if you have a critical application. That way if the main microcontroller chip fails you have an independent check on its operation. (Ideally, make sure the external watchdog has its own oscillator so a single failed oscillator doesn't fool the watchdog and CPU into running too slowly.)

I recently got a question about how to deal with a simple external watchdog chip that didn't give a lot of flexibility in setting a timeout period. You want the timeout period to be reasonably tight to the worst-case watchdog kick period. But with external chips you might have a huge amount of timing slack if the watchdog period settings are really coarse (for example, a 1 second external watchdog when what you really wanted was 300 msec).

Here's an idea for getting the best of both worlds. Most microcontrollers have an internal watchdog timer. Rather than turn it off and ignore it, set it up for a nice tight kick interval. Probably you will have a lot of control over the internal watchdog interval. Then set the external watchdog timer interval for whatever is convenient, even if it is a pretty long interval. Kick both watchdogs together whenever you normally would kick just a single watchdog.

The outcome is that you can expect the internal watchdog will work most of the time. When it does, you have a tight timing bound. In the rare cases where it doesn't work, you have the external watchdog as a safety net. So for most single-point failures (software hangs) you have tight timing protection. For the -- hopefully -- much rarer double point failures (software hangs AND takes down the internal watchdog with it; or a catastrophic hardware failure takes down the CPU including the internal watchdog), you still get protection from the external watchdog, even if it takes a while longer.

Note that this approach might or might not provide enough protection for your particular application. The point is that you can do better in a lot of cases by using the internal watchdog rather than turning it off when you add an external watchdog. Chapter 29 of my book discusses watchdog timers in more detail.
---

Sunday, May 2, 2010

Better Embedded System Software: The Book

The book is available from Amazon. Here's a description: http://koopman.us/index.html

Book Summary

This book distills the experience of more than 90 design reviews on real embedded system products into a set of bite-size lessons learned in the areas of software development process, requirements, architecture, design, implementation, verification & validation, and critical system properties. Each chapter describes an area that tends to be a problem in embedded system design, symptoms that tend to indicate you need to make changes, the risks of not fixing problems in this area, and concrete ways to make your embedded system software better. Each chapter is relatively self-sufficient, permitting developers with a busy schedule to cherry-pick the best ideas to make their systems better right away.

Click on the link for chapter 19 on Global Variables to see the free sample chapter

Chapters:

Introduction

Software Development Process
Written development plan
How much paper is enough?
How much paper is too much?

Requirements & Architecture
Written requirements
Measureable requirements
Tracing requirements to test
Non-functional requirements
Requirement churn
Software architecture
Modularity

Design
Software design
Statecharts and modes
Real time
User interface design

Implementation
How much assembly language is enough?
Coding style
The cost of nearly full resources
Global variables are evil
Mutexes and data access concurrency

Verification & Validation
Static checking and compiler warnings
Peer reviews
Testing and test plans
Issue tracking & analysis
Run-time error logs

Critical System Properties
Dependability
Security
Safety
Watchdog timers
System reset
Conclusions

Click Here To View Detailed Table of Contents
(Requires Acrobat Reader version 8 or higher)

Click here for errata list.

-----

Better Embedded System SW