Better Embedded System SW: 2010

Wednesday, December 22, 2010

Embedded Software Risk Areas -- People

Series Intro: this is one of a series of posts summarizing the different red flag areas I've encountered in more than a decade of doing design reviews of industry embedded system software projects. You can read more about the study here. If one of these bullets applies to your project, you should consider whether that presents undue risk to project success (whether it does or not depends upon your specific project and goals). The results of this study inspired the chapters in my book.

Here are the People red flags:

High turnover and developer overload

Developers have a high turnover rate. As a result, code quality and style varies. Lack of a robust paper trail makes it difficult to continue development. Often more important is that replacement developers may lack the domain experience necessary for understanding the details of system requirements.

No training for managing outsource relationships

Engineers who are responsible for interacting with outsource partners do not have adequate time and skills to do so, especially for multi-cultural partnering. This can lead to significant ineffectiveness or even failure of such relationships.

Saturday, December 18, 2010

International Shipping To Additional Countries and FBA

My publisher has recently added Fulfillment By Amazon (FBA) support for shipping. What this means is that the book is kept in stock at an Amazon warehouse and can be shipped as if it were any other Amazon.com product. This includes overnight shipping and international shipping to many more countries than can be supported via the Paypal fulfillment channel. Amazon prime shipping rates (where available) and most other Amazon policies apply. The discount from retail is less than on the author web site primarily because Amazon charges a significant fee for providing this service, but it does have advantages for many readers.

You can, of course, choose whichever channel makes sense to you based on total cost, delivery time, and whether you prefer to do business with Amazon. This link has pointers to both options.

Perhaps most importantly for many readers, the Amazon web site seems to indicate they will ship to India and China. If you have feedback about the Amazon service (both good and bad) please let me know. In particular, if you are from India or China and find that the service worked for you that would be very helpful to know. Thanks!

Wednesday, December 15, 2010

Embedded Software Risk Areas -- Project Management -- Part 2

Schedule not taken seriously

The software development schedule is externally imposed on an arbitrary basis or otherwise not grounded in reality. As a result, developers may burn out or simply feel they have no stake in following development schedules.

Presumption in project management that software is free

Project managers and/or customers (and sometimes developers) make decisions that presume software costs virtually nothing to develop or change. This is one contributing cause of requirements churn.

Risk of problems with external tools and components

External tools, software components, and vendors are a critical part of the system development plan, and no strategy is in place to deal with unexpected bugs, personnel turnover, or business failure of partners and vendors.

Disaster recovery not tested

Backups and disaster recovery plans may be in place but untested. Data loss can occur if backups are not being done properly.

Wednesday, December 8, 2010

Embedded Software Risk Areas -- Project Management -- Part 1

No version control

Sometimes source code is not under version control. More commonly, the source code is under version control but associated tools, libraries, and other support software components are not. As a result, it may be difficult or impossible to recreate and modify old software versions to fix bugs.

No backward compatibility and version management plan

There is no plan for dealing with backward compatibility with old products, product migration, or installations with a mix of old and new product versions. The result may be incompatibilities with fielded equipment or a combinatorial explosion of multi-component compatibility testing scenarios necessary for system validation.

Use of cheap tools (software components, etc.) instead of good ones

Developers have inadequate or substandard tools (for example, free demo compilers instead of paid-for full-featured compilers) because tool costs can’t be reckoned against savings in developer time in the cost accounting system being used. As a result, developers spend significant time creating or modifying tools to avoid spending money on tool procurement.

Saturday, December 4, 2010

Jack Ganssle Book Review

Jack Ganssle has published a review of my book. You can see it at this link.

Thanks Jack!

Thursday, December 2, 2010

Embedded Software Risk Areas -- An Industry Study

I've had the opportunity to do many design reviews of real embedded software projects over the past decade or so. About 95 reviews since 1996. For each review I usually had access to the project's source code and design documentation. And in most cases I got to spend a day with the designers in person. The point of the reviews was usually identifying risk areas that they should address before the product went into production. Sometimes the reviews were post mortems -- trying to find out what caused a project failure so it could be fixed. And sometimes the reviews were more narrow (for example, just look at security or safety issues for a project). But in most cases I (sometimes with a co-reviewer) found one or more "red flag" issues that really needed to be addressed.

In other postings I'll summarize the red flag issues I found from all those reviews. Perhaps surprisingly, even engineers with no formal training in embedded systems tend to get the basics right. The books that are out there are good enough for a well trained non-computer engineer to pick up what they need to get basic functionality right most of the time. Where they have problems are in the areas of complex system integration (for example, real time scheduling) and software process. I'm a hard-core lone cowboy techie at heart, and process is something I've learned gradually over the years as that sort of thing proved to be a problem for projects I observed. Well, based on a decade of design reviews I'm here to tell you that process and a solid design methodology matters. A lot. Even for small projects and small teams. Even for individuals. Details to follow in upcoming posts.

I'm giving a keynote talk at an embedded system education workshop at the end of October. But for non-academics, you'd probably just like me to summarize what I found:

(The green bar means it is things most embedded system experts think are the usual problems -- they were still red flags!) In other words, of all the red flag issues identified in these reviews, only about 1/3 were technical issues. The rest were either pieces missing from the software process or things that weren't being written down well enough.

Before you dismiss me as just another process guy who wants you to generate lots of useless paper, consider these points:

I absolutely hate useless paper. Seriously! I'm a techie, not a process guy. So I got dragged kicking and screaming to the conclusion that more process and more paper help (sometimes).

These were not audits or best practice reviews. What I mean by that is we did NOT say "the rule book says you have to have paper of type X, and it is missing, so you get a red flag." What we DID say, probably most of the time, was "this paper is missing, but it's not going to kill you -- so not a red flag." Only once in a while was a process or paperwork problem such high risk that it got a red flag.

Most reviews had only a handful of red flags, and every one was a time bomb waiting to explode. Most of the time bombs were in process and paperwork, not technology.

In other postings (click here to see the list as it grows) I'll summarize the 43 areas that had red flags (yes, including the technology red flags). They'll be organized, loosely, by phase in the design cycle. No doubt you have seen some project-killer risks that aren't on the lists, and I welcome brief war story postings about them.

Wednesday, December 1, 2010

Agile for Embedded Software and SQA (updated)

I'll be the first to admit that I'm not an eager adopter of Agile Methods, but I believe that there are some good ideas in the agile methods arena (and good ideas elsewhere as well). Based on some recent experiences I believe that agile can produce good long-lived embedded software ... but that it requires some things that aren't standard agile practices to get there.

Some folks use the "agile" banner as an excuse to have an ad hoc process. That's not going to work out so well for most embedded software. So we'll just move past that and assume that we're talking about an agile methods approach that uses a well defined process. (I didn't say "tons of paper" -- I said "well defined." That definition could be a single process flow chart on a whiteboard, some powerpoint slides, or the name of a book that is being followed.)

One of my major concerns with Agile as I have seen it practiced (and described by advocates) is that it is too light in Software Quality Assurance (SQA). This is by design -- as I understand it a primary point of Agile is to trust developers to do the right things, and audits aren't part of a typical trust picture. I've seen Agile do well, and I've seen Agile do not so well. But most interestingly, from what I've seen it is very difficult to tell if a gung-ho, competent Agile team that is saying they are doing all the right things is actually doing well or not in terms of resultant code quality. In other words, an agile team that says (and genuinely thinks) they are doing a good job might be producing great code or crummy code. And if nobody is there to find out which is happening, the team won't know the difference until it is much too late to fix things.

SQA is designed to make sure you're following the process you think you are following (whether agile or otherwise). Without an external check and balance it is all too easy to have a process failure. The failure could be skipped steps. It could be following the process too superficially, leading to ineffective efforts. It could be honest misunderstanding of the intent of the process rather than the letter of the process law. But in the end, I believe that you are taking a huge chance with any software process if you have no external check and balance

This isn't an issue of integrity, capability, or any other character flaw in developers. The simple fact is that developers and their management are paid to get the product done. While they can try their best to stay objective about the quality of their process, that isn't their primary job. The actual code is what they concentrate on, not the process. But everyone needs an external check and balance. And without one, upper management has no objective way to judge whether software development processes are working or broken (at least until some disaster happens, but then it's too late).

If I'm evaluating an Agile Methods group I ask the following questions: (1) Show me the written process you are following. (It can be an agile book; it can be a list of steps on a piece of paper; it can be just the Agile Manifesto. It can be anything.) (2) Explain to me how an external person can tell whether you are actually following that process. (3) Explain to me how an external person can tell that the process is producing the quality of software you want. I am flexible in terms of answers that are satisfactory, and none of these things preclude you from using an Agile Methods approach. But questions (2) and (3) typically aren't part of defined Agile Method approaches.

You'd be surprised how hard it is to answer those questions for many agile projects with an answer that amounts to something other than "trust us, we're professionals." I'm a professional too, but I'm not perfect. Nobody is. And I have learned in many different scenarios that everyone needs an external mechanism to make sure they are on track. Sometimes the answer is more or less "Agile always works." It doesn't just automatically work. Nothing does.

Sometimes an agile team works really well because of a single individual leader who makes it work. That's great, and it can be very effective. But in a year, or five, or ten, when that leader is promoted, wins the lottery, retires, or otherwise leaves the scene, you are at very high risk of having the team slip into a situation that isn't so good. So while getting a good leader to get things rolling however it can be done is great, a truly excellent leader will also plan for the day when (s)he isn't there. And that means establishing both a defined process and a quality check-and-balance on that process.

In my mind the above three questions capture the essence of SQA in a practical sense for most embedded system software projects. If you are happy with your answers for these questions, then Agile might work well for you. But if you simply blow off those topics and skimp on SQA, you are increasing your risk. And generally it's the kind of risk that eventually bites you.

Embedded Software Risk Areas -- Dependability -- Part 2

Insufficient consideration of system reset approach

System resets might not ensure a safe state during reboots that occur when the system is already in operation, resulting in unsafe transient actuator commands.

Neither run-time fault instrumentation nor error logs

There is no run-time instrumentation to record anomalous operating conditions, nor are there error logs to record events such as software crashes. This makes it difficult to diagnose problems in devices returned from customers.

No software update plan

There is no plan for distributing patches or software updates, especially for systems which do not have continuous Internet access. This can be an especially significant problem if the security strategy ends up requiring regular patch deployment. Updating software may require technician visits, equipment replacement, or other expensive and inconvenient measures.

No IP protection plan

There is no plan to protect the intellectual property of the product from code extraction, reverse engineering, or hardware/software cloning. (Protection strategies can be legal as well as technical.) As a result, competitors may find it excessively easy to successfully extract and sell products with exact software images or extracted proprietary software technology.

Monday, November 22, 2010

Embedded Software Risk Areas -- Dependability -- Part 1

No or incorrect use of watchdog timers

Watch dog timers are turned off or are serviced in a way that defeats their intended role in the system. For example, a watchdog might be kicked by an interrupt service routine that is triggered by a timer regardless of the status of the rest of the software system. Systems with ineffective watchdog timers may not reset themselves after a software timing fault.

Insufficient consideration of reliability/availability

There is no defined dependability goal or approach for the system, especially with respect to software. In most cases there is no requirement that specifies what dependability means in the context of the application (e.g., is a crash and fast reboot OK, or is it a catastrophic event for typical customer?). As a result, the degree of dependability is not being actively managed.

Insufficient consideration of security

There is no statement of requirements and intentional design approach for ensuring adequate security, especially for network-connected devices. The resulting system may be compromised, with unforeseen consequences.

Insufficient consideration of safety

In some systems that have modest safety considerations, no safety analysis has been done. In systems that are more overtly safety critical (but for which there is no mandated safety certification), the safety approach falls short of recommended practices. The result is exposure to unforeseen legal liability and reputation loss.

Wednesday, November 17, 2010

Embedded Software Risk Areas -- Verification & Validation

No peer reviews

Code, requirements, design and other documents are not subject to a methodical peer review, or undergo ineffective peer reviews. As a result, most bugs are found late in the development cycle when it is more expensive to fix them.

No test plan

Testing is ad hoc, and not according to a defined plan. Typically there is no defined criterion for how much testing is enough. This can result in poor test coverage or an inconsistent depth of testing.

No defect tracking

Defects and other issues are not being put into a bug tracking system. This can result in losing track of outstanding bugs and poor prioritization of bug-fixing activities.

No stress testing

There is no specific stress testing to ensure that real time scheduling and other aspects of the design can handle worst case expected operating conditions. As a result, products may fail when used for demanding applications.

Wednesday, November 10, 2010

Embedded Software Risk Areas -- Implementation -- Part 2

Ignoring compiler warnings

Programs compile with ignored warnings and/or the compilers used do not have robust warning capability. A static analysis tool is not used to make up for poor compiler warning capabilities. The result can be that software defects which could have been caught by the compiler must be found via testing, or miss detection entirely. If assembly language is used extensively, it may contain the types of bugs that a good static analysis tool would have caught in a high level language.

Inadequate concurrency management

Mutexes or other appropriate concurrent data access approaches aren’t being used. This leads to potential race conditions and can result in tricky timing bugs.

Use of home-made RTOS

An in-house developed RTOS is being used instead of an off-the-shelf operating system. While the result is sometimes technically excellent, this approach commits the company to maintaining RTOS development skills as a core competency, which may not be the best strategic use of limited resources.

Wednesday, November 3, 2010

Embedded Software Risk Areas -- Implementation -- Part 1

Inconsistent coding style

Coding style varies dramatically across the code base, and often there is no written coding style guideline. Code comments vary significantly in frequency, level of detail, and type of content. This makes it more difficult to understand and maintain the code.

Resources too full

Memory or CPU resources are overly full, leading to risk of missing real time deadlines and significantly increased development costs. An extreme example is zero bytes of program and data memory left over on a small processor. Significant developer time and energy can be spent squeezing software and data to fit, leaving less time to develop or refine functionality.

Too much assembly language

Assembly language is used extensively when an adequate high level language compiler is available. Sometimes this is due to lack of big enough hardware resources to execute compiled code. But more often it is due to developer preference, reuse of previous project code, or a need to economize on purchasing development tools. Assembly language software is usually more expensive to develop and more bug-prone than high level language code.

Too many global variables

Global variables are used instead of parameters for passing information among software modules. The result is often code that has poor modularity and is brittle to changes.

Wednesday, October 27, 2010

Embedded Software Risk Areas -- Architecture

No defined software architecture

There is no picture showing the system’s software architecture. (Many such pictures might be useful depending upon the context – but often there is no picture at all.) Ill defined architectures often lead to poor designs and poor quality code.

No message dictionary for embedded network

There is no list of the messages, payloads, timing, and other aspects of messages being sent on an embedded real time network such as CAN. As a result, there is no basis for analysis of real time network performance and optimization of message traffic.

Poor modularity of code

The design has poorly chosen interfaces and poorly decomposed functionality, resulting in high coupling, poor cohesion, and overly long modules. In particular, interrupt service routines are often too big and mask interrupts for too long. The result is often increased risk of software defects due to increased complexity.

Thursday, October 21, 2010

Embedded Software Risk Areas -- Design

Design is skipped or is created after code is written

Developers create the design (usually in their heads) as they are writing the code instead of designing each module before that module is implemented. The design might be written down after code is written, but usually there is no written design. As a result, the structure of the implementation is messier than it ought to be.

Flowcharts are used when statecharts would be more appropriate

Flowcharts are used to represent designs for functions that are inherently state-based or modal and would be better represented using a state machine design abstraction. Associated code usually has deeply nested, repetitive “if” condition clauses to determine what state the system is in, rather than having an explicit state variable used to control a case statement structure in the implementation. The result is code that is significantly more bug prone code and difficult to understand than code based on a state-machine based design.

No real time schedule analysis

There is no methodical approach to real time scheduling. Typically an ad hoc approach to real time scheduling is used, frequently featuring conditional execution of some tasks depending upon system load. Testing rather than an analytic approach is used to ensure real time deadlines will be met. Often there is no sure way to know if worst case timing has been experienced during such testing, and there is risk that deadlines will be missed during system operation.

No methodical approach to user interface design

The user interface does not follow established principles (e.g., [5]), making use of the product difficult or error-prone. The interface might not take into account the needs of users in different demographic groups (e.g., users who are colorblind, hearing impaired, or who have trouble with fine motor control).

Monday, October 18, 2010

Embedded Project Risk Areas -- Development Process Part 2

High requirements churn

Functionality required of the product changes so fast the software developers can’t keep up. This is likely to lead to missed deadlines and can result in developer burnout.

No SQA function

Nobody is formally assigned to perform an SQA function, so there is a risk that processes (however light or heavy they might be) aren’t being followed effectively regardless of the good intentions of the development team. Software Quality Assurance (SQA) is, in essence, ensuring that the developers are following the development process they are supposed to be following. If SQA is ineffective, it is possible (and in my experience likely) that some time spent on testing, design reviews, and other techniques to improve quality is also ineffective.

No mechanism to capture technical and non-technical project lessons learned

There is no methodical effort to identify technical, process, and management problems encountered during the course of the project so that the causes of these problems can be corrected. As a result, mistakes are repeated in future projects.

Wednesday, October 13, 2010

Embedded Project Risk Areas -- Development Process Part 1

Here are some of the Development Process red flags (part 1 of 2):

Informal development process

The process used to create embedded software is ad hoc, and not written down. The steps vary from project to project and developer to developer. This can result in uneven quality.

Not enough paper

Too few steps of development result in a paper trail. For example, test results may not be written down. Among other things, this can require re-doing tasks such as testing to make sure they were fully and correctly performed.

No written requirements

Software requirements are not written down or are too informal. They may only address changes for a new product version without any written document stating old version requirements. This can lead to misunderstandings about intended product functions and difficulty in designing adequate tests.

Requirements with poor measurability

Software requirements can’t be tested due to missing or subjective measurement criteria. As a result, it is difficult to know whether a requirement such as “product shall be user friendly” has been met.

Requirements omit extra-functional aspects

Product requirements may state hardware processing speed and hardware reliability, but omit software response times, software reliability, and other non-functional requirements. Implementing and testing these undefined aspects is left at the discretion of developers and might not meet market needs.

Saturday, October 2, 2010

Embedded Software Costs $15-$40 per line of code

Have you ever wondered how much software costs? Most techies vastly underestimate the cost of the software they are writing. Partly this is because engineers and especially software developers are an optimistic lot. Partly it is because our intuition for such things was formed back in the days when we hacked out 1000 lines of code in an all-night session. (I wouldn't trust my life to that code! But that is what our brains remember as being how long it takes to write that much code.)

I've had the opportunity to see how much it really costs embedded system companies to produce code.

The answer is $15 to $40 per line of code. At the $40 end you can get relatively robust, well designed code suitable for industry applications. The $15 end tends to be code with skimpy design packages and skimpy testing. (In other words, some people spend only $15/line, but their code is of doubtful quality.) Use of all-US developers tends to be more expensive than joint development with both US and overseas team members, but the difference is less than you might have imagined.

The methodology used to get those numbers is simply divide COST by code SIZE. Ask a project manager how much a project cost (they almost always know this) and how many lines of code were in the project. Divide, and you have your answer.

COST (in dollars) includes:
- All developer labor plus overhead, fringe benefits, etc.
- Test, SQA, technical writers, etc.
- Expenses (e.g., tools), capital equipment if not in overhead, etc.
- First-level managers for the above
- All activities, including meetings, design reviews, acceptance test, and holiday parties
- Does NOT include support after initial release (which can be huge, but I just don't have data on it)

In other words, you have to account for the whole cost of developing, not just the part where you actually write the code.

SIZE (in source lines of code SLOC) includes:
- All the non-comment lines of code written or substatially rewritten for the project

This metric is far from perfect. For example, tracking developers split among multiple projects gets tricky. So does accounting for reuse of big chunks of code. But if you get a manager to give a reasonable estimate following the above guidelines, the answer always seems to come in at the $15-$40 range for industrial and transportation embedded software.

Sure, you can spend as much or as little on software as you like. But it is hard to find an organization who can field embedded software and have a viable product at less than $15/line. And it is usually just regulated, safety critical applications in which it is worthwhile to spend more than about $40/line. When all is said and done, the results I've seen make sense, and are generally in line with what few other data points I've heard on this topic.

UPDATE, October 2015. It's probably more like $25-$50 per line of code now. Costs for projects outsourced to Asia have done up dramatically as wages and competition for scarce coders have increased.

Monday, September 13, 2010

Testing a Watchdog Timer

I recently got a query about how to test a watchdog timer. This is a special case of the more general question of how you test for a fault that is never supposed to happen. This is a tricky topic in general, but the simple answer (if you can figure out how to do it easily) is to insert the fault intentionally into your system and see what happens.

Here are some ideas that I've seen or thought up that you may find helpful:

(1) Set the watchdog timer period to shorter and shorter periods until it trips in normal operation. This will give you an idea how close to the edge you are. But, it is easy to make a mistake changing the watchdog period back to the normal value. So it isn't a good test for nearly-final code.

(2) Add a timeout loop (for example, a do-nothing loop that you have made sure won't be removed by your optimizer). Increase the timeout value until the watchdog timer trips. This similarly gives you an idea how close to the edge you are in terms of timing, but with a nicer level of detail. It also has the advantage of testing operation without modifying the watchdog code itself.

(3) Use a jumper that, when inserted, activates a time-wasting task (similar to idea #2). The idea is when a jumper is installed it enables the running of a task that wastes so much time it is guaranteed to trip the watchdog. When the jumper is removed, that task doesn't run and the system operates properly. You can insert the jumper during system test to make sure the watchdog function works properly. (So just jumping to the watchdog handling code isn't the idea -- you have to simulate a situation of CPU overload for this to be a realistic test.) When you ship the system, you make sure the jumper has been removed. Just to avoid problems, put the watchdog test first on the outgoing test plan instead of last. That way the jumper is sure to be removed before shipment. If you want to be really clever, that same jumper hardware could be used to disable the watchdog in early testing, and code could be changed to have it trip the watchdog as the system nears completion. But whether you want to be that tricky is a matter of taste and the type of system you are building.

Wednesday, September 1, 2010

Using Car Remote Entry Key Fobs For Payments?

There is a news story that suggests people will be using the remote keyless entry systems for their cars as payment systems. (So now your car key transmitter might have "lock", "unlock", and "pay" buttons.) Do you think this is a good idea?

I did some design work on a previous generation of this technology. Cryptographic algorithms I developed jointly with Alan Finn were in a lot of model year 1994-2004 cars. What struck me about this market was the extreme cost sensitivity of things. No way could we afford industrial strength crypto. The competition all had laughable crypto or made what I consider rookie mistakes (for example, using the same manufacturer key in all devices). I never heard that my algorithm was broken during its designed 10-year life (but I'll bet the NSA had a nice chuckle when they saw it). But other algorithms have proved to be insecure. For example, the Keeloq system has become the target of numerous attacks. While the first published attacks are relatively recent, people likely knew about the weaknesses and could have attacked them a lot earlier if they had wanted to do so. While technology has changed over the years, these important lessons probably are the same:

You have to have real security experts work on these products. You can't just put something together without really knowing your stuff or you will make rookie mistakes.
You have to have some appreciation for security by the customers buying the systems (often it is a car manufacturer deciding how good is good enough). Too often decisions are made on "cheap and not obviously bad" rather than "is actually secure." One of the best things that happened to me in my experience was that the customer had someone (his name is Tom) who understood crypto and was willing to back us when I said we couldn't go below a certain cost threshold without compromising security beyond the required level.
You have to use real crypto now, not cheesy crypto. Attackers have gained in sophistication and if someone has time to attack an old Keeloq system they'll attack your system.
Once you let a key fob control money, it's a lot more attractive to attack. So it is sure to be attacked.
Cars have a 10-15 year life, multi-year production runs, and easily a 3-5 year lead time. So you have to plan for your approach to be secure 20-25 years from now. That's hard to do in any system, much less one that is supposed to be inexpensive.

I predict next generation devices will have vulnerabilities due to failure to appreciate the above bullets. It has happened in the past, and it will happen in the future. I don't plan to use my key fob transmitter to make payments any time soon, and I used to design these things for a living. But then again, as long as you read about problems in the news before you're the victim of an attack, maybe you'll be OK.

Wednesday, August 25, 2010

Malware may have contributed to airline crash

A recent story is that "Malware may have been a contributory cause of a fatal Spanair crash that killed 154 people two years ago." (See this link for the full story.) The gist of the scenario is that a diagnostic monitoring computer (presumably that runs Windows) got infected with malware and stopped monitoring. The loss of monitoring meant nobody knew when problems occurred. Also, it meant that a warning system didn't work, so it didn't catch a critical pilot error upon takeoff. The pilots made the error, but the warning system didn't tell them they had made a mistake, and so the plane crashed.

If the above scenario is verified by the investigation, in one sense this is a classic critical system failure in which multiple things had to go wrong to result in a loss event (operators make a mistake AND operators fail to catch their own mistake with checklists AND automated warning system fails).
But what is a bit novel is that one of the failures was caused by malware, even if it wasn't intentionally targeted at aircraft. So the security problem didn't on its own cause the crash, but it tangibly contributed to the crash by removing a layer of safety.

Now, let's fast-forward to the future. What if someone created malware that would modify pilot checklists? (I know pilots are trained to know the checklists, but in a stressful situation someone could easily fall for a craftily bogus checklist.) What if someone intentionally attacked the warning system and caused some more subtle failure? For example, what if you managed to get all the aircraft in a fleet to give a bogus alarm on every takeoff attempt, and put in a time delay so it would happen on a particular day?

Security problems for embedded systems are going to get a lot worse unless people start taking this threat more seriously. This is just the tip of the iceberg. Hopefully things will get better sooner rather than later.

Monday, August 16, 2010

100% CPU Use With Rate Monotonic Scheduling

Rate Monotonic Scheduling (RMS) is probably what you should be using if you are using an RTOS. With rate monotonic scheduling you assign fixed priorities based solely on task execution periods. The fastest period gets the highest priority, and the slower periods get progressively lower priorities. It's that simple -- except for the math that limits maximum allowable CPU use.

The descriptions of Rate Monotonic theory (often called Rate Monotonic Analysis -- RMA) point out that you can't use 100% of the CPU if you want to guarantee schedulability. Maximum usage ranges between 69% to 85% of the CPU (see, for example Wikipedia for this analysis). And, as a result, many developers shy away from RMS. Who wants to pay up to about a third of their CPU capacity to the gods of scheduling? Nobody. Not even if this is an optimal way to schedule using fixed priorities.

The good news is that you can have your cake and eat it too.

The secret to making Rate Monotonic Scheduling practical is to use harmonic task periods. This requires that every task period evenly divide the period of every longer period. So, for example, the periods {20, 60, 120} are harmonic because 20 divides 60 evenly, 20 divides 120 evenly, and 60 divides 120 evenly. The period set {20, 67, 120} is not harmonic because, for example, 20 doesn't divide 67 evenly.

If you have harmonic periods, you can schedule up to 100% of the CPU. You don't need to leave slack! (Well, I'd leave a little slack in the real world, but the point is the math says you can go to 100% instead of saying you top out at a much lower utilization.) So, in a real system, you can use RMA without giving up a lot of CPU capacity.

What if your tasks aren't harmonic? Them make them so by running some tasks faster. For example, if your task periods are {20, 67, 120} change them to {20, 60, 120}. In other words, run the 67 msec task at 60 msec, which is a little faster than you would otherwise need to run it. Counter-intuitively, speeding up one task will actually lower your total CPU requirements in typical situations because now you don't have to leave unused CPU capacity to make the scheduling math work. Let's face it; in real systems task frequencies often have considerably leeway. So make your life easy and choose them to be harmonic.

You can find out more about RMA and how to do scheduling analysis in Chapter 14 of my book.

Monday, August 9, 2010

How Do You Know You Tested Everything? (Simple Traceability)

The idea of traceability is simple: look at the inputs to a design process and then look at the outputs from that same process. Compare them to make sure you didn't miss anything.

The best place to start using traceability is usually comparing requirements to acceptance tests. Here's how:

Make a table (spreadsheets work great for this). Label each column with a requirement. Label each row with an acceptance test. For each acceptance test, put an X in the requirement columns that are exercised to a non-trivial degree by that test.

When you're done making the table, you can do traceability analysis. An empty row (an acceptance test with no "X" marks) means you are running a test that isn't required. It might be a really good test to run -- in which case your requirements are missing something. Or it might be a waste of time.

An empty column (a requirement with no "X" marks) means you have a requirement that isn't being tested. This means you have a hole in your testing, a requirement that didn't get implemented, or a requirement that can't be tested. No matter the cause, you've got a problem. (It's OK to use non-testing validation such as a design review to check some requirements. For traceability purposes put it in as a "test" even though it doesn't involve actually running the code.)

If you have a table with no missing columns and no missing rows, then you've achieved complete traceability -- congratulations! This doesn't mean you're perfect. But it does mean you've managed to avoid some easy-to-detect gaps in your software development efforts.

These same ideas can be used elsewhere in the design process to avoid similar mistakes. The point for most projects is to use this idea in a way that doesn't take a lot of time, but catches "stupid" mistakes early on. It may seem too simplistic, but in my experience having written traceability tables helps. You can find out more about traceability in my book, Chapter 7: Tracing Requirements To Test.
---

Monday, August 2, 2010

How To Pick A Good Embedded Systems Design Book

If you want to know if an embedded system design book really talks about everything you need to know, I recommend you look for the word "watchdog" in the index. If it's not there, move on to the next book.

Let me explain...

There is a lot of confusion in the education and book market about the difference between learning about the underlying technology of embedded systems and learning how to build embedded systems. I think the difference is critical, because you are going to have trouble succeeding with complex embedded system projects if you are missing skills in the area of system integration and system architecture.

A book about the technology of embedded systems talks about how microcontrollers work, how analog to digital conversion works, interrupts, assembly language, and those sorts of things. In other words, it gives you the building blocks of the basic technology. This is essential information. But, it isn't enough to succeed when you get beyond relatively small and simple projects. If you took a course in college that was a typical entry-level "Introduction to Microcontrollers" course, this is what you might have seen. Again, it was essential, but far from complete.

A book about how to build embedded systems also talks about the parts that make the system function as a whole. For example, it explains how to address difficult areas such as concurrency management, reliability, software design, and real time scheduling. It need not go into serious depth about underlying theory if it is an introductory book, but it should at least give some commonly used design patterns, such as using a watchdog timer. In other words, it helps you understand the system level picture for building solid applications rather than just giving the basic technology building blocks.

There are other areas that I think are critical to designing good embedded systems that aren't covered in most embedded design books. Most notable is a notion of lightweight but robust software processes. There aren't many books that really cover everything. But, knowing what you don't know (and what a textbook leaves out) is an important step to achieving higher levels of understanding in any area.

If you learned embedded systems from a book or college course that didn't even mention watchdog timers, then you missed out a lot in that experience. You might or might not have found and filled all the gaps by now. Do yourself a favor and take a look at a book that covers these topics to make sure you don't have any big skill gaps left. If you think the "watchdog test" isn't fair, give your favorite book a second chance by looking for keywords such as statechart, mutex, and real time scheduling. These are also typical topics covered in books that talk about system level aspects rather than just building blocks.

Monday, July 26, 2010

Your Embedded System Might Be Safety Critical

Perhaps surprisingly, most embedded systems have some element of safety involved with their use. If you're lucky the safety part is dealt with in hardware. But sometimes software is involved too.

Safety problems are often caused by the uncontrolled release of energy (or, in some cases, information) into the environment. Since most embedded systems have actuators, and most embedded systems have software bugs, there is the potential for uncontrolled release of energy due to a software bug. Thus, most embedded systems have some aspect of safety that must be considered during design.

The good news is that in many cases the potential outcomes aren't that bad, so nobody is going to get killed because of a software defect. But that isn't always the case, and it may not be obvious there is a safety problem if you don't give it some serious thought.

Ask yourself this: if the software does the worst possible thing, could someone get hurt? Not the worst possible thing you've designed it to do. We're talking about bugs here -- so you don't have any idea what the bug will make the system do. Instead, take a healthy serving of Murphy's Law and ask yourself if the system software really wanted to do the worst possible thing, what would that be?

Here's an example. If you design a microwave oven you don't want the cooking energy to be released when the door is open. If you use a hardware switch that interrupts power, nothing your software can do will cook the user. But if you rely purely on software, you can't just say "seems to work OK." You have to make sure there is no possible software bug that can turn the cooking power on with the door open (even if the software "wanted" to in accordance with Murphy's Law). You may not think a bug this severe is there, but you can't really say your system is safe until you've taken a methodical approach to knowing such a bug isn't there.

In a sense, the art of getting safety right is thinking up all the worst possible things the system could do and then demonstrating it can't do them. There are many software safety techniques that address this issue, but you have to use them or you won't know if your system is safe. You can find out more about embedded system safety in Chapter 28 of my book.
---

Monday, June 21, 2010

Don't require perfection

A common problem with requirements is that they mandate perfection that is unattainable. For example, it is common for embedded software requirements to state the the software shall never crash, shall be perfectly safe, and shall be defect-free. (In truth, more often these things aren't even written down, but those are the answers you get when you ask what the requirements are for dependability, safety, and software defect rates for high quality embedded systems.)

Perfect doesn't ever happen. "Never" is longer than you have available to test for software dependability. And it is the rare everyday embedded system that has taken a rigorous approach to ensuring safety.

It is also true that in many areas it is too risky from a liability point of view to write down a concrete requirement for less than perfection. And you may be guessing as to a target of less than perfection even if you do specify one. (Is it OK for your system to crash once every 1000 years, or will 900 years do? Did you guess when you answered that, or do you have a concrete basis for making that tradeoff?)

If you can and are permitted to specify a concrete, non-perfect set of requirements for your product, you should. But if you can't, consider instead defining a set of acceptance tests that will at least let you perform actual measurements to validate your system is good enough. These can be either process requirements or actual tests. Some examples include:

System shall not crash during one full week of stress tests.
All sources of crashes during testing shall be tracked down to root cause, and eliminated if appropriate.
System shall perform an emergency shutdown if a defined safety requirement is violated at run time. (This assumes you are able to monitor these requirements effectively at run time.)
All system errors shall be logged for analysis in failed units returned for factory service.

None of these will get you to perfection. But they, and other possible criteria like them, will give you a concrete way of knowing if you have worked hard enough in relevant areas before you release your software. You can find out more by reading Chapter 6 of my book, which discusses creating measurable requirements.
---

Thursday, June 17, 2010

When is color worse than B&W?

Generally, color displays are better than black and white ones (or monochrome displays, depending on your display technology). In addition to making products look more sophisticated, color lets us communicate more information for a given display size. There's nothing like red to tell you there is a problem and green to tell you things are OK.

Unless you're red/green colorblind.

About 10% of males can have problems with red/green colorblindness, varying with the population you are considering. Most colorblind people don't just see gray, but it is very common for them to have problems distinguishing particular hues and intensities of red from corresponding greens.

If you are designing a product that uses red and green to display important information, then make sure there is a secondary way to obtain that information that works even if you can't tell the colors apart. Some example strategies include:

Positional information. Traffic lights are OK because the red light is always on top, so you know what color it is by its position.
Use color only as auxiliary information. If the display is the red text "FAIL" vs. green text "OK" then colorblind folks will do just fine.
Blinking rates. If you have a bicolor LED, then consider flashing for red and solid for green (which may be a good idea anyway, since flashing lights attract attention). Or a distinctly different blinking rate.
Significantly different luminosity or brightness. A very dark red vs. a bright green may work out OK, but you should do some testing or dig deeper to be sure you got it right.

Fortunately for me, I'm not colorblind. (This also means I'm not a personal expert on what tricks might work.) But enough people are that this is the sort of thing you don't want to miss when you are making an embedded system. Chapter 15 of my book discusses user interface design and user demographics in more detail.
---

Monday, June 14, 2010

White Box Testing

White box software tests are designed in light of the particular software design and implementation being tested. For example, if you have an if {} else {} code construct, a white box test would intentionally try to execute both the if side and the else clause of that statement (probably using separate tests). This may sound trivial, but designing tests to execute rare cases and fault handling code can be a real challenge.

The fraction of code that is tested is known as the test coverage. In general, higher coverage is good. For example, white box test code coverage might be the fraction of lines of code executed by tests, with 95% being a pretty good result and 98% to 99% is often the best that people do without heroic effort. (95% coverage means 5% of the code is never executed -- not even once -- during testing. Hard to believe this is a pretty good result. But as I said, executing code that handles rare cases can be a challenge.)

Although lines of code executed is the classic coverage metric, there are other possible coverage metrics that might be useful depending on your situation:

Testing that code can correctly handle exceptions it might encounter (for example, does it handle malloc failing?)
Testing that all entries of a lookup table are exercised (what if only one table entry is out to lunch?)
Testing that all states and arcs of a statechart have been exercised
Testing that algorithms have been checked for numerical stability in tricky areas where they might have problems

The point of all the above is that the tester knows exactly how the code is trying to perform its functions, and makes sure that nothing was missed in testing, especially corner cases.

It's important to remember that even if you have 100% test coverage, it doesn't mean you have tested the system completely (usually "complete" testing is impossible -- there are too many possibilities). What good coverage does mean is that you haven't left out anything obvious. And that's good enough to make understanding the coverage of your testing worthwhile. Chapter 23 of my book discusses the concepts and practices of embedded software testing in more detail.
---

Thursday, June 10, 2010

Is your software dependable enough?

Most embedded system software has to be reasonably dependable. For example, customers are likely to be unhappy if their software crashes once per minute. But how dependable is good enough can be a slippery subject. For example, is it OK if your software crashes once every 10 minutes? Every 10 hours? Every 10 days? Every 10 years? Is that number written down anywhere? Or is it just a guess as to what might be acceptable?

We suggest that every product have written dependability requirements. This probably has two parts: Mean Time Between Failures (MTBF) for hardware, and mean time between crashes for software. (You can add a lot more if you like, but if you are missing either of these you have a big hole in your requirements.)

Once you have set your requirements, how do you know you meet them? For hardware you can use well established reliability calculation approaches that ultimately rest upon an assumption of random independent failures. But for software there is no reasonable failure rate to make predictions with. So that leaves you with testing to determine software dependability.

Testing to determine your software crashes less often than once per minute is pretty easy. But when your dependability target is many years between software crashes, then testing longer than that is likely to be a problem. So, for most systems we recommend defining not only a target operational dependability, but also a concrete acceptance test for dependable that is easily measurable.

For example, set a requirement that the system has to survive 1 week of intense stress testing without a crash before it ships. This certainly doesn't guarantee you'll get 10 years between crashes in the field, but at least it is a concrete, measurable requirement that everyone can discuss and agree upon during the requirements process. It's far better to have a concrete, defined dependability acceptance test than to just leave dependability out of the requirements and hope things turn out OK. Chapter 26 of my book discusses embedded system dependability in more detail.
---

Monday, June 7, 2010

CAN Tutorial

If you are looking for a Controller Area Network (CAN) tutorial, you may find the slides I use in teaching one of my courses useful.

Have a look at this Acrobat file: http://www.ece.cmu.edu/~ece649/lectures/14_can.pdf
which covers:

CAN overview
Bit dominance and binary countdown
Bit stuffing (including the bit stuffing error vulnerability in CAN)
Message headers
Message header filtering
Network length restrictions
Devicenet overview

While there are a number of web pages and articles on CAN, sometimes it helps to have lecture slides to browse through.

Thursday, June 3, 2010

Top two mistakes with watchdog timers

Watchdog timers provide a useful fallback mechanism for tasks that hang or otherwise violate timing expectations. In brief, application software must occasionally kick (or "pet") the watchdog to demonstrate things are still working properly. If the watchdog hasn't seen a pet operation in too long, it times out, resetting the system. The idea is that if the system hangs, the watchdog will reset the system to restore proper operation.

The #1 mistake with watchdog timers is not using one. It won't work if you don't turn it on and use it.

The #2 mistake is using an interrupt hooked up to a counter/timer to service the watchdog. For example, if your watchdog trips after 250 msec, you might have a hardware timer/counter generate an interrupt every 200 msec that runs a task to pet the watchdog. This is, in some ways, WORSE than leaving the watchdog turned off entirely. The reason is that it fools people into thinking the watchdog timer is providing benefit, when in fact it's really not doing much for you at all.

The point of the watchdog timer is to detect that the main application has hung. If you have an interrupt that pets the watchdog, the main application could be hung and the watchdog will get petted anyway. You should always pet the watchdog from within the main application loop, not from a timer-triggered interrupt service routine. (As with any rule you can bend this one, but if it is possible to pet the watchdog when your application has hung, then you aren't using the watchdog properly.) Chapter 29 of my book discusses how to use watchdog timers in more detail.
---

Monday, May 31, 2010

More than 80% full is too full

It is common for embedded systems to optimize hardware costs without really looking at the effect that has on software development costs. Many of us have spent a few hours searching for a trick that will save a handful of memory bytes. That only makes sense if you believe engineering is free. (It isn't.)

Most developers have a sense that 99%+ is too full for memory and CPU cycles. But it is much less clear where to draw the line. Is 95% too full? How about 90%?

As it turns out, there is very little guidance on this area. But performing some what-if analysis with some classic software cost data leads to a rather startling conclusion:

If your memory or CPU is more than 80% full
and you are making fewer than 1 million units
then you should get more memory and a faster CPU.

This may seem too conservative. But what is happening between about 60% full and 80% full is that software is gradually becoming more difficult to develop. In part that is because a lot of optimizations have to be added. And in part that is because there is limited room for run-time monitoring and debug support. You have to crunch the data to see the curves (chapter 18 of my book has the details). But this is where you end up.

This is a rule of thumb rather than an absolute cutoff. 81% isn't much different than 80%. Neither is 79%. But by the time you are getting up to 90% full, you are spending dramatically more on the overall product than you really should be. Chapter 18 of my book discusses the true cost of nearly full resources and explains where these numbers come from.
---

Thursday, May 27, 2010

Managing developer staff by head count

Last I checked, engineers were paid pretty good salaries. So why does management act like they are free?

Let me explain. It is common to manage engineers by headcount. Engineering gets, say, 12 people, and their job is to make products happen. But this approach means engineers are an overhead resource that is just there to use without really worrying about who pays for it. Like electricity. Or free parking spots.

The good part about engineering headcount is it is simple to specify and implement. And it can let the engineering staff concentrate on doing design instead of spending a lot of time of marketing themselves to internal customers. But it can cause all sorts of problems. Here are some of the more common ones:

Decoupling of cost from workload. Most software developers are over-committed. You get the peanut butter effect. No matter how big your slice of bread, it always seems possible to spread the peanut butter a little thinner to cover it all. After too much spreading you spend all your time fighting fires and no time being productive. (Europeans -- feel free to substitute Nutella in this analogy! I'm not sure how well the analogy works Down Under -- I've only been brave enough to eat Vegemite once.)
Inability to justify tool spending. Usually tools, outside consultants, and other expenses come from "real" money budgets. Usually those budgets are limited. This often results in head-count engineers spending or wasting many hours doing something that they could get from outside much more cheaply. If a $5000 software tool saves you a month of time that's a win. But you can't do it if you don't have the $5000 to spend.
Driving use of the lowest possible cost hardware. Many companies still price products based on hardware costs and assume software is free. If you have an engineering headcount based system, engineers are in fact "free" (as far as the accountants can tell). This is a really bad idea for most products. Squeezing things into constrained software gets really expensive! And even if you can throw enough people at it, resource constraints increase the risk of bugs.

There are no doubt other problems with using an engineering headcount approach (drop me a line if you think of them!). And there are some benefits as well. But hopefully this gives you the big picture. In my opinion head counts do more harm than good.

Corporate budgeting is not a simple thing, so I don't pretend to have a simple magic wand to fix things. But in my opinion adding elements of the below will help over the long term:

Include engineering cost in product development cost. You probably budget for manufacturing tooling and other up-front (NRE) costs. Why isn't software development budgeted? This will at the very least screen out ill conceived products with expensive (complex, hard to get right) software put into marginally viable products.
If you must use head-counts, revisit them annually based on workload and adjust accordingly.
Treat headcount as a fixed resource that is rationed, not an infinite resource. Or at least make product lines pay into the engineering pool in proportion to how much engineering they use, even if they can't budget for it beforehand.

There aren't any really easy answers, but establishing some link between engineering workload and cost probably helps both the engineers and the company as a whole. I doubt everyone agrees with my opinions, so I'd like to hear what you have to say!

Monday, May 24, 2010

Improving CAN Bit Error Rates

If you are using the CAN protocol (Controller Area Network) in an embedded system, you should take some care to ensure you have acceptable bit error rates. (Sometimes we talk about embedded hardware even though this is a mostly software blog.)

The first step is to find out if you have a problem. Most CAN interface chips have an error counter that keeps track of how many message errors have occurred, and generates an exception if some number of errors has been detected. Turn that feature on, set the threshold as low as possible, and run the system in as noisy an environment as you reasonably expect. Let it run a good long while (a weekend at least). See if any errors are reported. If possible, send varying data patterns on some messages and check the values as you receive them to make sure there are no undetected errors. (CAN has some vulnerabilities that let some specific types of single and double bit errors slip through. So if you have detected CAN errors there are good odds you also have undetected CAN errors slipping through. More on that topic another time.)

If you find you are getting errors, here are some things to check and consider:

Make sure the cabling is appropriately terminated, grounded, shielded, and so on (see your CAN interface documentation for more).

Use differential signals instead of a single signal (differential signals are the usual practice, but it never hurts to make sure that is what you are using).

Make sure you aren't exceeding the maximum bus length for your chosen bit rate (your CAN chip reference materials should have a description of this issue).

Make sure you haven't hooked up too many nodes for your bus drivers to handle.

Those are all great things to check. And you probably thought of most or all of them. But here are the ones people tend to miss:

If you still have noise, switch to optocouplers (otherwise known as optical isolation). In high noise environments (such as those with large electric motors) that is often what it takes.

Don't use transformer coupling. While that is a great approach to isolation in high noise environments, it won't work with CAN because it doesn't give you bit dominance.

If you know of a trick I've missed, please send it to me!

Wednesday, May 19, 2010

Only 10 lines of code per day. Really??

If you want to estimate how long it's going to take to create a piece of embedded software (and how much it will cost), it's useful to have an idea of how productive you're going to be. Lines of code written per day is a reasonable starting point for this. It's a crude metric to be sure. If you have enough experience that you can criticize this metric you probably don't need to read further. But if you are just starting to keep count, this is a reasonable way to go. For tallying purposes we just consider executable code, and ignore comments as well as blank lines.

It's pretty typical for solid embedded software to come in at between 1 and 2 lines of code (LOC) per developer-hour. That's 8 to 16 LOC per developer each day, or about 2000-4000 LOC per year.

If you want just a single rough number, call it 10 LOC per day per developer. This is relatively language independent, and is for experienced developers producing code that is reasonably reliable, but not intended to be safety critical.

Wow, that's not much code, is it? Heck, most of us can recall cranking out a couple hundred lines in an evening when we were taking programming courses. So what's the deal with such low numbers?

First of all, we're talking about real code that goes into real products, that actually works. (Not kind-of sort-of works like student homework. This stuff actually has to work!)

More importantly, this is the total cost for everything, including requirements, design, testing, meetings, and so on. So to get this number, divide total code by the total person-hours for the whole development project. Generally you should leave out beta testing and marketing, but include all in-house testers and first-level technical managers.

It's unusual to see a number less than 1 LOC/hour unless you're developing safety critical code. Up to 3 LOC/hour might be reasonable for an agile development team. BUT, that 3 LOC tends to have little supporting design documentation, which is a problem in some circumstances (agile+embedded is a discussion for another time). It's worth mentioning that any metric can be gamed, and that there are situations in which metrics are misleading or useless. But, it's worth knowing the rules of thumb and applying them when they make sense.

Generally teams with a higher LOC/day number are cutting corners somewhere. Or they are teams composed entirely of the world's most amazing programmers. Or both. Think about it. If your LOC/day number is high, ask yourself what's really going on.

Friday, May 14, 2010

Security for automotive control networks

News is breaking today that automotive control networks are vulnerable to attacks if you inject malicious messages onto them (see this NY Times article as an example). It's good that someone has taken the trouble to demonstrate the attack, but to our research group the fact that such vulnerabilities exist isn't really news. We've been working on countermeasures for several years, sponsored by General Motors.

If this sort of issue affects you, here is a high level overview. Pretty much any embedded network doesn't have support for authentication of messages. By that, I mean there is no way to tell if the node sending a particular message is really the node that is supposed to be sending it. It is pretty easy to reverse engineer the messages on a car network and find out, for example, that message header ID #427 is the one that disables the car engine (not the real ID number and not necessarily a real message -- just an example). Once you know that, all you have to do is connect to the network and send that message. Easy to do. Probably a lot of our undergraduates could do it. (Not that we teach them to be malicious -- but they shouldn't get an "A" in the courses I teach if they can't handle something as simple as that!)

The problem is that, historically, embedded networks have been closed systems. Designers assumed there was no way to connect to them from the outside, and by extension assumed they wouldn't be attacked. That is all changing with connectivity to infotainment systems and the Internet.

As I said, we've worked out a solution to this problem. My PhD student Chris Szilagyi published the long version in this paper from 2009. The short version is that what you want to do is add a few bits of cryptographically secure authentication to each network message. You don't have a lot of bits to work with (CAN has a maximum 8 byte payload). So you put in just a handful of authentication bits in each message. Then you accumulate multiple messages over time until the receiver is convinced that the message is authentic enough for its purposes. For something low risk, a couple messages might be fine. For something high risk, you collect more messages to be sure it is unlikely an attacker has faked the message. It's certainly not "free", but the approach seems to provide reasonable tradeoff points among cost, speed, and security.

There is no such thing as perfectly secure, and it is reasonable for manufacturers to avoid the expense of security measures if attacks aren't realistically going to happen. But if they are going to happen, it is our job as researchers to have countermeasures ready for when they are needed. (And, if you are a product developer, your job to make sure you know about solutions when it is time to deploy them.)

I'm going to get into my car and drive home today without worrying about attacks on my vehicle network at all. But, eventually it might be a real concern.

Thursday, May 13, 2010

What's the best CRC polynomial to use?

(If you want to know more, see my Webinar on CRCs and checksums based on work sponsored by the FAA.)

If you are looking for a lightweight error detection code, a CRC is usually your best bet. There are plenty of tutorials on CRCs and a web search will turn them up. If you're looking at this post probably you've found them already.

The tricky part is the "polynomial" or "feedback" term that determines how the bits are mixed in the shift-and-XOR process. If you are following a standard of some sort then you're stuck with whatever feedback term the standard requires. But many times embedded system designers don't need to follow a standard -- they just need a "good" polynomial. For a long time folk wisdom was to use the same polynomial other people were using on the presumption that it must be good. Unfortunately, that presumption is often wrong!

Some polynomials in widespread use are OK, but many are mediocre, some are terrible if used the wrong way, and some are just plain wrong due factors such as a typographical error.

Fortunately, after spending a many CPU-years of computer time doing searches, a handful of researchers have come up with optimal CRC polynomials. You can find my results below. They've been cross-checked against other known results and published in a reviewed academic paper. (This doesn't guarantee they are perfect! But they are probably right.)

(click for larger version)
(**** NOTE: this data is now a bit out of date. See this page for the latest ****)

Here is as thumbnail description of using the table. HD is the Hamming Distance, which is minimum number of bit errors undetected. For example, HD=4 means all 1, 2, and 3 bit errors are detected, but some 4-bit errors are undetected, as are some errors with more than 4 bits corrupted.

The CRC Size is how big the CRC result value is. For a 14-bit CRC, you add 14 bits of error detection to your message or data packet.

The bottom number in each box within the table is the CRC polynomial in implicit "+1" hex format, meaning the trailing "+1" is omitted from the polynomial number. For example, hex 0x583 = binary 101 1000 0011 = x^11 + x^9 + x^8 + x^2 + x + 1. (This is "Koopman" notation in the wikipedia page. No, I didn't write the wikipedia entry, and I wasn't trying to be gratuitously different. A lot of the comparison stuff happened after I'd already done too much work to have any hope of changing my notation without introducing mistakes.)

The top number in each box is the maximum data word length you can protect at that HD. For example, the polynomial 0x583 is an 11-bit CRC that can provide HD=4 protection for all data words up to 1012 bits in length. (1012+11 gives a 1023 bit long combined data word + CRC value.)

You can find the long version in this paper: Koopman, P. & Chakravarty, T., "Cyclic Redundancy Code (CRC) Polynomial Selection For Embedded Networks," DSN04, June 2004. Table 4 lists many common polynomials, their factorizations, and their relative performance. It covers up to 16-bit CRCs. Longer CRCs are a more difficult search and the results aren't quite published yet.

You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/

(Note: updated 8/3/2014 to correct the entry for 0x5D7, which provides HD=5 up to 26 bits. The previous graphic incorrectly gave this value as 25 bits. Thanks to Berthold Gick for pointing out the error.)

Monday, May 10, 2010

Which Error Detection Code Should You Use?

Any time you send a message or save some data that might be corrupted in storage, you should think about using some sort of error detection code so you can tell if the data has been corrupted. If you do a web search you will find a lot of information about error detection codes. Some of it is great stuff. But much of it is incorrect, or on a good day merely suboptimal. It turns out that the usual rule of thumb of "do what the other guys do and you can't go far wrong" works terribly for error detection. There is lots of folk wisdom that just isn't right.

So, here is a guide to simple error correction in embedded systems. There is a journal paper with all the details (see this link), but this is the short version.

If you want the fastest possible computation with basic error detection:

Parity is fine if you have only one bit to spend, but takes about as much work to compute as a checksum.
Stay away from XOR checksums (often called LRCs or Longitudinal Redundancy Checks). Use an additive checksum instead to get better error detection at the same cost.
Use an additive checksum if you want something basic and fast. If possible, use a one's complement additive checksum instead of normal addition. This involves adding up all the bytes or words of your data using one's complement addition and saving the final sum as the checksum. One's complement addition cuts vulnerability to undetected errors in the top bit of the checksum in half. In a pinch normal integer addition will work, but gives up some error detection capability.

If you want intermediate computation speed and intermediate error detection:

Use a Fletcher checksum. Make sure that you use one's complement addition in computing the parts of that checksum, not normal integer addition. Normal integer addition just kills error detection performance for this approach.
Don't use an Adler checksum. In most cases it isn't as good as a Fletcher checksum and it is a bit slower to compute. The Adler checksum seems like a cool idea but it doesn't really pay off compared to a Fletcher checksum of the same size.

If you can afford to spend a little more computation speed to get a lot better error detection:

Use a CRC (cyclic redundancy check)
If you are worried about speed there are a variety of table lookup methods that trade memory for speed. CRCs aren't really as slow as people think they will be. Probably you can use a CRC, and you should if you can. Mike Barr has a posting on CRC implementations.
Use an optimal CRC polynomial if you don't have to conform to a standard. If you use a commonly used polynomial because other people use it, probably you are missing out on a lot of error detection capability. (More on this topic in a later post.)

You can find more discussion about CRCs and Checksums at my blog on that topic: http://checksumcrc.blogspot.com/

Thursday, May 6, 2010

Intangible Benefits of In-Person Peer Reviews

Beyond finding bugs, in my opinion, in-person reviews also provide the following intangible benefits:

Synergy of comments: one reviewer's comment triggers something in another reviewer's head that leads to more thorough reviews.
Training: probably not everyone on your team has 25 years+ experience. Reviews are a way for the younger team members to learn about everything having to do with embedded systems.
Focus: we'd all rather be doing something than be in a meeting room, but a review meeting masks human interrupts pretty effectively -- if you silence your cell phone and exit your e-mail client.
Pride: make a point of saying something nice about code you are reviewing. It will help the ego of the author (we all need ego stroking!) and give the new guys something concrete to learn from.
Consistency: a group review is going to be more effective at encouraging code and design consistency and in making sure everything follows whatever standards are relevant. In on-line reviews you might not make the effort to comment upon things that aren't hard-core bugs, but in a meeting it is much easier to make a passing comment about finer points of style that doesn't need to be logged as an issue.

So if you're going to spend the effort to do reviews, it is probably worth spending the extra effort to make them actual physical meetings rather than e-mail pass-arounds. Chapter 22 of my book discusses peer reviews in more detail.
---

Tuesday, May 4, 2010

Do On-Line Peer Reviews Work?

If you don't do peer reviews of your design and code you're missing the boat. It is the most effective way I know of to improve software quality. It really works!

A more interesting question is whether or not e-mail or on-line tool peer reviews are effective. From what I've seen they often don't work. I have no doubt that if you use a well thought out support tool and have just the right group of developers it can be made to work. But more often I have seen it not work. This includes some cases when I've been able to do more or less head-to-head comparisons, both for students and industry designers. The designers using on-line reviews are capable, hard-working, and really think the reviews are working. But they're not! They aren't finding the usual 40%-60% of defects in reviews (with most of the rest -- hopefully -- found via test).

I've also seen this effect in external reviews where sometimes I send comments via e-mail, and sometimes I subject myself to the US Air Transportation System and visit a design team. The visits are invariably more productive.

The reasons most people have for electronic reviews are that they are more convenient. I can believe that. But (just to stir the pot) when you say that, what you're really saying is you can't set aside a meeting time for a face to face review because you have more important things to do (like writing code).

Reviews let you save many hours of debugging for each review hour. If all you care about is getting to buggy code as fast as possible, then sure, skip reviews. But if what you really care about is getting to working product with the least effort possible, then you can't afford to skip reviews or do them in an ineffective way. Thus far I haven't seen data that shows tools are consistently effective.

If you're using on-line tools for reviews and they really work (or have been burned by them) let me know! If you think they work, please say how you know that they do. Usually when people claim that I'm looking for them to find about half their bugs via review, but if you have a novel and defensible measurement approach I'd be interested in hearing about it. I'd also be interested in hearing about differences between informal (e-mail pass-around) and tool based review approaches.

Monday, May 3, 2010

Effective Use of an External Watchdog Timer

It's a good idea to use an external watchdog timer chip if you have a critical application. That way if the main microcontroller chip fails you have an independent check on its operation. (Ideally, make sure the external watchdog has its own oscillator so a single failed oscillator doesn't fool the watchdog and CPU into running too slowly.)

I recently got a question about how to deal with a simple external watchdog chip that didn't give a lot of flexibility in setting a timeout period. You want the timeout period to be reasonably tight to the worst-case watchdog kick period. But with external chips you might have a huge amount of timing slack if the watchdog period settings are really coarse (for example, a 1 second external watchdog when what you really wanted was 300 msec).

Here's an idea for getting the best of both worlds. Most microcontrollers have an internal watchdog timer. Rather than turn it off and ignore it, set it up for a nice tight kick interval. Probably you will have a lot of control over the internal watchdog interval. Then set the external watchdog timer interval for whatever is convenient, even if it is a pretty long interval. Kick both watchdogs together whenever you normally would kick just a single watchdog.

The outcome is that you can expect the internal watchdog will work most of the time. When it does, you have a tight timing bound. In the rare cases where it doesn't work, you have the external watchdog as a safety net. So for most single-point failures (software hangs) you have tight timing protection. For the -- hopefully -- much rarer double point failures (software hangs AND takes down the internal watchdog with it; or a catastrophic hardware failure takes down the CPU including the internal watchdog), you still get protection from the external watchdog, even if it takes a while longer.

Note that this approach might or might not provide enough protection for your particular application. The point is that you can do better in a lot of cases by using the internal watchdog rather than turning it off when you add an external watchdog. Chapter 29 of my book discusses watchdog timers in more detail.
---

Sunday, May 2, 2010

Better Embedded System Software: The Book

The book is available from Amazon. Here's a description: http://koopman.us/index.html

Book Summary

This book distills the experience of more than 90 design reviews on real embedded system products into a set of bite-size lessons learned in the areas of software development process, requirements, architecture, design, implementation, verification & validation, and critical system properties. Each chapter describes an area that tends to be a problem in embedded system design, symptoms that tend to indicate you need to make changes, the risks of not fixing problems in this area, and concrete ways to make your embedded system software better. Each chapter is relatively self-sufficient, permitting developers with a busy schedule to cherry-pick the best ideas to make their systems better right away.

Click on the link for chapter 19 on Global Variables to see the free sample chapter

Chapters:

Introduction

Software Development Process
Written development plan
How much paper is enough?
How much paper is too much?

Requirements & Architecture
Written requirements
Measureable requirements
Tracing requirements to test
Non-functional requirements
Requirement churn
Software architecture
Modularity

Design
Software design
Statecharts and modes
Real time
User interface design

Implementation
How much assembly language is enough?
Coding style
The cost of nearly full resources
Global variables are evil
Mutexes and data access concurrency

Verification & Validation
Static checking and compiler warnings
Peer reviews
Testing and test plans
Issue tracking & analysis
Run-time error logs

Critical System Properties
Dependability
Security
Safety
Watchdog timers
System reset
Conclusions

Click Here To View Detailed Table of Contents
(Requires Acrobat Reader version 8 or higher)

Click here for errata list.

-----