Better Embedded System SW: August 2010

Wednesday, August 25, 2010

Malware may have contributed to airline crash

A recent story is that "Malware may have been a contributory cause of a fatal Spanair crash that killed 154 people two years ago." (See this link for the full story.) The gist of the scenario is that a diagnostic monitoring computer (presumably that runs Windows) got infected with malware and stopped monitoring. The loss of monitoring meant nobody knew when problems occurred. Also, it meant that a warning system didn't work, so it didn't catch a critical pilot error upon takeoff. The pilots made the error, but the warning system didn't tell them they had made a mistake, and so the plane crashed.

If the above scenario is verified by the investigation, in one sense this is a classic critical system failure in which multiple things had to go wrong to result in a loss event (operators make a mistake AND operators fail to catch their own mistake with checklists AND automated warning system fails).
But what is a bit novel is that one of the failures was caused by malware, even if it wasn't intentionally targeted at aircraft. So the security problem didn't on its own cause the crash, but it tangibly contributed to the crash by removing a layer of safety.

Now, let's fast-forward to the future. What if someone created malware that would modify pilot checklists? (I know pilots are trained to know the checklists, but in a stressful situation someone could easily fall for a craftily bogus checklist.) What if someone intentionally attacked the warning system and caused some more subtle failure? For example, what if you managed to get all the aircraft in a fleet to give a bogus alarm on every takeoff attempt, and put in a time delay so it would happen on a particular day?

Security problems for embedded systems are going to get a lot worse unless people start taking this threat more seriously. This is just the tip of the iceberg. Hopefully things will get better sooner rather than later.

Monday, August 16, 2010

100% CPU Use With Rate Monotonic Scheduling

Rate Monotonic Scheduling (RMS) is probably what you should be using if you are using an RTOS. With rate monotonic scheduling you assign fixed priorities based solely on task execution periods. The fastest period gets the highest priority, and the slower periods get progressively lower priorities. It's that simple -- except for the math that limits maximum allowable CPU use.

The descriptions of Rate Monotonic theory (often called Rate Monotonic Analysis -- RMA) point out that you can't use 100% of the CPU if you want to guarantee schedulability. Maximum usage ranges between 69% to 85% of the CPU (see, for example Wikipedia for this analysis). And, as a result, many developers shy away from RMS. Who wants to pay up to about a third of their CPU capacity to the gods of scheduling? Nobody. Not even if this is an optimal way to schedule using fixed priorities.

The good news is that you can have your cake and eat it too.

The secret to making Rate Monotonic Scheduling practical is to use harmonic task periods. This requires that every task period evenly divide the period of every longer period. So, for example, the periods {20, 60, 120} are harmonic because 20 divides 60 evenly, 20 divides 120 evenly, and 60 divides 120 evenly. The period set {20, 67, 120} is not harmonic because, for example, 20 doesn't divide 67 evenly.

If you have harmonic periods, you can schedule up to 100% of the CPU. You don't need to leave slack! (Well, I'd leave a little slack in the real world, but the point is the math says you can go to 100% instead of saying you top out at a much lower utilization.) So, in a real system, you can use RMA without giving up a lot of CPU capacity.

What if your tasks aren't harmonic? Them make them so by running some tasks faster. For example, if your task periods are {20, 67, 120} change them to {20, 60, 120}. In other words, run the 67 msec task at 60 msec, which is a little faster than you would otherwise need to run it. Counter-intuitively, speeding up one task will actually lower your total CPU requirements in typical situations because now you don't have to leave unused CPU capacity to make the scheduling math work. Let's face it; in real systems task frequencies often have considerably leeway. So make your life easy and choose them to be harmonic.

You can find out more about RMA and how to do scheduling analysis in Chapter 14 of my book.

Monday, August 9, 2010

How Do You Know You Tested Everything? (Simple Traceability)

The idea of traceability is simple: look at the inputs to a design process and then look at the outputs from that same process. Compare them to make sure you didn't miss anything.

The best place to start using traceability is usually comparing requirements to acceptance tests. Here's how:

Make a table (spreadsheets work great for this). Label each column with a requirement. Label each row with an acceptance test. For each acceptance test, put an X in the requirement columns that are exercised to a non-trivial degree by that test.

When you're done making the table, you can do traceability analysis. An empty row (an acceptance test with no "X" marks) means you are running a test that isn't required. It might be a really good test to run -- in which case your requirements are missing something. Or it might be a waste of time.

An empty column (a requirement with no "X" marks) means you have a requirement that isn't being tested. This means you have a hole in your testing, a requirement that didn't get implemented, or a requirement that can't be tested. No matter the cause, you've got a problem. (It's OK to use non-testing validation such as a design review to check some requirements. For traceability purposes put it in as a "test" even though it doesn't involve actually running the code.)

If you have a table with no missing columns and no missing rows, then you've achieved complete traceability -- congratulations! This doesn't mean you're perfect. But it does mean you've managed to avoid some easy-to-detect gaps in your software development efforts.

These same ideas can be used elsewhere in the design process to avoid similar mistakes. The point for most projects is to use this idea in a way that doesn't take a lot of time, but catches "stupid" mistakes early on. It may seem too simplistic, but in my experience having written traceability tables helps. You can find out more about traceability in my book, Chapter 7: Tracing Requirements To Test.
---

Monday, August 2, 2010

How To Pick A Good Embedded Systems Design Book

If you want to know if an embedded system design book really talks about everything you need to know, I recommend you look for the word "watchdog" in the index. If it's not there, move on to the next book.

Let me explain...

There is a lot of confusion in the education and book market about the difference between learning about the underlying technology of embedded systems and learning how to build embedded systems. I think the difference is critical, because you are going to have trouble succeeding with complex embedded system projects if you are missing skills in the area of system integration and system architecture.

A book about the technology of embedded systems talks about how microcontrollers work, how analog to digital conversion works, interrupts, assembly language, and those sorts of things. In other words, it gives you the building blocks of the basic technology. This is essential information. But, it isn't enough to succeed when you get beyond relatively small and simple projects. If you took a course in college that was a typical entry-level "Introduction to Microcontrollers" course, this is what you might have seen. Again, it was essential, but far from complete.

A book about how to build embedded systems also talks about the parts that make the system function as a whole. For example, it explains how to address difficult areas such as concurrency management, reliability, software design, and real time scheduling. It need not go into serious depth about underlying theory if it is an introductory book, but it should at least give some commonly used design patterns, such as using a watchdog timer. In other words, it helps you understand the system level picture for building solid applications rather than just giving the basic technology building blocks.

There are other areas that I think are critical to designing good embedded systems that aren't covered in most embedded design books. Most notable is a notion of lightweight but robust software processes. There aren't many books that really cover everything. But, knowing what you don't know (and what a textbook leaves out) is an important step to achieving higher levels of understanding in any area.

If you learned embedded systems from a book or college course that didn't even mention watchdog timers, then you missed out a lot in that experience. You might or might not have found and filled all the gaps by now. Do yourself a favor and take a look at a book that covers these topics to make sure you don't have any big skill gaps left. If you think the "watchdog test" isn't fair, give your favorite book a second chance by looking for keywords such as statechart, mutex, and real time scheduling. These are also typical topics covered in books that talk about system level aspects rather than just building blocks.