Monday, August 17, 2015

Three Suggestions from Les Chambers

Les Chambers is a knowledgeable software engineer with plenty of experience in critical embedded systems.  He had the following three comments about what kinds of "paper" (documentation) are essential for larger-scale embedded systems, which are all spot on.  (His points are in bullets below with permission and light editing for this format.)
  • Without an architectural design document it is impossible to plan and manage any software project with more than two or three people coding. You've got to know numbers of things to estimate resource requirements. How many functions, how many screens, how many objects. I've seen regularly a problem with big teams coming together and floundering around with nothing productive to do because the architect is still pulling his ideas together and has no vehicle for communicating them to the team.

Les is correct. Any time you have a project with more than a couple people you really need an architectural document of some sort.  Ideally a single sheet of paper, usually with boxes and arrows, that shows you what the pieces are and how they fit together. Once you have 5 people on the team, this is absolutely mandatory, and in my experience as Les says you will have just chaos until that picture is nailed down.

A related problem I've seen is when the architecture document shows all the hardware boxes and communication links, but software is nowhere to be found in the picture. You need to either put software on that same diagram or have a separate picture for the software structure that is compatible with the hardware architecture. Chapter 10 of my book has some general rules to help in constructing these types of diagrams. However, I'll be the first to admit that creating a good architecture is as much art as science. If you want to really delve into this, the best systems architecture book I've found is Rechtin's book on System Architecting (The first edition is by far the best for an initial read, if you can find it. The 3rd edition by Maier & Rechtin has a lot more material, but is a bit more complex if you are just going for the essentials.)
  • Another issue you've hinted at but not explicitly stated is the importance of detailed rationales for design approaches. The symptom here is, six months down the track, someone questions a nonintuitive design approach, spends a week working through the design rationale and then decides, "... oh yes it was right in the first place." Worse: someone, not in possession of the facts, changes an approach that was made for rational reasons and injects bugs.

Yes, I've seen this one too. This can especially be a problem if a design decision is made for an extra-functional purpose such as safety. For example, consider an aircraft in which two cable bundles are run down different sides of an aircraft.  Someone later might conclude that it is cheaper and easier to run them next to each other. Functionally there is (at least at first glance) no difference. But the point of separating the wires was so that if physical damage occurs to one part of the aircraft only one of the two cable bundles will be affected. (Could this happen? Read about the United Airlines Flight 232 crash where three hydraulic lines were damaged where they ran too close together.)

In general, it is a good idea to capture not just requirements but also design decisions with rationale so that the basis for important decisions is not lost. This is especially important for long-lived systems that are likely to be maintained and updated over periods of decades, which is a common enough situation in the embedded systems world.
  • Another piece of paper I think should be added is the configuration management documentation. Exactly what versions of what software are running on what versions of what hardware where. I once had to tackle this problem on a project with in excess of 200 computers deployed all over a [geographically distributed embedded system] network. The symptoms were: people in the development shop spending a week working on reproducing a bug found on site and being unable to do so because they are working on the wrong version of the code. Large deployment teams turning out to do site installation and having to abandon because they were armed with the wrong version of the software – incompatible with other control computers. The obvious solution was a database application, which took me three months to build, and turned out to be very useful - more useful than a stack of procedures and paper records.
And again, I've seen this one as well.  It is common enough for the configuration management for older systems to be a filing cabinet full of hard copy printouts, sometimes with the software for every single field installation having been customized by a field engineer. Usually the paper copy is out of date with reality.  If you find a bug, how can you fix it if you don't really even know what software is out in the field?

Configuration management is important not just for your build process, but also to keep track of what's out in the field. The most basic requirement is that your device needs to be able to tell you what version of software is installed (for example, with a start-up message). Beyond that, you really want a database that you can run queries on to find out what's out there. Such databases often get stale, so it's also very helpful to make a configuration audit part of every time you touch the equipment for maintenance to keep the database updated.

Les writes a thought-provoking (and nicely styled) long-form blog on system engineering :  http://www.systemsengineeringblog.com/
The stories are interesting and well told, with a few twists and turns along the way.  For example, the article on Fagan inspections provides insight into how culture changes if you are serious about creating high quality software. The title gives you an idea of the style:  "Extreme Review: A Tale of Nakedness, Alsations and Fagan Inspection." Highly recommended reading.

Monday, July 20, 2015

Avoiding EEPROM Wearout


Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem.

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that  EEPROM only works for a limited number of write cycles.  After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.")  A million writes sounds like a lot, but they go by pretty quickly.  Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
  1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36.  This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.



Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you use power.  Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough.  This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss.  In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.)  The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value.  You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage:  http://www.atmel.com/images/doc2526.pdf

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device.  The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note:   http://ww1.microchip.com/downloads/en/AppNotes/01449A.pdf

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:
ftp://ftp.microchip.com/tools/memory/total50/tutorial.html

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.

Sunday, May 3, 2015

Counter Rollover Bites Boeing 787

Counter rollover is a classic mistake in computer software.  And, it has just bit the Boeing 787.

The Problem:

The Boeing 787 aircraft's electrical power control units shut down if powered without interruption for 248 days (a bit over 8 months). In the likely case that all the control units were turned on at about the same time, that means they all shut down at the same time -- potentially in the middle of a flight. Fortunately, the power is usually not left on for 8 continuous months, so apparently this has not actually happened in flight.  But the problem was seen in a long-duration simulation and could happen in a real aircraft. (There are backup power supplies, but do you really want to be relying on them over the middle of an ocean?  I thought not.) The fix is turning off the power and turning it back on every 120 days.

That's right -- the FAA is telling the airlines they have to do a maintenance reboot of their planes every 120 days.

(Sources: NY Times ; FAA)


Analysis:

Just for fun, let's do the math and figure out what's going on.
248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200
Hmmm ... what if those systems keep time as an 32-bit signed integer in hundredths of a second? The maximum positive value for such a counter would give:
0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.
Bingo!

If they had used a 32-bit unsigned it would still overflow after twice as long = 497.1 days.


Other Examples:

This is not the first time a counter rollover has caused a problem.  Some examples are:

  • IBM: Interface adapters hang after 497 days of uptime [IBM]
  • Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]  
There are also plenty of date roll-over bugs:
  • Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00)   [Wikipedia]
  • GPS: 1024 week rollover on 22 August 1999 [USCG]
  • Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such as 512K day for IPv4 routers.

If you want to dig further, there is a "zoo" of related problems on Wikipedia:  "Time formatting and storage bugs"


Friday, May 1, 2015

How To Report An Unintended Acceleration Problem

Every once in a while I get e-mail from someone concerned about unintended acceleration that has happened to them or someone they know.  Commonly they go to the car dealer and get told (directly or indirectly) that it must have been the driver's fault. I'm sure that must be a frustrating experience.

Fortunately, you can do more than just get blown off by the dealer (if you feel like that is what happened to you).  Visit the US Dept. of Transportation's complaint system and file a complaint with their Office of Defects Investigation (ODI):

What this does is put information into the database that DoT uses to look for unsafe trends in vehicles. ODI conducts defect investigations and administers safety recalls, and this database is a primary source of information for them.  Putting in an entry does not mean that anyone will necessarily get back to you about your particular complaint, but eventually if enough drivers have similar problems with a particular vehicle type, ODI is supposed to investigate. You should be sure to use several different words and phrases to thoroughly describe your situation since often this database is searched via key words. (That means that if they are looking for a particular trend they might only look at records that contain a specific word or phrase, not all records for that vehicle.)  You should include specifics, and in particular things that you can recall that would suggest it is not simply driver error. But, realize that the description you type in will be publicly available, so think about what you write. 

To be sure, this should not be the only thing you do.  If you believe you have a problem with your vehicle should talk to the dealer and perhaps escalate things from there. (If it happened to me I would at a minimum demand a written problem report to the manufacturer central defects office be created and demand a written response from the manufacturer customer relations office to leave a record.)  But, if you skip the DoT database then one of the important feedback mechanisms independent of the car companies that triggers recalls won't have the data it needs to work. If you had an incident that did not result in a police report or insurance claim reporting is especially important, since there is no other way for DoT or the manufacturer to even know it happened.

Even if you haven't suffered unintended acceleration, you might be interested to look at complaints others have filed for your vehicle type, which are publicly available. And of course you can report any defect you like, not just acceleration issues. 

I recently came across the web site:  http://www.safercar.gov/
This has general car safety information and also has a way to file a vehicle safety complaint.  It is unclear to me what the different is between this and the IVOQ database above.  One would hope the data ends up in the same place, but I don't have information either way on that.

 (For those who are interested in how the keyword search might be done, you can see a NHTSA Document for an example from the Toyota UA investigations.)

Monday, April 13, 2015

FAA CRC and Checksum Report


I've been working for several years on an FAA document covering good practices for CRC and Checksum use.  At long last this joint effort with Honeywell researchers has been issued by the FAA as an official report.  Those of you who have seen my previous tutorial slides will recognize much of the material. But this is the official FAA-released version.

Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity
DOT/FAA/TC-14/49
Philip Koopman, Kevin Driscoll, Brendan Hall

http://users.ece.cmu.edu/~koopman/pubs/faa15_tc-14-49.pdf


Abstract:

This report explores the characteristics of checksums and cyclic redundancy codes (CRCs) in an aviation context. It includes a literature review, a discussion of error detection performance metrics, a comparison of various checksum and CRC approaches, and a proposed methodology for mapping CRC and checksum design parameters to aviation integrity requirements. Specific examples studied are Institute of Electrical and Electronics Engineers (IEEE) 802.3 CRC-32; Aeronautical Radio, Incorporated (ARINC)-629 error detection; ARINC-825 Controller Area Network (CAN) error detection; Fletcher checksum; and the Aeronautical Telecommunication Network (ATN)-32 checksum. Also considered are multiple error codes used together, specific effects relevant to communication networks, memory storage, and transferring data from nonvolatile to volatile memory.

Key findings include:
  • (1) significant differences exist in effectiveness between error-code approaches, with CRCs being generally superior to checksums in a wide variety of contexts; 
  • (2) common practices and published standards may provide suboptimal (or sometimes even incorrect) information, requiring diligence in selecting practices to adopt in new standards and new systems; 
  • (3) error detection effectiveness depends on many factors, with the Hamming distance of the error code being of primary importance in many practical situations; 
  • (4) no one-size-fits-all error-coding approach exists, although this report does propose a procedure that can be followed to make a methodical decision as to which coding approach to adopt; and 
  • (5) a number of secondary considerations must be taken into account that can substantially influence the achieved error-detection effectiveness of a particular error-coding approach.
You can see my other CRC and checksum posts via the CRC/Checksum label on this blog.

Official FAA site for the report is here: http://www.tc.faa.gov/its/worldpac/techrpt/tc14-49.pdf