Monday, July 20, 2015

Avoiding EEPROM Wearout


Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem.

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that  EEPROM only works for a limited number of write cycles.  After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.")  A million writes sounds like a lot, but they go by pretty quickly.  Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
  1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36.  This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.



Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you use power.  Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough.  This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss.  In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.)  The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value.  You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage:  http://www.atmel.com/images/doc2526.pdf

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device.  The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note:   http://ww1.microchip.com/downloads/en/AppNotes/01449A.pdf

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:
ftp://ftp.microchip.com/tools/memory/total50/tutorial.html

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.

Sunday, May 3, 2015

Counter Rollover Bites Boeing 787

Counter rollover is a classic mistake in computer software.  And, it has just bit the Boeing 787.

The Problem:

The Boeing 787 aircraft's electrical power control units shut down if powered without interruption for 248 days (a bit over 8 months). In the likely case that all the control units were turned on at about the same time, that means they all shut down at the same time -- potentially in the middle of a flight. Fortunately, the power is usually not left on for 8 continuous months, so apparently this has not actually happened in flight.  But the problem was seen in a long-duration simulation and could happen in a real aircraft. (There are backup power supplies, but do you really want to be relying on them over the middle of an ocean?  I thought not.) The fix is turning off the power and turning it back on every 120 days.

That's right -- the FAA is telling the airlines they have to do a maintenance reboot of their planes every 120 days.

(Sources: NY Times ; FAA)


Analysis:

Just for fun, let's do the math and figure out what's going on.
248 days * 24 hours/day * 60 minute/hour * 60 seconds/minute = 21,427,200
Hmmm ... what if those systems keep time as an 32-bit signed integer in hundredths of a second? The maximum positive value for such a counter would give:
0x7FFFFFFF = 2147483647 / (24*60*60) = 24855 / 100 = 248.55 days.
Bingo!

If they had used a 32-bit unsigned it would still overflow after twice as long = 497.1 days.


Other Examples:

This is not the first time a counter rollover has caused a problem.  Some examples are:

  • IBM: Interface adapters hang after 497 days of uptime [IBM]
  • Windows 95: hang after 49.7 days without reboot, counting in milliseconds [Microsoft]  
There are also plenty of date roll-over bugs:
  • Y2K: on 1 January 2000 (overflow of 2-digit year from 99 to 00)   [Wikipedia]
  • GPS: 1024 week rollover on 22 August 1999 [USCG]
  • Year 2038: Unix time will roll over on 19 January 2038 [Wikipedia]

There are also somewhat related capacity overflow issues such as 512K day for IPv4 routers.

If you want to dig further, there is a "zoo" of related problems on Wikipedia:  "Time formatting and storage bugs"


Friday, May 1, 2015

How To Report An Unintended Acceleration Problem

Every once in a while I get e-mail from someone concerned about unintended acceleration that has happened to them or someone they know.  Commonly they go to the car dealer and get told (directly or indirectly) that it must have been the driver's fault. I'm sure that must be a frustrating experience.

Fortunately, you can do more than just get blown off by the dealer (if you feel like that is what happened to you).  Visit the US Dept. of Transportation's complaint system and file a complaint with their Office of Defects Investigation (ODI):

What this does is put information into the database that DoT uses to look for unsafe trends in vehicles. ODI conducts defect investigations and administers safety recalls, and this database is a primary source of information for them.  Putting in an entry does not mean that anyone will necessarily get back to you about your particular complaint, but eventually if enough drivers have similar problems with a particular vehicle type, ODI is supposed to investigate. You should be sure to use several different words and phrases to thoroughly describe your situation since often this database is searched via key words. (That means that if they are looking for a particular trend they might only look at records that contain a specific word or phrase, not all records for that vehicle.)  You should include specifics, and in particular things that you can recall that would suggest it is not simply driver error. But, realize that the description you type in will be publicly available, so think about what you write. 

To be sure, this should not be the only thing you do.  If you believe you have a problem with your vehicle should talk to the dealer and perhaps escalate things from there. (If it happened to me I would at a minimum demand a written problem report to the manufacturer central defects office be created and demand a written response from the manufacturer customer relations office to leave a record.)  But, if you skip the DoT database then one of the important feedback mechanisms independent of the car companies that triggers recalls won't have the data it needs to work. If you had an incident that did not result in a police report or insurance claim reporting is especially important, since there is no other way for DoT or the manufacturer to even know it happened.

Even if you haven't suffered unintended acceleration, you might be interested to look at complaints others have filed for your vehicle type, which are publicly available. And of course you can report any defect you like, not just acceleration issues. 

I recently came across the web site:  http://www.safercar.gov/
This has general car safety information and also has a way to file a vehicle safety complaint.  It is unclear to me what the different is between this and the IVOQ database above.  One would hope the data ends up in the same place, but I don't have information either way on that.

 (For those who are interested in how the keyword search might be done, you can see a NHTSA Document for an example from the Toyota UA investigations.)

Monday, April 13, 2015

FAA CRC and Checksum Report


I've been working for several years on an FAA document covering good practices for CRC and Checksum use.  At long last this joint effort with Honeywell researchers has been issued by the FAA as an official report.  Those of you who have seen my previous tutorial slides will recognize much of the material. But this is the official FAA-released version.

Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity
DOT/FAA/TC-14/49
Philip Koopman, Kevin Driscoll, Brendan Hall

http://users.ece.cmu.edu/~koopman/pubs/faa15_tc-14-49.pdf


Abstract:

This report explores the characteristics of checksums and cyclic redundancy codes (CRCs) in an aviation context. It includes a literature review, a discussion of error detection performance metrics, a comparison of various checksum and CRC approaches, and a proposed methodology for mapping CRC and checksum design parameters to aviation integrity requirements. Specific examples studied are Institute of Electrical and Electronics Engineers (IEEE) 802.3 CRC-32; Aeronautical Radio, Incorporated (ARINC)-629 error detection; ARINC-825 Controller Area Network (CAN) error detection; Fletcher checksum; and the Aeronautical Telecommunication Network (ATN)-32 checksum. Also considered are multiple error codes used together, specific effects relevant to communication networks, memory storage, and transferring data from nonvolatile to volatile memory.

Key findings include:
  • (1) significant differences exist in effectiveness between error-code approaches, with CRCs being generally superior to checksums in a wide variety of contexts; 
  • (2) common practices and published standards may provide suboptimal (or sometimes even incorrect) information, requiring diligence in selecting practices to adopt in new standards and new systems; 
  • (3) error detection effectiveness depends on many factors, with the Hamming distance of the error code being of primary importance in many practical situations; 
  • (4) no one-size-fits-all error-coding approach exists, although this report does propose a procedure that can be followed to make a methodical decision as to which coding approach to adopt; and 
  • (5) a number of secondary considerations must be taken into account that can substantially influence the achieved error-detection effectiveness of a particular error-coding approach.
You can see my other CRC and checksum posts via the CRC/Checksum label on this blog.

Official FAA site for the report is here: http://www.tc.faa.gov/its/worldpac/techrpt/tc14-49.pdf



Monday, February 2, 2015

Tester To Developer Ratio Should Be 1:1 For A Typical Embedded Project

Believe it or not, most high quality embedded software I've seen has been created by organizations that spend twice as much on validation as they do on software creation.  That's what it takes to get good embedded software. And it is quite common for organizations that don't have this ratio to be experiencing significant problems with their projects. But to get there, the head count ratio is often about 1:1, since developers should be spending a significant fraction on their time doing "testing" (in broad terms).

First, let's talk about head count. Good embedded software organizations tend to have about equal number of testers and developers (i.e., 50% tester head count, 50% developer head count). Also, typically, the testers are in a relatively independent part of the organization so as to reduce pressure to sign off on software that isn't really ready to ship. Ratios can vary significantly depending on the circumstance.  5:1 tester to developer ratio is something I'd expect in safety-critical flight controls for an aerospace application.  1:5 tester to developer ratio might be something I'd expect on ephemeral web application development. But for a typical embedded software project that is expected to be solid code with well defined functionality, I typically expect to see 1:1 tester to developer staffing ratios.

Beyond the staffing ratio is how the team spends their time, which is not 1:1.  Typically I expect to see the developers each spend about two thirds of their time on development, and one-third of their time on verification/validation (V&V). I tend to think of V&V as within the "testing" bin in terms of effort in that it is about checking the work product rather than creating it. Most of the V&V effort from developers is spent doing peer reviews and unit testing.

For the testing group, effort is split between software testing, system testing, and process quality assurance (SQA).  Confusingly, testing is often called "quality assurance," but by SQA what I mean is time spent creating, training the team on, and auditing software processes themselves (i.e., did you actually follow the process you were supposed to -- not actual testing). A common rule of thumb is that about 5%-6% of total project effort should be SQA, split about evenly between process creation/training and process auditing. So that means in a team of 20 total developers+testers, you'd expect to see one full time equivalent SQA person. And for a team size of 10, you'd expect to see one person half-time on SQA.

Taking all this into account, below is a rough shot at how effort across a project should break down in terms of head counts and staff.

Head count for a 20 person project:
  • 10 developers
  • 9 testers
  • 1 SQA (both training and auditing)
Note that this does not call out management, nor does it deal with after-release problem resolution, support, and so on. And it would be no surprise if more testers were loaded at the end of the project than the beginning, so these are average staffing numbers across the length of the project from requirements through product release.

Typical effort distribution (by hours of effort over the course of the project):
  • 33%: Developer team: Development (left-hand side of "V" including requirements through implementation)
  • 17%: Developer team: V&V (peer reviews, unit test)
  • 44%: V&V team: integration testing and system testing
  • 3%: V&V team: Process auditing
  • 3%: V&V team: Process training, intervention, improvement
If you're an agile team you might slice up responsibilities differently and that may be OK. But in the end the message is that you're probably only spending about a third of your time actually doing development compared to verification and validation if you want to create very high quality software and ensure you have a rigorous process while doing so.  Agile teams I've seen that violated this rule of thumb often did so at the cost of not controlling their process rigor and software quality. (I didn't say the quality was necessarily bad, but rather what I'm saying is they are flying blind and don't know what their software and process quality is until after they ship and have problems -- or don't.  I've seen it turn out both ways, so you're rolling the dice if you sacrifice V&V and especially SQA to achieve higher lines of code per hour.)

There is of course a lot more to running a project than the above. For example, there might need to be a cross-functional build team to ensure that product builds have the right configuration, are cleanly built, and so on. And running a regression test infrastructure could be assigned to either testers or developers (or both) depending on the type of testing that was included. But the above should set rough expectations.  If you differ by a few percent, probably no big deal. But if you have 1 tester for every 5 developers and your effort ratio is also 1:5, and you are expecting to ship a rock-solid industrial-control or similar embedded system, then think again.

(By way of comparison, Microsoft says they have a 1:1 tester to developer head count. See: How We Test Software at Microsoft, Page, Johston & Rollison.)