Better Embedded System SW: hardware

Showing posts with label hardware. Show all posts

Monday, July 20, 2015

Avoiding EEPROM and Flash Memory Wearout

Summary: If you're periodically updating a particular EEPROM value every few minutes (or every few seconds) you could be in danger of EEPROM wearout. Avoiding this requires reducing the per-cell write frequency. For some EEPROM technology anything more frequent than about once per hour could be a problem. (Flash memory has similar issues.)

Time Flies When You're Recording Data:

EEPROM is commonly used to store configuration parameters and operating history information in embedded processors. For example, you might have a rolling "flight recorder" function to record the most recent operating data in case there is a system failure or power loss. I've seen specifications for this sort of thing require recording data every few seconds.

The problem is that EEPROM only works for a limited number of write cycles. After perhaps 100,000 to 1,000,000 (depending on the particular chip you are using), some of your deployed systems will start exhibiting EEPROM wearout and you'll get a field failure. (Look at your data sheet to find the number. If you are deploying a large number of units "worst case" is probably more important to you than "typical.") A million writes sounds like a lot, but they go by pretty quickly. Let's work an example, assuming that a voltage reading is being recorded to the same byte in EEPROM every 15 seconds.

1,000,000 writes at one write per 15 seconds is 4 writes per minute:
1,000,000 / ( 4 * 60 minutes/hr * 24 hours/day ) = 173.6 days.
In other words, your EEPROM will use up its million-cycle wearout budget in less than 6 months.

Below is a table showing the time to wearout (in years) based on the period used to update any particular EEPROM cell. The crossover values for 10 year product life are one update every 5 minutes 15 seconds for an EEPROM with a million cycle life. For a 100K life EEPROM you can only update a particular cell every 52 minutes 36. This means any hope of updates every few seconds just aren't going to work out if you expect your product to last years instead of months. Things scale linearly, although in real products secondary factors such as operating temperature and access mode can play a factor.

Reduce Frequency
The least painful way to resolve this problem is to simply record the data less often. In some cases that might be OK to meet your system requirements.

Or you might be able to record only when things change more than a small amount, with a minimum delay between successive data points. However, with event-based recording be mindful of value jitter or scenarios in which a burst of events can wear out EEPROM.

(It would be nice if you could track how often EEPROM has been written. But that requires a counter that's kept in EEPROM ... so that idea just pushes the problem into the counter wearing out.)

Low Power Interrupt
In some processors there is a low power interrupt that can be used to record one last data value in EEPROM as the system shuts down due to loss of power. In general you keep the value you're interested in a RAM location, and push it out to EEPROM only when you lose power. Or, perhaps, you record it to EEPROM once in a while and push another copy out to EEPROM as part of shut-down to make sure you record the most up-to-date value available.

It's important to make sure that there is a hold-up capacitor that will keep the system above the EEPROM programming voltage requirement for long enough. This can work if you only need to record a value or two rather than a large block of data. But it is easy to get this wrong, so be careful!

Rotating Buffer
The classical solution for EEPROM wearout is to use a rotating buffer (sometimes called a circular FIFO) of the last N recorded values. You also need a counter stored in EEPROM so that after a power cycle you can figure out which entry in the buffer holds the most recent copy. This reduces EEPROM wearout proportionally to the number of copies of the data in the buffer. For example, if you rotate through 10 different locations that take turns recording a single monitored value, each location gets modified 1/10th as often, so EEPROM wearout is improved by a factor of 10. You also need to keep a separate counter or timestamp for each of the 10 copies so you can sort out which one is the most recent after a power loss. In other words, you need two rotating buffers: one for the value, and one to keep track of the counter. (If you keep only one counter location in EEPROM, that counter wears out since it has to be incremented on every update.) The disadvantage of this approach is that it requires 10 times as many bytes of EEPROM storage to get 10 times the life, plus 10 copies of the counter value. You can be a bit clever by packing the counter in with the data. And if you are recording a large record in EEPROM then an additional few bytes for the counter copies aren't as big a deal as the replicated data memory. But any way you slice it, this is going to use a lot of EEPROM.

Atmel has an application note that goes through the gory details:
AVR-101: High Endurance EEPROM Storage: http://www.atmel.com/images/doc2526.pdf

Special Case For Remembering A Counter Value
Sometimes you want to keep a count rather than record arbitrary values. For example, you might want to count the number of times a piece of equipment has cycled, or the number of operating minutes for some device. The worst part of counters is that the bottom bit of the counter changes on every single count, wearing out the bottom count byte in EEPROM.

But, there are special tricks you can play. An application note from Microchip has some clever ideas, such as using a gray code so that only one byte out of a multi-byte counter has to be updated on each count. They also recommend using error correcting codes to compensate for wear-out. (I don't know how effective ECC will be at wear-out, because it will depend upon whether bit failures are independent within the counter data bytes -- so be careful of using that idea). See this application note: http://ww1.microchip.com/downloads/en/AppNotes/01449A.pdf

Note: For those who want to know more, Microchip has a tutorial on the details of wearout with some nice diagrams of how EEPROM cells are designed:
ftp://ftp.microchip.com/tools/memory/total50/tutorial.html

Don't Re-Write Unchanging Values
Another way to reduce wearout is to read the current value in a memory location before updating. If the value is the same, skip the update, and eliminate the wearout cycle associated with an update that has no effect on the data value. Make sure you account for the worst case (how often can you expect values to be the same?). But even if the worst case is bad, this technique will give you a little extra margin of safety if you get lucky once in a while and can skip writes.

If you've run into any other clever ideas for EEPROM wearout mitigation please let me know.

Leraning More
Nash Reilly has a nice series of tutorial postings on how Flash/EEPROM technology works. (I found out about these via Jack Ganssle's newsletter.)
http://cushychicken.github.io/nand-pt1-transistors/
http://cushychicken.github.io/nand-pt2-floating/
http://cushychicken.github.io/nand-pt3-arrays/
http://cushychicken.github.io/nand-pt4-pages-blocks/
http://cushychicken.github.io/nand-pt5-how-nand-breaks/
http://cushychicken.github.io/nand-pt6-dealing-with-flaws/
http://cushychicken.github.io/inconvenient-truths/

Oct 2019: Tesla is said to have a flash wearout problem for its SSDs. https://insideevs.com/news/376037/tesla-mcu-emmc-memory-issue/

Monday, April 7, 2014

Self-Monitoring and Single Points of Failure

A previous post discussed single points of failure in general. Creating a safety-critical embedded system requires avoiding single points of failure in both hardware in software. This post is the first part of a discussion about examples of single points of failure in safety critical embedded systems.

Consequences: A consequence of having a single point of failure is that when a critical single point fails, the system becomes unsafe via taking an unsafe action or ceasing to perform critical functions.

Accepted Practices: The following are accepted practices for avoiding single point failures in safety critical systems:

A safety critical system must not have any single point of failure that results in a significant unsafe condition if that failure can reasonably be expected to occur during the operational life of the deployed fleet of systems. Because of their high production volume and usage hours, for automobiles, aircraft, and similar systems it must be expected that any single microcontroller hardware chip and software on any single chip will fail in an arbitrarily unsafe manner.
Properly implemented monitor-actuator pairs, redundant input processing, and a comprehensive defense-in-depth strategy are all accepted practices for mitigating single point faults (see future blog entries for postings on those topics).
Multiple points of failure that can fail at the same time due to the same cause, can accumulate without being detected and mitigated during system operation, or are otherwise likely to fail concurrently, must be treated as having the same severity as a single point of failure.

Discussion:

MISRA Report 2 states that the objective of risk assessment is to “show that no single point of failure within the system can lead to a potentially unsafe state, in particular for the higher Integrity Levels.” (MISRA Report 2, 1995, pg. 17). In this context, “higher Integrity levels” are those functions that could cause significant unsafe behavior, typically involving passenger deaths. That report also says that the risk from multiple faults must be sufficiently low to be acceptable.

Mauser reports on a Siemens Automotive study of electronic throttle control for automobiles (Mauser 1999). The study specifically accounted for random faults (id. p. 732), as well as considering the probability of a “runaway” incidents (id., p. 734) in which an open throttle fault could cause a mishap. It found a possibility of single point failures, and in particular identified dual redundant throttle electrical signals being read by a single shared (multiplexed) analog to digital converter in the CPU (id., p. 739) as a critical flaw.

Ademaj says that “independent fault containment regions must be implemented in separate silicon dies.” (Ademaj 2003, p. 5) In other words, any two functions on the same silicon die are subject to arbitrary faults and constitute a single point of failure.

But Ademaj didn’t just say it – he proved it via experimentation on a communication chip specifically designed for safety critical automotive drive-by-wire applications (id., pg. 9 conclusions), and those results required the designers of the TTP protocol chip (based on the work of Prof. Kopetz) to change their approach to achieving fault tolerance to the use of a Star topology because combining a network CPU with the network monitor on the same silicon die was proven to be susceptible to single points of failure even though the die had been specifically designed to physically isolate their network monitor from their main CPU. Even though every attempt had been made for on-chip isolation, two completely independent circuits sharing the same chip were observed to fail together from a single fault in a safety-critical automotive drive-by-wire design.

A fallacy in designing safety critical systems is thinking that partial redundancy in the form of "fail-safe" hardware or software will catch all problems without taking into account the need for complete isolation of the potentially faulty component and the mitigation component. If both the mitigation and the fault are in the same Fault Containment Region (FCR), then the system can't be made entirely safe.

To give a more concrete example, consider a single CPU with a self-monitoring feature that has hardware and/or software that detects faults within that same CPU. One could envision such a system signaling to an outside device a self-health report. Such a design pattern is sometimes called a "simplex system with disengagement monitor" and uses "Built-In Test" (BIT) to do the self-checking. (Note that BIT is a generic term for self-checks, and does not necessarily mean a manufacturing gate-level test or other specific diagnostic.) If self-health checks are false, then the system fails over to a safe state via, for example, shutting down (if shutting down is safe). To be sure, doing this is better than doing nothing. But, it can never get complete coverage. What if the self-health check is compromised by the fault in the chip?

A look at a research paper on aerospace fault tolerant architectures explains why a simplex (single-FCR) system with BIT is inadequate for high-integrity safety-critical systems. Hammett (2001) figure 5 shows a simplex computer with BIT disengagement features, and says that they “increase the likelihood the computer will fail passive rather than fail active. But it is important to realize that it is impossible to design BIT that can detect all types of computer failures and very difficult to accurately estimate BIT effectiveness.” (id., pg. 1.C.5-4, emphasis added) Such an architecture is said to “Fail Active” after some failures (id., Table 1, p. 1.C.5-7), where “A fail active condition is when the outputs to actuators are active, but uncontrolled. … A fail active condition is a system malfunction rather than a loss of function.” (id., pg. 1.C.5-2, emphasis per original) “For some systems, an annunciated loss of function is an acceptable fail-safe, but a malfunction could be catastrophic.” (id., p. 1.C.5-3, emphasis per original) In particular, with such an architecture depending on the fraction of failures caught (which is not 100%), some “failures will be undetected and the system may fail to a potentially hazardous fail active condition.” (id., p. 1.C.5-4, emphasis added).

Table 1 from Hammett 2001, below, shows where Simplex with BIT stands in terms of fault tolerance capability. It will fail active (i.e., fail dangerously) for some single point failures, and that's a problem for safety critical systems. .

Note that dual standby redundancy is also inadequate even though it has two copies of the same computer with the same software. This is because the primary has to self-diagnose that it has a problem before it switches to the backup computer (Hammett Fig. 6, below). If the primary doesn't properly self-diagnose, it never switches over, resulting in a fail-active (dangerous system).

On the other hand, a self-checking pair (Hammett figure 7 above), sometimes known as a "2 out of 2" or 2oo2 system, can tolerate all single point faults the following way. Each of the computers in a 2oo2 pair runs the same software on identical hardware, usually operating in lockstep. If the outputs don't agree, then the system disables its outputs. Any single failure that affects the computation will, by definition, cause the outputs to disagree (because it can only affect one of the 2 computers, and if it doesn't change the output then it is not affecting the result of the computation). Most dual-point failures will also be detected, except for dual point failures that happen to affect both computers in exactly the same way. Because the two computers are separate FCRs, this is unlikely unless there is a correlated fault such as a software defect or hardware design defect. In practice, the inputs are also replicated to avoid a bad sensor being a single point of failure as well (Hammett's figure is non-specific about inputs, because the focus is on computing patterns). 2oo2 is not a free lunch in many regards, and I'll queue a discussion of the gory details for a future blog post if there is interest. Suffice it to say that you have to pay attention to many details to get this right. But it is definitely possible to build such a system.

With a 2oo2 system, the second CPU does not improve availability, but in fact reduces it because there are twice as many computers to fail. To attain availability, a redundant failover set of 2oo2 computers can be used (Hammett Fig.9 -- dual self-checking pair). And in fact this is a commonly used architecture in railway switching equipment. Each 2oo2 pair self-checks, and if it detects an error it shuts down, swapping in the other 2oo2 pair. So having a single 2oo2 pair is done for safety. The second 2oo2 pair is there to prevent outages (see Hammett figure 9, below).

From the above we can see that avoiding single points of failure requires at least two CPUs, with care taken to ensure that each CPU is a separate fault containment region. If you need a fail-operational system, then 4 CPUs arranged per figure 9 above will give you that, but at a cost of 4 CPUs.

Note that we have not at any point attempted to identify some "realistic" way in which a computer can both produce a dangerous output and cause its BIT to fail. Such analysis is not required when building a safe system. Rather, the effects of failure modes in electronics are more subtle and complex than can be readily understood (and some would argue that many real but infrequent failure modes are too complex for anyone to understand). It is folly to try to guess all possible failures and somehow ensure that the BIT will never fail. But even if we tried to do this, the price for getting it wrong in terms of death and destruction with a safety critical system is simply too high to take that chance. Instead, we simply assert that Murphy will find a way to make a simplex system with BIT fail active, and take that as a given.

By way of analogy, there is no point doing analysis down to single lines of code or bolt tensile strengths in high-vibration environments within a jet engine to know that flying across the Pacific Ocean in a jet airliner with only one engine working at takeoff is a bad idea. Even perfectly designed jet engines break, and any single copy of perfectly design jet engine software will eventually fail (due to a single event upset within the CPU it is running in, if for no other reason). The only way to achieve safety is to have true redundancy, with no single point failure whatsoever that can possibly keep the system from entering a safe state.

In practice the "output if agreement" block shown in these figures can itself be a single point of failure. This is resolved in practical systems by, for example, having each of the computers in a 2oo2 pair control the reset/shutdown line on the other computer in the 2oo2 pair. If either computer detects a mismatch, it both shuts down the other CPU and commits suicide itself, taking down the pair. This system reset also causes the switch in a dual 2oo2 system to change over to the backup pair of computers. And yes, that switch can also be a single point of failure, which can be resolved by for example having redundant actuators that are de-energized when the owner 2oo2 pair shuts down. And, we have to make sure our software doesn't cause correlated faults between pairs by ensuring it is of sufficiently high integrity as well.

As you can see, flushing out single points of failure is no small thing. But if you want to build a safety critical system, getting rid of single points of failure is the price of admission to the game. And that price includes truly redundant CPUs for performing safety critical computations.

References:

Hammett, Design by extrapolation: an evaluation of fault-tolerant avionics, 20th Conference on Digital Avionics Systems, IEEE, 2001.
MISRA, Report 2: Integrity, February 1995.
Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.

Monday, March 3, 2014

Random Hardware Faults

Computer hardware randomly generates incorrect answers once in a while, in some cases due to single event upsets from cosmic rays. Yes, really! Here's a summary of the problem and how often you can expect it to occur, with references for further reading.

---------------------------------------

(As will be the case with some of my posts over the next few months, this text is written more in an academic style, and emphasizes automotive safety applications since that is the area I perform much of my research in. But I hope it is still useful, if less entertaining in style than some of my other posts.)

Hardware corruption in the form of bit flips in memories has been known to occur on terrestrial computers for more than 30 years (e.g., May & Woods 1979; Karnik et al. 2004, Heijman 2011). Hardware data corruption includes changes during operation made to both RAM and other CPU hardware resources (for example bits held in CPU registers that are intermediate results in a computation).

"Soft errors" are those in which a transient fault causes a loss of data, but not damage to a circuit. (Mariani03, pg. 51) Terminology can be slightly confusing in that a “soft error” has nothing to do with software, but instead refers to a hardware data corruption that does not involve permanent chip damage. These “soft” errors can be caused by, among other things, neutrons created by cosmic rays that penetrate the earth’s atmosphere.

Soft errors can be caused by external events such as naturally occurring alpha radiation particle caused by impurities in the semiconductor materials, as well as neutrons from cosmic rays that penetrate the earth’s atmosphere (Mariani03, pp. 51-53). Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani03, pg. 51).

As computers evolve to become more capable, they also use smaller and smaller individual devices to perform computation. And, as devices get smaller, they become more vulnerable to data corruption from a variety of sources in which a subtle fleeting event causes a disruption in a data value. For example, a cosmic ray might create a particle that strikes a bit in memory, flipping that bit from a “1” to a “0,” resulting in the change of a data value or death of a task in a real time control system. (Wang 2008 is a brief tutorial.)

Chapter 1 of Ziegler & Puchner (2004) gives an historical perspective of the topic. Chapter 9 of Ziegler & Puchner (2004) lists accepted techniques for protecting against hardware induced errors, including use of error correcting codes (id., p. 9-7) and memory scrubbing (id., p. 9-8). It is noted that techniques that protect memory “are not effective for computational or logic operations” (id., p. 9-8), meaning that only values stored in RAM can be protected by these methods, although special exotic methods can be used to protect the rest of the CPU (id., p. 1-13).

Causes for such failures vary in source and severity depending upon a variety of conditions and the exact type of technology used, but they are always present. The general idea of the failure mode is as follows. SRAM memory uses pairs of inverter logic gates arranged in a feedback pattern to “remember” a “1” or a “0” value. A neutron produced by a cosmic ray can strike atoms in an SRAM cell, inducing a current spike that over-rides the remembered value, flipping a 0 to a 1, or a 1 to a 0. Newer technology uses smaller hardware cells to achieve increased density. But having smaller components makes it easier for a cosmic ray or other disturbance to disrupt the cell’s value. While a cosmic ray strike may seem like an exotic way for computers to fail, this is a well documented effect that has been known for decades (e.g., May & Woods 1979 discussed radiation-induced errors) and occurs on a regular basis with a predictable average frequency, even though it is fundamentally a randomly occurring event.

Radiation strike causing transistor disruption (Gorini 2012).

A paper from Dr. Cristian Constantinescu, a top Intel dependability expert at that time, summarized the situation about a decade ago (Constantinescu 2003). He differentiates random faults into two cases. Intermittent faults, which might be due to a subtle manufacturing or wearout defect, can cause bursts of memory errors (id., pp. 15-16). Transient faults, on the other hand, can be caused by naturally occurring neutron and alpha particles, electromagnetic interference, and electrostatic discharge (id., pg. 16), which he refers to as “soft errors” in general. To simplify terminology, we’ll consider both transient and intermittent faults as “random” errors, even though intermittent errors strictly speaking may come in occasional bursts rather than having purely random independent statistical distributions. His paper goes on to discuss errors observed in real computers. In his study, 193 corporate server computers were studied, and 69% of them observed one or more random errors in a 16-month period. 2.6 percent of the servers reported more than 1000 random errors in that 16-month period due to a combination of random error sources (id., pg. 16):

Random Error Data (Constantinescu 2003, pp. 15-16).

This data supports the well known notion that random errors can and do occur in all computer systems. It is to be expected due to minute variations in manufacturing, environments, and statistical distribution properties that some systems will have a very large number of such errors in a short period of time, while many of the same exact type of system will exhibit none even for extended periods of operation. The problem is only expected to get worse for newer generations of electronics (id., pg. 17).

Excerpt from Mariani 2003, pg. 50

Mariani (2003, pg. 50) makes it clear that soft errors “are officially considered one of the main causes of loss of reliability in modern digital components.” He even goes so far to say of “x-by-wire” automotive systems (which includes “throttle-by-wire” and other computer-controlled functions): “Of course such systems must be highly reliable, as failure will cause fatal accidents, and therefore soft errors must be taken into account.” Additionally, random faults may be caused by internal events such as electrical noise and power supply disruptions that only occasionally cause problems. (Mariani 2003, pg. 51). MISRA Report 2 makes it clear that “random hardware failures” must be controlled for safety critical automotive applications, with a required rigor to control those failures increasing as safety becomes more important. (MISRA Report 2 p. 18).

Semiconductor error rates are often given in terms of FIT (“Failures in Time”), where one FIT equals one failure per billion operating hours. Mariani gives representative numbers as 4000 FIT for a CPU circa 2001 (Mariani03, pg. 54), with error rates increasing significantly for subsequent years. Based on a review of various data sources considered it seems reasonable to use an error rate of 1000 FIT/Mbit (1 errors per 1e13 bit-hr) for SRAM. (e.g., Heijman 2011 pg. 2, Wang 2008 pg. 430, Tezzaron 2004) Note that not all failures occur in memory. Any transistor on a chip can fail due to radiation, and not all storage is in RAM. As an example, registers such as a computer’s program counter are exposed to Single Event Upsets (SEUs), and the effect of an SEU there can result in the computer having arbitrarily bad behavior unless some form of mitigation is in place. Automotive-specific publications state that hardware induced data corruption must be considered (Seeger 1982).

SEU rates vary considerably based on a number of factors such as altitude, with data presented for Boulder Colorado showing a factor of 3.5 to 4.8 increase in cosmic-ray induced errors over data collected in Essex Junction Vermont (near sea level) (O’Gorman 1994, abstract).

Given the above, it is clear that random hardware errors are to be expected when deploying a large fleet of embedded systems. This means that if your system is safety critical, you need to take such errors into account. In a future post I'll explore how bad the effects of such a fault can be.

References:

Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, July-Aug 2003, pp. 14-19

Heijmen, Soft errors from space to ground: historical overview, empirical evidence, and future trends (Chapter 1), in: Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, 2011, Volume 41, pp. 1-41.

Karnik et al., Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable and Secure Computing, April-June 2004, pp. 128-143.

Los Alamos National Laboratory, The Invisible Neutron Threat, https://www.lanl.gov/science/NSS/issue1_2012/story4full.shtml

Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.

May & Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Trans. On Electronic Devices, vol. 26, 1979

MISRA, Report 2: Integrity, February 1995. (www.misra.org.uk)

O’Gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, April 1994, pp. 553-557.

Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.

Tezzaron Semiconductor, Soft Errors in Electronic Memory -- A White Paper, 2004, www.tezzaron.com

Wang & Agrawal, Single event upset: an embedded tutorial, 21st Intl. Conf. on VLSI Design, 2008, pp. 429-434.

Ziegler & Puchner, SER – History, Trends, and Challenges: a guide for designing with memory ICs, Cypress Semiconductor, 2004.

Additional material:

Article on cosmic ray memory corruption as a limiting factor in supercomputer size: https://spectrum.ieee.org/computing/hardware/how-to-kill-a-supercomputer-dirty-power-cosmic-rays-and-bad-solder
Cosmic rays and Japanese network malfunctions: "Cosmic rays causing 30,000 network malfunctions in Japan each year, April 5, 2021: https://mainichi.jp/english/articles/20210405/p2g/00m/0bu/028000c
https://www.bbc.com/future/article/20221011-how-space-weather-causes-computer-errors

Sunday, November 27, 2011

Avoiding EEPROM corruption problems

EEPROM is invaluable for storing operating parameters, error codes, and other non-volatile data. But it can also be a source of problems if stored values get corrupted. Here are some good practices for avoiding EEPROM problems and increasing EEPROM reliability.

Pay attention to error return codes when writing values, and make sure that the data you wanted to write actually gets written. Read it back after the write and checking for a correct value.
It takes a while to write to data. If your EEPROM supports a "finished" signal, use that instead of a fixed timeout that might or might not be long enough.
Don't access the EEPROM with marginal voltage. It is common for an external EEPROM chip to need a higher minimum voltage than your microcontroller. That means the microcontroller can be running happily while the EEPROM doesn't have enough voltage to operate properly, causing corrupted writes. In other words, you might have to set your brownout protection at a higher voltage than normal to protect EEPROM operation.
Have a plan for power loss during a write cycle. Will your hold-up cap keep things afloat long enough to finish the write? Do you re-initialize the EEPROM when the CPU powers up in case the EEPROM was left half-way through a write cycle when the CPU was reset?
Don't use address zero of the EEPROM. It is common for corruption problems to hit address zero, which can be the default address pointed to in the EEPROM if that chip is reset or otherwise has a problem during a write cycle.
Watch out for wearout. EEPROM can only be written a finite number of times. Make sure you stay well within the wearout rating for your EEPROM.
Consider using error coding such as a CRC protection for critical data items or blocks of data items stored in the EEPROM. Make sure the CRC isn't re-written so often that it causes EEPROM wearout.

If you've seen other problems or strategies that you'd like to share, let me know!

You might also be interested in my blog post on EEPROM and flash memory wearout.

Better Embedded System SW