Better Embedded System SW: Redundant Input Processing for Safety

Redundant analog and digital inputs to a safety critical system must be fed to independent chips to ensure no single point failure exists. (This posting is a follow-on to a previous post about single points of failure.)

Consequences: If fully replicated input processing and validation is not implemented with complete avoidance of single points of failure, it is possible for a single fault to result in erroneous input values causing unsafe system operation.

Accepted Practices:

A safety critical system must not have any single point of failure that results in a significant unsafe condition if that failure can reasonably be expected to occur during the operational life of the deployed fleet of systems. Redundant input processing is an accepted practice that can help avoid single point failures.

Discussion:

A specific instance of avoiding single points of failure involves the processing of data inputs. It is imperative that safety critical input signals be duplicated and processed independently to avoid a single point of failure in input processing.

Analog inputs must be converted to digital signals via an A/D converter, which is a relatively complex apparatus that takes up a significant amount of chip area. For this reason, it is common to use a single shared A/D converter with multiple shared (“muxed”) inputs to that single converter. If redundant external inputs are run through the same A/D, this creates a single point of failure in the form of the A/D converter itself and the associated control circuitry.

Mauser gives an example of this problem applied to automotive throttle control, showing that only a "true 2-channel system" (e.g., a 2oo2 system with redundant inputs) provides safety.

Figures from Mauser 1999, pp. 731, 738-739 showing an example throttle control system that causes runaway unless a truly redundant system (dual CPUs plus dual A/D conversion) is used.

Similarly, digital inputs that are processed in the same chip have common circuitry affecting their operation, which in typical chips includes a direction register that determines whether a digital pin is an input or output.

For both analog and digital inputs, an additional way of looking at single point failures is that if both redundant input signals are processed on the same chip, that one chip is subject to arbitrary faults, with arbitrary fault behavior including the possibility of corrupting both inputs in a way that is both faulty and undetectable by other components in the system. Unless some independent means of ensuring system safety is present, such a single point of failure impairs system safety. An arbitrary fault on a single chip that processes both copies of an analog input sensor might declare the sensor to be fully activated, but within normal operational limits, resulting in that value being processed by the rest of the system whether the input is really active or not. For example, an embedded CPU in a car might think that the brake pedal is depressed, accelerator pedal is depressed, or parking brake is engaged despite the potential presence of redundant sensors on those controls if both sensors pass through the same A/D converter or digital input port and there is a common-mode hardware fault in those input processing circuits.

Beyond the need to independently computer results on two different chips, there is an additional requirement for safety that each of the two chips independently and fully compare the inputs to detect any faults. For example, if chips A and B both process inputs, but only chip B compares them for correctness, then there is a single point of failure if chip B has a bad input and incorrectly reports the comparison as passing (this only counts as one failure because chip B can fail in an arbitrary way in which it both mis-interprets input B and "lies" about the comparison with input A being OK). A safe way to do such a comparison is that chip A compares both inputs, chip B compares both inputs, and the system only continues operation if both chip A and chip B agree that each of their comparisons validated the inputs as being consistent. Moreover, cross-checks on the outputs based on those inputs must also be performed to detect faults that occur after input processing. That generally leads to a 2oo2 architecture like the one shown below, with each FCR usually being a CPU chip.

Sometimes both inputs go to both FCRs, but then it must be ensured that there is hardware isolation in place so that a hardware fault on one FCR can't propagate to the inputs of the other FCR via the shared input lines. Another complication is that redundant sensors often do not produce identical output values, and the problem of determining distributed agreement turns out to be very difficult even if all you need is an approximate agreement result. In general, once you have replication, getting agreement across the replicated copies of inputs and computations requires some effort (Poledna 1995). But, these are the sorts of issues that engineers routinely work through when creating safety-critical systems.

In the end, having a single shared A/D converter or other input circuit for a safety critical system is inadequate. You must have two separate input processing circuits on two separate chips to have two independent Fault Containment Regions (e.g., using a 2oo2 architectural approach with redundant inputs). This is required to achieve safety for a high-integrity application, and any high-integrity embedded system that uses a shared A/D converter on a single chip to process redundant inputs is unsafe.

References:

Mauser, Electronic throttle control – a dependability case study, J. Univ. Computer Science, 5(10), 1999, pp. 730-741.
Poledna, S., "Fault tolerance in safety critical automotive applications: cost of agreement as a limiting factor ", Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on , 27-30 Jun 1995 Page(s): 73 -8

Better Embedded System SW

Monday, April 14, 2014

Redundant Input Processing for Safety

No comments:

Post a Comment

Static Analysis Ranked Defect List

Pages

Search This Blog