Leveraging Application-Specific Knowledge to Guide Statistical Fault Injection

Testing fault tolerance mechanisms is commonly done by performing extensive fault injection experiments on a system that try to mimic physical causes of radiation effects like soft errors/bit flips and then observing the system’s behavior. There are many possibilities for such injections: Every bit in every cycle. This spans a so-called fault space and one of the first steps is determining the sets of possible injection points, which lead to the same system's behavior, to reduce the number of injections needed to test the functional reliability of the system.

One way to reduce the number of injections is to take a representative sample of the fault space and only inject said sample. In literature this is called statistical FI and reportedly highly reduces the needed injections.

Goal of this thesis is to answer if Statistical FI is actually viable and provides with an accurate picture over failure-rates in respect to changes to the program under test. For this a sample generator needs to be implemented, with variable confidence, margin of error and potential sampling method (see the linked paper). The results of the generator are then compared with the underlying truth, provided by a systematic fault injection campaign. To test if statistical FI accurately catches changes in the failure-rates, a hardened version of the used benchmarks needs to be implemented e.g., using triple-modular redundancy.

A second, open ended, part is to investigate, whether the disadvantages of sampling (loss of meaning of the results i.e., what line of code is affected by SDCs?) can be alleviated by performing sampling on a certain granularity (i.e., user-defined, function, basic-block) and injecting the full fault space of e.g., a function if needed.

Keywords

Statistical Fault Injection: Quantified Error and Confidence
R. Leveugle, A. Calvez, P. Maistri, P. VanhauwaertProceedings of the Conference on Design, Automation and Test in EuropeEuropean Design and Automation Association2009.
10.1109/DATE.2009.5090716 [BibTex]

Further Information