Budget Driven Fault Injection Campaigns for Continous Intergration Testing

Context

Testing fault tolerance mechanisms and by indirection safety critical systems is commonly done by performing extensive fault injection (FI) experiments on a system that tries to mimic physical causes of soft errors/bit flips and then observing the system’s behaviour. There are many possibilities for such injections, for instance every bit in every cycle. The size of the fault space also heavily depends on the layer in the system stack. In our case, we perform ISA-level fault injection, and FAIL*1 is the fault injection framework of our choice.

Currently fault injection campaigns are performed on a by-need basis. However, as advocated by the community2 performing test early and often leads to a more reliable system. In an ideal world, testing is done continously and since the project start, for instance in a continous integration (CI) setting.

As a FI campaign, which covers the whole fault space, takes a long time3, we need to shorten the campaign runtime to enable FI in a CI setting. A solution to this is injecting only a sample of the fault space4. However, as the samples are typically chosen at random and uniformly distributed, the sample might provide a misleading picture of the depedability of the program under test (PUT), especially if we want to use the results to guide developers.

Problem

We envision, given a injection budget, a mechanism which selects the most promising candidates for silent data corruptions (SDCs), the worst case outcome of a fault. However, the prediction of the probability of an SDC is a chicken and egg problem. If we could predict, whether an injection leads to an SDC or not, we wouldnt need the FI campaign in the first place.

Thus, we need another solution. The idea is based on Ebrahimi et.al.5, which build a k-ary tree to bundle FI pilots into larger chunks (multi-bit injections) to save overall campagin runtime 5. We could adapt this, and combine it with knowledge of the program under test.

image

Goal

The main goal of this thesis is to provide a way to enable meaningful results under a tight budget of injections.

Ebrahimi et.al.5 propose to bundle FI pilots in multi-bit injections, were each multi-bit injection is a node in a k-ary tree. The leaves of the correspond to single-bit injections, and each node above the leaves bundle injections together. In their implementation, the bundled leaves (single injections) are chosen randomly.

However in our case, we could try to bundle them in a semantically meaningful way (depending on e.g. static analysis of the program under test). Each layer of the tree could represent a finer granularity: basic blocks, function boundaries, maybe module (translation unit) boundaries, etc. Then we could calculate the needed budget to answer for a certain level of granularity; whether a part of the PUT is suspectible to SDCs. Based on these test results, the developer could decide whether the results of the deeper part of the subtree are needed.

In the CI setting we also could utilise knowledge from injections for past commits/patches. If in the past a certain subtree was not affected by SDCs, we could combine them on the next run and only re-split them if the combined multi-bit injection leads to an SDC.

But the mechanism of Ebrahimi et.al. are based on the finding, that single-bit injections which lead to SDCs are independent to each other and for instance do not cancel each other out (with a very very low propability). If we switch from randomly bundled injections to injections which are somehow related, we need to re-verify this property and need a model how propable it actually is that injections mask each other out.

For the evaluation several targets come to mind:

To summarize following is needed:

  1. Reevaluation of the hypothesis of Ebrahimi et. al.
    • With totally random grouped injections
    • With injections that are grouped in a semantically meaningful way
  2. Model the connections of semantic hierarchie (aka. levels of the tree)
    • Translation unit -> (static) functions -> (dynamic) function executions, basic blocks -> ??? -> single injections
    • Predict (or atleast measure) the potential error of the bundling on each level
  3. Customize FAIL* to allow the bundled injections
    • Evaluate on MiBench (some tooling already exists)
    • Maybe one of the other stated projects
  4. Utilize the CI context
    • For concurrent FI campaigns adjust the tree depending on pasts benign results

Topics: C++ (for FAIL*), python (for data visualization, benchmark automation; I am open for other tooling), docker/podman (environment to run/build FAIL*), dependability analysis

References