Fault Tolerance in Linear Algebraic Methods using Erasure Coded Computations
As parallel and distributed systems scale to hundreds of thousands of cores and beyond, fault tolerance becomes increasingly important -- particularly on systems with limited I/O capacity and bandwidth. Error correcting codes (ECCs) are used in communication systems where errors arise when bits are corrupted silently in a message. Error correcting codes can detect and correct erroneous bits. Erasure codes, an instance of error correcting codes that deal with data erasures, are widely used in storage systems. An erasure code addsredundancy to the data to tolerate erasures.
In this thesis, erasure coded computations are proposed as a novel approach to dealing with processor faults in parallel and distributed systems. We first give a brief review of traditional fault tolerance methods, error correcting codes, and erasure coded storage. The benefits and challenges of erasure coded computations with respect to coding scheme, fault models and system support are also presented.
In the first part of my thesis, I demonstrate the novel concept of erasure coded computations for linear system solvers. Erasure coding augments a given problem instance with redundant data. This augmented problem is executed in a fault oblivious manner in a faulty parallel environment. In the event of faults, we show how we can compute the true solution from potentially fault-prone solutions using a computationally inexpensive procedure. The results on diverse linear systems show that our technique has several important advantages: (i) as the hardware platform scales in size and in number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code as the core algorithm remains the same; (iii) the general scheme is flexible to accommodate a range of computation and communication trade-offs.
We propose a new coding scheme for augmenting the input matrix that satisfies the recovery equations of erasure coding with high probability in the event of random failures. This coding scheme also minimizes fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. Our experimental results show that the scheme adds minimal overhead for fault tolerance, yields excellent parallel efficiency and scalability, and is robust to different fault arrival models and fault rates.
Building on these results, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point during execution, we only solve a system with the same size as the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate the observed faults. We present, in details, the augmentation process, the parallel formulation, and the performance of our method. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance.
Based on the promising results for linear system solvers, we apply the concept of erasure coded computation to eigenvalue problems, which arise in many applications including machine learning and scientific simulations. Erasure coded computation is used to design a fault tolerant eigenvalue solver. The original eigenvalue problem is reformulated into a generalized eigenvalue problem defined on appropriate augmented matrices. We present the augmentation scheme, the necessary conditions for augmentation blocks, and the proofs of equivalence of the original eigenvalue problem and the reformulated generalized eigenvalue problem. Finally, we show how the eigenvalues can be derived from the augmented system in the event of faults.
We present detailed experiments, which demonstrate the excellent convergence properties of our fault tolerant TraceMin eigensolver in the average case. In the worst case where the row-column pairs that have the most impact on eigenvalues are erased, we present a novel scheme that computes the augmentation blocks as the computation proceeds, using the estimates of leverage scores of row-column pairs as they are computed by the iterative process. We demonstrate low overhead, excellent scalability in terms of the number of faults, and the robustness to different fault arrival models and fault rates for our method.
In summary, this thesis presents a novel approach to fault tolerance based on erasure coded computations, demonstrates it in the context of important linear algebra kernels, and validates its performance on a diverse set of problems on scalable parallel computing platforms. As parallel systems scale to hundreds of thousands of processing cores and beyond, these techniques present the most scalable fault tolerant mechanisms currently available.