Date of Award
2025
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Committee Chair
Joshua Booth
Committee Member
Jacob Hauenstein
Committee Member
Evan Miller
Research Advisor
Joshua Booth
Subject(s)
High performance computing, Iterative methods (Mathematics)
Abstract
The increasing complexity of modern high-performance computing (HPC) systems, with their vast number of processing units and extensive memory hierarchies, introduces significant challenges to system resilience, particularly in the face of soft errors. Krylov subspace methods, which play a pivotal role in solving large-scale sparse linear systems and eigenvalue problems, are iterative in nature and thus susceptible to the propagation of soft errors. In this paper, we examine the fault tolerance properties of two widely-used Krylov subspace methods: the Lanczos method and the Bi- Conjugate Gradient (BiCG) method. Specifically, we analyze the impact of soft errors introduced during the critical Sparse Matrix-Vector Multiplication (SpMV) operation on the accuracy of eigenvalue computations in the Lanczos method and the convergence behavior in the BiCG method. Our empirical results reveal that while both methods demonstrate an intrinsic resilience to errors, the BiCG method exhibits superior self-correcting characteristics. Moreover, we observe a strong correlation between the row-2-norm of the sparse matrix and the slowdowns, suggesting that targeted fault protection can enhance overall algorithmic robustness. Additionally, we present a comparative analysis of the fault tolerance between the BiCG and Preconditioned Conjugate Gradient (PCG) method, emphasizing the importance of BiCG in Krylov Subspace Methods.
Recommended Citation
Pandit, Sandesh, "Fault tolerance in Krylov subspace methods" (2025). Theses. 743.
https://louis.uah.edu/uah-theses/743