Purdue University Graduate School
Browse

Compiler and Architecture Co-design for Reliable Computing

Download (7.45 MB)
thesis
posted on 2024-07-24, 22:02 authored by Jianping ZengJianping Zeng

Reliability against errors, such as soft errors—transient bit flips in transistors caused by energetic particle strikes—and crash inconsistency arising from power failure, is as crucial as performance and power efficiency for a wide range of computing devices, from embedded systems to datacenters. If not properly addressed, these errors can lead to massive financial losses and even endanger human lives. Furthermore, the dynamic nature of modern computing workloads complicates the implementation of reliable systems as the likelihood and impact of these errors increase. Consequently, system designers often face a dilemma: sacrificing reliability for performance and cost-effectiveness or incurring high manufacturing and/or run-time costs to maintain high system dependability. This trade-off can result in reduced availability and increased vulnerability to errors when reliability is not prioritized or escalated costs when it is.

In light of this, this dissertation, for the first time, demonstrates that with a synergistic compiler and architecture co-design, it is possible to achieve reliability while maintaining high performance and low hardware cost. We begin by illustrating how compiler/architecture co-design achieves near-zero-overhead soft error resilience for embedded cores (Chapter 2). Next, we introduce ReplayCache (Chapter 3), a software-only approach that ensures crash consistency for energy harvesting systems (backed with embedded cores) and outperforms the state-of-the-art by 9x. Apart from embedded cores, reliability for server-class cores is more vital due to their widespread adoption in performance-critical environments. With that in mind, we then propose VeriPipe (Chapter 4), which showcases how a straightforward microarchitectural technique can achieve near-zero-overhead soft error resilience for server-class cores with a storage overhead of just three registers and one countdown timer. Finally, we present two approaches to achieving performant crash consistency for server-class cores by leveraging existing dynamic register renaming in out-of-order cores (Chapter 5) and repurposing Intel’s non-temporal path (Chapter 6), respectively. Through these innovations, this dissertation paves the way for more reliable and efficient computing systems, ensuring that reliability does not come at the cost of performance degradation or hardware complexity.

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Changhee Jung

Additional Committee Member 2

Xuehai Qian

Additional Committee Member 3

Mohammadkazem Taram

Additional Committee Member 4

Muhammad Shahbaz

Additional Committee Member 5

Yongle Zhang

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC