The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

Kumar, Rakesh

doi:10.25394/PGS.9044138.v1

MSThesis_RakeshKumar_July25.pdf (1.49 MB)

The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

thesis

posted on 2019-08-14, 17:35 authored by Rakesh KumarRakesh Kumar

Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. We provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.

Funding

NSF Grant No. CNS-1548114, CNS-1405906

History

Degree Type

Master of Science in Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Prof. Saurabh Bagchi

Additional Committee Member 2

Prof. Jan P. Allebach

Additional Committee Member 3

Prof. Somali Chaterji

Additional Committee Member 4

Prof. Felix X. Lin

Usage metrics

Keywords

High Performance Computing Performance Evaluation Failure Analysis Failure Prediction Performance Evaluation; Testing and Simulation of Reliability Computer Engineering Distributed Computing Distributed and Grid Systems

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems

Funding

NSF Grant No. CNS-1548114, CNS-1405906

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Additional Committee Member 4

Usage metrics

Categories

Keywords

Licence

Exports