Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning

Zheng, Weijian

doi:10.25394/PGS.21680567.v1

thesis_Weijian_Zheng_deposit.pdf (3.34 MB)

Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning

thesis

posted on 2022-12-08, 17:29 authored by Weijian ZhengWeijian Zheng

High-performance computing (HPC) and machine learning (ML) have been widely adopted by both academia and industries to address enormous data problems at extreme scales. While research has reported on the interactions of HPC and ML, achieving high performance and scalability for parallel and distributed ML algorithms is still a challenging task. This dissertation first summarizes the major challenges for applying HPC to ML applications: 1) poor performance and scalability, 2) loss of the convergence rate, 3) lower quality of the trained model, and 4) a lack of performance optimization techniques designed for specific applications. Researchers can address the four challenges in new ML applications. This dissertation shows how to solve them for two specific applications: 1) a clustering algorithm and 2) graph optimization algorithms that use reinforcement learning (RL).

As to the clustering algorithm, we first propose an algorithm called the simulated-annealing clustering algorithm. By combining a blocked data layout and asynchronous local optimization within each thread, the simulated-annealing enhanced clustering algorithm has a convergence rate that is comparable to the K-means algorithm but with much higher performance. Experiments with synthetic and real-world datasets show that the simulated-annealing enhanced clustering algorithm is significantly faster than the MPI K-means library using up to 1024 cores. However, the optimization costs (Sum of Square Error (SSE)) of the simulated-annealing enhanced clustering algorithm became higher than the original costs. To tackle this problem, we devise a new algorithm called the full-step feel-the-way clustering algorithm. In the full-step feel-the-way algorithm, there are L local steps within each block of data points. We use the first local step’s results to compute accurate global optimization costs. Our results show that the full-step algorithm can significantly reduce the global number of iterations needed to converge while obtaining low SSE costs. However, the time spent on the local steps is greater than the benefits of the saved iterations. To improve this performance, we next optimize the local step time by incorporating a sampling-based method called reassignment-history-aware sampling. Extensive experiments with various synthetic and real world datasets (e.g., MNIST, CIFAR-10, ENRON, and PLACES-2) show that our parallel algorithms can outperform the fastest open-source MPI K-means implementation by up to 110% on 4,096 CPU cores with comparable SSE costs.

Our evaluations of the sampling-based feel-the-way algorithm establish the effectiveness of the local optimization strategy, the blocked data layout, and the sampling methods for addressing the challenges of applying HPC to ML applications. To explore more parallel strategies and optimization techniques, we focus on a more complex application: graph optimization problems using reinforcement learning (RL). RL has proved successful for automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised RL’s ability to solve large scale graph optimization problems due to the lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop OpenGraphGym-MG, a high performance distributed-GPU RL framework for solving graph optimization problems. OpenGraphGym-MG focuses on a class of computationally demanding RL problems in which both the RL environment and the policy model are highly computation intensive. In this work, we distribute large-scale graphs across distributed GPUs and use spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of spatial and data parallelism and highlight their differences. To support graph neural network (GNN) layers that take data samples partitioned across distributed GPUs as input, we design new parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design new parallel graph environments to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance OpenGraphGym-MG training and inference algorithms in parallel.

To summarize, after proposing the major challenges for applying HPC to ML applications, this thesis explores several parallel strategies and performance optimization techniques using two ML applications. Specifically, we propose a local optimization strategy, a blocked data layout, and sampling methods for accelerating the clustering algorithm, and we create a spatial parallelism strategy, a parallel graph environment, agent, and policy model, and an optimized replay buffer, and multi-node selection strategy for solving large optimization problems over graphs. Our evaluations prove the effectiveness of these strategies and demonstrate that our accelerations can significantly outperform the state-of-the-art ML libraries and frameworks without loss of quality in trained models.

Funding

This research was funded by the U.S. Department of Energy (DOE), Office of Science, Advanced Scientific Computing Research (ASCR) (Interoperable Design of Extreme-scale Application Software (IDEAS)).

Framework: Software: Collaborative Research: CyberWater--An open and sustainable framework for diverse data and model integration with provenance and access to HPC

Directorate for Computer & Information Science & Engineering

Find out more...

This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory (ORNL), which is managed by UT-Battelle LLC for the Department of Energy under contract DE-AC05-00OR22725.

History

Degree Type

Doctor of Philosophy

Department

Computer Science

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Fengguang Song

Advisor/Supervisor/Committee co-chair

Zhiyuan Li

Additional Committee Member 2

Yao Liang

Additional Committee Member 3

Pan Li

Usage metrics

Keywords

High Performance Computing (HPC)Clustering Algorithm Reinforcement Learning combinatorial optimization problems graph problems Travelling salesperson problem Minimum Vertex Cover Problem Distributed processing of data NP-Hard optimization problems model parallelism data parallelism

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Scalable Parallel Machine Learning on High Performance Computing Systems–Clustering and Reinforcement Learning

Funding

This research was funded by the U.S. Department of Energy (DOE), Office of Science, Advanced Scientific Computing Research (ASCR) (Interoperable Design of Extreme-scale Application Software (IDEAS)).

Framework: Software: Collaborative Research: CyberWater--An open and sustainable framework for diverse data and model integration with provenance and access to HPC

This research used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at Oak Ridge National Laboratory (ORNL), which is managed by UT-Battelle LLC for the Department of Energy under contract DE-AC05-00OR22725.

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Advisor/Supervisor/Committee co-chair

Additional Committee Member 2

Additional Committee Member 3

Usage metrics

Categories

Keywords

Licence

Exports