Purdue University Graduate School
thesis.pdf (2.08 MB)


Download (2.08 MB)
posted on 2019-05-14, 15:36 authored by Jianqiao LiuJianqiao Liu
N-body problems, such as simulating the motion of stars in a galaxy and evaluating the spatial statistics through n-point correlation function, are popularly solved. The naive approaches to n-body problems are typically O(n^2) algorithms. Tree codes take advantages of the fact that a group of bodies can be skipped or approximated as a union if their distance is far away from one body’s sight. It reduces the complexity from O(n^2) to O(n*lgn). However, tree codes rely on pointer chasing and have massive branch instructions. These are highly irregular and thus prevent tree codes from being easily parallelized.

GPU offers the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires the code to be carefully structured to deal with the limitations of the SIMT execution model. This dissertation focusses on optimizations for n-body problems on the heterogeneous system. A general inspector-executor based framework is proposed to automatically schedule GPU threads to achieve high performance. Essentially, the framework lets the GPU execute partial of the tree codes and profile threads behaviors, then it assigns the CPU to re-organize these threads to minimize the divergence before executing the remaining portion of the traversals on the GPU. We apply this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses hand tuned, application-specific scheduling.
For large scale input, ChaNGa is the best-of-breed n-body platform. It uses an asymp-totically-efficient tree traversal strategy known as a dual-tree walk to quickly provide an accurate simulation result. On GPUs, ChaNGa uses a hybrid strategy where the CPU performs the tree walk to determine which bodies interact while the GPU performs the force computation. In this dissertation, we show that a highly optimized single-tree walk approach is able to achieve better GPU performance by significantly accelerating the tree walk and reducing CPU/GPU communication. Our experiments show that this new design can achieve a 8.25x speedup over baseline ChaNGa using one node, one process per node configuration. We also point out that ChaNGa's implementation doesn't satisfy the inclusion condition so that GPU-centric remote tree walk doesn't perform well.


Compiler and Run-Time Approaches to Enable Large Scale Irregular Programs

Office of Advanced Scientific Computing Research

Find out more...

OAC-1550525 SI2-SSI: Collaborative Research: ParaTreet: Parallel Software for Spatial Trees in Simulation and Analysis


Degree Type

  • Doctor of Philosophy


  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Milind Kulkarni

Additional Committee Member 2

Anand Raghunathan

Additional Committee Member 3

Mithuna S. Thottethodi

Additional Committee Member 4

Samuel P. Midkiff