Purdue University Graduate School
thesis.pdf (2.76 MB)


Download (2.76 MB)
posted on 2021-12-03, 18:14 authored by Hongyu MiaoHongyu Miao

A large volume of data is continuously being generated by data centers, humans, and the internet of things (IoT). In order to get useful insights, such enormous data must be processed in time with high throughput, low latency, and high accuracy. To meet such performance demands, a large body of new hardware is being shipped by vendors, such as multi-core CPUs, 3D-stacked memory, embedded microcontrollers, and other accelerators.

However, traditional operating systems (OSes) and data analytics frameworks, the key layer that bridges high-level data processing applications and low-level hardware, fails to deliver these requirements due to quickly evolving new hardware and increases in explosion of data. For instance, general OSes are not aware of the unique characters and demands of data processing applications. Data analytics engines for stream processing, e.g., Apache Spark and Beam, always add more machines to deal with more data but leave every single machine underutilized without fully exploiting underlying hardware features, which leads to poor efficiency. Data analytics frameworks for machine learning inference on IoT devices cannot run neural networks that exceed SRAM size, which disqualifies many important use cases.

In order to bridge the gap between the performance demands of data analytics and the new features of emerging hardware, in this thesis we exploit runtime system designs for high-level data processing applications by exploiting low-level modern hardware features. We study two important data analytics applications, including real-time stream processing and on-device machine learning inference, on three important hardware platforms across the Cloud and the Edge, including multicore CPUs, hybrid memory system combining 3D-stacked memory and general DRAM, and embedded microcontrollers with limited resources.

In order to speed up and enable the two data analytics applications on the three hardware platforms, this thesis contributes three related research projects. In project StreamBox, we exploit the parallelism and memory hierarchy of modern multicore hardware on single machines for stream processing, achieving scalable and highly efficient performance. In project StreamBox-HBM, we exploit hybrid memories to balance bandwidth and latency, achieving memory scalability and highly efficient performance. StreamBox and StreamBox-HBM both offer orders of magnitude performance improvements over the prior state of the art, opening up new applications with higher data processing needs. In project SwapNN, we investigate a system solution for microcontrollers (MCUs) to execute neural networks (NNs) inference out-of-core without losing accuracy, enabling new use cases and significantly expanding the scope of NN inference on tiny MCUs.

We report the system designs, system implementations, and experimental results. Based on our experience in building above systems, we provide general guidance on designing runtime systems across hardware/software stack for a wider range of new applications on future hardware platforms.


NSF Award #1718702

NSF Award #1619075

Google Faculty Award

NSF Award #1464357


Degree Type

  • Doctor of Philosophy


  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Felix Xiaozhu Lin

Additional Committee Member 2

Kathryn S. McKinley

Additional Committee Member 3

Mithuna S. Thottethodi

Additional Committee Member 4

Y. Charlie Hu