Purdue University Graduate School
Browse

System Optimizations for SLO-aware ML Inference Serving

thesis
posted on 2025-07-25, 16:02 authored by Zhaoning KongZhaoning Kong
<p dir="ltr">The rapid growth of machine learning (ML)-powered applications, including interactive augmented reality (AR), large-scale video analytics, to large language models (LLMs)-based chatbots, has introduced the need to run large ML models that are typically too computationally intensive for mobile devices such as phones or laptops. As a result, ML inference requests are often offloaded to powerful servers, making efficient inference serving more critical than ever. These systems need to serve high volumes of real-time requests under strict latency and/or accuracy requirements, while operating under practical constraints such as heterogeneous hardware, fluctuating workloads, and energy consumption concerns. Meeting these demands requires rethinking traditional inference serving architectures to achieve high throughput, low latency, and energy efficiency simultaneously. However, designing such inference systems poses a number of technical challenges.</p><p dir="ltr">First, latency-optimal serving strategies often conflict with throughput-optimal strategies such as batching, especially under dynamic workloads. Second, serving on clusters with heterogeneous GPUs under tight SLO constraints introduces complexities in resource provisioning and request scheduling. Third, large-scale deployments of models, especially LLMs, require energy-aware designs to tame the growing power costs without violating stringent service-level objectives (SLOs). Addressing these challenges requires novel policies and mechanisms in scheduling, resource allocation, and system optimizations.</p><p dir="ltr">This thesis first presents ARISE, a proactive inference serving system for edge-assisted AR applications. ARISE improves server throughput by coordinating client offloading schedules with server-side batching, while satisfying the accuracy and latency demands of AR workloads. It introduces a lightweight accuracy estimator and a centralized scheduler that jointly optimize offloading frequency and batching efficiency. ARISE demonstrates up to 6.9x improvement in serving capacity compared to baseline systems on representative AR tasks like depth estimation and object detection.</p><p><br></p><p dir="ltr">Next, the thesis introduces PPipe, a model serving system that exploits pipeline parallelism across heterogeneous GPU clusters to improve inference throughput. Unlike traditional pipeline parallelism which groups GPUs into multiple pipelines where each pipeline is a chain of GPUs, PPipe employs a novel pool-based pipeline parallelism strategy which assigns a pool of GPUs to each model partition, enabling flexible scheduling across GPUs with different capabilities. It employs an MILP-based control plane to determine optimal pipeline configurations, and an adaptive batching data plane to handle bursty and asynchronous request arrivals. PPipe achieves up to 75.1% higher throughput and over 60% greater utilization of lower-tier GPUs.</p><p dir="ltr">Finally, this thesis presents Proteus, an energy-efficient inference framework for LLMs that performs fine-grained GPU frequency scaling using Dynamic Voltage and Frequency Scaling (DVFS) at sub-second granularity. Proteus is designed to respond to rapid workload fluctuations, which conventional autoscaling mechanisms, typically operating at minute-level timescales, are too slow to handle. Proteus leverages predictive models and model-predictive control (MPC) to dynamically select GPU frequencies that minimize power draw while maintaining latency SLOs. Compared to the baselines, Proteus reduces energy consumption by up to 51.8% under low-load conditions, and up to 38.5% under high-load conditions.</p>

History

Degree Type

  • Doctor of Philosophy

Department

  • Electrical and Computer Engineering

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Yu Charlie Hu

Additional Committee Member 2

Lu Su

Additional Committee Member 3

Qiang Qiu

Additional Committee Member 4

Sonia Fahmy