Purdue University Graduate School
Browse

<b>Efficient Deep Learning Online Inference Service</b>

thesis
posted on 2025-07-28, 20:14 authored by Zhiwei ChuZhiwei Chu
<p dir="ltr">Deep learning (DL) has emerged as a transformative technology in many fields. However, deploying high-performance deep neural network (DNN) online inference services with low service response time and high throughput at scale remains challenging due to high computational demands of DNNs, service latency constraints, and complex GPU resource management. Although GPUs are commonly used to accelerate DNN inference, efficiently utilizing GPU resources remains a significant challenge. Addressing these challenges is critical for deploying scalable, low-latency, high-throughput DNN online inference systems that support real-time AI applications such as autonomous driving, fraud detection, and conversational AI. Efficient GPU utilization also reduces operational costs and ensures reliable large-scale deployment.</p><p dir="ltr">This dissertation investigates how DNN online inference services can be designed and implemented to achieve low service response time, high throughput, and efficient GPU utilization. First, this work analyzes the impact of data preprocessing on DNN online inference efficiency. Then, this dissertation explains how data preprocessing bottlenecks can be identified and mitigated in DNN online inference services, and evaluates whether scalable, parallel data preprocessing enables high-performance inference with low latency, high throughput, and improved GPU utilization. For efficient GPU utilization, this dissertation introduces nvidia-mig-py, a MIG resource management tool that enables sharing a single GPU among multiple inference workloads. Additionally, backend optimization via model quantization with the NVIDIA Triton Inference Server is explored to enhance execution efficiency and resource utilization.</p><p dir="ltr">Overall, this dissertation provides practical engineering solutions and insights for improving the scalability and efficiency of DNN online inference services, while identifying limitations and suggesting directions for future work. Experimental results demonstrate that the proposed approach can deliver a high-throughput, low-latency inference system with efficient GPU utilization.</p>

Funding

Elements: Data: Integrating Human and Machine for Post-Disaster Visual Data Analytics: A Modern Media-Oriented Approach

Directorate for Computer & Information Science & Engineering

Find out more...

RESILIENT EXTRATERRESTRIAL HABITATS

National Aeronautics and Space Administration

Find out more...

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer and Information Technology

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Thomas J. Hacker

Additional Committee Member 2

John A. Springer

Additional Committee Member 3

Jin Wei-Kocsis

Additional Committee Member 4

Deepak Nadig Anantha