<p dir="ltr">Deep learning (DL) has emerged as a transformative technology in many fields. However, deploying high-performance deep neural network (DNN) online inference services with low service response time and high throughput at scale remains challenging due to high computational demands of DNNs, service latency constraints, and complex GPU resource management. Although GPUs are commonly used to accelerate DNN inference, efficiently utilizing GPU resources remains a significant challenge. Addressing these challenges is critical for deploying scalable, low-latency, high-throughput DNN online inference systems that support real-time AI applications such as autonomous driving, fraud detection, and conversational AI. Efficient GPU utilization also reduces operational costs and ensures reliable large-scale deployment.</p><p dir="ltr">This dissertation investigates how DNN online inference services can be designed and implemented to achieve low service response time, high throughput, and efficient GPU utilization. First, this work analyzes the impact of data preprocessing on DNN online inference efficiency. Then, this dissertation explains how data preprocessing bottlenecks can be identified and mitigated in DNN online inference services, and evaluates whether scalable, parallel data preprocessing enables high-performance inference with low latency, high throughput, and improved GPU utilization. For efficient GPU utilization, this dissertation introduces nvidia-mig-py, a MIG resource management tool that enables sharing a single GPU among multiple inference workloads. Additionally, backend optimization via model quantization with the NVIDIA Triton Inference Server is explored to enhance execution efficiency and resource utilization.</p><p dir="ltr">Overall, this dissertation provides practical engineering solutions and insights for improving the scalability and efficiency of DNN online inference services, while identifying limitations and suggesting directions for future work. Experimental results demonstrate that the proposed approach can deliver a high-throughput, low-latency inference system with efficient GPU utilization.</p>
Funding
Elements: Data: Integrating Human and Machine for Post-Disaster Visual Data Analytics: A Modern Media-Oriented Approach
Directorate for Computer & Information Science & Engineering