Purdue University Graduate School
Browse

Building A More Efficient Mobile Vision System Through Adaptive Video Analytics

Download (85.78 MB)
thesis
posted on 2024-12-17, 21:20 authored by Junpeng GuoJunpeng Guo

Mobile vision is becoming the norm, transforming our daily lives. It powers numerous applications, enabling seamless interactions between the digital and physical worlds, such as augmented reality, real-time object detection, and many others. The popularity of mobile vision has spurred advancements from both computer vision (CV) and mobile edge computing (MEC) communities. The former focuses on improving analytics accuracy through the use of proper deep neural networks (DNNs), while the latter addresses the resource limitations of mobile environments by coordinating tasks between mobile and edge devices, determining which data to transmit and process to enable real-time performance.

Despite recent advancements, existing approaches typically integrate the functionalities of the two camps at a basic task level. They rely on a uniform on-device processing scheme that streams the same type of data and uses the same DNN model for identical CV tasks, regardless of the analytical complexity of the current input, input size, or latency requirements. This lack of adaptability to dynamic contexts limits their ability to achieve optimal efficiency in scenarios involving diverse source data, varying computational resources, and differing application requirements.

Our approach seeks to move beyond task-level adaptation by emphasizing customized optimizations tailored to dynamic use scenarios. This involves three key adaptive strategies: dynamically compressing source data based on contextual information, selecting the appropriate computing model (e.g., DNN or sub-DNN) for the vision task, and establishing a feedback mechanism for context-aware runtime tuning. Additionally, for scenarios involving movable cameras, the feedback mechanism guides the data capture process to further enhance performance. These innovations are explored across three use cases categorized by the capture device: one stationary camera, one moving camera, and cross-camera analytics.

My dissertation begins with a stationary camera scenario, where we improve efficiency by adapting to the use context on both the device and edge sides. On the device side, we explore a broader compression space and implement adaptive compression based on data context. Specifically, we leverage changes in confidence scores as feedback to guide on-device compression, progressively reducing data volume while preserving the accuracy of visual analytics. On the edge side, instead of training a specialized DNN for each deployment scenario, we adaptively select the best-fit sub-network for the given context. A shallow sub-network is used to “test the waters”, accelerating the search for a deep sub-network that maximizes analytical accuracy while meeting latency requirements.

Next, we explore scenarios involving a moving camera, such as those mounted on drones. These introduce new challenges, including increased data encoding demands due to camera movement and degraded analytics performance (e.g., tracking) caused by changing perspectives. To address these issues, we leverage drone-specific domain knowledge to optimize compression for object detection by applying global motion compensation and assigning different resolutions at a tile-granularity level based on the far-near effect. Furthermore, we tackle the more complex task of object tracking and following, where the analytics results directly influence the drone’s navigation. To enable effective target following with minimal processing overhead, we design an adaptive frame rate tracking mechanism that dynamically adjusts based on changing contexts.

Last but not least, we extend the work to cross-camera analytics, focusing on coordination between one stationary ground-based camera and one moving aerial camera. The primary challenge lies in addressing significant misalignments (e.g., scale, rotation, and lighting variations) between the two perspectives. To overcome these issues, we propose a multi-exit matching mechanism that prioritizes local feature matching while incorporating global features and additional cues, such as color and location, to refine matches as needed. This approach ensures accurate identification of the same target across viewpoints while minimizing computational overhead by dynamically adapting to the complexity of the matching task.

While the current work primarily addresses ideal conditions, assuming favorable weather, optimal lighting, and reliable network performance, it establishes a solid foundation for future innovations in adaptive video processing under more challenging conditions. Future efforts will focus on enhancing robustness against adversarial factors, such as sensing data drift and transmission losses. Additionally, we plan to explore multi-camera coordination and multimodal data integration, leveraging the growing potential of large language models to further advance this field.

Funding

National Science Foundation (NSF) CAREER Award 1750953

History

Degree Type

  • Doctor of Philosophy

Department

  • Computer Science

Campus location

  • West Lafayette

Advisor/Supervisor/Committee Chair

Chunyi Peng

Additional Committee Member 2

Bharat Bhargava

Additional Committee Member 3

Bruno Ribeiro

Additional Committee Member 4

Fengqing Maggie Zhu

Additional Committee Member 5

Raymond Yeh

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC