TOWARDS MOTION-RELATED VISUAL LEARNING
Computer systems face significant challenges in understanding and processing video content, despite recent advances in deep learning technologies. While specialized solutions exist for individual tasks like optical flow estimation and video depth perception, current approaches are limited by task-specific constraints, insufficient temporal modeling, and heavy reliance on densely annotated data. These limitations contrast sharply with human visual perception, which processes motion holistically and can function effectively with incomplete information.
In this dissertation, we address the research question: how can we develop more efficient and generalizable visual understanding systems that better align with human-like visual perception? Our investigation yields three major contributions. First, we present a optical flow estimation framework that effectively models long-range dependencies in video sequences. This approach uniquely addresses the challenge of occlusions and geometric variations by leveraging information from multiple frames through transformer-based spatial and temporal attention mechanisms. Second, we introduce a unified prototypical transformer framework that bridges the gap between task-specific solutions and holistic motion understanding. This novel approach achieves superior results in both optical flow and depth estimation tasks through innovative feature denoising and prototypical learning mechanisms. Third, we explore a label-efficient learning framework that generates high-quality simulated masks from sparsely annotated frames using optical flow estimation and differentiable warping, significantly reducing the annotation burden for video object segmentation tasks.
History
Degree Type
- Doctor of Philosophy
Department
- Computer Graphics Technology
Campus location
- West Lafayette