Sequential Semantic Segmentation of Streaming Scenes for Autonomous Driving
In traffic scene perception for autonomous vehicles, driving videos are available from in-car sensors such as camera and LiDAR for road detection and collision avoidance. There are some existing challenges in computer vision tasks for video processing, including object detection and tracking, semantic segmentation, etc. First, due to that consecutive video frames have a large data redundancy, traditional spatial-to-temporal approach inherently demands huge computational resource. Second, in many real-time scenarios, targets move continuously in the view as data streamed in. To achieve prompt response with minimum latency, an online model to process the streaming data in shift-mode is necessary. Third, in addition to shape-based recognition in spatial space, motion detection also replies on the inherent temporal continuity in videos. While current works either lack long-term memory for reference or consume a huge amount of computation.
The purpose of this work is to achieve strongly temporal-associated sensing results in real-time with minimum memory, which is continually embedded to a pragmatic framework for speed and path planning. It takes a temporal-to-spatial approach to cope with fast moving vehicles in autonomous navigation. It utilizes compact road profiles (RP) and motion profiles (MP) to identify path regions and dynamic objects, which drastically reduces video data to a lower dimension and increases sensing rate. Specifically, we sample one-pixel line at each video frame, the temporal congregation of lines from consecutive frames forms a road profile image; while motion profile consists of the average lines by sampling one-belt pixels at each frame. By applying the dense temporal resolution to compensate the sparse spatial resolution, this method reduces 3D streaming data into 2D image layout. Based on RP and MP under various weather conditions, there have three main tasks being conducted to contribute the knowledge domain in perception and planning for autonomous driving.
The first application is semantic segmentation of temporal-to-spatial streaming scenes, including recognition of road and roadside, driving events, objects in static or motion. Since the main vision sensing tasks for autonomous driving are identifying road area to follow and locating traffic to avoid collision, this work tackles this problem by using semantic segmentation upon road and motion profiles. Though one-pixel line may not contain sufficient spatial information of road and objects, the consecutive collection of lines as a temporal-spatial image provides intrinsic spatial layout because of the continuous observation and smooth vehicle motion. Moreover, by capturing the trajectory of pedestrians upon their moving legs in motion profile, we can robustly distinguish pedestrian in motion against smooth background. The experimental results of streaming data collected from various sensors including camera and LiDAR demonstrate that, in the reduced temporal-to-spatial space, an effective recognition of driving scene can be learned through Semantic Segmentation.
The second contribution of this work is that it accommodates standard semantic segmentation to sequential semantic segmentation network (SE3), which is implemented as a new benchmark for image and video segmentation. As most state-of-the-art methods are greedy for accuracy by designing complex structures at expense of memory use, which makes trained models heavily depend on GPUs and thus not applicable to real-time inference. Without accuracy loss, this work enables image segmentation at the minimum memory. Specifically, instead of predicting for image patch, SE3 generates output along with line scanning. By pinpointing the memory associated with the input line at each neural layer in the network, it preserves the same receptive field as patch size but saved the computation in the overlapped regions during network shifting. Generally, SE3 applies to most of the current backbone models in image segmentation, and furthers the inference by fusing temporal information without increasing computation complexity for video semantic segmentation. Thus, it achieves 3D association over long-range while under the computation of 2D setting. This will facilitate inference of semantic segmentation on light-weighted devices.
The third application is speed and path planning based on the sensing results from naturalistic driving videos. To avoid collision in a close range and navigate a vehicle in middle and far ranges, several RP/MPs are scanned continuously from different depths for vehicle path planning. The semantic segmentation of RP/MP is further extended to multi-depths for path and speed planning according to the sensed headway and lane position. We conduct experiments on profiles of different sensing depths and build up a smoothly planning framework according to their them. We also build an initial dataset of road and motion profiles with semantic labels from long HD driving videos. The dataset is published as additional contribution to the future work in computer vision and autonomous driving.
- Doctor of Philosophy
- Computer Science