Font Size: a A A

Spatiotemporal Deep Neural Network For Video Salient Object Detection

Posted on:2021-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y TangFull Text:PDF
GTID:1488306110487394Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Salient object detection aims to pop out the salient regions of the natural scene that are most attractive to human's attention.Compared with high-level visual tasks,such as object detection,tracking,and image retrieval,salient object detection belongs to the image pre-processing method,and its results can be used to improve the accuracy and accelerate the speed of visual tasks.Salient object detection can be divided into image salient object detection and video salient object detection based on the types of input data.Compared with the image salient object detection,video salient object detection needs to consider both spatial features in an image and temporal features in video sequences,thus is more complicated and challenging.Our research focuses on video salient object detection with deep learning.Deep learning-based methods have recently become a hot research topic for various visual tasks.However,there are still several difficulties in video salient object detection with deep learning models: 1)The shortage of training data.Deep learning is a data-driven model,which is trained by a large number of labeled data.In the community of video salient object detection,the training data needs to be labeled at the pixel-wise level,and the labeling data should be consecutive.However,labeling of this kind of data is difficult.2)The shortage of robust sequential features.Video salient object detection not only needs to contain static information but also to consider the more complicated sequential information.However,the traditional deep learning models are difficult to extract robust sequential information.3)The exploration between the spatial and temporal cues is insufficient.The current methods use directly the weighted addition to fuse spatial and temporal cues,which cannot fully explore the contextual relationship between the two kinds of information.This thesis aims to propose new deep learning models to solve the above difficulties,thus improving the accuracy of video salient object detection.The proposed methods can be summarized as follows:(1)A weakly supervised salient object detection model with spatiotemporal cascade neural networks is proposed to mitigate the effect of scarce training data.The proposed weakly supervised learning mechanism produces a large number of pixel-wise weak labels in the training processing,which are used for efficient network training.A spatiotemporal cascade neural network is then proposed to generate a spatial prior in the first sub-network and combine the prior with the optical flow-based temporal prior to guide the second sub-network to learn the spatiotemporal features,thus generating the saliency maps with higher accuracy.(2)A two streams based network for video salient object detection is proposed to respectively extract static features and dynamic features in parallel.These features are then refined by the sequential modules.At the top of the network,we propose an attentive module to generate robust fusing spatiotemporal features by automatically learning the relationship between spatial and temporal features.(3)A novel inter-frame spatiotemporal feature enhancement module for video salient object detection is proposed to deal with the missing salient objects in consecutive frames.The module uses the recursive feature aggregation and historical saliency supervision to generate more robust spatiotemporal features.To further explore the contextual dependencies between the spatial and temporal features,a mutual attention block is proposed to learn the weights of spatial and temporal features and then implement the fusion of the features.This approach further improves the performance and efficiency of video saliency detection.(4)A fast lightweight video salient object detection model is proposed with the usage of spatiotemporal knowledge distillation.On the one hand,the multi-scale feature encoding and the spatial distillation are adopted to refine the spatial features.On the other hand,the interframe feature embedding and the temporal distillation are applied in the network to learn the robust sequential features.Without the use of the sequential structures,such as optical flow modules and RNN-based modules,the proposed lightweight network can dramatically improve efficiency while retaining satisfactory detection performance.
Keywords/Search Tags:Salient object detection, Deep models, Weakly supervised learning, Spatiotemporal feature fusion, Spatiotemporal feature enhancement, Spatiotemporal knowledge distillation
PDF Full Text Request
Related items