Font Size: a A A

Pixel-wise Scene Understanding Based On Fully Convolutional Networks

Posted on:2021-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z JiangFull Text:PDF
GTID:2518306104987159Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Scene understanding is an important research topic in computer vision,and it is widely used in the fields of autonomous navigation,autonomous driving,unmanned aerial vehicle and assistance system for the blind.The depth information and semantic information are the key to understanding the scene.For a single RGB image,pixel-wise depth information and semantic information can be obtained by monocular depth estimation and semantic segmentation respectively.In recent years,monocular depth estimation algorithm and semantic segmentation algorithm based on deep learning have made great achievements,but still face many challenges due to the complexity and diversity of scenes.In view of the existing problems,this paper has done the following research work:This paper designs an end-to-end pixel-wise scene understanding algorithm framework based on full convolution network,which can be applied to monocular depth estimation or semantic segmentation respectively.The framework adopts the encode-decode structure,takes Res Net as the encoder for feature extraction,and uses the dilated convolution to increase the receptive field.In the decoder part,bilinear interpolation is used to carry out up-sampling step by step,and the feature maps in the same size of the encoder and decoder are concatenated by using skip connection.For monocular depth estimation,this paper proposes a joint loss function with three terms,which combines the errors in depth,gradient and surface normal.Due to their complementarity,the predicted depth map is more accurate,which well preserves the geometric details of the object and the spatial structure of the scene.For semantic segmentation,this paper proposes edge aware loss,which use the edge information to carry out explicit constraint.This can reduce the fuzzy degree of the boundary of different categories and improve the precision of semantic segmentation.For pixel-wise dense prediction tasks,both low-level and high-level features are important,and the feature fusion method based on skip connection is unreasonable.We proposes an Adaptive Multi-scale Feature Fusion(AMFF)module,which can replace the skip connection and can be applied flexibly in the full convolution network based on encode-decode structure.The experimental results show that AMFF makes the fusion of low-level high-resolution features and high-level low-resolution features more effective and can improve the performance of the algorithm.The depth information and semantic information in the scene are correlated,and the multi-task model can be used to obtain the two kinds of information simultaneously,so as to reduce the computation and improve the efficiency.For this reason,this paper proposes a multi-task scene understanding network that can simultaneously perform monocular depth estimation and semantic segmentation.The multi-task network uses an encoder for feature extraction and is shared by the two visual tasks.For the decoding process,this paper designs two ways: independent decoding and interactive decoding,and the latter is added Lateral Interaction Unit(LIU)on the basis of the former.The LIU is responsible for the interaction between depth cues and semantic information.Experimental results show that interactive decoding can achieve better performance than independent decoding,indicating that there is a certain correlation between depth information and semantic information,and they can be learned together to guide and benefit each other.
Keywords/Search Tags:Scene understanding, Monocular depth estimation, Semantic segmentation, Full convolutional network, Multi-Task learning
PDF Full Text Request
Related items