Font Size: a A A

Lightweight Networks For Dense Pixel Perception

Posted on:2022-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:R Q LuoFull Text:PDF
GTID:2518306563478144Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Scene perception has always been an important research hotspot in the field of computer vision.As tasks become more refined,image level information can no longer meet the demand.Indeed,scene perception has entered a new stage,requiring the detailed information of each pixel.This paper focuses on two important pixel level tasks,namely image semantic segmentation and monocular depth estimation.The former task answers what each pixel is in a given image,and the latter answers how far away the pixel is from the shooting source.At present,the most advanced algorithms are all based on deep learning.These methods stack a large number of convolution operations to improve the accuracy of the model.However,these methods also increase a lot of model parameters and calculation,leading to more memory consuming and longer inference time.This makes these algorithms hard to be implemented in real-time application scenarios,such as automatic driving,Internet of Things,etc.Because in these applications,most systems are not only limited in computing resources,but also need to meet the delay limit.So,the key challenge is how to balance the accuracy and the speed performance of the model.In this paper,we proposed a lightweight network for image semantic segmentation and monocular depth estimation respectively,gaining both accuracy and speed performance.First,we proposed a fully convolutional residual network for fast semantic segmentation.The whole network follows the traditional encoder-decoder structure.In order to reduce the computation,we construct a tiny encoder based on the residual block.We compare and choose the better one from three decoder structures.Based on this,we introduce channel attention mechanism to strengthen the representative ability of the model.Then,we adopt different feature fusion strategies considering the characteristics of different layers.For high-level features,we use atrous spatial pyramid pooling module to further extract semantic information from feature maps;for low-level features,we capture attention from channel and space to restore clearer boundary information of objects.Finally,we utilize online hard example mining strategy to alleviate the problem of class imbalance.Then,we proposed a lightweight network based on asymmetric convolution block,for monocular depth estimation.In this experiment,the former lightweight encoder is used.And in order to improve the accuracy of the model,we modify the encoder and introduce asymmetric convolution block.In the training phase,we use asymmetric convolution to enhance the representation ability of the standard convolution kernel.And in the test phase,the multi-branch computation can be integrated into the original structure through fusing parameters.Therefore,the model does not increase the inference time while improving the accuracy.In the decoder part,we proposed a new upsampling method,which can effectively extract features of objects on different scales,restore more spatial details,and further improve the accuracy of the network.In order to verify the effectiveness of our proposed methods,we conduct ablation experiments on Cityscapes and NYU-Depth V2 datasets,respectively.Experimental results show that our proposed lightweight encoder can be well applied to dense pixelwise tasks like semantic segmentation and monocular depth estimation.Our methods improve the accuracy of segmentation and depth estimation while ensuring the real-time performance of the model.Compared with other methods,our proposed model also achieves competitive results.
Keywords/Search Tags:Lightweight Network, Real-time Inference, Scene Perception, Semantic Segmentation, Monocular Depth Estimation
PDF Full Text Request
Related items