| Image semantic segmentation is a research topic in the field of computer vision.It can assign a predefined category to each pixel of a given image,so as to achieve more accurate scene analysis than image classification and object detection.As a result,image semantic segmentation has been widely used in autonomous driving,augmented reality and medical image processing.In recent years,with the proposal of Full Convolutional Neural Network(FCN),the research of semantic segmentation based on deep Convolutional Neural Network has made great progress.However,the improvement of segmentation accuracy is often accompanied by the increase of network size,computational complexity and operation delay.This limits the application of semantic segmentation model in practical engineering.This paper mainly studies the semantic segmentation of images in vehicle-mounted scenes.Main challenges of semantic segmentation in vehicle-mounted scenes are as follows.First of all,objects in vehicle-mounted scenes have scale diversity,and conventional standard convolution often has a fixed receptive field,which leads to the inability to solve the object multi-scale problem well.Secondly,it is difficult to achieve the high accuracy and high efficiency(the number of parameters and FLOPs etc.)of the model at the same time.Finally,the data collected in the vehicle-mounted scene is video data,and the single-frame image segmentation method ignores the time contextual information.This paper carries out research on the above issues,and the specific research contents are as follows:1)In this paper,a cascade dilated convolution module is proposed to extract multi-scale features from each layer of the backbone network to solve the objects multi-scale problem of vehicle scene.At the same time,in order to gather more context information at the high level of the backbone network,a context aggregation module is constructed based on the cascade dilated convolution module and channel attention mechanism.The spatial detail module is proposed,which transmits the features with rich detail information from the lower layer of the network to the higher layer and integrates them with semantic information to make the segmentation result more refined.Based on the above modules,this paper constructs a multi-layer and multi-scale feature aggregation network which can run in real time.2)In order to balance the accuracy and efficiency of the model,deep convolution and convolution decomposed design strategies are used to reduce the number of parameters and the computational cost.Specifically,this paper lightens the convolution in each module of the multi-layer and multi-scale feature aggregation network.Through a large number of experiments,the influence of lightweight technology on the accuracy and efficiency of the model is analyzed.It provides a reference for choosing appropriate network model under different requirements of accuracy and speed.3)Based on the TDNet video semantic segmentation framework,this paper uses the time contextual information of the video frame to improve the accuracy and reduce the computation.By analyzing the structure of TDNet,it is found that the framework has the problems of relatively complex backbone network and low frame rate.The cascade dilated convolution module is used to replace the basic residual block in the high layer of backbone network and reduce the characteristic channel capacity of the module,which can reduce the calculation amount by at least 86%.At the same time,this paper designs the feature extraction algorithm between frames,which improves the speed by at least 5.7 times combined with the improvement of the previous step. |