| Semantic segmentation of images,as one of the three basic tasks of computer vision,has a wide range of applications.At the beginning of the research,the convolutional network was mainly used to obtain the contextual information of the image and the abstract semantic features of the image to generate dense predictions for semantic segmentation.However,due to the limited receptive field of convolution,it is necessary to stack multiple layers of convolution to obtain better results.The network structure parameters that use convolution to obtain image context information are large and difficult to train.In recent years,the application of Transformer in the field of computer vision has achieved good results.Its unique structure enables the network model to obtain the global receptive field of the image in the early stage of processing,without stacking many layers to obtain the receptive field of the entire image.Therefore,Transformer performs well in the field of image semantic segmentation,but it is not ideal in terms of segmentation accuracy.It needs a deeper network layer to perform well,and there is still a large room for improvement.In this paper,aiming at improving the accuracy of neural network in image semantic segmentation and reducing network model parameters,a model improvement based on Seg Former network is proposed.The specific research contents of this paper are as follows:(1)Through the research of Dense Net and Seg Former network models,this paper integrates the dense connection architecture into the encoder stage of the Seg Former model,so that the multi-scale feature map can make up for the loss of location information in the deep network and the problem of insufficient abstract semantic information in the shallow network.This architecture can reduce the problem that the encoder loses more semantic information of the original image during the continuous downsampling process.(2)For the problem that requires more parameters to perform well,this paper proposes to reduce the number of output channels of each layer of the encoder on the basis of dense connections.(3)For the problem of insufficient semantic information restoration,it is proposed to add multi-layer transposed convolution in the channel fusion module.The transposed convolution can upsample the feature map,help the network fuse multi-scale feature maps,and restore the semantic information of the image at the same time.The experiments are trained and tested on the ADE20 K semantic segmentation dataset.The results show that the proposed Dense-Seg Former,which integrates dense connections and transposed convolutions,can achieve higher performance while reducing the amount of network model parameters compared with the Seg Former network before improvement.The average pixel intersection ratio(m Io U),improves the model inference speed,etc. |