Image semantic segmentation has been a challenging topic of research in the field of computing,but with the gradual saturation of convolutional neural networks in the field of image semantic segmentation.The plateauing of current semantic segmentation research is an urgent problem to be solved.Therefore,researchers have started to turn the attention to Transformer.A big hit in the field of natural language processing,which is based on a self-attentive mechanism independent of local interactions and can tap both long-range dependencies and parallel computation.Achieving experimental results comparable to those of convolutional neural networks.However,the lack of local correlation of images and complex computation have been the pain points of Transformer network research.In this paper,we conduct research based on Transformer to design lightweight image coding-decoding segmentation networks by preserving the local relevance of images.The details of the research are as follows:(1)This paper proposes a Transformer-based image semantic segmentation framework.It mainly combines the segmentation idea of image pyramid to segment the semantic information of different layers of images.So as to obtain semantic feature maps of different dimensions,adopts the encoder-decoder structure commonly used in computer vision,designs a model structure more in line with the downstream segmentation task,and further improves Transformer for image segmentation tasks.(2)Encoding part.The overlapping image cutting module is proposed to preserve the semantic correlation of adjacent positions in images.Unlike natural language processing,this module preserves the local relevance of the images themselves,maintains a high-resolution segmentation feature map,and improves the semantic segmentation accuracy.For the network in the middle stage of this paper,the same input overlaps the cut images to ensure the high resolution of the images at different levels.(3)Decoding part.This paper adopts a simpler and more effective decoding structure,which is different from the previous complex design,but abates the computation of the model.The lightweight decoding part proposed in this paper is applicable to the backbone network Transformer,which not only greatly reduces the number of redundant parameters,but also greatly improves the accuracy.Based on the above content research,compared with the traditional convolutional neural network model,the model in this paper achieves 40.18%,77.61% and 64.64%accuracy on ADE20 K,Cityscapes and VOC 2012 datasets.The validity of the model is further verified and the foundation for further practical applications is laid. |