Font Size: a A A

Research On Semantic Segmentation Method Of Visual Transformer Based On Residual Cascade At Multi-scale

Posted on:2024-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:F LinFull Text:PDF
GTID:2558306920954909Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Image semantic segmentation is an important task in the field of computer vision,aiming at classifying every pixel in an image.Deep learning algorithms led by convolutional neural networks(CNN)have made great progress in the field of vision.But CNN lacks the ability to capture long-distance information.Although with the help of some attention mechanisms,the feature extraction efficiency of the network is improved,the convolution operation itself is only a local operator,and the capture of long-distance information still depends on the repeated stacking of the network.The context information and long-distance dependence of the feature graph are often difficult to capture,which leads to the difficulty of semantic restoration in the decoding stage and becomes a difficulty in the field of image processing.In contrast,Transformer has global information interaction capabilities to facilitate feature extraction and quickly build a global field of perception for more accurate scene understanding.Therefore,this project uses visual Transformer as a research method to deal with image segmentation from the perspective of sequence.The main research work of this project is as follows:A semantic segmentation model RCPVT for visual Transformer based on residual cascade multi-scale is proposed.The network design follows the codec architecture.In the coding stage,the image will be formed into Patch of the same size through Patch Embedding,and then the Patch will be serialized as token vector and input to the cascaded Transformer module to extract context information.Meanwhile,the residual structure will be added to promote semantic information transmission.In order to reduce the calculation cost of the model,LSRA module is added into the multi-head self-attention module in the coding stage of the model to reduce the calculation amount during token mapping.LSRA will set a certain scale according to the hyperparameter,remap the dimensions of Key and Value token vectors for compression,and reduce the model parameters.An up-sampling module TUS based on Transformer feature fusion is proposed.The feature graph of high-dimensional semantic information is fused with the feature graph of encoding stage,and then block embedding encoding is carried out again to form token input into Transformer.On the one hand,the semantic loss caused by the up-sampled zero-value filling is made up;on the other hand,the semantic fusion of low-dimensional information and high-level information is strengthened to obtain richer context dependence.Two public data sets,Cityscapes and ADE20 K,were used for RCPVT comparison experiment and ablation experiment respectively.The experimental results show that the RCPVT model is superior to the existing deep learning algorithms in MIo U,parameter computation and segmentation visualization.
Keywords/Search Tags:image semantic segmentation, vision transformer, self-attention, pyramid pooling
PDF Full Text Request
Related items