In recent years,with the rapid development of deep learning,great progress has been made in the direction of computer vision research.The study of visual backbone networks is of great significance for core vision tasks such as image classification,object detection,semantic segmentation and instance segmentation.Recently,Transformer Networks have become the mainstream approach for visual backbone networks,which show great potential by training the network with a large amount of annotated data to obtain the best-fit model.The existing methods of building complex networks by stacking a large number of Transformer Blocks have achieved significant performance gains,but face serious problems in terms of computational resource consumption,making them unpopular in application scenarios.This thesis presents an effective study of Transformer-based backbone network design methods from various perspectives,including the application of attention mechanism,lightweight network design,and efficient feature extraction module design methods,and designs various efficient backbone network algorithms.The effectiveness of these algorithms is verified through training,validation and testing on several benchmark datasets.Therefore,this thesis presents an in-depth study on the lightweight approach based on the backbone of Transformer,which is both accurate and efficient,as follows.(1)From the perspective of reducing the computational complexity of self-attentive structures,this thesis proposes a lightweight Transformer-based backbone network,Single-Head Transformer(SVT)backbone.This network reduces the computational resource consumption of Transformerbased backbone networks by constructing self-attentive networks with low computational complexity.SVT proposes a novel Single-Head Self-Attention module,which includes a pyramid-pooling feature extraction module to extract multi-scale features.Unlike the previous Transformer for computing Multi-Head Self-Attention(MHSA),SHSA restricts the representation of input tokens to a single head,thus enabling low-dimensional embeddings and significantly reducing computational complexity.Despite the addition of a small number of model parameters,SHSA significantly reduces the number of input tokens.The method achieves a balance between classification accuracy and efficiency,making it an efficient Transformer-based backbone network.(2)Although SVT achieved good performance using SHSA,there are still some problems: 1)the number of parameters of the model is still high;2)the accuracy of the model still has much room for improvement.Based on these problems,this thesis proposes a Multi-branch Lightweight Transformer(MLT)backbone,which consists of a SHSA module and a Multi-scale Feature Extraction Module(MFEM).The MFEM replaces the Multi-layer perceptron(MLP)by using a series of parallel convolutional layers with different receptive fields to extract different levels of features from the input to improve the accuracy of the model,while the convolutional layers make extensive use of depth-wise convolution to reduce the overall parameters of the model.In a comprehensive experimental comparison,we have evaluated MLT on the tasks of image classification,object detection and semantic segmentation.It has only 14.11 M model parameters,but achieves competitive results in image classification(79.3% Top-1% accuracy on Image Net-1K),semantic segmentation(77.4% and 44.3% m Io U achieved on Cityscapes and ADE20K),target detection and instance segmentation(42.3% box-AP and 38.6% mask-AP)produced competitive results. |