| Recently,vision Transformer(ViT)has witnessed prevailing success in a series of vision tasks with its capability of capturing the global information,making it a hot topic in academic research and industrial application of computer vision.However,the existing ViTs often rely on extensive computational costs to achieve high performance,which is burdensome to deploy on resource-constrained or computing-power-constrained devices.Therefore,the thesis mainly researches on the architecture design and lightweight strategy of vision Transformer.Aiming at the issues of large number of parameters,high computing cost,low inference speed,difficult deployment,and etc,this thesis will study the lightweight architecture design of ViT and the hybrid design of ViT and CNN.Firstly,in order to relieve the problem of large parameters and high computational complexity of vision Transformer,this thesis firstly studies the lightweight architecture design of vision Transformer and propose a depthwise separable vision Transformer,abbreviated as SepViT.Inspired by the lightweight ideology of depthwise separable convolution,SepViT firstly proposes depthwise separable self-attention based on the computing of self-attention components in Transformer,which achieves the local-global information interaction within and among windows in sequential order in a single Transformer block.Meanwhile,in order to efficiently model the attention relationship among windows,SepViT also designs the Window Token embedding scheme to learn the global feature representation of each window with negligible cost.In addition,SepViT also draws lesson from grouped convolution and extends depthwise separable self-attention to grouped selfattention,which can establish long-range visual interactions across multiple windows and further improve the performance.Finally,extensive experiments on classification,segmentation,object detection tasks show that SepViT achieves the best trade-off between performance and latency.On the other hand,since the inefficient large matrix operation in Transformer,most existing ViTs can not perform as efficiently as CNNs on various hardware devices or inference frameworks,e.g.Tensor RT and CoreML.Therefore,the thesis studies the hybrid design of vision Transformer and CNN.From the perspective of inference speed during deployment,the thesis proposes next generation vision Transformer(Next-ViT),which infers as fast as CNN and performs as powerful as ViT.Next-ViT firstly designs the next convolution block(NCB)and next Transformer block(NTB)to achieve the local and global information interaction,respectively.At the same time,Next-ViT proposes the next hybrid strategy(NHS)to efficiently stack NCB and NTB,which also improves the inference speed and the performance of downstream tasks.Experiments on various benchmark visual tasks show that Next-ViT not only outperforms the recent ViTs,but also achieves the same inference speed as the famous CNNs,and it’s also deployment-friendly on different hardware devices or inference frameworks. |