Architecture Design And Lightweight Research Of Vision Transformer

Posted on:2024-04-10

Degree:Master

Type:Thesis

Country:China

Candidate:W Li

Full Text:PDF

GTID:2568307079959979

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently,vision Transformer(ViT)has witnessed prevailing success in a series of vision tasks with its capability of capturing the global information,making it a hot topic in academic research and industrial application of computer vision.However,the existing ViTs often rely on extensive computational costs to achieve high performance,which is burdensome to deploy on resource-constrained or computing-power-constrained devices.Therefore,the thesis mainly researches on the architecture design and lightweight strategy of vision Transformer.Aiming at the issues of large number of parameters,high computing cost,low inference speed,difficult deployment,and etc,this thesis will study the lightweight architecture design of ViT and the hybrid design of ViT and CNN.Firstly,in order to relieve the problem of large parameters and high computational complexity of vision Transformer,this thesis firstly studies the lightweight architecture design of vision Transformer and propose a depthwise separable vision Transformer,abbreviated as SepViT.Inspired by the lightweight ideology of depthwise separable convolution,SepViT firstly proposes depthwise separable self-attention based on the computing of self-attention components in Transformer,which achieves the local-global information interaction within and among windows in sequential order in a single Transformer block.Meanwhile,in order to efficiently model the attention relationship among windows,SepViT also designs the Window Token embedding scheme to learn the global feature representation of each window with negligible cost.In addition,SepViT also draws lesson from grouped convolution and extends depthwise separable self-attention to grouped selfattention,which can establish long-range visual interactions across multiple windows and further improve the performance.Finally,extensive experiments on classification,segmentation,object detection tasks show that SepViT achieves the best trade-off between performance and latency.On the other hand,since the inefficient large matrix operation in Transformer,most existing ViTs can not perform as efficiently as CNNs on various hardware devices or inference frameworks,e.g.Tensor RT and CoreML.Therefore,the thesis studies the hybrid design of vision Transformer and CNN.From the perspective of inference speed during deployment,the thesis proposes next generation vision Transformer(Next-ViT),which infers as fast as CNN and performs as powerful as ViT.Next-ViT firstly designs the next convolution block(NCB)and next Transformer block(NTB)to achieve the local and global information interaction,respectively.At the same time,Next-ViT proposes the next hybrid strategy(NHS)to efficiently stack NCB and NTB,which also improves the inference speed and the performance of downstream tasks.Experiments on various benchmark visual tasks show that Next-ViT not only outperforms the recent ViTs,but also achieves the same inference speed as the famous CNNs,and it’s also deployment-friendly on different hardware devices or inference frameworks.

Keywords/Search Tags:

Self-Attention, Lightweight, SepViT, Next-ViT

PDF Full Text Request

Related items

1	Image Classification Based On Lightweight And Multi-scale Attention Fusion
2	Research On Lightweight Mask Wearing Detection Based On Attention Mechanism
3	Research On Lightweight Speech Separation Method Based On Attention Mechanism
4	Study Of Lightweight Face Recognition Method Based On Attention Mechanisms
5	Research On Lightweight Transformer Backbone Based On Single Head Self Attention
6	An Intelligent Marking Algorithm Based On Lightweight Network And Attention Mechanism
7	Research Of Lightweight Face Alignment Network Based On Attention Mechanism
8	Research On Neural-Network-Based Lightweight Algorithms For Objection Detection
9	Image Semantic Segmentation Based On Attention Mechanism And Lightweight Network
10	Study On Lightweight Textural Representation Neural Networks Modeling