Font Size: a A A

Research On Two-Stream Action Recognition Method Based On Transformer

Posted on:2024-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:X C QuFull Text:PDF
GTID:2568307106999499Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the introduction of deep learning-based video recognition models,action recognition has become an important research content in the field of computer vision.Action recognition aims to recognize the movements of people in videos.It not only requires the computer to understand the semantic content of spatial features,but also needs to capture the temporal features embedded in videos based on the movements of people,and then train the temporal reasoning ability of the model from that.Because of the complexity and feature diversity of videos,the accuracy and efficiency of action recognition is lower than that of common image classification,which also illustrates the enormity of the task of action recognition.Therefore,achieving efficient extraction of spatial-temporal features by suitable models and finding satisfactory mappings between features and action recognition is a proposition that must be solved.In recent years,Transformerbased models are able to model global features efficiently through the attention mechanism,while their parallel computing features also show strong adaptability in large-scale datasets;nevertheless,Transformerbased action recognition still faces serious challenges in research.On the one hand,how to efficiently use Transformer to extract spatial-temporal features and model them with high accuracy;on the other hand,how to reduce the size of high accuracy Transformer models,where there is a trade-off between accuracy,model size and computational complexity.To this end,new solutions are proposed to address some of the existing deficiencies in the field of action recognition,and the main work and contributions are as follows:First,Convolutional Neural Networks and Transformer models have different feature capture and cannot effectively combine local features and global features to model action recognition with higher accuracy;while Two-Stream networks can significantly improve action recognition accuracy by the feature of interaction between networks.Therefore,in order to extract as much spatial-temporal information as possible,and then model more effectively according to the spatial-temporal features to get higher recognition performance,the Transformer-based Two-Stream model is proposed,which is the Sparse Dense Transformer Networks(SDTN)for action recognition.The Sparse and Dense pathway of this model will sample frames from the video at a specific differential temporal resolution,respectively,after which the frames are segmented into non-overlapping pixel block(Patch),and then the Patch are projected into vectors that are input into the spatial Transformer and temporal Transformer to capture spatial and temporal features,respectively,and finally use last-fusion method to integrate the network and obtain the video-level inference of the model.In addition,to investigate the robustness of the model within the field of action recognition research,SDTN is compared with mainstream methods on Kinetics-400,UCF101 and HMDB51 datasets for action classification tasks,and finally the model is quantitatively analyzed for model performance as well as the number of parameters,and the Top-1 and Top-5 accuracies of the comparison experiments fully verified that the model can more effectively improve network performance in the field of action recognition.Then,a Spatial-Temporal-Clip Transformer Networks(STCTN)based on SDTN is proposed to address the shortcomings of SDTN,such as model bloat,training difficulty and recognition performance to be improved.The model adopts a multi-stage Two-Stream Transformer structure,which is progressively arranged by a Spatial Transformer,a Temporal Transformer and a Clip Transformer,respectively.By analyzing the internal structure and computation of Transformer at different stages,STCTN places the lightweight spatial Transformer at the shallowest layer of the network structure,which is the Patch-based Transformer for capturing spatial features;the temporal Transformer is the Transformer for frames,which is used to extract temporal features to improve the temporal inference of the model;the final lightweight Clip Transformer is the Clip-based Transformer,which is used to integrate the spatial-temporal features of Clip.Among then,two types of preprocessing for sampled frames are designed:Patch Crop is based on the crop of Patch as a unit,which can make the model focus more on the central Patch;Frame Alignment can achieve the purpose of reducing the model computation by comparing the input frames of Sparse and Dense paths.Then,the model is compared with frontier methods on the public datasets,and the superiority of STCTN is demonstrated through accuracy and precision improvement.Finally,experimenting with SDTN on the optimized Tiny-Kinetics-400 dataset,STCTN improves performance while further testing the ability of STCTN to reduce model size and computational complexity through dimensions such as number of parameters,thus validating the feasibility of STCTN networks in the field of action recognition research.In summary,this thesis proposes Sparse Dense Transformer Network(SDTN)for action recognition based on the integration of Two-Stream framework and Transformer;meanwhile,to address the shortcomings of SDTN,lightweight spatial Transformer and combine novel lightweight Clip Transformer are used to construct Spatial-Temporal-Clip Transformer Network(STCTN),and then introduce Patch Crop and Frame Alignment preprocessing methods for this model.Finally,the effectiveness and robustness of the Two-Stream Transformer is verified by extensive ablation experiments from Top-1,Top-5,and the number of parameters.
Keywords/Search Tags:Deep learning, Action recognition, Transformer, Two-Stream
PDF Full Text Request
Related items