| Video action recognition has become a very important research hotspot in the field of computer vision.Due to the variety of video data,it is difficult to recognize people or objects in complex scenes.With the continuous development of artificial intelligence technology,video action recognition technology has achieved rapid development.A variety of models based on convolutional neural networks have been proposed for feature extraction and classification of video actions.However,video action recognition still faces complex problems and severe challenges,such as low recognition accuracy and large training parameters.Therefore,we propose a video action recognition method based on spatio-temporal transformer.The main research contents include: first,research how to apply the transformer model to video action recognition;second,research how to optimize the network structure while saving the cost of GPU hardware,so as to improve the utilization rate.This thesis mainly focuses on the following research of video action recognition based on space-time transformer.First,take different actions on the Patch embedded module.This method designs two different schemes,namely,non-convolution operation and convolution operation,to achieve feature extraction for each block of image or video frame.The experiment proves that the feature extraction with convolution operation is strong and can improve the network performance to a certain extent.Secondly,a method based on space-time transformer module is designed and proposed.The method includes LSTM(Long Short-Term Memory)module and space-time transformer module.First connect the initial LSTM module with the fusion layer,and then combine the space-time transformer module to form the R-TST(LSTM-Time Space Transformer)module.The experimental results show that the model is effective for video motion recognition.Finally,the HDMB51 data set and UCF101 data set are used in the ablation experiment of video motion recognition.Taking them as the benchmark data,it is proved that the model method proposed in this thesis can effectively perform motion recognition,while improving network performance,reducing the amount of parameters,saving GPU hardware costs,and improving utilization. |