Aciton recognition in video refers to the process of recognizing specific action categories from video.It has a wide range of applications in the fields of video surveillance,video retrieval,and human-computer interaction.However,most methods are still limited by the complex background in the videoresulting in the action recognition task still suffering huge difficulties and challenges.For this problem,we combine the attention mechanism with the basic action recognition model,search the deep learning network structure,suppress the interference of background information,and improve the recognition ability of the action recognition model in complex interference videos.The contributions of this article are as follows:(1)The subject summarizes the research status of action recognition on traditional features,CNN models,RNN model methods,and focuses on the deep learning basic models of Resnet and LSTM.Finally,the content of the attention mechanism is briefly introduced.This lays a theoretical foundation for the deep learning model that integrates attention in this subject.(2)In view of the existence of redundant frames in the videos,which reduces the reliability of action expression,this paper proposes a new temporal attention LSTM action recognition model based on sequential verification.The model is designed to SVM to discriminate the sequence relationship between video frames,and learn tempoal attention of each frame by time pooling the sequence relationship,so as to obtain enhanced action expression,and suppress the low-quality redundant frames.After obtaining the enhanced features,LSTM is used to learn the time-dependent relationship between the action features.The experiment was validated on two recognized datasets,UCF101 and HMDB51,which can achieve reliable action recognition.(3)For the spatial background information on a single video frame,we add a spatial attention module in the preprocessing stage of the network structure,and propose a action recognition method based on the spatio-temporal attention two-stream network.This model is designed with a convolution structure that combines average pooling and maximum pooling to achieve spatial attention and is used to suppress the spatial background.The experiment is verified on two recognized datasets,UCF101 and HMDB51,which can further improve the action recognition performance. |