Font Size: a A A

Study On Human Action Recognition Based On Non-local Spatial-temporal Residual Attention Mechanism

Posted on:2021-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:J LuoFull Text:PDF
GTID:2518306107985799Subject:Instrument Science and Technology
Abstract/Summary:PDF Full Text Request
Human action recognition is one of the most active topic in the field of computer vision.It has a wide range of applications and great values on research.At present,the research can be divided into handcrafted-feature and deep-learning methods.In handcrafted-feature methods,features need to be designed manually and can be easily influenced by designers' experience.So deep-learning methods,using neural network to learn features adaptively,becomes the main direction at present.Though some achievement has been made,there are still some problems remained to be solved: first,almost every model puts the same weight on every part in video,which makes noise irrelevant to recognition be introduced.Second,manual algorithm is used to extract motion features from video,which cannot be automatically completed by the recognition model.Finally,current convolutional model can only extract local information due to the limitation of convolutional kernel.To solve the above problems,following work has been done:(1)A temporal attention module is proposed.The module consists of intra-frame attention and inter-frame attention.Using non-local connection,two sub modules capture the global dependencies within and between frames.By analyzing the dependencies captured by intra and inter-frame sub module,the probability that a frame a frame belongs to foreground,and whether the frame has obvious difference with other frames,can be get.These information make the proposed temporal attention module ignore the background and redundant frame,and pay more attention to frames which have high relevance to recognition results.(2)A spatial attention of video is constructed by nonlocal connection.Nonlocal connections regard the points with high dependence as key points,and the model will pay more attention to these points.As the features extracted by neural network have redundancy,the dependencies between feature channels are also modeled,and the attention score between the output channels is output,so that the model ignores the redundant features with high repeatability.This information makes the model further focus on the key points of motion.(3)Based on the definition of optical flow,a motion feature is extracted.The spatio-temporal gradient is directly used to express the motion features on the attentional mask output by the attention mechanism,which can be achieved by only spatial filtering and subtraction.The whole motion representation model is differentiable and can be integrated into any neural networks for further learning.Experiments have been done in UCF-101 and HMDB51 datasets.The recognition accuracy of 97.1% and 78.0% are finally obtained.The attention mechanism improves the accuracy from the base-line recognition model by 7.6% and 7.2%.Comparing with the models also using the mechanism,the accuracy of our model is improved at least1.6% and 5.3%.Comparing with method that uses optical flow based feature,the accuracy our model is improved by 1.1% and 3.8%.
Keywords/Search Tags:Action Recognition, Non-local connection, Temporal Attention, Spatial Attention, Spatial-temporal gradient feature
PDF Full Text Request
Related items