Font Size: a A A

Human Action Recognition Based On Attention Mechanism And Multi-Modality Feature Fusion

Posted on:2020-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:H Q WuFull Text:PDF
GTID:2428330575963085Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer software and hardware technology,video data has grown exponentially.How to effectively manage and utilize video data is an urgent problem to be solved.Therefore,as a key technology of video analysis,action recognition is favored by researchers.The purpose of the action recognition task is to analyze the behavior of people in the video clip and give the corresponding label.Although action recognition has received extensive attention and research,video-based action recognition research has been slow due to objective factors such as changes in perspectives and differences in individual behavior.This thesis starts from how to extract the robust video-level feature,and carries out the action recognition research based on the characteristics of the action itself and some problems existing in the current research work.The main work is as follows:(1)An action recognition model based on spatial and temporal attention is proposed.The video recording action consists of a series of image frames that encode the state of the action at a certain moment.Continuous multi-frame images can represent small motion phases,and all motion phases constitute a complete action.In order to effectively extract video-level features,the model uses two attention mechanisms to learn image frame features and video-level feature transfer.First,the spatial attention mechanism is used to locate action-related regions in an image frame and suppress the expression of extraneous information.At the same time,in order to accurately locate the position where the action occurs,the convolutional layer feature maps of the image are used to calculate the spatial attention heat map.Second,each sub-stage of the action has different importance for distinguishing categories.Therefore,the temporal attention module is designed to learn the distribution of temporal weights in different sub-stages and adds regularization to the temporal weights in the training phase,making the weight distribution more reasonable.The model consists of two separate networks,with image frames and stacked optical flows as inputs,and uses a score fusion strategy during the test phase to obtain the final classification results.Experimental results show that this method can effectively extract video-level features and has higher classification accuracy than other methods.(2)As image frame auxiliary information,optical flow image has been applied to many related works,but the manner in which the classification score is fused lacks the interaction between the image frame feature and the optical flow feature,and the obtained classification performance is not ideal.Based on the above defects,an action recognition model based on multi-modality temporal attention mechanism is proposed.First,the global temporal attention pooling layer is designed to fuse multiple frame image features.Since actions are related to each other before and after,the bidirectional LSTM is used to model the time-series information.The image frames and stacked optical flows are then used as data for two different modalities.There are two kinds of temporal weights,and the video-level features of the two modalities are obtained.Second,the fused image frame features and optical flow features are hybrid features that are input to the global temporal attention pooling layer and also obtain corresponding video-level features.At this point,three video-level features are fused as a unique feature representation of the video and classification is based on this feature.In order to speed up the model convergence,the training process is divided into two phases.In the first stage,the spatial network based on image frames and the temporal network based on stacked optical flows are independently trained;in the second stage,the parameters of spatial network and temporal network are frozen,and only train the network used for fusing features.In addition,due to the small difference between successive image frames,the training and test phases use sparse sampling to select 10 frames of images or corresponding stacked optical flows to represent the entire video.The model achieved 94.5%and 71.1%classification accuracy on the UCF101 and HMDB51 datasets respectively.
Keywords/Search Tags:Action recognition, Attention mechanism, Temporal attention pooling, Deep learning, Multi-modality feature fusion
PDF Full Text Request
Related items