Font Size: a A A

Attention Mechanism Based Deep Network For Human Action Recognition In Video

Posted on:2019-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:H D YangFull Text:PDF
GTID:2428330611493349Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The human action recognition in video is a longstanding topic in the computer vision community.In literature,it extends over a broad range of high-impact societal applications from the video surveillance to the human-computer interaction.Moreover,the human action recognition draws numerous attentions from global scholars.With the coming of the big-data era,the quantity of videos increases at an amazing rate.However,due to the complex and variety of human actions,how to efficiently recognize actions in videos poses a challenging task to researchers.In our work,we propose the deep encode-decode architecture with the attention mechanism to handle with two mainly tasks in the human action recognition: representing and recognizing actions.In this thesis,we make use of related technologies of deep learning.Then,we represent actions via the features extracted from convolutional neural network,while,we recognize actions through the deep sequential network.Meanwhile,we establish an integral framework,which optimizes the model during the process of learning,to combine the representing and the recognizing actions into together.In detail,the main achievements of this paper include:(1)We propose the attention-again model which adapt the temporal information of videos.Lots of conventional attention mechanisms mainly focus on the spatial information.In our opinion,those methods cannot gain the whole information,thus we propose the attention-again model which roots in the reading-again model generated from reading habits.Therefore,we integrate neighbor frames into together,and then,take advantage of long-term dependences of the LSTM.After that,the bottom LSTM could receive global information,which will instruct the top LSTM to recognize features.Our proposals outperform the baseline and are superior to methods with the same experimental conditions(RGB data)on three benchmark datasets: UCF11,HMDB51 and UCF101.Detailed,the accuracy in UCF11 is 91.2%,the HMDB51 is 54.4% and UCF101 gets 87.7%.(2)We propose the bi-direction hierarchical LSTM with spatial-temporal attention to improve the performance of recognition of the similar actions.Considering that the conventional methods have difficulties in solving this problem,we intend to make some changes in our deep encode-decode architecture.Firstly,we introduce a simple hypothesis that one action must be composed of many atom actions.Based on this hypothesis,we propose the spatial-temporal attention to refine the frames in the temporal range and attend the region of interesting in one frame;moreover,we also introduce the bi-direction hierarchical LSTM.To efficiently represent actions,we add 3D features to enhance the representation of actions.Our comprehensive experimental results on two benchmark datasets,i.e.,HMDB51 and UCF101,verify the effectiveness of our proposed methods and show that our proposals can significantly outperform the current state-of-the-art methods.Detailed,the accuracy in HMDB51 is 71.9% and the HMDB51 is 94.8%.
Keywords/Search Tags:Human action recognition, Attention mechanism, Encode-decode architecture, Attention-again model, Spatial-temporal attention mechanism
PDF Full Text Request
Related items