Font Size: a A A

Video Based Human Action Recognition With Deep Learning

Posted on:2019-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2428330542494079Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the boost of computation power with GPU and the appearance of large scale labeled visual data,Deep learning has achieved the state-of-the-art performance on many tasks in computer vision.In the field of video based action recognition,the multi-stream based ConvNet and 3D ConvNet have achieved promising results.However,since there is no explicit modeling of contexts and cues in videos,it is hard for a Con-vNet to exploit these useful information.Besides,since the appearance of action in the videos is uncertain,it is important to use attention model to focus on the most informa-tive duration of videos.In order to solve these mentioned issues,we proposed three work here.First we propose a semantic attention model based on multiple-stream framework,aiming at exploit contexts and cues in videos to improve the performance of action recognition.Multi-stream based ConvNet is a type of deep learning model which is widely used in action recognition tasks.By learning the features of several modals separately firstly,then combining them in different ways,the information from multiple domain is merged.However,there are several informative regions or objects in which can help to recognize the action in videos.Here we proposed a semantic attention model based on multiple-stream framework,aiming at exploit contexts and cues in videos to improve the performance of action recognition.Firstly we use object detection methods to find possible objects and cues.Then we add these contexts to the ConvNet by the ROI-pooling layer.Finally we use fully connected layer and softmax layer to find the response from context regions to determine the action in videos.Second we propose a visual attribute based 3D convolutional neural network to recognize actions in videos.Another popular ConvNet architecture in action recognition is 3D ConvNet,which expand the dimension of convolution kernels and pooling kernels to 3 in order to jointly learn the structure in videos.Although I3D,the state-of-the-art 3D ConvNet,achieves very high accuracy on video action recognition benchmark,lack of explicitly learning of visual attributes in videos,it is difficult for I3D to clarify the videos which are both similar on spatial pattern and temporal pattern.To solve this problem,here we propose a visual attribute mining and classifying structure to learn attributes explicitly in videos.Together with I3D ConvNet,we improve the accuracy of recognizing on UCF101 and HMDB51 dataset.Finally we propose generalized attentional pooling model to recognize actions.In video,an action as a certain motion pattern lasts for a period of time and the time is uncertain,and most video clips have no action.Therefore,the attentional model can be used to discover the spatial locations of the active segments and actions in the video.Based on this,we propose a generalized attentional pooling model,using low-order nonlinear operations to approximate second-order pooling operations,and at the same time as an attention model,our method recognizes the motion after combining key points of human body data.Performance has been further improved.Experiments have proved that our method is very complementary to the identification of key points on the human body.In summary,this work has studied the weighting of semantic information in video,the mining of visual attributes,and the attention model in video under the framework of common video analytics based on deep learning.Through these three experiments,this thesis verifies the feasibility and effectiveness of explicitly learning the key content in the video.
Keywords/Search Tags:Action Recognition, Deep Learning, Convolutional Neutral Networks, Contextual Information, Visual Attributes Mining, Attention Model
PDF Full Text Request
Related items