Font Size: a A A

Deep Learning Based Human Action Recognition And Video Description Generation

Posted on:2018-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2348330512984853Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The computer vision community has been working on video analysis or video content understanding for decades and tackled different problems such as video semantic segmentation,tracking,video searching,human action recognition and video description generation.To bridge the gap between the video contents and high-level semantics,this work focuses on two applications: 1).human action recognition and 2).video description generation.Specifically,we view action recognition as the problem of low-level semantics classification,while video description generation is the problem of high-level semantics generation which requires understanding of visual contents and human language simultaneously.These challenges motivate us to study two problems:1)how to design an algorithm to compute video pattern;2)and how to build an effective framework to bridge the gap between the video contents and high-level semantics.For human action recognition,conventional methods view this as the problem of multi-class classification and various approaches to video-level feature extraction have been proposed.However most of extracted features are based on low level information such as texture or motion estimation,which will led to suboptimal results.The convolutional neural network(CNN),as one of deep learning techniques,has been very recently proposed for action recognition in videos,which is a integrated pipeline of feature learning and classifier.However,existing convNets has three “artificial” requirements that may reduce the quality of video analysis: 1)It requires a fixed-sized input video;2)most of the convNets require fixed-length input(i.e.,video shots with fixed number of frames);and 3)Conventional convNets can only deal with short-term temporal structure.To tackle these issues,we propose an end-to-end pipeline named Two-stream 3D convNet Fusion,which can recognize human actions in videos of arbitrary size.Specifically,we decompose a video into spatial and temporal shots.By taking a sequence of shots as input,each stream is implemented using a Spatial Temporal Pyramid Pooling(STPP)convNet with a Long Short-Term Memory(LSTM).We devise the STPP convNet to extract equaldimensional descriptions for each variable-size shot,and we adopt LSTM model to learn a global description for the input video using these time-varying descriptions.We empirically evaluate our method for action recognition in videos and the experimental results show that our method outperforms the state-of-the-art methods(both 2D and 3D based)on three standard benchmark datasets(UCF101,HMDB51 and ACT datasets).For video description generation,the encoder-decoder framework has been widely used for video description to achieve promising results,and various attention mechanisms are proposed to further improve the performance.While temporal attention determines where to look,semantic decides the context.However,the combination of semantic and temporal attention has never be exploited for video captioning.To tackle this issue,we propose an end-to-end pipeline named Fused GRU with Semantic-Temporal Attention(STA-FG),which can explicitly incorporate the high-level visual concepts to the generation of semantic-temporal attention for video description generation.The encoder network aims to extract visual features from the videos and predict their semantic concepts,while the decoder network is focusing on efficiently generating coherent sentences using both visual features and semantic concepts.Specifically,the decoder combines both visual and semantic representation,and incorporates a semantic and temporal attention mechanism in a fused GRU network to accurately learn the sentences for video captioning.We experimentally evaluate our approach on the two prevalent datasets MSVD and MSR-VTT,and the results show that our STA-FG achieves the currently best performance on both BLEU and METEOR.
Keywords/Search Tags:action recognition, video description generation, deep learning, convolutional neural network, feature learning
PDF Full Text Request
Related items