Font Size: a A A

Research On Action Recognition In Videos

Posted on:2020-03-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:G L YaoFull Text:PDF
GTID:1368330596475780Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Vision-based information acquisition is one of the extremely important ways to obtain information nowadays.Action recognition in the video sequences has become an important and active research topic in fields of artificial intelligence,computer vision and multimedia application.Action recognition is divided into two kinds of tasks: action recognition in the trimmed videos and action recognition in the untrimmed videos.The former is a classification task,which assigns a set of the predefined action labels to the trimmed videos;the latter is a detection task,which not only assigns the predefined action labels,but also determine the start and end time of the actions in the untrimmed videos.At present,action recognition in videos has become an important technology in video retrieval,intelligent surveillance,human-computer interaction,robotics and other fields.Although researchers at home and abroad have made some achievements in video action recognition,the action recognition is still a challenging research topic,due to the challenges of variation in viewpoint,occlusion,cluttered background,diversity of actions,association with semantics,and so on.Inspired by the success of convolutional neural networks(CNNs)in the image domain,the CNN architectures and CNN-based methodologies were extended from image to video tasks,including video action recognition.In recent years,the CNN-based action recognition has undergone significant advancements and is dominating the research of video action recognition,with numerous CNN-based action recognition approaches emerging.This dissertation focuses on the practical problems of the existing CNN-based action recognition methods and performs further research on action recognition using the theories of image processing,computer vision,machine learning and deep learning.The main research of this dissertation are three-fold: 1)studying on the problem of the multi-temporal-scale deep information for action recognition in the trimmed videos;2)studding on the problem of the temporal modeling on the action atoms for action recognition in the trimmed videos;and 3)studying on the action recognition in the untrimmed videos from fine to coarse granularity.The main contributions are summarized as following:(1)A method of multi-temporal-scale deep spatiotemporal learning is proposed for action recognition in the trimmed videos.The CNN-based action recognition in the literature suffer a limitation: single temporal scale spatiotemporal information.A typical human action contains the spatiotemporal information from various temporal scales.Learning and fusing the multi-temporal-scale information would make action recognition more reliable in terms of recognition accuracy.Therefore,we create the variants of Res3 D,a 3D Convolution Neural Network(CNN)architecture,to extract spatiotemporal information in multiple temporal scales.And in each temporal scale,we transfer the knowledge learned from RGB to 3-channel optical flow(OF)and learn information from RGB and OF fields.We also propose Parallel Pair Discriminant Correlation Analysis(PPDCA)to fuse the multi-temporal-scale information into action representation,and feed the representation into a support vector machine(SVM)for action recognition.The experimental results show that the multi-temporal-scale method performs better than single-temporal-scale method on recognition accuracy,and yields the action representations with lower dimension and stronger discriminativity.(2)A method is proposed to recognize action in the trimmed videos by temporal modeling on the action atoms.An action can be considered as a temporal sequence of action units,which are referred to as action atoms and captures the key semantic and characteristic spatiotemporal features of actions in different temporal scales.We extract deep spatiotemporal information from RGB and OF fields in multiple temporal scales.In each temporal scale we mine the action atoms in the spatiotemporal space and use long short-term memory(LSTM)to model the temporal evolution of atoms for action recognition.The experimental results show that the proposed multi-temporal-scale spatiotemporal atom modeling method achieves competitive performance in terms of recognition accuracy.(3)A method is proposed to recognize action in the untrimmed videos from fine to coarse granularity.The fine granular classifier tends to offer precise temporal boundary of the action while the coarse granular classifier considers the dependence between the frames or segments of one action instance.We make the most of the different granular classifiers and propose to recognize action in untrimmed videos from fine to coarse granularity,which is also in line with the people's detection habits.The proposed action recognition method is built in the ‘proposal then classification' framework.We design the segment-level(fine granular)and window-level(coarse granular)classifiers for each of the proposal and classification steps.And each step is executed from segment to window level.The experimental results show that our method not only achieves detection performance comparable to that of state-of-the-art methods,but also has a relatively balanced performance for different action categories.
Keywords/Search Tags:Action Recognition, Action Representation, Deep Feature, Convolutional Neural Network, Long Short-Term Memory
PDF Full Text Request
Related items