Font Size: a A A

Research On Efficient Action Recognition Based On Deep Learning

Posted on:2023-10-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z HuangFull Text:PDF
GTID:1528306902459094Subject:Information and Communication Engineering
Abstract/Summary:
Video is becoming one of the most important data modalities in the era of big data.Compared to static images,videos are more complex because they carry more motion and auditory information.Thus,it is more challenging to apply video-based tasks,such as video processing,video classification and video retrieval.The human action recognition is one of the most important topics in video understanding.It can be applied to virtual reality,human-computer interaction,intelligent monitoring and video retrieval,and provide technical support for online short videos,live broadcast platforms,and video websites.The human body is the most important subject in videos.The machines cannot understand videos until they are able to understand the patterns of human action.However,previous methods for human action recognition ignore the natural structure of human body,leading to poor representation ability of spatio-temporal features.On the other hand,directly increasing the number of parameters of models to improve performance will inevitably introduce higher computational complexity and make the models hard to train.In order to solve the above two problems,this thesis starts from increasing the ability for the human body structure modeling and model acceleration for different modalities of video data.The thesis proposes the 3D local convolutional neural networks,the spatio-temporal inception graph convolutional networks and the inheritance and exploration knowledge distillation to simultaneously increase the representation ability of features and reduce the computational cost.The contributions and novelty can be summarized as follows.(1)To better model the features of human body parts in the video-based action recognition,this thesis introduces the 3D local convolutional neural networks.This thesis introduces 3D local operations as a generic family of building blocks for extracting 3D local features from adaptive 3D local neighborhood.The proposed 3D local operations support the extraction of local 3D volumes of body parts in a sequence with adaptive spatial and temporal scales,locations and lengths.In this way,the spatiotemporal patterns of the body parts are well learned from the 3D local neighborhood in part-specific scales,locations,frequencies and lengths.Based on this novel 3D local operation,this thesis proposes a simple but effective 3d local convolutional network for action recognition.The network can be combined with any existing architecture to explicitly extract the distinguished motion features of different human body parts and improve the representation ability of spatiotemporal features.(2)To address the problem of modeling the graph structure of skeleton sequence in the skeleton-based action recognition,this thesis introduces the spatio-temporal graph convolutional neural networks.By exploring the graph structures of joints in the skeleton sequence,this thesis proposes to extract and merge multi-scale graph features in the graph networks.To implement this idea,this thesis proposes to a simple but highly modulized graph convolutional neural networks for skeleton-based action recognition.The proposed approach overcomes the limitations of previous methods by extracting and synthesizing different scale and transformation information from different paths at different layers.Apart from that,it also aggregates multi-scale information from spatial and temporal dimensions.Extensive experiments show that its performance surpass the previous state-of-the-art methods on multiple datasets,with greatly reduced the amount of parameters and computational cost.(3)In order to further reduce the computational and training cost,this thesis introduces a model distillation algorithm to improve the computational efficiency while keeping the high performance of neural networks.Directly pushing the student model to mimic the probabilities/features of the teacher model to a large extent limits the student model in learning undiscovered knowledge/features.To address this issue,this thesis proposes a novel inheritance and exploration model distillation framework,in which a student model is split into two parts-inheritance and exploration.The inheritance part is learned with a similarity loss to transfer the existing learned knowledge from the teacher model to the student model,while the exploration part is encouraged to learn representations different from the inherited ones with a dis-similarity loss.The proposed framework is generic and can be easily combined with existing distillation or knowledge transferring methods for training deep neural networks.Extensive experiments demonstrate that these two parts can jointly push the student model to learn more diversified and effective representations.
Keywords/Search Tags:Deep Learning, Action Recognition, Convolutional Neural Network, Graph Convolutional Neural Network, Model Distillation
Related items