Font Size: a A A

Research On Human Action Analysis And Recognition Method Based On Deep Learning

Posted on:2022-09-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:W WangFull Text:PDF
GTID:1488306728965379Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet multimedia technology and the large-scale popularization of video capture devices,such as smart phones and surveillance cameras,video data has shown explosive growth.Human action is an important information that characterizes video content.Human action analysis and recognition technology for video understanding has become a hottest research topic in the field of computer vision in re-cent years.There are huge application requirements and broad application prospects in the fields of intelligent video surveillance,human-computer interaction,autonomous driving,etc.Although human action recognition technology has made great progress,the perfor-mance of the current model for human action recognition in complex scenes is far from reaching the level of application in our daily life on a large scale.This dissertation focuses on video action analysis and recognition tasks,such as cross-modal video action grounding,temporal action detection,online action detection and action prediction,and conducts in-depth research on the existing methods in terms of multi-modal feature expression,action category similarity measurement,model paral-lelization training,and temporal modeling.Under this research technical route,the main research contents and innovation contributions of this dissertation are as follows:(1)Aiming at the problem of effective representation and fusion of multi-modal fea-tures in cross-modal video temporal grounding task,a feature fusion strategy for cross-modal collaborative attention interaction was proposed.The interactive attention mech-anism is used to interactively integrate and enhance the features of visual and language,and the residual stack method is used to build a hierarchical interactive network to achieve multi-stage deep fusion,so as to capture the fine-grained video texture feature represen-tation.Experimental results show that the proposed model can effectively improve the accuracy of video temporal grounding,and achieves the best performance under multiple evaluation metrics.(2)Aiming at the problem of intra-class differences and inter-class overlaps,this dissertation proposed a siamese network(IVS-Net)model for joint identification and ver-ification.This network introduces the idea of action category similarity measure into the temporal action detection task.By adding distance measurement into the cross entropy loss function,the IVS-Net model can jointly optimize the action classification loss and the action similarity loss,and achieve the purpose of intra-class contraction and inter-class separation for different human actions.The proposed model can be used to further improve the accuracy of action segment classification after any temporal segment proposal network.(3)In view of the incomplete video structure and the lack of temporal information in online action recognition task,under the fixed feature extraction network and action clas-sification loss function,this dissertation explored the performance of different temporal modeling methods in this task,and discovered the complementarity among different tem-poral models,and then proposed a temporal hybrid model for online action recognition.This dissertation explored four common temporal modeling methods,including temporal convolution,temporal pooling,temporal attention mechanism,recurrent neural network and its variants,and a total of eleven temporal models,to provide a fair and comprehensive performance comparative analysis for the research in this field.(4)In view of the shortcomings of the recurrent neural network based encoder-decoder framework in human action prediction task,which is difficult to parallelize training and the timing dependence is not flexible enough,a progressive action prediction model(TTPP)based on temporal attention was proposed.Based on the encoder-decoder design frame-work,TTPP uses temporal multi-head self-attention module to aggregate historical infor-mation and lightweight network module to progressively predict future action features.The multi-head self-attention mechanism does not depend on the calculation of the pre-vious time steps,and can well support parallel computing.At the same time,the self-attention mechanism can effectively capture the long-term dependence by calculating the attention probability of the current moment and historical time steps.Experimental results show that the proposed model can greatly improve the accuracy and efficiency of action prediction.
Keywords/Search Tags:deep learning, online action detection, action prediction, temporal action detection, cross-modal video temporal grounding
PDF Full Text Request
Related items