Font Size: a A A

Research On Human Action Recognition Based On Multi-modal Video

Posted on:2023-05-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ZhengFull Text:PDF
GTID:1528306902453264Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays,in the era of "smart cities" driven by big data,effective analysis and understanding of video data is of great significance to national innovative construction,comprehensive social governance,and criminal investigation under the new situation.Among many video understanding tasks,human action recognition task,as one of the most representative human-centeric visual understanding tasks,has received continuous attention from academia and industry.Video is essentially a carrier of multimodal information,which contains a variety of different modal information such as Red,Green,Blue(RGB)images,optical flow images,and skeleton sequences.The core problem of solving multimodal video action recognition is to extract spatio-temporal joint features that can accurately describe human action characteristics from videos with complex,abstract semantic information and easily disturbed by external environmental factors.This thesis uses the information of different modalities in the video to mine high-level semantic features that can effectively describe the essential characteristics of human action.The main research work is as follows:1.The problem of RGB video action recognition is studied.Aiming at the difference between RGB images representing static information and optical flow images representing dynamic information in RGB videos at the spatio-temporal scale,an implicit modal alignment network based on weakly supervised learning and an explicit modal alignment based on subspace learning are proposed respectively to compensate the difference.Aiming at the fusion problem between these two modalities,an unsupervised learning model sparse contractive auto-encoder is proposed to cooperate with the deep belief network model to realize the two processes of modal feature learning and feature fusion at the same time.The experimental results show that the proposed modal alignment network effectively compensates for the differences existing between modalities and lays a good foundation for modal fusion.At the same time,high-level feature representations with high robustness and strong discrimination are mined through the proposed fusion network.2.The problem of skeleton video action recognition is studied.Aiming at the complex intra-view dependencies between different joints and different bones in skeletal modalities,a wavelet graph convolutional network based on attention mechanism is proposed to mine the internal features of two perspectives of joints and bones respectively.Aiming at the inter-view dependencies between different joints and bones in skeletal modalities,a unified graph fusion network model is proposed,which treats high-dimensional information from two different perspectives as the same type of target object for convolution operation to realize the mutual circulation of information between the two different perspectives of the joint and the bone.At the same time,the method combined with transfer learning can better mine the correlated and complementary information between the two modalities.Experimental results show that the proposed attention-based wavelet graph convolutional network can effectively mine discriminative and comprehensive feature representations from different perspectives.The proposed fusion network model based on transfer learning effectively mines the feature representations of correlation,consistency and complementarity between different perspectives.3.The problem of multi-modal video action recognition is studied.Aiming at a variety of different modal information in the video,and the difference between homogeneous modalities and heterogeneous modalities.A two-stage fusion strategy is proposed to fuse homogeneous and heterogeneous modalities respectively.Specifically,for the fusion of homogeneous modalities and heterogeneous modalities applied to human action recognition tasks,a homogeneous modal fusion network model based on adversarial learning and transfer learning and a heterogeneous modal fusion model based on adaptive learning and transfer learning are proposed respectively.It is showed from the experimental results the performance of the recognition task is effectively improved by the proposed two-stage fusion strategy,which reduces the difference between homogeneous and heterogeneous modalities.In addition,through the proposed homogeneous and heterogeneous modal fusion network,the specificity,correlation and complementarity between different modal types can be better captured and mined.4.The problem of action recognition for low-quality video enhancement is studied.Aiming at the requirement to protect the privacy information of people in the video while accurately identifying and the low quality(resolution)of the video itself,a multi-scale reconstruction method of video is proposed.First,a twodimensional discrete wavelet is performed on the video frame to obtain the subimages of different frequency bands,and then,according to the characteristics of different frequency bands,a band-adaptive model is proposed to.restore the lost details,especially for the restoration of high-frequency details,an adversarial generative network model based on wavelet transform is proposed.Aiming at the different importance of different modalities in the video to the task,the optical flow image network is set as the main branch,and the RGB image network is set as the auxiliary branch,and a two-stream Transformer-based model is proposed for spatio-temporal joint feature learning.Aiming at the fusion of spatio-temporal features,it is proposed to use the cross-attention mechanism in another Transformer model in the fusion branch to fuse the features output by the two-stream Transformer encoding.The experimental results show that the proposed multi-scale reconstruction method can effectively restore the lost details.The proposed action recognition network based on Transformer model effectively mines the consistent and complementary feature representation between modalities.
Keywords/Search Tags:action recognition, multi-modal, RGB, skeleton, fusion, super-resolution, video
PDF Full Text Request
Related items