Research On Action Recognition Based On Multi-modal Information

Posted on:2023-03-10

Degree:Master

Type:Thesis

Country:China

Candidate:X H Hu

Full Text:PDF

GTID:2568307025965959

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid development of the Internet and the widespread use of wearable devices,people are increasingly using video as recording media.As human action recognition is widely used in many real-world applications,more research focuses on analyzing human action in videos.With the rapid development of the deep neural network,many video action recognition models based on deep learning have been proposed and achieved excellent performance.However,these existing algorithms still have two main problems.First,they make insufficient use of multi-modal information,such as only using singlemodal data or not training multi-modal data simultaneously.Second,the multi-precision information is not thoroughly mined.To solve the above problems,this thesis studies the task of action recognition in videos,and proposes action recognition models based on multi-modal information fusion and multi-level features,respectively.The research contents of this thesis mainly include the following two parts:(1)The use of multi-modal information.Currently,many video action recognition models only use single-modal data,or train the network using different-modal data separately,which does not fully use the complementarity and correlation of multi-modal information contained in the data.Therefore,this thesis proposes a multi-modal feature fusion module,which maps each modality to a shared feature space and a specific feature space,respectively,and then enhances the features in the specific feature space according to the similarity of the two modalities in the shared feature space.The fused features not only combine the characteristics of two different modalities,but also obviously enhance the invariance between modalities.In this thesis,sufficient comparative and ablation experiments are carried out,and the experimental results prove that the proposed model has excellent recognition ability compared with other models.(2)The extraction of multi-precision information.In action recognition in videos,the model must understand both long-term and short-term events accurately because the duration of action segments is different.Currently,many models only establish singlelevel features,and there is no explicit multi-precision information modeling.To solve this problem,this thesis proposes a model based on multi-level features.Through different degrees of feature aggregation,different levels of features are explicitly constructed to represent different-precision information,and then corresponding sub-networks are used to process different-level features.Finally,the outputs of all sub-networks are aggregated to get the final prediction.In this thesis,a complete comparative experiment is carried out,and the results show that using multi-level features can significantly improve the model performance compared with single-level features.

Keywords/Search Tags:

Multi-modal fusion, Multi-level ensembling, Action recognition, Deep Learning

Related items

1	Multi-modal Human Action Recognition Based On Deep Learning
2	Research On Multi-modal Biometric Identification Method Based On Convolutional Neural Network
3	Research On Human Action Recognition Based On Multimodal Information Fusion
4	Human Action Recognition Algorithm Based On Multi-modal
5	Research On Human Action Recognition Based On Multi-modal Video
6	Research On Multi-modal Human Action Recognition Based On Features Fusion And Attention Mechanisms
7	Multi-modal Emotion Recognition Based On Deep Learning
8	Temporal Action Proposal Generation Based On Multi-modal Information Fusion
9	Action Recognition Method Based On Multi-modal Data Multi-speed Feature Fusion
10	Action Recognition Based On Graph Convolutional Networks