Font Size: a A A

Research On Action Recognition Based On Multi-modal Information

Posted on:2023-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:X H HuFull Text:PDF
GTID:2568307025965959Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and the widespread use of wearable devices,people are increasingly using video as recording media.As human action recognition is widely used in many real-world applications,more research focuses on analyzing human action in videos.With the rapid development of the deep neural network,many video action recognition models based on deep learning have been proposed and achieved excellent performance.However,these existing algorithms still have two main problems.First,they make insufficient use of multi-modal information,such as only using singlemodal data or not training multi-modal data simultaneously.Second,the multi-precision information is not thoroughly mined.To solve the above problems,this thesis studies the task of action recognition in videos,and proposes action recognition models based on multi-modal information fusion and multi-level features,respectively.The research contents of this thesis mainly include the following two parts:(1)The use of multi-modal information.Currently,many video action recognition models only use single-modal data,or train the network using different-modal data separately,which does not fully use the complementarity and correlation of multi-modal information contained in the data.Therefore,this thesis proposes a multi-modal feature fusion module,which maps each modality to a shared feature space and a specific feature space,respectively,and then enhances the features in the specific feature space according to the similarity of the two modalities in the shared feature space.The fused features not only combine the characteristics of two different modalities,but also obviously enhance the invariance between modalities.In this thesis,sufficient comparative and ablation experiments are carried out,and the experimental results prove that the proposed model has excellent recognition ability compared with other models.(2)The extraction of multi-precision information.In action recognition in videos,the model must understand both long-term and short-term events accurately because the duration of action segments is different.Currently,many models only establish singlelevel features,and there is no explicit multi-precision information modeling.To solve this problem,this thesis proposes a model based on multi-level features.Through different degrees of feature aggregation,different levels of features are explicitly constructed to represent different-precision information,and then corresponding sub-networks are used to process different-level features.Finally,the outputs of all sub-networks are aggregated to get the final prediction.In this thesis,a complete comparative experiment is carried out,and the results show that using multi-level features can significantly improve the model performance compared with single-level features.
Keywords/Search Tags:Multi-modal fusion, Multi-level ensembling, Action recognition, Deep Learning
PDF Full Text Request
Related items