| With the rapid development of computer vision and machine learning,action analysis has shifted from identifying current action states to predicting future action states.Both action recognition and prediction are major research areas in computer vision,with various applications in areas such as video surveillance,human-computer interaction,and autonomous driving.While action recognition infers human actions based on complete execution,action prediction aims to anticipate human actions based on incomplete execution.However,single modality often fails to provide sufficient valid information for accurate prediction,and multimodal fusion can combine information from multiple modalities to achieve complementary information,improve prediction accuracy,and enhance model robustness.This thesis analyzes the current state of research on human action recognition and prediction based on multimodal fusion in both domestic and foreign contexts.It investigates various algorithms for human action recognition and prediction based on multimodal fusion,discussing the following three main approaches: depth and inertial sensor-based human action recognition methods,neural architecture search-based human action recognition methods,and Transformer-based human motion prediction methods.The main research focuses on these critical aspects.(1)To address the problem of inability to map real semantic information from data to classifiers during feature-level and decision-level fusion performed at a single level or stage,a human action recognition method based on depth and inertial sensors is proposed.Firstly,the depth data and inertial data are preprocessed and the processed data is further transformed into multimodal inputs through local ternary patterns.Then,all modal data is trained and features are extracted through a residual network.Discriminative correlation analysis is used for feature-level fusion,as it can maximize the correlation between corresponding features in two feature sets and eliminate the feature correlation between different categories in each feature set.Finally,the experimental results on two public datasets demonstrate that the multimodal fusion-based human action recognition method effectively improves the recognition accuracy.(2)To address the issue of artificial design networks being unable to effectively exploit temporal features at different periods during shallow and deep feature extraction,a human action recognition method based on neural architecture search is proposed.Firstly,neural architecture search selects features between or within modalities from a pre-trained single-modal backbone.Then,to reduce the memory usage of the model and accelerate the search rate,a mixture of operations is performed on select channels when searching for units.Each search unit automatically reconstructs a candidate operation in the search pool and the resulting reconstructed search units have different structures.Finally,the experimental results on three public datasets demonstrate that the proposed method significantly improves the recognition accuracy in human action recognition.(3)To address the problem of error accumulation during motion prediction,and to avoid the issue of discontinuity between the final observed frame and the first predicted frame,a human motion prediction method based on Transformer is proposed.To better capture the smoothness between human actions,a spatial trajectory is used to simulate the human motion sequence,and the temporal information of motion prediction is encoded using discrete cosine transformation.In addition,a multi-head attention layer is added at the end of the Transformer model to solve the problem of the decoder’s inability to rapidly and accurately predict between all of the frames learned.Finally,the experimental results on public datasets show that the proposed method significantly reduces the Euler angle error and mean per joint position error. |