| Human motion prediction based on 3D skeleton data is one of the classic researches in computer vision,aiming to predict future human motions using captured ones.It has been applied in various human-machine interaction applications.A human motion is a sequence of human states,each of which consists of 3D position information of joints in the human skeleton in the camera coordinate system at the current moment.Current research mainly uses deep learning frameworks to design prediction models for human motion prediction.A prediction model is generally divided into two parts:encoder and decoder.The former extracts the motion features of a captured human motion;the latter uses the motion features to perform motion prediction.Although the existing prediction models perform well,there are still three main problems:1)Insufficient study of human skeleton spatial structure;2)Ignoring the capture of human motion laws;and 3)Weak anti-interference ability and lack of semantic commonality learning ability.To this end,this dissertation will conduct research on the above problems,aiming to further improve the performance of prediction models.The main contributions of this dissertation are listed as follows:Aiming at the first problem,a graph neural network based on directed acyclic skeleton space representation is proposed.First,a directed acyclic skeleton space representation method is designed.The method regards a human state as a directed acyclic graph(DAG),whose nodes and directed edges represent the joints and bones in the human state,respectively.The method can obtain the 3D position information of joints and the spatial attributes(i.e.,length and orientation)of bones simultaneously.Then,a directed acyclic graph neural network(DA-GNN)is designed,whose encoder consists of multiple encoding modules.In an encoding module,a DAG update unit extracts the space-dependent features of joints and bones of each human state,and explores the linkage relationship between joints and bones to update the space-dependent features.Then,a sequential dimension reduction unit reduce the sequential dimension of the updated spacedependent features.It can been seen that the DAG update unit considers not only the spatial dependencies of joints and bones but also their linkage relationship,thus sufficiently studying the human skeleton spatial structure.Multiple encoding modules perform the above update and reduction processes step by step to obtain the final motion features.Based on the motion features,the decoder cyclically uses a directed acyclic gated recurrent unit and multilayer perceptrons to perform human motion prediction.The experimental results show that DA-GNN achieves effective predictions on mainstream human motion datasets.Aiming at the second problem,a semantic correlation attention-based multiorder multiscale feature fusion network(SCAFF)is proposed.The encoder of SCAFF uses a multi-order difference calculation module to calculate the multi-order difference information of joints’ 3D position information and bones’ spacial attributes of each human state,and extracts the multi-scale features of the multi-order difference information through multiple graph calculation modules.A graph calculation module extracts the space-dependent features of joints and bones of an order difference information,and uses a semantic correlation attention(SCA)unit to capture the semantic correlations between body parts(i.e.,joints and bones)and motion time to improve the space-dependent features.The semantic correlations describe the movements of each body part at different times,thus effectively characterizing the human motion laws.Then,the module learns the temporal dynamics of the space-dependent features and reduces their dimension,thus obtaining the output features.Afterward,multi-order feature fusion modules and multi-scale feature fusion modules are used to fuse the output features of all graph calculation modules,thus attaining the final motion features.Based on the motion features,the decoder cyclically uses a compound gated recurrent unit and multilayer perceptrons to perform human motion prediction;meanwhile,residual connections are introduced to stabilize the prediction.The experimental results show that SCAFF has excellent performance in predicting human motions corresponding to different categories.Aiming at the third problem,a self-supervised pretraining and finetuning method for prediction models is proposed.First,an encoder is constructed using DAG update units and SCA units designed in this dissertation.Then,the self-supervised pretraining is performed on the encoder by executing two pretasks,i.e.,noise-free action reconstruction and semanticaware contrastive learning.The former extracts motion features of a noiseadded human motion to reconstruct a noise-free human motion.This pretask can improve the anti-interference ability of the encoder.The latter narrows the human motions corresponding to the same category while separating the human motions corresponding to different categories,by minimizing the contrastive loss.This pretask enables the encoder to have the ability to learn semantic commonality.Afterward,a decoder is constructed using a compound gated recurrent unit designed in this dissertation.The pretrained encoder and the decoder are combined to form a new prediction model which is then trained with a labeled human motion dataset for fine-tuning.The experimental results show that the performance of the prediction model is better than that of the state-of-the-art prediction models.It is worth noting that the prediction model can not only sufficiently study the human skeleton spatial structure,but also effectively capture the human motion laws.Meanwhile,it also has strong anti-interference ability and semantic commonality learning ability,so it is a high-performance prediction model.Moreover,we apply the high-performance prediction model to humanrobot interaction.First,a human motion early recognition framework is proposed.The framework feeds part of a human motion into the highperformance prediction model to predict the following part and inputs these two parts into a motion recognition model,obtaining a category recognition result.Then,a human-robot interaction system based on human motion early recognition is designed.In this system,a user shows a motion;a NAO robot recognizes the motion category in advance through the proposed framework and performs the corresponding response action quickly,thus completing the interaction.The experimental results show that the proposed framework can improve the response speed of the NAO robot while ensuring the high-accuracy recognition,so that the human-robot interaction system can meet the real-time requirements. |