| Human action prediction has important applications in humancomputer interaction,service robots,intelligent monitoring,and other application scenarios.Human action prediction aims to predict the information of ongoing or future actions that will happen soon before the action is completed.Human action has multi-level characteristics,including pose and action-semantic levels,and prediction tasks at different levels are complementary.For example,the pose level is a fine-grained prediction task,predicting future spatio-temporal trajectories of human body joints.The action-semantic level is a coarse-grained prediction task,predicting the semantic label information of future actions.Therefore,the predictions of pose and action-semantic levels are the two main contents of human action prediction.Aiming at the multi-level human action,this research carries out a series of works related to multi-level human action prediction,including pose-level fine-grained prediction,action semanticlevel coarse-grained prediction,and joint pose-level and action semanticlevel based coarse-grained and fine-grained action prediction.For the prediction of pose level,the human motion sequence is a typical time-series data.The core research content is how to extract a robust spatio-temporal representation of human motion to solve the task of human pose prediction.Current researches show that spatial and temporal concerns are different but interact simultaneously.For example,the spatial information reflects the static information of human action,the temporal information reflects the dynamic information of human action-related regions evolving with time,and the spatio-temporal information can be regarded as an organic whole coupled with each other in a highdimensional spatio-temporal space.To sum up,it is an urgent problem to study how to build spatio-temporal relationship model to solve the different spatial and temporal concerns and their interactions better.For the prediction of the action-semantic level,different from the traditional action recognition task,the action semantic-level prediction receives the partial video containing only a part of the action information as the input and predicts the action-semantic information of partial videos as early as possible.Due to the limitation of current observed information,human action-semantic prediction has the difficulties and challenges of limited helpful information and high ambiguity of action semantics.Therefore,how to reduce the action-semantic ambiguity of partial videos is the core research content of human action-semantic prediction.Aiming at some difficult problems of human action prediction,this research has carried out a series of researches from the perspectives of fine to coarse and joint fine and coarse granularity,including pose-level finegrained action prediction,action semantic-level coarse-grained action prediction,and joint pose-level and action semantic-level based multi-level human action prediction.Firstly,for the fine-grained prediction of the pose-level task,the progressive research idea is adopted.From the perspective of decoupling spatio-temporal,coupling spatio-temporal,and semi-coupling spatio-temporal modeling,the spatio-temporal relationship representation method of human action is continuously improved to extract the robust spatio-temporal representation of human action,so as to better realize the pose-level prediction.Secondly,aiming at the action semanticlevel based coarse-grained prediction,this research studies how to use the action-semantic consistent knowledge of partial videos with different observation ratios to help reduce the action-semantic ambiguities of partial videos with lower observation ratios,so as to better achieve the action semantic-level prediction.Finally,aiming at the coarse-grained and finegrained action prediction tasks,this research studies how to jointly realize pose-based and action semantic-based multi-level human action prediction tasks.Through the combination of coarse-grained and fine-grained action prediction tasks,the model can utilize the information of future human motion and action semantics to improve the performance of actionsemantic prediction and human motion prediction,respectively,so as to achieve complementary performance.The main research contents and innovations of this research are summarized as follows.(1)For the different characteristics of space and time,this research proposes a decoupled spatio-temporal relationship representation based on human motion prediction.Through the decoupling modeling of space and time,we can model the spatial and temporal characteristics well to better capture spatio-temporal features.This study proposes pseudo-image sequence evolution-based human motion prediction.Firstly,we convert the human motion sequence into a pseudo image sequence and build spatial and temporal convolutional modules to achieve spatial and temporal modeling separately.Then,we built a hierarchical framework to model the multi-scale and coupled spatio-temporal features of the human motion sequence.The experimental results on FNTU and G3D datasets show the effectiveness of the decoupled spatio-temporal relationship modeling.The non-recursive prediction of future poses can effectively alleviate the problem of error accumulations.(2)For the interactions between space and time,this research proposes the coupled spatio-temporal relationship representation-based human motion prediction model.The 3D skeletal sequence is represented by a 3D tensor with the joint as width,the dimension of joints as height,and the time dimension as depth.The research extracts coupled spatio-temporal features of human motion sequence utilizing the advantages of CNN by covering the width,height,and depth of the 3D tensor simultaneously.In terms of methodology,this research proposes a novel TrajectoryCNN,which models trajectory transformation,coupled spatio-temporal representation,and the predictions of future poses end to end.This method introduces a new trajectory space to transform the human motion sequence from the position space to the trajectory space.Introducing new space helps encode long-term information and mine coupled spatio-temporal trajectory information for long-term prediction.The method is evaluated and verified on multiple diverse datasets such as human3.6m,CMU-Mocap,3DPW,etc.This method achieves state-of-the-art performance,showing the effectiveness of coupled spatio-temporal and long-term modeling.Moreover,the coupled spatio-temporal models can be adapted for other tasks of video understanding.(3)For the different characteristics of space and time and their interactions,this research proposes a semi-decoupled spatio-temporal relationship representation-based human motion prediction model.A new semi-coupled spatio-temporal learning mechanism is proposed through decoupled and merged strategies.The decoupled strategy allows the model to focus on the characteristics of space and time,respectively.The merged strategy allows the model to focus on the interactions between space and time.In terms of methodology,this research proposes a Multi-Scale Semicoupled spatio-temporal Learning network(MSSL),which organically integrates the semi-coupled spatio-temporal learning mechanism and hierarchical framework.In this way,the model repeats the processes of semi-coupled spatio-temporal modeling to capture the semi-coupled spatio-temporal features and multi-scale spatio-temporal features of the whole human motion sequence.The method is evaluated and verified on multiple diverse datasets such as Human3.6m,CMU-Mocap,3DPW,etc.This method shows superior short-term and long-term performance than the decoupled or coupled spatio-temporal models,which shows the effectiveness of semi-coupled spatio-temporal modeling.This method inspires researchers in the areas of sequential analysis to explore more robust spatio-temporal representation,so as to better achieve downstream tasks.(4)For the limited information and highly ambiguous semantics,this research proposes action spatio-temporal semantic consistency of partial videos with arbitrary observation ratios to predict action-semantic labels.The existing works ignore that partial videos have the same action semantics under different observation ratios of the same human motion sequence.Using the relationship modeling advantage of graph convolution,this research builds an Action Semantic Consistency learning Network(ASCNet)of arbitrary observation ratios under the framework of the teacher-student network.Taking partial videos with different observation ratios as nodes and their action semantics relationship between partial videos as edges,we build action-semantic consistent graphs of partial videos with different observation ratios.Under the optimization of the network,the model mines the complete action semantic knowledge from the full video and transfers it to the partial videos via the distillation loss,which helps to reconstruct the missing action knowledge of partial videos.The experimental results on UCF101,HMDB51,and Sthsth-v2 datasets show the effectiveness of using action-semantic consistent knowledge of partial videos with different observation ratios to improve the performance of the early action prediction model.(5)For the problem of coarse-grained and fine-grained action prediction tasks that complement each other,this research proposes posebased and action semantic-based multi-level human action predictions jointly.Most existing works separately model the tasks of human motion prediction and action-semantic prediction,which ignores their complementary characteristic and thus limits their applications.For example,the task of human motion prediction can promote the early prediction of action semantics and vice versa.Using the convolutional neural network technique,this research studies a unified framework,Multihead Trajectory CNN,jointly modeling coarse-grained and fine-grained prediction tasks at the pose and action-semantic levels.The proposed method is evaluated and verified on the NTU RGB+D dataset,which effectively verifies the effectiveness of the proposed framework for coarsegrained and fine-grained multi-level action prediction.The experimental results prove that human motion prediction can effectively assist actionsemantic prediction,and action-semantic prediction can also help human motion prediction to a certain extent. |