Font Size: a A A

End-to-end Modeling Method Of Action Prediction

Posted on:2024-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:X F LiuFull Text:PDF
GTID:2568306944967419Subject:(degree of mechanical engineering)
Abstract/Summary:PDF Full Text Request
Robots have a wide range of application and research value in the field of industrial and daily life.As an important component of the robotics industry,service robots have developed rapidly in the recent years.As an important basic technology for human-machine interaction in service robot systems,human action prediction aims at semantic prediction of human motion in progress to enhance the interactive experience of intelligent systems.This research focuses on action prediction of RGB video streams.Using the RGB video streams obtained from real scenes,through deep learning algorithm,the spatiotemporal information of human motion contained in RGB video streams is analyzed and their semantic categories are predicted.However,in the action prediction,due to the limited information contained in the incomplete motion sequence,and the ambiguous of appearance and motion mode at different stages of different actions,it is a great challenge to make efficient and accurate semantic prediction using the evolving video motion.Based on computer vision and deep learning technology,this research studies the key issues in the action prediction.The specific content is as follows:(1)At present,the action prediction adopts the two-stage method,which leads to the complexity of modeling,the separation of feature extraction and prediction,and then makes the spatiotemporal features learned by the model deviate from the prediction task itself.To solve this problem,this research proposes an end-to-end spatiotemporal coupling modeling method for semantic prediction of incomplete video actions using convolutional neural networks(CNN)+Long Short-Term Memory(LSTM),aiming at achieving fine-grained spatiotemporal information extraction of actions through reasonable input preprocessing and 2D CNN design,and integrating global spatiotemporal information of evolutionary actions using LSTM.First of all,aiming at the incompleteness and evolving characteristics of video,through the analysis and experimental verification of video redundancy and the improvement of input preprocessing methods,a more efficient and reasonable input processing is realized,which makes it suitable for the dynamic evolution of motion sequences and end-to-end architecture.Then,using static frame and RGB difference,the static information and dynamic information in the video scene are extracted respectively through deep architecture of 2D CNN.After fusion and deep network,the deep fusion features with stronger representation ability are obtained.Finally,in order to obtain the global spatiotemporal information of the evolution action in the partial video,LSTM is used to fuse the local spatiotemporal features to achieve observed global predictions.In order to verify the end-to-end model of CNN+LSTM proposed in this study,a comparative experiment was conducted with the existing mainstream action prediction methods using the twostage method,which proved the effectiveness of the end-to-end model proposed in this study by exceeding the prediction accuracy of the twostage method.(2)To solve the problem that convolution neural network and LSTM have relatively weak ability to model global information,this research proposes an attention mechanism based spatiotemporal coupling method,which enhances the local and historical global spatiotemporal feature based on the attention mechanism,aiming to further enhance the robustness and representation of local and global spatiotemporal feature.Specifically,in order to further increase the robustness of the local spatiotemporal feature extracted by 2D CNN,this research calculates the channel attention of the deep feature in the local spatiotemporal feature extraction module,suppresses the lower active response,and increase the higher active response,so as to obtain a more powerful local spatiotemporal feature.Using the Transformer module based on self-attention mechanism,the local feature embedded by input embedding and positional embedding are input into a parallel Transformer to model global spatiotemporal feature in a multi-scale manner.This research compared the ’CNN+ Transformer’model with the state-of-the-arts and the ’CNN+LSTM’ model in Chapter III,which verified the effectiveness of attention mechanism.In addition,the effects of different scales of local spatiotemporal feature are verified by ablation experiments.
Keywords/Search Tags:action prediction, end-to-end model, convolutional neural network, Long Short-Term Memory network, attention mechanism
PDF Full Text Request
Related items