Font Size: a A A

Action Recognition Based On Human Skeleton Graph Convolution And Image Convolution Fusion

Posted on:2021-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:W Q ZhengFull Text:PDF
GTID:2428330602483369Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Computer vision is an important field of artificial intelligence,and human action recognition plays an important role in the field of computer vision,which has attracted more and more people's attention and has a wide application prospect.In recent years,with the development of deep neural networks and the improvement of computer computing power,deep learning has become the most important method for solving problems in the field of computer vision.Among them,the convolutional neural network has achieved great success in the recognition and classification of still images,but it cannot show its obvious superiority in the problem of behavior recognition in videos.The current action recognition methods mainly include Two-Stream method,3D convolution method and human skeleton-based method,but all have their own advantages and disadvantages.For example,although the Two-Stream method has a high recognition accuracy,the feature is the entire image information of the video frame.The attention to human behavior is not enough,so the recognition accuracy is generally for a specific data set.Although the 3D convolution method has a simple model and good real-time performance,the recognition accuracy is lower.The advantage of the skeleton-based method is to remove other information from the video,only focus on human movement.Although the features extracted are few,they are more targeted and persuasive.The disadvantage of extracted skeleton is lower extraction accurate,which leads to a generally low recognition rate of such methods.Therefore,this paper addresses the above problems by studying action recognition methods that fuse human skeletons and image information in videos,while retaining the targeted action of the skeleton-based methods,and combining with image information to improve the accuracy of action recognition.First of all,this paper starts from the research background of action recognition and its theoretical research significance and practical application prospects in the development of artificial intelligence environment.It introduces the current research status and existing problems of this research direction at home and abroad.Secondly,the action recognition based on the spatial temporal graph convolution networks(ST-GCN)model is studied.The spatial temporal graph model uses graph convolution to process the relationships between nodes of skeleton in a single frame and temporal convolution,and learns the characteristics of corresponding nodes between adjacent frames over time.The human skeleton in the video is extracted by the OpenPose pose estimation algorithm,and the skeleton spatial temporal graph model is constructed from the multi-frame skeleton graph as input,and the network training is performed end-to-end.UCF-101 dataset is used for model training and testing,and a special UCF-31 dataset is constructed for comparison experiments.It is verified that the spatial temporal graph convolution model shows good recognition performance with good skeleton extraction.Later,through experimental comparison,the advantages of the spatial temporal graph convolution model compared to the optical flow method are analyzed.It is verified that the recognition accuracy of the optical flow method greatly decreased when the video brightness changed,and the recognition performance of the spatial temporal graph convolution model is almost unaffected,which has good robustness.Finally,a new two-stream model with skeleton stream and image stream is constructed by fusing the spatial temporal graph convolution model based on human skeleton information and the convolution model based on video image information according to the idea of the original two-stream method.First,the action recognition based on image information is studied.The traditional 2D image convolution has a large span of motion in the video,and randomly extracting a frame from the video is often not enough to represent the key features of the entire motion,so the action recognition performance is average.Drawing on the sparse sampling strategy of the temporal segment network,each input video is segmented and randomly sampled to form a sparse video frame to jointly determine the recognition result.The image-based action recognition extracts scene information more abundantly,and this is the weak point of the human skeleton-based behavior recognition.Therefore,according to the idea of two stream action recognition,the late fusion method is used to combine the two models of human skeleton-based action recognition and image-based action recognition to construct a new two stream model.The validity of the proposed model in this paper is verified through experiments,and compared with other action recognition methods to analyze the advantages and disadvantages of the model.
Keywords/Search Tags:action recognition, spatial temporal graph convolution, OpenPose, sparse sampling strategy, two-stream model
PDF Full Text Request
Related items