Font Size: a A A

Multi-view Feature Learning Based On Skeleton And Image Data And Its Application In Behavior Recognition

Posted on:2021-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:D S GuoFull Text:PDF
GTID:2428330614465976Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of the discipline of artificial intelligence and the improvement of computing power,research on human behavior recognition based on deep learning has become one of the hot research topics,and it is also a very difficult research difficulty.Because human behavior recognition technology has a wide range of applications in human social life,it is of great practical value to carry out human behavior recognition research.Existing behavior recognition methods usually only use single-modal data such as images or skeletons.The image or video contains intuitive scene information,but it is easily affected by lighting and occlusion.Skeleton node data represents the three-dimensional coordinates of human joint points in the video frame,including the spatial structure information of the skeleton and the dynamic information of the timing.At the same time,the skeleton node data can well avoid occlusion and complex background interference,but the skeleton data lacks appearance Details.Therefore,there is a high degree of complementarity between image and bone data.In this paper,multi-view feature learning is performed from the two modal data of skeleton and image,and the complementary information of the two modal data is combined to improve the accuracy of behavior recognition.According to the characteristics of the two data of the image and the skeleton,the deep neural network model suitable for the characteristics of the two data is studied separately.For continuous video frame data,because the video can be decomposed into image data and optical flow data,we use a dual-stream convolutional neural network architecture to extract the spatiotemporal information of the video,but the traditional dual-stream network cannot learn the long-term spatiotemporal space of the human body in the video feature.In view of the shortcomings of dual-stream convolutional neural networks,this paper proposes a convolutional recursive fusion method.This method uses a recurrent neural network to model long-sequence video frames,extract the long-term dependency of the video frames,and at the same time combine the convolution operation with the recurrent neural network architecture to fuse the spatiotemporal features of the dual-stream network output and make full use of the image.Complementarity with optical flow to learn long-term human movement characteristics in video.In addition,this paper also proposes an RNN attention mechanism to allow the network to learn to focus on areas related to human behavior at different moments.For skeleton data,graph convolutional networks aremore suitable for modeling such non-Euclidean data.The joint points of the human body are connected to form an irregular undirected graph.The graph convolution network can extract and combine the local features and time series features of the key point sequence space of the human body.In order to enable the algorithm to capture the long-term human motion characteristics of behavioral video,it can also combine posture and joint information to improve the recognition accuracy of the algorithm.This paper proposes an efficient dual-stream network for feature learning of skeleton and image data.Due to the huge parameters of end-to-end training for large-scale neural networks,it is difficult to train and converge.In this paper,we first train the convolution recursive fusion network and the graph convolution network,and finally fuse the scores of the two to reduce the network.Difficulty of training and parameter adjustment,so as to improve the overall accuracy.Using UCF101 and HMDB51 two behavior recognition databases to test this method,compared with the current mainstream video behavior recognition,the effectiveness of this method is verified...
Keywords/Search Tags:Behavior recognition, recurrent neural network, attention mechanism, graph convolution, multi-view feature extraction
PDF Full Text Request
Related items