Font Size: a A A

Research On Video Action Recognition Method Based On Visual Representation And Deep Neural Networks

Posted on:2021-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:C X LiFull Text:PDF
GTID:2518306050970439Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
The innate habit is satisfied,that vision is an important way of human communication and information acquiring,with the development of communication technologies such as the Internet and 5G.Video multimedia has increasingly become the main way of information exchange in manufacturing and everyday life.With the rapid growth of video data,the intelligent perception of video information has become an urgent and important demand.Video action recognition has become a hotspot as one of the core technologies.Deep learning,as an artificial intelligence method,has become the mainstream method for computer vision research due to its ability to automatically learn feature representations from data,especially in recent years when the accuracy of natural image tasks is greatly improved.This paper mainly studies video action recognition method based on deep learning algorithms and video representations learning.The main works are as follows:(1)The video scene can be divided into foreground and background.The subject of motion behavior belongs to the foreground,and the environment of behavior belongs to the background.These two kinds of information have different statistical characteristics,and ultimately affect the behavior category together.We proposed an action recognition method based on foreground segmentation and dual-channel network fusion.The algorithm firstly divides the video into multiple temporal segments and randomly select a frame from each segment as input.Then use the foreground segmentation algorithm to decompose the input into foreground and background,and feed them into spatial network and temporal network respectively,and finally fuse them into the video representation.Experiments show that the foreground can be used to learn an effective video representation,and its fusion of background is better.(2)The 3D convolutional neural networks can directly learn the video representation form video frames.However,It's accuracy is limited to its generalization capability.We therefore study and specifically analyze the influence of network structure and parameter initialization methods on its performance and show that the latter operation seems to be the major factor.And an unsupervised visual representation learning method based on 3D convolutional auto-encoder is proposed.The algorithm firstly takes the 3D model as an encoder and uses the 3D transposed convolution layer to construct the decoder.Then we use the whole 3D auto-encoder to reconstruct the sequence of video frames while making predictions to the subsequent input frames.Experiments show that our methods can help the 3D networks to obtain a better video representation and improve the accuracy of action recognition tasks.(3)Deep learning algorithms often use end-to-end training methods to learn video representations,which lacks clear physical significance.Inspired by the work in the fourth chapter,an unsupervised video representation learning method based on the brightness constancy constraint of optical flow is proposed.The algorithm firstly uses the 3D models to construct auto-encoder.Then we regard the optical flow as an image transformation between adjacent frames and use the auto-encoder to learn it.Specifically,an odd frame sequence of a continuous video sequence is used as an input,the 3D auto-encoder is used to predict its optical flow transformation which exactly is the subsequent even frame sequence.And adversarial training is used to measure the difference between the prediction and input frame sequences.When the adversarial training is completed,we believe the brightness constancy constraint of optical flow is learned by the 3D model.Experiments show that our algorithm can help the 3D convolutional network obtain better video representation without large-scale labeled datasets pre-training,which outperforms existing unsupervised and self-supervised algorithms.
Keywords/Search Tags:action recognition, deep learning, visual representation, unsupervised learning, 3D convolution
PDF Full Text Request
Related items