Font Size: a A A

Human Action Recognition Via Dual Spatio-temporal Network Flow And Attention Mechanism Fusion

Posted on:2018-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q W QiaoFull Text:PDF
GTID:2348330536979538Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Human action recognition from videos is one of the important issues in computer vision.The goal is to analyze,extract and represent the human behavior information from videos.Inspired by the human brain visual mechanism,the deep learning frameworks make great progress for machine learning,and also open a new research direction for human action recognition.However,there exist several aspects of the limitations for deep learning,such as massive data to be required,too many network parameters to be learned,and being not suitable to design targeted solutions for specific tasks,this paper focuses on mining data under limited data constraints,presenting the deep neural network with strong generalization and less parameters to identify the human action from videos.In generally,the neural network to learn the spatial and temporal relationship between behavior sequences is easy to be over-fit while only considering the less amount data for most human action recognition from the original video sequence.On the basis of the dual flow hypothesis of ventral flow and back flow in human visual cortex,this paper proposes a human action recognition method with dual spatial and temporal network fusion in the framework of neural network.The spatial network stream performs object recognition from video frame while the temporal network stream executes motion identification from the corresponding dense optical flow.First,a coarse-to-fine Lucas-Kanade optical flow estimation and Munsell color conversion system are adopted to extract the optical flows feature with respect to move information from the RGB image in video frames and convert them into the three-channel optical flow map with respect to the corresponding frame;then,,the GoogLenet network with transferred model parameters is applied to convolute layer-by-layer respectively the original appearance images and the corresponding optical flow feature maps in the selected time window,which automatically aggregate edge,angle,line and other low level visual features to generate the high level visual features perceived from spatial and temporal stream,(LSTM)recursive neural network,and cross-recursively the original image and the corresponding high-level semantic feature sequences of the corresponding optical flow feature image,and obtain the video window for each frame candidate feature description with decoding the implicit state layer of the time window.Finally,the softmax classifier is used to calculate the class label probability distribution of each frame of the video sequence,and the video sequence category label is judged according to the principle of the majority.The experiments on UCF-101 dataset show that,compared with the traditional method,the spatial and temporal dual-stream structure can improve the ability of human behavior recognition from video sequence with the high recognition accuracy.While spatial and temporal network cross-transmit multi-layer LSTM recursive neural network hidden state parameters to prevent over-fitting,where spatial network recognize object and replenishment time network texture missing information and the temporal network limits the sparseness of the overall network parameters.Inspired by the cognitive principle in which attention often focus on a certain area to accurately obtain the salient objects in the given scene in the human action recognition on the basis of the given dual spatial and temporal network flow architecture,this paper presents the spatial attention selection model to focus on the significant and salient regions from the given video frames while simulating the human attention shift mechanism.First,the GoogLenet deep convolution neural network in the spatial perception stream is applied to generate the high-level visual and structural features from the original images.Second,multi-layer LSTM network in the temporal perception stream is adopted to decode the high-level semantic feature sequence from the corresponding optical flow feature image,which output visual descriptor sequence in temporal perception stream,and the softmax function is applied to calculate the significant weight coefficient matrix of spatial attention.Thirdly,significant graph feature sequence in the spatial network flow and the significant weight coefficient matrix of spatial attention are weighted to generate the characteristic activation map sequence,in which the high values compose the interest area that is significant inputting characteristics features for multi-layer LSTM network in the spatial perception stream.Final,softmax classifier in the spatio-temporal perception stream is adopted to calculate behavior labels in the video sequence.The experiments on UCF-11 dataset show that dual spatio-temporal network flow focus on the most significant region in the video image of human behavior,and the current behavior can be deduced from the background.These advantages lie in reducing the calculating costs of area correlation in video sequence,and improving the discriminatory of human actions.In view of the fact that videos usually contain a large number of redundant and confusing frames on the basis of the given dual spatial and temporal network flow architecture,this paper proposes temporal attention selection mechanism to select the related key frames to distinguish human actions by the relevance of each frame relative to the behavior in the video sequence.First,to handle the sequences of original images and the corresponding optical flow features,multi-layer LSTM network in the spatio-temporal perception stream is used to decode them and output two sequences of the related visual descriptors,and the softmax classifier is adopted to calculate the probability distribution matrix with respect to the video sequence in spatial perception stream.Second,in terms of the obtained two sequences with respect to the given visual descriptors in the spatio-temporal perception stream,the relative entropy cost function is exploited to calculate the attention confidence scores for human actions,and then one attention confidence score multiply one column vector in the probability distribution matrix in the spatial perception stream,where obtain the scaled probability distribution for the action category of each frame with respect to the video sequence.Final,softmax classifier is used to identify the human action category for video sequence.The experiments on UCF-11 dataset show that the temporal attention model select key frames with respect to the related action confidence score and temporal dependency of human behaviors in the given time window of video sequence.The temporal attention model can eliminate the redundant and confused frames in the video sequence,and achieve better performance in terms of the evaluation criteria with respect to computational accuracy of human action recognition.
Keywords/Search Tags:optical flow features, deep learning, attention mechanism, CNN, LSTM
PDF Full Text Request
Related items