Font Size: a A A

A Series-stream Deep Network Model For Video Action Recognition

Posted on:2020-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:B WenFull Text:PDF
GTID:2428330575494172Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The purpose of action recognition is to enable computers to understand human actions and respond accordingly.This is a research hotspot and difficulty in the field of computer vision.Relative to the development of image recognition,action recognition in video considers a video,involving many different types of features,so that action recognition is not as fast and efficient as the development of image recognition.In the video action recognition research,from the early traditional methods to the deep learning methods in recent years,the speed and accuracy of recognition are constantly improving.The two-stream method combining spatial information and temporal information in the deep learning method is the most mainstream method in the current action recognition field.In this paper,based on the two-stream method,the spatial stream and temporal stream models are improved respectively,and then the spatial stream and temporal stream models are connected in series to complete the construction of the overall series-stream network model.A single spatial stream and temporal stream have different emphasis on video features,one focuses on spatial information and one focuses on temporal information.The multi-class classification of the streaming network model proposed in this paper takes the spatial stream as the first-level classification model,and judges whether there is spatial feature ambiguity according to the classification score of the spatial stream model,and chooses whether to enter the second-level temporal stream classification model.After entering the time model,according to this rule,choose whether to enter the third-level spatiotemporal fusion model.The multi-layer streaming method adaptively adjusts the fusion scheme according to the characteristics of the input video,which not only effectively reduces the amount of calculation,saves resource space,but also retains the advantages of the two stream fusion model.The spatial stream takes a video frame as input.In the improvement of the spatial stream model,this paper builds a twin network based on the ResNet50,and then iteratively interacts with them to obtain the feature extractor.Finally,the classifier is fine-tuned based on the trained feature extractor.Iterative interaction training is constructed by constructing a new spatial stream model,which makes the learning of spatial features more complete,thereby improving the final classification accuracy.The temporal stream takes the optical flow picture as input.Similar to the classical two-stram method,this paper first calculates the optical flow picture and takes 20 optical flow pictures as input.In order to integrate the temporal series information between multiple optical flow pictures and fully exploit the temporal characteristics in the optical flow picture,this paper adds two convolution block containing 1*1 convolutional layer in front of the basic ResNet50 in the construction of the temporal stream model.Through the 1*1 size convolution kernel,the timing information integration between multiple optical flow pictures is realized,so that more time characteristics are learned and the single temporal stream model is improved.The final overall series-stream deep model is tested on the UCF101 dataset.,which was 1.21% higher than the original method on a single spatial stream,and 1.42% higher than the original method in the temporal stream.And on the final model,a larger increase of about 6% is achieved compared to a single spatial stream and temporal stream.
Keywords/Search Tags:Action recognition, Deep learning, Temporal stream, Spatial stream, Series-stream
PDF Full Text Request
Related items