Font Size: a A A

Research On Video Action Recognition Method Based On Spatial-Temporal Feature Fusion And Deep Learning

Posted on:2020-07-30Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiFull Text:PDF
GTID:2428330602950604Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the explosive growth of video data and the development of artificial intelligence,there is an urgent need to establish a perfect intelligent video analysis system.As one of the core technologies of intelligent video analysis system,action recognition technology naturally becomes a research hotspot.Human action recognition technology has important application value in the fields of intelligent video analysis,intelligent transportation system and medical monitoring,and has broad research prospects.The deep learning method has gradually replaced the artificial feature-based method due to its excellent feature extraction ability and achieved great success in the field of image processing.Action recognition is based on video.Thanks to the successful application in the field of image,deep learning method has become the mainstream method in the current behavior recognition research.However,video is different from the static image in that it contains not only the static spatial information,but also the temporal action information.Therefore,how to effectively integrate the spatial and temporal characteristics is the difficulty in action recognition research.This paper mainly studies the video action recognition method based on deep learning and spatio-temporal feature fusion.The main works are as follows:(1)An action recognition algorithm based on 3D residual network and spatio-temporal feature fusion is proposed.3D convolution can be used to operate in both spatial and temporal dimensions of video,which can extract the spatial and temporal features of video images.In addition,the residual network structure with its good network characteristics can be used to reduce the difficulty of network training.Considering that the spatial domain information extracted from a single frame image can distinguish different action,we propose to fuse the spatial-temporal features extracted from the 3D residual network with the pure spatial domain features extracted from the 2D residual network.While retaining the original temporal features,the ability of the extracted features to represent the spatial information is enhanced.Experimental results show that compared with some existing algorithms,the proposed algorithm has a certain degree of improvement in action recognition accuracy.(2)An action recognition algorithm based on 3D multifiber network and temporal linear coding is proposed.3D multi-fiber module is used to replace 3D convolution to extract the space-time domain characteristics of video,which can effectively reduce the number of parameters to be optimized in the network and reduce the training difficulty of the network model.In addition,aiming at the disadvantage that traditional 3D convolution methods can only extract the spatial-temporal features of local video clips,we propose to add a temporal linear coding layer after the convolution layers of the 3D multi-fiber network to fuse the spatial and temporal features of multiple video clips from the same video.Thus,the spatialtemporal feature representation of the whole video with long-time structure can be obtained to improve the accuracy of action recognition.(3)An action recognition algorithm based on temporal segmentation and(2+1)D convolutional neural network is proposed.Combined with the idea of temporal segmentation,the algorithm sparsely sampled video continuous frames to maintain the overall temporal information of video while removing a large amount of redundancy.By using(2+1)D convolution instead of 3D convolution,the non-linear expression ability of network is improved.In addition,the network can effectively learn the space-time feature representation of long-time structure from the sampled video images.The experimental results show that the efficiency of the algorithm is improved while maintaining high recognition rate.
Keywords/Search Tags:action recognition, deep learning, feature fusion, 3D convolution, multifiber network, temporal linear encoding
PDF Full Text Request
Related items