Font Size: a A A

Research On Video Action Recognition Methods Based On Two-stream Networks

Posted on:2022-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:X Q XiongFull Text:PDF
GTID:2518306536954839Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video action recognition is a representative task in computer vision,focusing on the method of automatically recognizing the semantic labels of given human actions by analyzing the spatiotemporal information in videos.With the rapid development of deep learning and the continuous improvement of hardware performance,video action recognition has achieved great advancements and many successful deep neural network models have been proposed for the extraction and classification of action features.But at the same time,there are many challenges to developing effective models for action recognition.Two-stream networks are the most popular and effective methods for video action recognition.However,the traditional two-stream architecture could not model the long-term temporal motion information and lacks the interaction between the spatial and temporal features.To model spatiotemporal information in videos more effectively,in this paper,two kinds of video action recognition models based on two-stream networks are proposed.The main contents are as follows:To model the long-term sequences of videos,two kinds of spatiotemporal residual networks are proposed,transforming the 2D spatial residual network to 3D domain.In our methods,we constructed two kinds of spatiotemporal residual units to learn the local temporal motion features that are based on residual scaling and identity mapping.And by stacking several such units through the hierarchy of the network to build the 3D spatiotemporal residual network,the temporal receptive is extended,making it possible to learn global motion information.We explored different kinds of 3D architectures to model long-term motion information,and different methods to initialize temporal kernels.The results show that our methods are effective to model the long-term motion information and global spatiotemporal features are more effective than local features for video action recognition.To build the interaction of spatial and temporal features in two-stream networks,two kinds of cross-stream interaction strategies(additive and multiplicative interaction)were introduced,making it possible to fuse the two-stream networks at multiple abstract levels.We systematically explored various alternatives to connect the two-stream networks,and the results show that effective cross-stream interaction could further improve the performance.Experiments on the UCF101 and HMDB51 datasets show that the proposed models in this paper are superior to the traditional method,which indicates that our optimized methods could make better use of spatiotemporal information for video action recognition.
Keywords/Search Tags:Video Action Recognition, Two-Stream Networks, Two-Stream Fusion Networks, Spatiotemporal Residual Network, Additive Interaction, Multiplicative Interaction
PDF Full Text Request
Related items