Font Size: a A A

Spatiotemporal Squeeze-and-Excitation Residual Multiplier Networks For Video Action Recognition

Posted on:2020-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:K TongFull Text:PDF
GTID:2428330575494239Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
As one of the main carriers of information,video has been more and more shared by humans.How to understand and analyze these massive amounts of video data is crucial.Research on human action recognition in videos has become a challenging topic in the field of computer vision.It is widely not only used in video information retrieval,daily life security,public video surveillance,but also human-computer interaction,scientific cognition and other fields.First,the research background,research significance and difficulties of action recognition are briefly introduced,and then the deep learning model based action recognition methods are comprehensively reviewed from three different aspects: the types and numbers of input signals,the combination with traditional feature extraction methods,and the pre-trained datasets.Furthermore,the performances of some typical methods on UCF101 and HMDB51 datasets are overviewed and analyzed.Last the possible future research directions are discussed from three perspectives: the video data preprocessing,the video human motion feature representation,and the model training.The current video action recognition method based on the depth model is summarized and analyzed for reference by relevant researchers.The two-stream deep model combining temporal information and spatial information is the most typical method in the field of video action recognition.Based on the two-stream network structure,a spatiotemporal squeeze-and-excitation residual multiplier networks for action recognition was proposed,which obtained effectively improved performance.The squeeze-and-excitation residual network is better than shallow networks or traditional deep networks of action recognition in learning spatial and temporal features.The long-term temporal dependence is captured by injecting the identity mapping kernel into the network model as a temporal filter.In the feature level fusion phase of two-stream networks,spatiotemporal feature multiplication fusion is used to further enhance the interaction between spatial information and temporal information of squeeze-and-excitation residual networks.In addition,a lot of ablation experiments were conducted to study the influence of spatial-temporal stream multiplication fusion methods,times and locations on the performance of the proposed model.Also,three different strategies are proposed to generate model ensembles,and the average and weighted average of the results of a model ensemble was calculated for the final recognition result.The experimental results on the UCF101 and HMDB51 datasets have shown that the proposed method has good performance in video action recognition.
Keywords/Search Tags:Action recognition, Spatiotemporal stream, Squeeze-and-Excitation residual networks, Multiplication fusion, Multi-model ensemble
PDF Full Text Request
Related items