Font Size: a A A

Human Action Recognition Based On Spatiotemporal Two Stream Convolution Network

Posted on:2022-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:L LuFull Text:PDF
GTID:2518306491996759Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the research of video-based action recognition,learning the spatiotemporal information of video is a challenging task.Action recognition is widely used in video surveillance,human-computer interaction,athlete training,and the industrial operation process.In recent years,with the development of deep learning in the field of action recognition,many scholars at home and abroad have proposed a lot of action recognition research methods based on deep learning,and all show good performance.How to better extract more effective features is particularly important.However,due to the differences in the appearance of actions and the difficulty in capturing continuous information of actions.Improving the recognition accuracy of measures remains a serious challenge.Therefore,designing a reasonable network structure to extract better feature information and constructing an appropriate loss function can improve the performance of the model,which will help to improve the accuracy of action recognition.In this paper,based on the general action recognition and the detailed analysis of the mainstream action recognition research methods,a reasonable two-stream network structure is designed to better learn the characteristics of the network middle layer and the relationship between the characteristics.Our main research results are as follows:1.Improved MARS algorithm.According to the task of action recognition network to better capture temporal and spatial information,aiming at the defects of MARS algorithm,such as insufficient learning of network layer features and the relationship between features,the distillation strategy is retained,the structure of two-stream convolution network is improved,and a more reasonable loss function is constructed,which improves the ability of network learning features.The specific improvement method is: distill each layer of the network to deal with the appearance and continuous action features that are difficult to capture;use linear multi-level MAE loss function and multi-level Gram loss function to make the network middle layer features and the relationship between features be better learned.At present,the Dual-action Stream network proposed in this paper has the best accuracy of 97.8% on UCF-101 and 81.2% on HMDB-51.When the input video is 16 frames clip,it is 0.6% higher than the baseline network MARS,and when the input video is 64 frames,it is 0.1% higher than the baseline.When the input video is 16 frames clip,it is 1.4% higher than the baseline network MARS,and when the input video is 64 frames,it is 0.2% higher than the baseline.2.Score fusion module based on attention mechanism.To solve the problem of how to combine spatiotemporal feature information effectively,a score fusion module based on an attention mechanism is designed.By adding a linear module based on the two-stream architecture and adjusting the weight of each stream in the spatiotemporal two-stream architecture,the neural network is used to replace the original average and SVM score fusion method without affecting the recognition accuracy,so that the recognition accuracy is improved.The experimental results show that this method can improve not only space-time information but also detection efficiency.On UCF-101,the recognition accuracy with this module is 1.1% higher than that of the baseline network and 0.4% higher than that of the HMDB-51;based on the Dual-action Stream network proposed in this paper,the best accuracy of the two-stream network with this module can reach 97.8% on UCF-101,81.2%on HMDB-51,and 0.3% higher than that of MARS.
Keywords/Search Tags:Action recognition, Distillation strategy, Multi-level MAE loss, Multi-level Gram loss, Dual-action Stream network, Attention mechanism
PDF Full Text Request
Related items