Font Size: a A A

Action Recognition Based On Spatio-temporal Features

Posted on:2023-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z MaoFull Text:PDF
GTID:2568306818495304Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Human action recognition task has important technical worth and tremendous application prospect in the domain of human-computer interaction,intelligent monitoring,medical assistance,motion assistance and so on.Through the continuous efforts of researchers,human action recognition task has realized many milestone achievements.While in practical application,existing various factors such as complicated background,camera movement,light intensity transformation and posture transformation in the action make human action recognition task more challenging.For the most part of existing methods follow with interest of the feature extraction ability of spatial feature extraction network or temporal feature extraction network,but ignore the spatio-temporal feature fusion strategy and whether the input features of feature extraction network contain enough feature information.In this paper,three action recognition algorithms are proposed and implemented by integrating feature enhancement and feature extraction methods.The main work and contributions of this paper are as follows:(1)A non-fair feature fusion strategy based human action recognition method is proposed.No matter spatial feature extraction network or temporal feature extraction network,it is inevitable to obtain low-confidence temporal feature or spatial feature.When conventional spatio-temporal feature fusion is carried out,the feature fusion strategy has no idea to distinguish between high-confidence features and low-confidence features,which decline the accuracy of the network.In order to reduce the influence of low-confidence features on fusion features,a non-fair feature fusion strategy is proposed.By introducing dropout layer and nonfair treatment into conventional feature fusion strategy,non-fair feature fusion strategy can reduce the quantity of low-confidence features in spatial features or temporal features as much as possible before spatio-temporal feature fusion,so that spatio-temporal feature fusion can only rely on the high-confidence features and achieve higher accuracy.It achieves 81.6% and48.7% accuracy on UCF101 and HMDB51 datasets respectively,which is competitive with the methods taking only RGB frames as input data.(2)A multi-views reinforced attention mechanism based human action recognition method is proposed.Although the spatial feature extraction network has strong spatial feature extraction ability,its temporal feature extraction ability is weak,which leads to the loss of a large amount of temporal information in the result features of the spatial feature extraction network.In the end-to-end network architecture,the result feature of the spatial feature extraction network is directly employed as the input feature of the temporal feature extraction network,which leads to the serious degradation of the performance of the temporal feature extraction network.In order to give better play to the temporal feature extraction ability of temporal feature extraction network,a multi-views reinforced attention mechanism is proposed.The multi-views reinforced attention mechanism reinforces the input features of the temporal feature extraction network by introducing two reinforced input features with abundant temporal information as attention features,in order to make up for the loss of temporal information.It achieves 82.4%and 50.6% accuracy on UCF101 and HMDB51 datasets respectively,which is competitive with the methods taking only RGB frames as input data.(3)A multi-views temporal feature extractor based human action recognition method is proposed.The multi-views reinforced attention mechanism requires that the reinforced input features contain enough temporal information and less misleading information as much as possible.The multi-views temporal feature extractor based human action recognition method directly takes multi-views features obtained from input data as the reinforced input features,which retains the largest amount of original temporal information while also introduces a large amount of misleading information for the multi-views reinforced attention mechanism.In order to reduce the misleading information in the reinforced input features,a multi-views temporal feature extractor is proposed.The multi-views temporal feature extractor extracts multi-views temporal features related to time dimension by stacking convolutional layers,and achieves the purpose of removing misleading information at the same time.Taking the multi-views temporal features as the new reinforced input features not only can meet the requirements of the multiviews reinforced attention mechanism for enhancing the input feature but also can reduce the impact of misleading information on the temporal feature extraction network.It achieves 83.1%and 52.1% accuracy on UCF101 and HMDB51 datasets respectively,which is competitive with the methods taking only RGB frames as input data.
Keywords/Search Tags:Human action recognition, Feature fusion, Multi-views feature, Attention mechanism, Feature extractor
PDF Full Text Request
Related items