Font Size: a A A

Human Action Recognition Based On Spatial Temporal Feature Fusion

Posted on:2023-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:B MaFull Text:PDF
GTID:2558307073495374Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Human action recognition has a wide range of application value and is a hot and difficult point in the field of computer vision research.Compared with the obj ect detection task,human action recognition requires additional processing of time series information,and the modeling of temporal and spatial dimension information is difficult,which seriously affects the recognition accuracy.At this stage,the mainstream research methods use a single feature extraction method to model motion information or spatial information,and there is a problem of insufficient complementary information modeling ability.Therefore,from the perspective of mining the complementary information between multiple features,this thesis constructs a suitable fusion network to achieve efficient and accurate human behavior recognition.First,this thesis proposes a feature enhancement method based on distillation learning to pretrain the backbone network.Optical flow contains human motion information and local spatial information,but there is a problem of high extraction time overhead.For this reason,this paper proposes to use optical flow distillation to learn and train the backbone network.The network is divided into optical flow teacher branch and RGB student branch.By fixing the optical flow branch parameters,the student branch can extract features containing optical flow information only by using RGB information.When the human action recognition task is performed,the pre-trained network is combined with the spatial feature extraction model,the modeling ability of spatial features is enhanced by introducing the spatial attention mechanism,and the feature fusion is performed by using the channel attention mechanism.Experimental results show that the network outperforms existing methods on public datasets.Second,this thesis proposes a Transformer-based multi-feature fusion network.The basic feature extraction module uses 2D convolution and 3D convolution to extract spatial features and motion features,respectively,to provide different levels of information for the fusion stage.The fusion module performs position encoding along the channel,and uses the self-attention mechanism in Transformer to analyze the dependencies between local and global features to achieve more accurate action classification and localization.Through this fusion method,the model can be effectively guided to pay attention to more useful information among multiple features and reduce the influence of irrelevant information.The results show that the network achieves competitive results in both performance and model parameters.Finally,this thesis proposes an end-to-end real-time sparse detection method for abnormal behavior recognition applications in real surveillance scenarios.Using the above fusion network,a sparse detection mechanism with intervals of multiple frames is adopted,combined with an interpolation algorithm to generate a dense detection set,which improves the model inference speed and has real-time detection capabilities.A center-of-gravity-assisted discrimination algorithm is proposed to solve the persistent discrimination problem of ambiguous abnormal behaviors.The real-time performance and recognition accuracy of the model meet the actual needs,and it performs well on the self-collected data set.
Keywords/Search Tags:Human action recognition, Feature fusion, Transformer, Attention mechanism, Knowledge distillation
PDF Full Text Request
Related items