| Video is a continuous image sequence that changes over time and is an important form of multimedia data.The understanding and analysis of video has always been a research hotspot in computer vision.The research of action recognition is the classification of short-term,segmented videos with unique labels,and is an essential research direction in the field of video understanding.As a time-varying sequence,video contains both spatial and temporal information.The study of action recognition is an exploration of spatiotemporal modeling methodology.Early research on action recognition applies conventional methods that focus on constructing handcrafted features that represent the spatiotemporal characteristics of the actions.However,due to the high complexity of video data,the modeling capability of conventional methods is very limited.With the remarkable progress of deep learning in image recognition and natural language processing,the technology for extracting merely spatial or temporal information through deep learning has gradually matured.The exploration of spatiotemporal modeling methods applied to action recognition is closely related to the aforementioned achievements,but also requires corresponding innovations.The difficulties in action recognition can be generally summarized as follows: Firstly,is the data complexity.The cost of transmission and computing is large;The second is the algorithm precision.Real-world applications require for the high precision of algorithms;The third is the limitation of data volume and computing resources.Real-world applications often face the problems of limited training data and limited computing resources.In order to deal with the challenges in these three aspects,the research of action recognition can be roughly divided into the following three specific directions.One is the input feature level,which extracts effective feature representations from the complex video data.It optimizes the input rather than the main body of the deep network to improve the performance of the algorithm.We can call it pre-recognition research.The second is the model structure level.By adjusting and optimizing the network structure of the deep learning algorithm,a more effective spatiotemporal modeling method is proposed to decouple the video content,and captures the temporal correlation characteristics to improve the performance of the recognition.We can call it recognition methodology research.The third is the practical effectivity level.To consider the practical usage,comprehensively utilize the data characteristics to adaptively strengthen the robustness of the algorithm for specific tasks.Studying the performance of the existing algorithm in practical applications,we can call it post-recognition research.This thesis conducts a systematic and comprehensive exploration of action recognition in the above three aspects.In the pre-recognition research,this thesis explores the two main input modalities of skeleton and RGB video respectively,thus 4 aspects of research are conducted in total.Its main contributions are as follows:(1)Descriptors for skeleton motion is proposed.Unlike traditional RGB input,skeleton features contain strong semantic information,it is easier to extract effective feature representations than RGB data.The previous skeleton-based action recognition algorithms use various features,but they are all obtained by linear transformation in the temporal dimension based on the skeleton coordinates.Such feature representation conducts limited changes compared to the raw coordinates and is easily covered by the process of deep learning.To solve this problem,we propose rotation descriptors for skeleton motion intrinsic.The feature representation of the rotation descriptors is completely independent of the position of skeleton joints,it has good orthogonality with the original coordinate data,and can be easily applied to various deep learning models to construct multi-stream networks for improving existing algorithms.There are two kinds of rotation descriptors.Rotation Angle Representation(RAR)is a feature descriptor that strictly observes the constraint of 3D rotation group when describing the motion of joints,the motion description is accurate and usually brings about higher performance;Two-Directional Difference Representation(2DDR)is a feature descriptor that relaxes the constraint of 3D rotation group to a linear transformation,which has better robustness and lower computational cost than RAR.(2)A feature generation method of short-term motion based on RGB is proposed.Due to the complexity of RGB data,dense data sampling brings great computational cost and input /output(i/o)cost.Deep learning algorithms usually sparsely collect samples in video sequences for spatiotemporal modeling.The addition of short-term motion features strongly enhances action recognition methods,but the collection of short-term motion features requires dense frame sampling as input,which brings big challenges to the quality of data and the speed of transportation in practical applications.Based on traditional sparse sampling,the overall process of actions can be correctly grasped.Humans are able to imagine the entire action process from sparse samples,so we propose a motion feature generator to emulate this ability.The feature generator takes sparsely collected samples as input and estimates short-term motion features with the long-term motion between sparse samples through an encoder-decoder structure.The generated short-term motion is a feature level enhancement which can be adopted flexibly in existing methods.Therefore,the short-term motion estimation module is a feature enhancement method that can be generally applied to various existing methods.This method improves recognition accuracy with no additional input,simple structure and almost no increase in computational complexity.It can be applied in various scenarios.(3)A spatiotemporal modeling method based on foreground feature extraction is proposed.In spatiotemporal modeling,since actions are conducted by the foreground of the video,an essential challenge is to decouple the foreground and background of the video data.Since the foreground usually occupies a small part in the entire video frame,directly conducting spatiotemporal modeling causes the background part to be over-modeled,which affects the performance of the algorithms.The concepts of foreground and background are relative.Only during the process of action execution can the attention mechanism be used to identify the dynamic part as the foreground and strengthen it.Therefore,in order to solve this problem,we propose a Foreground EXtraction(FEX)strategy.In the process of spatiotemporal modeling,the features are simply fused in the temporal dimension to encode a relatively stable and static background.Subtracted by the original features,we obtain the foreground.In the spatial dimension,the FEX strategy is implemented as a Scene Segregation(SS)module,while in the channel dimension,it is implemented as a Foreground Enhancement(FE)module.The proposed FEXNet combines the two modules and achieves an improvement in the performance of recognition.(4)A spatiotemporal modeling method for irregular data and low computing resources is proposed.Common spatiotemporal modeling algorithms are ideally trained on public datasets with uniform distribution of sufficient data.In this case,select suitable hyper-parameters for the data can fully exploit the performance of the algorithms,post-recognition is not necessary.While in real-world situations,the characteristics of data are unpredictable,the class distribution is not likely to be uniform,and the scale is often small.To overcome this challenge,we design a 2D Progressive Fusion module that can be flexibly embedded in 2D CNN backbone networks.This module utilizes a novel convolution named Variation Attenuating Convolution to fuse the features extracted by the backbone network in both spatial and temporal dimensions.By gradually reducing the temporal dimension,it reduces the amount of parameters to adjust the small data volumes.The proposed network structure constraints the change of channel semantics brought by the spatiotemporal modeling,hence the classification model of the pretrained backbone network could be sufficiently applied,it can thus converge well on small-scale data sets and is insensitive to hyper-parameters.The proposed network structure is suitable for realworld usage. |