Font Size: a A A

Action Recognition Based On Feature Encoding And Pooling

Posted on:2021-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S LuFull Text:PDF
GTID:1368330614950636Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of mobile communication technologies,video data is occupying an increasing proportion on the Internet and our lives is inextricably linked to various video applications.Intelligent video analysis technology will play an important role in many fields such as intelligent monitoring,automatic driving,and violent video detection.Action recognition is the fundamental task in the field of video analysis.The research on action recognition can not only improve the recognition performance,but also provide basic models for other video related tasks.Feature encoding and pooling play an important role in the study of feature representation,which is the core problem in computer vision area.Feature encoding means encoding input features to obtain higher-level representation and feature pooling refers to the process of aggregating visual features in a specific spatial or temporal area.Both of these operations transfer the local and redundant features to the global and compact features by statistical methods.In this dissertation,we conduct research on feature encoding and pooling methods towards action recognition task in videos.Two pieces of work are first carried out in feature encoding area: encoding the low-level features(e.g.,optical flow)based on the local aggregating method and designing the attentional encoding layer to encode the spatial features extracted by the convolutional network.Then two studies are also performed in feature pooling area: adopting the trajectory prior to pool 3D convolutional features and putting forward the spatial-temporal gated pyramid pooling layer to pool the convolution features.The main contents and contributions of this dissertation can be summarized as the following four aspects:(1)Applying the local aggregating idea into the design of action features and presenting locally aggregated histogram encoding descriptors.The disadvantage of the conventional histogram encoding descriptors is that they only calculate the number of data points falling into each bin.Motivated by the local aggregating idea in VLAD,this paper constructs LA-HOF and LA-MBH descriptors based on the original HOF and MBH descriptors respectively.The experimental results on action recognition datasets verify the effectiveness of the proposed descriptors.This paper also extends this idea toHOG descriptor and construct LA-HOG descriptor,which achieve good results on object recognition task.The results prove that the local aggregating idea is suitable for different kinds of descriptors.(2)Introducing the attention mechanism and feature encoding method into the design of neural networks and proposing two-stream convolutional attention based encoding network.The designed attentional encoding layer can simultaneously encode the convolution features of video frames as a whole and in part,where the global and local encoding divisions encode the entire video frame and multiple local salient regions respectively.This paper also proposes and compares two different multi-branch encoding structures.The proposed network achieves good results on public action recognition datasets.(3)Utilizing the trajectory prior and proposing multi-scale trajectory-pooled 3D convolutional descriptors.The motion information are contained in trajectories,which is often used for temporal modeling in traditional features.We first extract feature maps using 3D convolutional network from the input video,and then project the multi-scale trajectories calculated from the original video to feature maps and conduct max pooling on the projected trajectories to obtain the final features.The experimental results on the public datasets show that the proposed descriptor performs better that the original C3 D descriptor,because it makes better use of temporal information in videos.(4)Combining the pyramid pooling method with the gating mechanism and proposing deep convolutional networks with spatiotemporal gated pyramid pooling method.The designed spatiotemporal gated pyramid pooling layer contains the pyramid pooling and gating modules.The pyramid pooling module splits the features in space-time by the pyramid method.The gating module can be divided into the spatio-temporal and channel gating operations.This paper proposes and compares two gating structures with the serial and parallel connections of these two operations.The experimental results on action recognition public datasets show that the proposed STGPP layer further enhances the discriminative ability of convolutional features.
Keywords/Search Tags:Computer vision, Action recognition, Feature encoding, Feature pooling
PDF Full Text Request
Related items